Extending expert system technology to handle text-based representation needed for legal reasoning and similar text-based information processing.

3 sub-projects:

a) inventing and developing a knowledge representation which can integrate text, including huge text bodies

a) Jan 1, 1994 to Feb. 19, 1995 (continued from 1993)

a) the first project (continued from 1993) involved inventing and developing a text-based knowledge and information representation that is appropriate for huge text-based information sets. A scheme was developed, along with suitable technology to transform information to fit the scheme. The scheme involves "naming" fairly small individual chunks of information so that they can be individually addressed and jumped to or retrieved. The scheme extends the concept of cataloguing and can be used to time-stamp information. To the best of our knowledge the scheme is novel and unique. The development of the supporting technology and the first test was slated to be completed by end of 1993, but in fact was not completed till mid February 1994. A first report on this approach is appended.

b) developing a set of software tools for automatic and semi-automatic text transformations

b) Feb 7, 1994 to June 22, 1995; -- phase 2: Aug. 8 to Nov. 18, 1994

b) the second project involved the development of a set of tools which allow automatic and semi-automatic text transformations. The first goal involved building a set of tools which could fully automatically transform the complete set of Ontario Statutes from a WordPerfect representation into a Folio electronic book representation for CDs. The transformation included the extraction of defined terms. Both English and French text would be transformed, and hypertext text links would be build automatically. Development of the first tools started in early February. Full tool development started in April, with first phase completion on June 22. (This project did not include building an automatic index).

The first experiments used Rexx, but these failed due to slow performance and limitations in large text handling. The second experiment used Awk and PolyAwk, with success in the transformations but with still inadequate performance. The goal was to do the transformation in 1.5 days, so that a partial update could be done overnight. This first version was too slow by a factor of 2.5 or 3.

The second phase of the project started in the middle of August with the development of special sort routines in C to help in building indexes and in speeding up the performance of the transformation. This phase was completed in the middle of November with a significant speed-up, but not quite as yet satisfactory performance. (The next phase is not anticipated until sometime in 1995 when transformation into HTML will be included.)

.Sort & sort_l -- these DOS routines are for lexically sorting an index, e.g., for an index of defined terms. Each entry to be sorted must be represented in a single line, where the lines are sorted relatively to one another. The target string on which the line is to be sorted must always have the same relative offset from the beginning of the line. This routine can handle large files, is relatively fast, and can ignore initial tags (C based). Unlike commercially available sort routines, this routine can handle very long lines and does not suffer from the usual size restrictions.

.sort_p -- this DOS routine is for lexically sorting larger multi-line chunks of information. The line on which the sorting will be based must be uniquely identified with initial tags. This line must be the first line for the chunk of information, (or it must be a fixed number of lines from the beginning of the chunk.) The chunk can contain a variable number of lines, as it is bounded by the next unique sorting line or the end of the file. To give an example, the "record" entries in FileLaw (a Carswell electronic publication soon to be published) were not in alphabetical order. This routine was used to sort the files within provinces and within industry types to put the entry records in the correct lexical order. (C based)

.trans -- this DOS system translates from a specially formatted ASCII source format into Folio flat files for a design. During the same pass it may report of errors and prepare index files. An associated err routine translates and helps to interpret the error reports. This system is complex and needs to be bundled with consulting for customization. The system is also highly novel, using parallel automatic editorial processes in synchronization. The system should be protected as proprietary, and thus can be licensed without technology transfer. This system is based on PolyAwk. The system comes with special utilities to prepare batch files automatically and to interpret error log files.

.trans2 -- this DOS system can also translate from source into Folio flat files, but it is much less powerful than trans. It is based on Rexx.

c) developing a set of tools for quality assurance and change management for huge bodies of text

c) Dec. 5, 1994 to Dec. 31, 1994 (continued into 1995)

c) the third project involves the development of a set of tools for quality assurance and change management of large bodies of text. This project started in the beginning of December, 1994 and is planned to complete in March or April 1995. Unlike change management in traditional engineering disciplines, there is very little formal change management control even in editorial productions which demand high accuracy, e.g., the publication of legal texts. Quality assurance is handled in a very labour intensive fashion by copy reading galley proofs. This process can be and must be automated if one is to handle large bodies of text in a timely and cost-effective fashion. The first approach is to automate and semi-automate the copy to copy reading and chunk checking that is now done on printed page proofs. A multiple window approach allows the textual material to flow in parallel, and to note differences into a change management report.

c) Here is a list of tools for quality assurance and change management of large bodies of text. (Note: Most of them were completed in 1995 or are still under development. None of them have yet been used by any client.)

.qa & qa_index -- these Windows routines are for quality assurance and change management. This routine compares two files and detects differences. The differences are noted in a file which can be saved or printed. The differences can also automatically prepare a change index which indicates the difference between successive versions and allows users to use jump links to go directly to the chunks of information which have been changed in the new release. (C based)

Example 1: qa could make a comparison between two successive releases of TaxPartner (Carswell) which indicate precisely what was changed in the new release. This change record can be used as a log of changes, and to get appropriate management approval and sign-off for the changes.

Example 2: qa_index can do what qa can do, and at the same time automatically prepare an index of the changes that can be integrated into the TaxPartner product. qa_index is still under development. It would need customization for each product to which it would be applied.

.char_map -- this Windows routine checks the character set of a document and translates extended ASCII characters into the equivalent ANSI Latin 1 character set required for Folio flat files. The ANSI character set is the standard for Windows and for the Mac. The routine can also map string or tag encoded characters used in SGML and other representations. The routine can also map special characters into tag strings, such as mapping special characters into characters in special symbol fonts or using super and subscript tags. (C based)

.tag_ex -- this DOS routine is for extracting all SGML or folio tags from a file and listing them. This list can be sorted with sort above. The routine is used for document analysis and QA. It is also used for converting from strange typesetting systems by first extracting the tags in preparation for subsequent translation and clean-up. (REXX based)

.char_ex -- this DOS routine is for extracting all SGML based special characters from a file and listing them (&xxx; format). The routine supports document analysis and QA. The routine can be used to customize char_map to assign Folio characters and strings to electronic versions of documents. (Rexx based)