Chapter 10 Handling Text

Strings, as you learned in the last chapter, are lists of characters. This means that you can bring to bear all the inherited features of the class List on a string. You can add, drop, reposition, and transpose characters. You can splice in a new substring and cut out an existing substring. You can examine characters at a specified position and, as they say in the commercials, much, much, more.

Strings, however, go beyond being mere lists of characters because lists of characters are typically text. And text--with its rich groupings of letters, spaces, numerals, symbols, and punctuation--resonates with much deeper meaning than simple lists.

To help you deal with some of the subtle complexities of text, the Telescript language endows the String class with features designed for strings of text. Telescript also provides the class Pattern, which you can use to perform sophisticated searches through the text of a string, and to modify the string according to the results of the search. In this chapter, you'll learn how to use both classes together to handle text in the Telescript world.

An Overview of Strings and Patterns

The following class hierarchy shows the inheritance of the two classes described in this chapter. Both classes appear in boldface.

Object

* Collection[Character, Equal] (Equal)
* * List[Character, Equal] (Ordered)
* * * String (Cased)
* Pattern

As you can see, the String class is a subclass of the derived class List[Character, Equal], so it is a list of characters. Because its items are defined by the class Character, each of its items is a Unicode character, one of approximately 65,000 possible characters defined in The Unicode Standard, Version 1.0 and The Unicode Standard, Version 1.1. The first 128 characters of the Unicode standard are the characters of the ASCII standard, a group of characters you'll use constantly if you're dealing with English-language text and many other European-language texts.

String also inherits from the mix-in class Cased which, as you'll recall from Chapter 5, offers operations that recognize and modify upper- and lower-case characters. These operations are specialized in String to handle strings of characters, not single characters as you read about in the Character class description of Chapter 5.

String's own features allow you to read a substring within a string and to convert a string to an identifier, an octet string, an integer, or a real number. These are all simple features. To perform more involved work on a string, you must use a pattern on the string.

The Pattern class is a direct subclass of Object and has no mix-in superclasses. When you initialize a Pattern object, you supply a string as its initialization argument. This string, called the pattern's search text, is a search specification for text within another string--the subject string. This search text is the pattern's sole property. It can't be changed once the pattern is initialized.

You can bring a pattern and its search text to bear on a subject string by calling an operation on the pattern. Each of the pattern's three operations accepts a subject string as one of its arguments, and then searches through the subject string for text that satisfies the pattern's search text. Text that satisfies the search text is called matching text.

A pattern's operations offer several ways to work on a subject string. One operation simply returns the location of matching text so you can do with it as you please. The second operation replaces each occurrence of matching text with text from a third string containing substitution text. The third operation splits the subject string wherever it finds matching text, and returns the substrings created by splitting.

The String Class

The String class, like its kindred List-descended classes BitString and OctetString, is so common and fundamental in use that Telescript syntax provides a literal form of a string. A string literal allows you to create and use a string without formally initializing an instance of String. A string literal starts with a double quote ("), follows with a series of characters that are the string's items, and ends with a second double quote. For example, "Kalamazoo Michigan" is a string literal composed of the characters between the quotes. You can assign it directly to a variable if you want:

cityState := "Kalamazoo Michigan"

A string literal, like an octet-string literal or a bit-string literal, returns a protected object. This means that you can't modify the contents of a string created by a literal, and you can't supply a string literal as the argument for an operation that requires an unprotected object--you must use an unprotected string created by initialization or a copy of of a literal instead. This is particularly important when using the substitute operation of class Pattern, which you'll learn to do later in this chapter. It requires an unprotected string argument, and won't work if you supply a string literal.

Breaks in a String

As you read in Chapter 3, a Telescript source program has many breaks--space, horizontal tab, or line feed characters-- that separate program tokens without changing their meaning. Spaces, for example, separate keywords, and line feeds often separate statements so you can easily see where each statement starts. When the code is compiled, these breaks are ignored unless they serve to separate alphanumeric tokens. In any case, breaks have no intrinsic syntactic meaning.

String literals, on the other hand, don't ignore breaks. They are a single program token, and all characters you enter within the double quotes of a string become characters of that string even if they are break characters. If you assign a string literal to a variable as shown in the following statement,

cityState: = "Kalamazoo
Michigan";

the string assigned to the variable is a single string with a line feed character between Kalamazoo and Michigan.

You can, if you'd like, include break characters in a string by listing them as escape sequences (which are discussed in Chapter 5). You can create the same string as you did in the previous statement by using this statement:

cityState: = "Kalamazoo\nMichigan";

The \n escape sequence denotes a line feed character that occurs between Kalamazoo and Michigan.

Strings accept the same escape sequences that you use for characters; you'll find the full table of escape sequences in Chapter 5 and in Appendix B. They're the escape sequences defined by the ANSI C standard.

Initializing a String

If you want to initialize a string using a constructor expression, the string takes a variable number of arguments, each of which must be either a character or a string. Upon initialization, the characters and strings provided are concatenated in the order supplied to create one large string. This initialization, for example,

text = String('A', 'B', 'C', " starts the alphabet.");

creates a single string that reads "ABC starts the alphabet."

Initializing a string using multiple arguments is an effective way to concatenate existing strings. Simply provide the strings you want concatenated as the initialization arguments and use the newly instantiated string as the concatenated result.

Ordering Strings

String specializes its inherited operation order so that when two strings are compared, the shorter of the two strings is treated as though it had enough NULs appended to it to equal the length of the longer string. (A NUL is ASCII character 0 or Unicode character U+0000.) This means, for example, that the string "Alpha" is before the string "Alphabet".

Reading a Substring

String offers a single operation to examine the contents of the string: substring, which accepts two integers. These integers specify, respectively, the position of the first character in the substring and the position of the character following the last character in the substring. (The substring positions for a string are the same as for a list.) When the operation executes, it reads the specified substring and returns it as the result of the operation. The operation doesn't change the string itself.

Dealing With Case

The String class, like the Character class, is defined with the mix-in class Cased so that it can distinguish between upper- and lowercase characters. Upper- and lowercase forms of some characters are defined by the Unicode standard; Cased follows those definitions. "M", for example, is the uppercase form of the lowercase character "m". And "SS" is the uppercase form of the lowercase German character "ß" (a somewhat unusual uppercase equivalent in that it turns one character into two).

String inherits four features from Cased. It specializes these features so they work with strings of characters instead of single characters (as Cased works with the class Character):

The isLower attribute is true if there are any lowercase characters in the string; it's false if there are no lowercase characters.
The isUpper attribute is true if there are any uppercase characters in the string; it's false if there are no uppercase characters.
The makeLower operation returns a new string that is a copy of the responder string with all uppercase characters converted to lowercase characters.
The makeUpper operation returns a new string that is a copy of the responder string with all lowercase characters converted to uppercase characters.

Note that isLower and isUpper can both return true on a string if the string contains both lower- and upper-case characters.

Converting a String

The String class offers four conversion operations to convert a string into other types of objects:

asOctetString converts a string into an octet string. The Unicode characters in the string are encoded as octets using FSS-UTF (File System Safe Unicode character set Transformation Format, described in the Chapter 9 discussion of octet strings.) Each character may be converted into one, two, or three octets as described in Appendix F of The Unicode Standard Version 1.1. (Note that the asString operation of class OctetString performs the converse action: it turns bytes of FSS-UTF encoding back into a string of Unicode characters.)
asIdentifier converts a string into an identifier. The string must follow the rules for acceptable identifier text as described in Chapter 5. If the text is malformed for use as an identifier, this operation throws the exception ConversionUnavailable. For example, the string "convertible" contains legal identifier text, so asIdentifier converts it to an identifier. The string "non.convertible" contains a period, an illegal identifier character, so asIdentifier throws ConversionUnavailable.
asInteger converts a string into an integer. The string must be one or more decimal numerals. If the characters in the string aren't a well-formed integer as described in Chapter 5, the operation throws ConversionUnavailable.
asReal converts a string into a real number. The characters of the string must be a well-formed real number as described in Chapter 5--which means it may optionally include a fractional component and a mantissa. If the string isn't well-formed real-number text, this operation throws ConversionUnavailable.

Each of these operations leaves the responder string in its original condition.

Example Code

This very simple program creates two string variables, concatenates them when initializing a third string, then demonstrates String and Cased features on the concatenated string:

do{
	firstText := "Toss ten tin toys ";
	secondText := "to two tan tots.";
	joinedText := String(firstText, secondText);
	"Here's a concatenated string:".dump();
	joinedText.dump();
	"Here are characters 10 through 12:".dump();
	joinedText.substring(10,13).dump();
	"True or false: uppercase characters in the string?".dump();
	joinedText.isUpper.dump();
	"Here's a copy of the string converted to all uppercase:".dump();
	joinedText.makeUpper().dump();
	"Here's what the string looks like converted to an octet string:".dump();
	joinedText.asOctetString().dump();
}

When it runs, you see these results:

String: <Here's a concatenated string:>
String: <Toss ten tin toys to two tan tots.>
String: <Here are characters 10 through 12:>
String: <tin>
String: <True or false: uppercase characters in the string?>
Boolean: <true>
String: <Here's a copy of the string converted to all uppercase:>
String: <TOSS TEN TIN TOYS TO TWO TAN TOTS.>
String: <Here's what the string looks like converted to an octet string:>
OctetString: <546F73732074656E2074696E20746F797320746F2074776F2074616E20746F 74732E>

The Pattern Class

As you read earlier in the chapter, the class Pattern defines an object that you can use to search through and modify text in a subject string. The pattern's sole property is a search text that defines matching text to find within the subject string. When you call any of the three Pattern operations on a pattern, you supply the subject string as an argument. This brings the pattern's search text to bear on the subject string.

The search text of a pattern has its own syntax. That syntax allows you to specify wild-card characters, repetitions, string position, and other general requirements for matching text. To use a pattern, you must understand that syntax.

Creating Search Text

When you create search text for a pattern, you use search-text syntax to create a string of characters. This syntax uses the following metacharacters:

$%()*+-.?[]\^|

When a pattern translates its search text to look for matching text, it interprets these metacharacters as directions for the search--it doesn't read them as literal characters. For example, it interprets the metacharacter $ to mean that the search text preceding it must occur at the end of the subject string. You'll see what the various metacharacters do in the descriptions that follow. In the meantime, it's enough to know that they're not interpreted as literal characters, and that all other characters are interpreted as literal characters.

Search Elements

A search element is the most basic unit of search text. A search element specifies a matching element, the most basic unit of matching text that can be specified in search text. A search element is usually a search specification that requires a single character as the matching element. For example, the search text "M" has a single search element: the literal character "M", which specifies the matching element "M" in the subject string. (A search element may also be more complex than one that requires a single character as a matching element--but that"s a complex topic you"ll read about later in this chapter. We"ll stick to single characters for now.)

Search elements may be strung together in succession to specify a corresponding succession of matching elements. For example, the search text "Look" has four search elements: the literal characters "L", "o", "o", and "k". They specify the four matching elements "L", "o", "o", and "k". Together, they specify the matching text "Look".

Search elements that specify a single character can take any one of these forms:

A named character
A wild-card character
An interval-specified character
A list-specified character
An attribute-specified character

Named Characters

Named characters are literal characters such as "Z", "B", "!", and ":". A named character specifies a matching character that is that character and no other. "z" specifies "z", "B" specifies "B", and so on. Named characters can be escape sequences (explained below and shown in Chapter 5) as well. You can, for example, use the escape sequence "\n" as a named character that names a line feed.

There are times when you'll want to use a search-text metacharacter as a literal character in search text. To do so, precede the metacharacter with a backslash. To place a question mark in search text, for example, use \?. To place a backslash in search text, use \\. (The first backslash tells the pattern to consider the second backslash as a literal character.) To represent the sentence "Are you there?", you would use this search text: Are you there\?

It's important to note that when you instantiate a pattern in your source code, you supply the search text as a string. Remember that a string may contain any of the escape characters described in Chapter 5--and that you must use \\ to represent a backslash. This adds a few more hash marks to the metacharacter stew when you create a pattern that uses a literal backslash. Each literal backslash in the search text string of the source code must be preceded by two backslashes: one to satisfy the syntax of pattern search text, the other to satisfy the syntax of string metacharacter syntax.

To see how this works, consider this example. When you write a string in source code to represent the search text that represents the sentence "Are you there?", you must use the string "Are you there\\?". When the source code is compiled, the compiler creates the string "Are you there\?". (It reads \\ as an escape sequence representing \.) When the string is used to instantiate a pattern, pattern syntax regards "\?" as a single question mark, so the specified pattern is "Are you there?".

The Wild-Card Character

A wild-card character is a search element that specifies any character as its matching element. It is, in other words, a placeholder in search text that says "The character that goes here can be any possible character."

The wild-card metacharacter is the period (.). In the search text "b.b", it represents any character so that matching text could, for example, be "bib", "bob", "bZb", "b\tb", or any other three characters starting and ending with "b".

Interval-Specified Characters

If you know the encoding order of Unicode characters, there may be times when you want to specify a single character from a range of those characters. For example, you might want to specify any character that falls in the range of Cyrillic characters (thereby specifying any Cyrillic character), or any character that falls in the range of the first half of the Latin alphabet (in uppercase). To do so, you use an interval-specified character as a search element.

An interval-specified character starts with the first character in the range you want to specify, follows with the metacharacter dash (-), and ends with the last character in the range. To specify a character from "A" through "M" in the Latin alphabet, you use the interval-specified character "A-M".

To see how an interval-specified character works, consider the search text "A-Mart". It specifes matching text that could be "Bart", "Cart", "Dart", and other interesting art-ending words that start with a capital letter from A through M.

List-Specified Characters

You may, at times want to specify a single character from a set of characters that don't all occur in a single Unicode range. If so, you can define your own set of possible characters with a list-specified character.

A list-specified character provides a set of characters that will satisfy the search element. These characters are all enclosed in square brackets ([]). The list-specified character [369] can, for example, only be satisfied by a 3, 6, or 9.

A list-specified character can use interval-specified characters within its list to specify a range or ranges of characters. The list-specified character [A-Za-z], for example, can be satisfied by any single upper- or lowercase character of the Latin alphabet.

You may also create a list-specified character whose list defines which characters will not satisfy the search element. In such a search element, the first square bracket is followed by a caret (^). The list specified character [^0-9ABC] can, for example, be satisfied by any character that is not a numeral from 0 through 9 nor the letter A, B, or C.

Attribute-Specified Characters

Certain common groups of characters are frequently specified for a single character. The search text syntax provides attribute-specified characters that define these common groups. If you use them, you won't have to do the work of defining list-specified characters to define those groups.

An attribute-specified character starts with the metacharacter percent sign (%) followed by a single character to specify the attribute of the group you want to specify. The table that follows shows the different groups defined by attribute-specified characters. (Note that "attribute" in this case refers only to the characteristics of the characters in each group, not to the attributes of an object.)
Attribute-Specified Character Character Group
%A Any alphabetic character as defined by Unicode.
%L Any lowercase character as defined by Unicode.
%U Any uppercase character as defined by Unicode.
%D Any decimal-digit character as defined by Unicode.
%P Any punctuation character as defined by Unicode.
%S Any space character as defined by Unicode.
%7 Any ASCII character.
%a Any ASCII character that is an alphabetic character--in other words, A through Z and a through z.
%l Any ASCII character that is a lowercase character--in other words, a through z.
%u Any ASCII character that is an uppercase character--in other words, A through Z.
%d Any ASCII character that is a decimal-digit character--in other words, 0 through 9.
%p Any ASCII character that is a punctuation character.
%s Any ASCII character that is a space character.

Notice that the first six attribute-specified characters in this table are much more broadly defined than their equivalents in the last six entries of this table. %D, for example, specifies any single character that is a Unicode-defined decimal-digit character. That includes the characters 0 through 9, but it also includes characters used as decimal digits--circled, inverse-circled, parenthesized, and other types of digits defined by Unicode. %d restricts the decimal digits specified to those belonging to ASCII--0 through 9, in other words.

An example of an attribute-specified character in use is the search text "Series%d", which specifies the matching text "Series1", "Series2", "Series3", and so on up to "Series9".

Adding an Occurrence Specification

You can add an occurrence specification to any search element with the exception of an interval-specified character. The occurrence specification defines a variable number of contiguous matching elements that will satisfy the search element. Each occurrence specification is a metacharacter added to the end of the search element. Occurrence specifications are shown in this table:

Occurrence Specification	Repetitions Defined
?	The matching element can occur once or not at all.
+	The matching element can occur one or more times.
*	The matching element can occur zero or more times.
(No metacharacter)	The matching element must occur once and once only.

To see how they work, consider these examples:

The search text "R%d?D2" has an occurrence specification of ? after the attribute-specified character %d. This means that a single decimal digit can appear in the midst of the matching text or no character at all can appear there. These texts match the search text: "RD2", "R1D2", "R2D2", and so on up to "R9D2". "R10D2" doesn't match because the two decimal digits of "10" are one digit too many for the occurrence specification.

The search text "Series%d+" uses the occurrence specification + at the end of %d to specify that there can be one or more decimal digits at the end of the matching text. That means it can be satisfied by "Series1", "Series35", "Series59238582935", and an infinite number of other matching text as long as they start with "Series" and end with a string of one or more decimal-digital characters. Note that "Series" won't match because it has no decimal digits.

The search text "ba*d" uses the occurrence specification * at the end of an 'a' between a 'b' and a 'd'. This means that matching text can have zero, one, or more 'a's between the 'b' and 'd'. These texts match: "bd", "bad", "baad", "baaad", "baaaad", and so on to an infinite number of 'a's.

Note that if there is no occurrence specification after a search element, then only a single occurrence of the matching element satisfies the search element.

Search Strings

When you combine search elements in a contiguous string, as you've seen, you specify a contiguous string of corresponding matching elements. Search elements combined in such a way create the next higher level element of search-text syntax: a search string. The contiguous matching elements that satisfy a search string are called a matching string.

A search string can be any combination of search elements; each search element may or may not have an occurrence specification. You've just seen examples of these search strings in our discussion of occurrence specifications. Here are two other examples:

"Ozo Benton" is a search string that is only satisfied by the matching string "Ozo Benton". It's one of the simplest kinds of search strings.

"%U[^\.\?!]*[\.\?!]" is a search string you can use to find a whole sentence as a matching string. It starts with a single uppercase character (%U). It follows with an exclusive list character ([^\.\?!]) that specifies any character but a period, question mark, or exclamation point. It uses the occurrence specifier (*) to say that there can be zero or more of these matching non-punctuation characters. The search string ends with an inclusive list character ([\.\?!]) that says the matching string ends with a punctuation character: period, question mark, or exclamation point. This search string shows the flexibility and range of search-text syntax.

Anchoring a Search String

You can add the element of position to a search string if you wish by requesting that the string be anchored. An anchored string either starts at the beginning of the subject string or ends at the end of the subject string. You can specify either or both of these anchor positions.

To specify the beginning of the subject string, you add the metacharacter caret (^) to the beginning of the search string. For example, the search string "^Dear M" is only satisfied by a matching string "Dear M" whose 'D' is the first character of the subject string.

To specify the end of the subject string, you append the metacharacter dollar sign ($) to the end of the search string. For example, the search string "Transmission finished$" is only satisfied by a matching string "Transmission finished" whose final 'd' is the last character of the subject string.

Using Alternate Search Strings in Search Text

When you assemble the final search text for a pattern, it's often just a single search string. You may, however, provide more than one search string in search text. In that case, you separate the search strings with the metacharacter |, which acts as a logical OR specification. Search strings set up this way in search text become alternate search strings. Matching text for the search text can be a matching string that satisfies any one of the alternate search strings.

As an example, consider the search text "bad|terrible|awful". It offers the alternate search strings "bad", "terrible", and "awful". It can be satisfied by any occurrence of the matching text "bad", "terrible", or "awful".

Turning Search Text into a Search Element

At the beginning of our discussion of search text, you read that a search element typically specifies a single character. You also read that it may, on some occasions, be more complex. This is how it works:

Any permissible search text may be turned into a search element by simply enclosing it in parentheses. Once you've turned search text into a search element, you may do all the things with that element that you can with a simple one-character search element: add an occurrence specification, combine it with other elements to create a search string, anchor that search string, and so on. This allows you tremendous search possibilities--as well as tremendous complexities to master. Take a look at a relatively simple example.

"^(ABC|abc)" is a search text that encloses two alternate search strings--ABC and abc--in parentheses to turn them into one search element. That search element is a one-element search string that has a caret before it, specifying that it has to be found at the beginning of a subject string. This search text is satisfied with either "ABC" or "abc" at the beginning of a subject string.

Metacharacter Recap

Now that our discussion of search text is finished, the following table recaps the metacharacters used in pattern syntax.

Metacharacter	Use
\	The escape character used before another metacharacter to use that metacharacter as a literal character
.	The wildcard character used to specify any character as a matching element
-	The dash character used between two characters to denote an interval-specified character
[]	Beginning and ending square brackets used to enclose a set of characters that denote a list-specified character
^	The character used after the first square bracket of a list-specified character to specify that the character can be anything but the characters enclosed in brackets. Also the anchored search string character used before a search string to specify that the matching string must occur at the beginning of the subject string.
%	The attribute character used before any one of the characters "ALUDPS7aludps" to create an attribute-specified character.
?	The once-or-not-at-all occurrence specification that may follow a search element
+	The one-or-more-times occurrence specification that may follow a search element
*	The zero-or-more-times occurrence specification that may follow a search element
$	The anchored search string character used before a search string to specify that the matching string must occur at the end of the subject string.
\|	The alternate character used between two search strings as a logical OR specification between those strings.
()	Beginning and ending parentheses used to enclose search text and turn that search text into a single search element

Initializing a Pattern

Figuring out the search text for a pattern is the most involved part of initializing a pattern. You need to consider what subject strings you'll apply the pattern to, what it is you want to find in each subject string, and how to express that in search-text syntax. Once you've created your search text, initializing a pattern is easy: You simply provide a string containing your search text as the sole initialization argument for the pattern. The following initialization statement shows an example:

findDate = Pattern("%d%d?/%d%d?/%d%d%d?%d?");

This statement creates a pattern whose search text looks for matching text in the form of abbreviated dates such as 1/9/87 or 05/10/2001.

Note that if you supply a string that doesn't follow the search text syntax, the initialization throws PatternInvalid. Note also that you can't supply a nil as the initialization argument; a pattern can do nothing without search text.

Finding a Pattern in a Subject String

Once you've created a pattern, you can use it to find its search text in a subject string by calling find on the pattern. The find operation accepts a subject string and two integers as arguments. The subject string is the string to be searched for in the pattern. The two integers specify a substring location within the subject string where searching will take place.

If you supply a nil for the first integer, you specify the beginning of the subject string; if you supply a nil for the second integer, you specify the end of the subject string. Supplying nils for both arguments specifies the whole subject string. (Remember that trailing nil arguments may be dropped when you call an operation, so you can supply a subject string as a single argument. find then fills in nils for the substring positions, and ends up searching the entire string.)

When find executes, it starts at the beginning of the specified substring and searches for matching text that satisfies the pattern. When it finds the first possible match, it returns a list of two integers: the beginning and beyond-the-end positions of the substring where the match occurs--in other words, the location of the matching text. If the operation doesn't find a match, it returns a nil instead of a list of two integers.

Note that if the pattern contains an anchored search string, the anchor specifies that the matching text is anchored to the beginning or end of the specified substring, not to the beginning or end of the full subject string--unless the substring specified is the full subject string.

Finding and Substituting in a Subject String

If you want to find matching text in a subject string and then substitute new text for it, you can call substitute on the pattern. The substitute operation accepts an unprotected subject string, a replacement string, and a non-negative integer as arguments.

The subject string is the string to be searched and modified, so it must be unprotected. You can't supply a string literal (which is protected); you must supply a string created by a constructor expression. The replacement string is the text to be substituted for matching text and may be protected or unprotected because it remains unmodified. The integer specifies the maximum number of substitutions that may take place. If this argument is nil, it specifies that the operation should make substitutions for all occurrences of matching text that it finds.

When the operation executes, it finds the first occurrence of matching text in the subject string. It replaces that matching text with the text of the replacement string. The operation then looks for another match, and replaces that match if found. It does this for up to as many occurrences of matching text as are specified in the maximum-substitutions argument. When the operation is finished, it returns an integer that reports how many substitutions it made (which is zero if it made no substitutions).

The replacement string has its own simple syntax. It uses two metacharacters: the ampersand (&) and the backslash (\). & specifies that when matching text is removed from the subject string, it's inserted in the replacement string in place of the &. The replacement string is then inserted in the subject string where the matching text was removed.

For example, consider a pattern with the search text "Rory". If you call substitute on this pattern and supply the replacement string "the culprit &", the operation replaces every occurrence of "Rory" with "the culprit Rory".

If you want to use a literal ampersand within the replacement string, use a backslash before the ampersand: \&. The backslash instructs the pattern to consider the following ampersand as a literal character, not a metacharacter.

Splitting a Subject String into Substrings

The operation split allows you to search a subject string for a pattern and--wherever you find an occurrence of matching text--split the string at that point. It returns a list of the strings created by splitting the subject string; it leaves the subject string unchanged.

split offers two methods of splitting: it can remove the matching text when it splits the subject string, or it can keep the matching text as separate split strings.

For example, consider a pattern that searches for the search text "AB" in the subject string "6235AB234652AB346743AB490209451AB". If split removes matching text when it operates, the operation returns a list of the split strings "6235", "234652", "346743", and "490209451". If split keeps matching text, it returns "6235", "AB", "234652", "AB", "346743", "AB", "490209451", "AB", and "". (The last empty string is returned because split always returns the string before the first match, the strings between matches, and the string following the last match, even if any of those strings are empty.)

The split operation takes three arguments: a subject string through which it searches; a boolean that, if true, asks the operation to keep the matching text when splitting; and an integer that specifies the maximum number of splits to perform.

If the boolean is false when split executes, the operation removes matching text. It searches through the subject string for occurrences of matching text. When it finds a match, it makes a string of the text that stretches from the end of the last match (or the beginning of the subject string if there is no previous match) up to the beginning of the current match. It stores the string as an item added to the end of a list of strings that is newly created for this purpose. It goes on to find the next match, where it performs the same actions. This continues until the operation finds no more matches or until it reaches the maximum specified number of matches. It then creates a string that stretches from the end of the last match to the end of the subject string and adds the new string as an item at the end of the list of strings. The operation returns the list of strings--the split strings derived from the subject string without the matching text. Remember that only the strings in the returned list have been split; the subject string is unchanged.

If the boolean is true when split executes, the operation keeps matching text as it searches and splits. When it finds a match, it adds the string of the previous unmatching text to its list of strings as described in the last paragraph with this important difference: it then adds a string of the current matching text to the list of strings. When the operation is finished, it returns the list of strings. The list consists of the unmatching text strings alternating with the matching text strings in the order in which they occurred in the subject string.

Example Code

This program creates a pattern and a subject string, then uses the pattern on the subject string in three different ways:

1	do{
2		/* Declare variables */
3		subject: String;
4		pattern: Pattern;
5		subLocation: List[Integer, Equal]|Nil;
6		position1, timesReplaced: Integer|Nil;

7		/* Initialize subject string and pattern. */
8		subject = String("One and two or three and four.");
9		pattern = Pattern(" and | or ");

10		/* Print out the locations of matching text. */
11		loop {
12				subLocation = pattern.find(subject, position1, nil);
13				if subLocation == nil {break;};
14				"Found matching text at:".dump();
15				subLocation.dump();
16				position1 = subLocation[2];
17		};

18		/* Split subject string into substrings w/ and w/out matching text. */
19		"This is the subject text split without keeping matching text:".dump();
20		pattern.split(subject, false).dump();
21		"This is the subject text split while keeping matching text:".dump();
22		pattern.split(subject, true).dump();
23		"This is the original subject string:".dump();
24		subject.dump();

25		/* Substitute text "plus" for "and" and "or" */
26		timesReplaced = pattern.substitute(subject, " plus ");
27		"This is the original subject string after substitution:".dump();
28		subject.dump();
29		String("There were ", timesReplaced.asString(),
				" substitutions.").dump();
30	}

When it runs, you see these results:

String: <Found matching text at:>
List: <2 elements>
        Integer: <4>
        Integer: <9>
String: <Found matching text at:>
List: <2 elements>
        Integer: <12>
        Integer: <16>
String: <Found matching text at:>
List: <2 elements>
        Integer: <21>
        Integer: <26>
String: <This is the subject text split without keeping matching text:>
List: <4 elements>
        String: <One>
        String: <two>
        String: <three>
        String: <four.>
String: <This is the subject text split while keeping matching text:>
List: <7 elements>
        String: <One>
        String: < and >
        String: <two>
        String: < or >
        String: <three>
        String: < and >
        String: <four.>
String: <This is the original subject string:>
String: <One and two or three and four.>
String: <This is the original subject string after substitution:>
String: <One plus two plus three plus four.>
String: <There were 3 substitutions.>

The first part of this program declares and initializes the variables used throughout the program. (Notice that subLocation and position1 are initialized with a |Nil option so they can refer to a nil if operations later in the program return a nil.) The subject string, subject, contains text with "and"s and "or"s sprinkled throughout. The pattern, pattern, has alternate search strings to find those "and"s and "or"s along with the spaces around them.

Line 11 starts a loop where--in line 12--it calls find on the pattern to look for its search text in the subject string. Notice that position1 is set to nil the first time through because it was declared but not initialized, so the search range in the subject string is from the beginning to the end of the string (nil, nil). Line 13 checks to see if the find operation found anything. If so, the if statement here is false and the loop continues. Line 14 dumps a label line, and line 15 prints out the list of integers that gives the location of the matching text found. Line 16 sets position1 to the second of the found substring positions, which means that the next time through the loop the find operation searches the subject string starting at the text immediately following the found matching text. When the find operation finds no more matching text, the loop breaks at line 13 because find returns a nil.

Line 20 splits the subject string at each occurrence of matching text without keeping the matching text, then dumps the list of substrings returned. Line 22 does the same, but keeps matching text this time. And line 24 dumps the original subject string so you can see that it remains unchanged. (Notice that split requires three arguments but receives only two arguments. That's because Telescript syntax allows you to leave out trailing nil arguments in a request expression--so the effect here is of supplying a nil for the third argument.)

Line 26 substitutes the string " plus " for every occurrence of matching text in the subject string. (The third argument, an implicit nil, specifies every occurrence.) It returns an integer reporting how many substitutions it made. Line 28 dumps the subject string so you can see it after substitutions. Line 29 then reports how many substitutions were made. Notice that it concatenates three strings to create a final string for dumping. And notice that the middle string is an integer variable that's converted to a string for the purpose of concatenation.

* * *

Now that you've seen how Telescript handles text, you can move on to the next chapter, which covers a timely subject.

10 - Handling Text
An Overview of Strings and Patterns
The String Class
Breaks in a String
Initializing a String
Ordering Strings
Reading a Substring
Dealing With Case
Converting a String
Example Code
The Pattern Class
Creating Search Text
Search Elements
Named Characters
The Wild-Card Character
Interval-Specified Characters
List-Specified Characters
Attribute-Specified Characters
Adding an Occurrence Specification
Search Strings
Anchoring a Search String
Using Alternate Search Strings in Search Text
Turning Search Text into a Search Element
Metacharacter Recap
Initializing a Pattern
Finding a Pattern in a Subject String
Finding and Substituting in a Subject String
Splitting a Subject String into Substrings
Example Code

TS Ref - 26 JUN 1996

Generated with Harlequin WebMaker

Attribute-Specified Character	Character Group
%A	Any alphabetic character as defined by Unicode.
%L	Any lowercase character as defined by Unicode.
%U	Any uppercase character as defined by Unicode.
%D	Any decimal-digit character as defined by Unicode.
%P	Any punctuation character as defined by Unicode.
%S	Any space character as defined by Unicode.
%7	Any ASCII character.
%a	Any ASCII character that is an alphabetic character--in other words, A through Z and a through z.
%l	Any ASCII character that is a lowercase character--in other words, a through z.
%u	Any ASCII character that is an uppercase character--in other words, A through Z.
%d	Any ASCII character that is a decimal-digit character--in other words, 0 through 9.
%p	Any ASCII character that is a punctuation character.
%s	Any ASCII character that is a space character.