{**************************************************************************************************} { } { Project JEDI Code Library (JCL) } { } { The contents of this file are subject to the Mozilla Public License Version 1.1 (the "License"); } { you may not use this file except in compliance with the License. You may obtain a copy of the } { License at http://www.mozilla.org/MPL/ } { } { Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF } { ANY KIND, either express or implied. See the License for the specific language governing rights } { and limitations under the License. } { } { The Original Code is JclUnicode.pas. } { } { The Initial Developer of the Original Code is Mike Lischke (public att lischke-online dott de). } { Portions created by Mike Lischke are Copyright (C) 1999-2000 Mike Lischke. All Rights Reserved. } { } { Contributor(s): } { Marcel van Brakel } { Andreas Hausladen (ahuser) } { Mike Lischke } { Flier Lu (flier) } { Robert Marquardt (marquardt) } { Robert Rossmair (rrossmair) } { Olivier Sannier (obones) } { Matthias Thoma (mthoma) } { Petr Vones (pvones) } { Peter Schraut (http://www.console-dev.de) } { } {**************************************************************************************************} { } { Various Unicode related routines } { } {**************************************************************************************************} // Last modified: $Date: 2005/10/26 09:15:13 $ // For history see end of file unit JclUnicode; {$I jcl.inc} // Copyright (c) 1999-2000 Mike Lischke (public att lischke-online dott de) // // 10-JUL-2005: (changes by Peter Schraut) // - added CodeBlockName, returns the blockname as string // - added CodeBlockRange, returns the range of the specified codeblock // - updated TUnicodeBlock to reflect changes in unicode 4.1 // - updated CodeBlockFromChar to reflect changes in unicode 4.1 // - Notes: // Here are a few suggestions to reflect latest namechanges in unicode 4.1, // but they were not done due to compatibility with old code: // ubGreek should be renamed to ubGreekandCoptic // ubCombiningMarksforSymbols should be renamed to ubCombiningDiacriticalMarksforSymbols // ubPrivateUse should be renamed to ubPrivateUseArea // // // 19-SEP-2003: (changes by Andreas Hausladen) // - added OWN_WIDESTRING_MEMMGR for faster memory managment in TWideStringList // under Windows // - fixed: TWideStringList.Destroy does not set OnChange and OnChanging to nil before calling Clear // // // 29-MAR-2002: MT // - WideNormalize now returns strings with normalization mode nfNone unchanged. // - Bug fix in WideCompose: Raised exception when Result of WideComposeHangul was an // empty string. (#0000044) // - Bug fix in WideAdjustLineBreaks // - Added Asserts were needed. // - TWideStrings.IndexOfName now takes care of NormalizeForm as well. // - TWideStrings.IndexOf now takes care of NormalizeForm as well. // - TWideString.List Find now uses the same NormalizationForm for the search string as it uses // within the list itself. // // 29-NOV-2001: // - bug fix // 06-JUN-2001: // - small changes // 28-APR-2001: // - bug fixes // 05-APR-2001: // - bug fixes // 23-MAR-2001: // - WideSameText // - small changes // 10-FEB-2001: // - bug fix in StringToWideStringEx and WideStringToStringEx // 05-FEB-2001: // - TWideStrings.GetSeparatedText changed (no separator anymore after the last line) // 29-JAN-2001: // - PrepareUnicodeData // - LoadInProgress critical section is now created at init time to avoid critical thread races // - bug fixes // 26-JAN-2001: // - ExpandANSIString // - TWideStrings.SaveUnicode is by default True now // 20..21-JAN-2001: // - StrUpperW, StrLowerW and StrTitleW removed because they potentially would need // a reallocation to work correctly (use the WideString versions instead) // - further improvements related to internal data // - introduced TUnicodeBlock // - CodeBlockFromChar improved // 07-JAN-2001: // optimized access to character properties, combining class etc. // 06-JAN-2001: // TWideStrings and TWideStringList improved // APR-DEC 2000: versions 2.1 - 2.6 // - preparation for public rlease // - additional conversion routines // - JCL compliance // - character properties unified // - character properties data and lookup improvements // - reworked Unicode data resource file // - improved simple string comparation routines (StrCompW, StrLCompW etc., include surrogate fix) // - special case folding data for language neutral case insensitive comparations included // - optimized decomposition // - composition and normalization support // - normalization conformance tests applied // - bug fixes // FEB-MAR 2000: version 2.0 // - Unicode regular expressions (URE) search class (TURESearch) // - generic search engine base class for both the Boyer-Moore and the RE search class // - whole word only search in UTBM, bug fixes in UTBM // - string decompositon (including hangul) // OCT/99 - JAN/2000: version 1.0 // - basic Unicode implementation, more than 100 WideString/UCS2 and UCS4 core functions // - TWideStrings and TWideStringList classes // - Unicode Tuned Boyer-Moore search class (TUTBMSearch) // - low and high level Unicode/Wide* functions // - low level Unicode UCS4 data import and functions // - helper functions // // Version 2.9 // This unit contains routines and classes to manage and work with Unicode/WideString strings. // You need Delphi 4 or higher to compile this code. // // Publicly available low level functions are all preceded by "Unicode..." (e.g. // in UnicodeToUpper) while the high level functions use the Str... or Wide... // naming scheme (e.g. StrLICompW and WideUpperCase). // // The normalization implementation in this unit has successfully and completely passed the // official normative conformance testing as of Annex 9 in Technical Report #15 // (Unicode Standard Annex #15, http://www.unicode.org/unicode/reports/tr15, from 2000-08-31). // // Open issues: // - Yet to do things in the URE class are: // - check all character classes if they match correctly // - optimize rebuild of DFA (build only when pattern changes) // - set flag parameter of ExecuteURE // - add \d any decimal digit // \D any character that is not a decimal digit // \s any whitespace character // \S any character that is not a whitespace character // \w any "word" character // \W any "non-word" character // - The wide string classes still compare text with functions provided by the // particular system. This works usually fine under WinNT/W2K (although also // there are limitations like maximum text lengths). Under Win9x conversions // from and to MBCS are necessary which are bound to a particular locale and // so very limited in general use. These comparisons should be changed so that // the code in this unit is used. interface uses {$IFDEF MSWINDOWS} Windows, {$ENDIF MSWINDOWS} Classes, JclBase; {$IFNDEF FPC} {$IFDEF MSWINDOWS} {$DEFINE OWN_WIDESTRING_MEMMGR} {$ENDIF MSWINDOWS} {$ENDIF ~FPC} {$IFDEF SUPPORTS_WIDESTRING} const // definitions of often used characters: // Note: Use them only for tests of a certain character not to determine character // classes (like white spaces) as in Unicode are often many code points defined // being in a certain class. Hence your best option is to use the various // UnicodeIs* functions. WideNull = WideChar(#0); WideTabulator = WideChar(#9); WideSpace = WideChar(#32); // logical line breaks WideLF = WideChar(#10); WideLineFeed = WideChar(#10); WideVerticalTab = WideChar(#11); WideFormFeed = WideChar(#12); WideCR = WideChar(#13); WideCarriageReturn = WideChar(#13); WideCRLF: WideString = #13#10; WideLineSeparator = WideChar($2028); WideParagraphSeparator = WideChar($2029); // byte order marks for Unicode files // Unicode text files (in UTF-16 format) should contain $FFFE as first character to // identify such a file clearly. Depending on the system where the file was created // on this appears either in big endian or little endian style. BOM_LSB_FIRST = WideChar($FEFF); BOM_MSB_FIRST = WideChar($FFFE); type TSaveFormat = ( sfUTF16LSB, sfUTF16MSB, sfUTF8, sfAnsi ); const sfUnicodeLSB = sfUTF16LSB; sfUnicodeMSB = sfUTF16MSB; BOM_UTF16_LSB: array [0..1] of Byte = ($FF,$FE); BOM_UTF16_MSB: array [0..1] of Byte = ($FE,$FF); BOM_UTF8: array [0..2] of Byte = ($EF,$BB,$BF); BOM_UTF32_LSB: array [0..3] of Byte = ($FF,$FE,$00,$00); BOM_UTF32_MSB: array [0..3] of Byte = ($00,$00,$FE,$FF); // BOM_UTF7_1: array [0..3] of Byte = ($2B,$2F,$76,$38); // BOM_UTF7_2: array [0..3] of Byte = ($2B,$2F,$76,$39); // BOM_UTF7_3: array [0..3] of Byte = ($2B,$2F,$76,$2B); // BOM_UTF7_4: array [0..3] of Byte = ($2B,$2F,$76,$2F); // BOM_UTF7_5: array [0..3] of Byte = ($2B,$2F,$76,$38,$2D); type // Unicode transformation formats (UTF) data types PUTF7 = ^UTF7; UTF7 = Char; PUTF8 = ^UTF8; UTF8 = Char; PUTF16 = ^UTF16; UTF16 = WideChar; PUTF32 = ^UTF32; UTF32 = Cardinal; // UTF conversion schemes (UCS) data types PUCS4 = ^UCS4; UCS4 = Cardinal; PUCS2 = PWideChar; UCS2 = WideChar; TUCS2Array = array of UCS2; TUCS4Array = array of UCS4; // various predefined or otherwise useful character property categories TCharacterCategory = ( // normative categories ccLetterUppercase, ccLetterLowercase, ccLetterTitlecase, ccMarkNonSpacing, ccMarkSpacingCombining, ccMarkEnclosing, ccNumberDecimalDigit, ccNumberLetter, ccNumberOther, ccSeparatorSpace, ccSeparatorLine, ccSeparatorParagraph, ccOtherControl, ccOtherFormat, ccOtherSurrogate, ccOtherPrivate, ccOtherUnassigned, // informative categories ccLetterModifier, ccLetterOther, ccPunctuationConnector, ccPunctuationDash, ccPunctuationOpen, ccPunctuationClose, ccPunctuationInitialQuote, ccPunctuationFinalQuote, ccPunctuationOther, ccSymbolMath, ccSymbolCurrency, ccSymbolModifier, ccSymbolOther, // bidirectional categories ccLeftToRight, ccLeftToRightEmbedding, ccLeftToRightOverride, ccRightToLeft, ccRightToLeftArabic, ccRightToLeftEmbedding, ccRightToLeftoverride, ccPopDirectionalFormat, ccEuropeanNumber, ccEuropeanNumberSeparator, ccEuropeanNumberTerminator, ccArabicNumber, ccCommonNumberSeparator, ccBoundaryNeutral, ccSegmentSeparator, // this includes tab and vertical tab ccWhiteSpace, ccOtherNeutrals, // self defined categories, they do not appear in the Unicode data file ccComposed, // can be decomposed ccNonBreaking, ccSymmetric, // has left and right forms ccHexDigit, ccQuotationMark, ccMirroring, ccSpaceOther, ccAssigned // means there is a definition in the Unicode standard ); TCharacterCategories = set of TCharacterCategory; // four forms of normalization are defined: TNormalizationForm = ( nfNone, // do not normalize nfC, // canonical decomposition followed by canonical composition (this is most often used) nfD, // canonical decomposition nfKC, // compatibility decomposition followed by a canonical composition nfKD // compatibility decomposition ); // used to hold information about the start and end // position of a unicodeblock. TUnicodeBlockRange = record RangeStart, RangeEnd: Cardinal; end; // An Unicode block usually corresponds to a particular language script but // can also represent special characters, musical symbols and the like. TUnicodeBlock = ( ubUndefined, ubBasicLatin, ubLatin1Supplement, ubLatinExtendedA, ubLatinExtendedB, ubIPAExtensions, ubSpacingModifierLetters, ubCombiningDiacriticalMarks, //ubGreekandCoptic, ubGreek, ubCyrillic, ubCyrillicSupplement, ubArmenian, ubHebrew, ubArabic, ubSyriac, ubArabicSupplement, ubThaana, ubDevanagari, ubBengali, ubGurmukhi, ubGujarati, ubOriya, ubTamil, ubTelugu, ubKannada, ubMalayalam, ubSinhala, ubThai, ubLao, ubTibetan, ubMyanmar, ubGeorgian, ubHangulJamo, ubEthiopic, ubEthiopicSupplement, ubCherokee, ubUnifiedCanadianAboriginalSyllabics, ubOgham, ubRunic, ubTagalog, ubHanunoo, ubBuhid, ubTagbanwa, ubKhmer, ubMongolian, ubLimbu, ubTaiLe, ubNewTaiLue, ubKhmerSymbols, ubBuginese, ubPhoneticExtensions, ubPhoneticExtensionsSupplement, ubCombiningDiacriticalMarksSupplement, ubLatinExtendedAdditional, ubGreekExtended, ubGeneralPunctuation, ubSuperscriptsandSubscripts, ubCurrencySymbols, //ubCombiningDiacriticalMarksforSymbols, ubCombiningMarksforSymbols, ubLetterlikeSymbols, ubNumberForms, ubArrows, ubMathematicalOperators, ubMiscellaneousTechnical, ubControlPictures, ubOpticalCharacterRecognition, ubEnclosedAlphanumerics, ubBoxDrawing, ubBlockElements, ubGeometricShapes, ubMiscellaneousSymbols, ubDingbats, ubMiscellaneousMathematicalSymbolsA, ubSupplementalArrowsA, ubBraillePatterns, ubSupplementalArrowsB, ubMiscellaneousMathematicalSymbolsB, ubSupplementalMathematicalOperators, ubMiscellaneousSymbolsandArrows, ubGlagolitic, ubCoptic, ubGeorgianSupplement, ubTifinagh, ubEthiopicExtended, ubSupplementalPunctuation, ubCJKRadicalsSupplement, ubKangxiRadicals, ubIdeographicDescriptionCharacters, ubCJKSymbolsandPunctuation, ubHiragana, ubKatakana, ubBopomofo, ubHangulCompatibilityJamo, ubKanbun, ubBopomofoExtended, ubCJKStrokes, ubKatakanaPhoneticExtensions, ubEnclosedCJKLettersandMonths, ubCJKCompatibility, ubCJKUnifiedIdeographsExtensionA, ubYijingHexagramSymbols, ubCJKUnifiedIdeographs, ubYiSyllables, ubYiRadicals, ubModifierToneLetters, ubSylotiNagri, ubHangulSyllables, ubHighSurrogates, ubHighPrivateUseSurrogates, ubLowSurrogates, //ubPrivateUseArea, ubPrivateUse, ubCJKCompatibilityIdeographs, ubAlphabeticPresentationForms, ubArabicPresentationFormsA, ubVariationSelectors, ubVerticalForms, ubCombiningHalfMarks, ubCJKCompatibilityForms, ubSmallFormVariants, ubArabicPresentationFormsB, ubHalfwidthandFullwidthForms, ubSpecials, ubLinearBSyllabary, ubLinearBIdeograms, ubAegeanNumbers, ubAncientGreekNumbers, ubOldItalic, ubGothic, ubUgaritic, ubOldPersian, ubDeseret, ubShavian, ubOsmanya, ubCypriotSyllabary, ubKharoshthi, ubByzantineMusicalSymbols, ubMusicalSymbols, ubAncientGreekMusicalNotation, ubTaiXuanJingSymbols, ubMathematicalAlphanumericSymbols, ubCJKUnifiedIdeographsExtensionB, ubCJKCompatibilityIdeographsSupplement, ubTags, ubVariationSelectorsSupplement, ubSupplementaryPrivateUseAreaA, ubSupplementaryPrivateUseAreaB ); TWideStrings = class; TSearchFlag = ( sfCaseSensitive, // match letter case sfIgnoreNonSpacing, // ignore non-spacing characters in search sfSpaceCompress, // handle several consecutive white spaces as one white space // (this applies to the pattern as well as the search text) sfWholeWordOnly // match only text at end/start and/or surrounded by white spaces ); TSearchFlags = set of TSearchFlag; // a generic search class defininition used for tuned Boyer-Moore and Unicode // regular expression searches TSearchEngine = class(TObject) private FResults: TList; // 2 entries for each result (start and stop position) FOwner: TWideStrings; // at the moment unused, perhaps later to access strings faster protected function GetCount: Integer; virtual; public constructor Create(AOwner: TWideStrings); virtual; destructor Destroy; override; procedure AddResult(Start, Stop: Cardinal); virtual; procedure Clear; virtual; procedure ClearResults; virtual; procedure DeleteResult(Index: Cardinal); virtual; procedure FindPrepare(const Pattern: WideString; Options: TSearchFlags); overload; virtual; abstract; procedure FindPrepare(Pattern: PWideChar; PatternLength: Cardinal; Options: TSearchFlags); overload; virtual; abstract; function FindFirst(const Text: WideString; var Start, Stop: Cardinal): Boolean; overload; virtual; abstract; function FindFirst(Text: PWideChar; TextLen: Cardinal; var Start, Stop: Cardinal): Boolean; overload; virtual; abstract; function FindAll(const Text: WideString): Boolean; overload; virtual; abstract; function FindAll(Text: PWideChar; TextLen: Cardinal): Boolean; overload; virtual; abstract; procedure GetResult(Index: Cardinal; var Start, Stop: Integer); virtual; property Count: Integer read GetCount; end; // The Unicode Tuned Boyer-Moore (UTBM) search implementation is an extended // translation created from a free package written by Mark Leisher (mleisher att crl dott nmsu dott edu). // // The code handles high and low surrogates as well as case (in)dependency, // can ignore non-spacing characters and allows optionally to return whole // words only. // single pattern character PUTBMChar = ^TUTBMChar; TUTBMChar = record LoCase, UpCase, TitleCase: UCS4; end; PUTBMSkip = ^TUTBMSkip; TUTBMSkip = record BMChar: PUTBMChar; SkipValues: Integer; end; TUTBMSearch = class(TSearchEngine) private FFlags: TSearchFlags; FPattern: PUTBMChar; FPatternUsed: Cardinal; FPatternSize: Cardinal; FPatternLength: Cardinal; FSkipValues: PUTBMSkip; FSkipsUsed: Integer; FMD4: Cardinal; protected procedure ClearPattern; procedure Compile(Pattern: PUCS2; PatternLength: Integer; Flags: TSearchFlags); function Find(Text: PUCS2; TextLen: Cardinal; var MatchStart, MatchEnd: Cardinal): Boolean; function GetSkipValue(TextStart, TextEnd: PUCS2): Cardinal; function Match(Text, Start, Stop: PUCS2; var MatchStart, MatchEnd: Cardinal): Boolean; public procedure Clear; override; procedure FindPrepare(const Pattern: WideString; Options: TSearchFlags); overload; override; procedure FindPrepare(Pattern: PWideChar; PatternLength: Cardinal; Options: TSearchFlags); overload; override; function FindFirst(const Text: WideString; var Start, Stop: Cardinal): Boolean; overload; override; function FindFirst(Text: PWideChar; TextLen: Cardinal; var Start, Stop: Cardinal): Boolean; overload; override; function FindAll(const Text: WideString): Boolean; overload; override; function FindAll(Text: PWideChar; TextLen: Cardinal): Boolean; overload; override; end; // Regular expression search engine for text in UCS2 form taking surrogates // into account. This implementation is an improved translation from the URE // package written by Mark Leisher (mleisher att crl dott nmsu dott edu) who used a variation // of the RE->DFA algorithm done by Mark Hopkins (markh att csd4 dott csd dott uwm dott edu). // Assumptions: // o Regular expression and text already normalized. // o Conversion to lower case assumes a 1-1 mapping. // // Definitions: // Separator - any one of U+2028, U+2029, NL, CR. // // Operators: // . - match any character // * - match zero or more of the last subexpression // + - match one or more of the last subexpression // ? - match zero or one of the last subexpression // () - subexpression grouping // {m, n} - match at least m occurences and up to n occurences // Note: both values can be 0 or ommitted which denotes then a unlimiting bound // {,} and {0,} and {0, 0} correspond to * // {, 1} and {0, 1} correspond to ? // {1,} and {1, 0} correspond to + // {m} - match exactly m occurences // // Notes: // o The "." operator normally does not match separators, but a flag is // available that will allow this operator to match a separator. // // Literals and Constants: // c - literal UCS2 character // \x.... - hexadecimal number of up to 4 digits // \X.... - hexadecimal number of up to 4 digits // \u.... - hexadecimal number of up to 4 digits // \U.... - hexadecimal number of up to 4 digits // // Character classes: // [...] - Character class // [^...] - Negated character class // \pN1,N2,...,Nn - Character properties class // \PN1,N2,...,Nn - Negated character properties class // // POSIX character classes recognized: // :alnum: // :alpha: // :cntrl: // :digit: // :graph: // :lower: // :print: // :punct: // :space: // :upper: // :xdigit: // // Notes: // o Character property classes are \p or \P followed by a comma separated // list of integers between 0 and the maximum entry index in TCharacterCategory. // These integers directly correspond to the TCharacterCategory enumeration entries. // Note: upper, lower and title case classes need to have case sensitive search // be enabled to match correctly! // // o Character classes can contain literals, constants and character // property classes. Example: // // [abc\U10A\p0,13,4] // structure used to handle a compacted range of characters PUcRange = ^TUcRange; TUcRange = record MinCode, MaxCode: UCS4; end; TUcCClass = record Ranges: array of TUcRange; RangesUsed: Integer; end; // either a single character or a list of character classes TUcSymbol = record Chr: UCS4; CCL: TUcCClass; end; // this is a general element structure used for expressions and stack elements TUcElement = record OnStack: Boolean; AType, LHS, RHS: Cardinal; end; // this is a structure used to track a list or a stack of states PUcStateList = ^TUcStateList; TUcStateList = record List: array of Cardinal; ListUsed: Integer; end; // structure to track the list of unique states for a symbol during reduction PUcSymbolTableEntry = ^TUcSymbolTableEntry; TUcSymbolTableEntry = record ID, AType: Cardinal; Mods, Categories: TCharacterCategories; Symbol: TUcSymbol; States: TUcStateList; end; // structure to hold a single State PUcState = ^TUcState; TUcState = record ID: Cardinal; Accepting: Boolean; StateList: TUcStateList; Transitions: array of TUcElement; TransitionsUsed: Integer; end; // structure used for keeping lists of states TUcStateTable = record States: array of TUcState; StatesUsed: Integer; end; // structure to track pairs of DFA states when equivalent states are merged TUcEquivalent = record Left, Right: Cardinal; end; TUcExpressionList = record Expressions: array of TUcElement; ExpressionsUsed: Integer; end; TUcSymbolTable = record Symbols: array of TUcSymbolTableEntry; SymbolsUsed: Integer; end; TUcEquivalentList = record Equivalents: array of TUcEquivalent; EquivalentsUsed: Integer; end; // structure used for constructing the NFA and reducing to a minimal DFA PUREBuffer = ^TUREBuffer; TUREBuffer = record Reducing: Boolean; Error: Integer; Flags: Cardinal; Stack: TUcStateList; SymbolTable: TUcSymbolTable; // table of unique symbols encountered ExpressionList: TUcExpressionList; // tracks the unique expressions generated // for the NFA and when the NFA is reduced States: TUcStateTable; // the reduced table of unique groups of NFA states EquivalentList: TUcEquivalentList; // tracks states when equivalent states are merged end; TUcTransition = record Symbol, NextState: Cardinal; end; PDFAState = ^TDFAState; TDFAState = record Accepting: Boolean; NumberTransitions: Integer; StartTransition: Integer; end; TDFAStates = record States: array of TDFAState; StatesUsed: Integer; end; TUcTransitions = record Transitions: array of TUcTransition; TransitionsUsed: Integer; end; TDFA = record Flags: Cardinal; SymbolTable: TUcSymbolTable; StateList: TDFAStates; TransitionList: TUcTransitions; end; TURESearch = class(TSearchEngine) private FUREBuffer: TUREBuffer; FDFA: TDFA; protected procedure AddEquivalentPair(L, R: Cardinal); procedure AddRange(var CCL: TUcCClass; Range: TUcRange); function AddState(NewStates: array of Cardinal): Cardinal; procedure AddSymbolState(Symbol, State: Cardinal); function BuildCharacterClass(CP: PUCS2; Limit: Cardinal; Symbol: PUcSymbolTableEntry): Cardinal; procedure ClearUREBuffer; function CompileSymbol(S: PUCS2; Limit: Cardinal; Symbol: PUcSymbolTableEntry): Cardinal; procedure CompileURE(RE: PWideChar; RELength: Cardinal; Casefold: Boolean); procedure CollectPendingOperations(var State: Cardinal); function ConvertRegExpToNFA(RE: PWideChar; RELength: Cardinal): Cardinal; function ExecuteURE(Flags: Cardinal; Text: PUCS2; TextLen: Cardinal; var MatchStart, MatchEnd: Cardinal): Boolean; procedure ClearDFA; procedure HexDigitSetup(Symbol: PUcSymbolTableEntry); function MakeExpression(AType, LHS, RHS: Cardinal): Cardinal; function MakeHexNumber(NP: PUCS2; Limit: Cardinal; var Number: Cardinal): Cardinal; function MakeSymbol(S: PUCS2; Limit: Cardinal; var Consumed: Cardinal): Cardinal; procedure MergeEquivalents; function ParsePropertyList(Properties: PUCS2; Limit: Cardinal; var Categories: TCharacterCategories): Cardinal; function Peek: Cardinal; function Pop: Cardinal; function PosixCCL(CP: PUCS2; Limit: Cardinal; Symbol: PUcSymbolTableEntry): Cardinal; function ProbeLowSurrogate(LeftState: PUCS2; Limit: Cardinal; var Code: UCS4): Cardinal; procedure Push(V: Cardinal); procedure Reduce(Start: Cardinal); procedure SpaceSetup(Symbol: PUcSymbolTableEntry; Categories: TCharacterCategories); function SymbolsAreDifferent(A, B: PUcSymbolTableEntry): Boolean; public procedure Clear; override; procedure FindPrepare(const Pattern: WideString; Options: TSearchFlags); overload; override; procedure FindPrepare(Pattern: PWideChar; PatternLength: Cardinal; Options: TSearchFlags); overload; override; function FindFirst(const Text: WideString; var Start, Stop: Cardinal): Boolean; overload; override; function FindFirst(Text: PWideChar; TextLen: Cardinal; var Start, Stop: Cardinal): Boolean; overload; override; function FindAll(const Text: WideString): Boolean; overload; override; function FindAll(Text: PWideChar; TextLen: Cardinal): Boolean; overload; override; end; // Event used to give the application a chance to switch the way of how to save // the text in TWideStrings if the text contains characters not only from the // ANSI block but the save type is ANSI. On triggering the event the application // can change the property SaveUnicode as needed. This property is again checked // after the callback returns. TConfirmConversionEvent = procedure (Sender: TWideStrings; var Allowed: Boolean) of object; TWideStrings = class(TPersistent) private FUpdateCount: Integer; FLanguage: LCID; // language can usually left alone, the system's default is used FSaved: Boolean; // set in SaveToStream, True in case saving was successfull otherwise False FNormalizationForm: TNormalizationForm; // determines in which form Unicode strings should be stored FOnConfirmConversion: TConfirmConversionEvent; FSaveFormat: TSaveFormat; // overrides the FSaveUnicode flag, initialized when a file is loaded, // expect losses if it is set to sfAnsi before saving function GetCommaText: WideString; function GetName(Index: Integer): WideString; function GetValue(const Name: WideString): WideString; procedure ReadData(Reader: TReader); procedure SetCommaText(const Value: WideString); procedure SetNormalizationForm(const Value: TNormalizationForm); procedure SetValue(const Name, Value: WideString); procedure WriteData(Writer: TWriter); function GetSaveUnicode: Boolean; procedure SetSaveUnicode(const Value: Boolean); protected procedure DefineProperties(Filer: TFiler); override; procedure DoConfirmConversion(var Allowed: Boolean); virtual; procedure Error(const Msg: string; Data: Integer); function Get(Index: Integer): WideString; virtual; abstract; function GetCapacity: Integer; virtual; function GetCount: Integer; virtual; abstract; function GetObject(Index: Integer): TObject; virtual; function GetTextStr: WideString; virtual; procedure Put(Index: Integer; const S: WideString); virtual; abstract; procedure PutObject(Index: Integer; AObject: TObject); virtual; abstract; procedure SetCapacity(NewCapacity: Integer); virtual; procedure SetUpdateState(Updating: Boolean); virtual; procedure SetLanguage(Value: LCID); virtual; public constructor Create; function Add(const S: WideString): Integer; virtual; function AddObject(const S: WideString; AObject: TObject): Integer; virtual; procedure Append(const S: WideString); procedure AddStrings(Strings: TStrings); overload; virtual; procedure AddStrings(Strings: TWideStrings); overload; virtual; procedure Assign(Source: TPersistent); override; procedure AssignTo(Dest: TPersistent); override; procedure BeginUpdate; procedure Clear; virtual; abstract; procedure Delete(Index: Integer); virtual; abstract; procedure EndUpdate; function Equals(Strings: TWideStrings): Boolean; procedure Exchange(Index1, Index2: Integer); virtual; function GetSeparatedText(Separators: WideString): WideString; virtual; function GetText: PWideChar; virtual; function IndexOf(const S: WideString): Integer; virtual; function IndexOfName(const Name: WideString): Integer; function IndexOfObject(AObject: TObject): Integer; procedure Insert(Index: Integer; const S: WideString); virtual; abstract; procedure InsertObject(Index: Integer; const S: WideString; AObject: TObject); procedure LoadFromFile(const FileName: string); virtual; procedure LoadFromStream(Stream: TStream); virtual; procedure Move(CurIndex, NewIndex: Integer); virtual; procedure SaveToFile(const FileName: string); virtual; procedure SaveToStream(Stream: TStream; WithBOM: Boolean = True); virtual; procedure SetText(const Value: WideString); virtual; property Capacity: Integer read GetCapacity write SetCapacity; property CommaText: WideString read GetCommaText write SetCommaText; property Count: Integer read GetCount; property Language: LCID read FLanguage write SetLanguage; property Names[Index: Integer]: WideString read GetName; property NormalizationForm: TNormalizationForm read FNormalizationForm write SetNormalizationForm default nfC; property Objects[Index: Integer]: TObject read GetObject write PutObject; property Values[const Name: WideString]: WideString read GetValue write SetValue; property Saved: Boolean read FSaved; property SaveUnicode: Boolean read GetSaveUnicode write SetSaveUnicode default True; property SaveFormat: TSaveFormat read FSaveFormat write FSaveFormat default sfUnicodeLSB; property Strings[Index: Integer]: WideString read Get write Put; default; property Text: WideString read GetTextStr write SetText; property OnConfirmConversion: TConfirmConversionEvent read FOnConfirmConversion write FOnConfirmConversion; end; //----- TWideStringList class TDynWideCharArray = array of WideChar; TWideStringItem = record {$IFDEF OWN_WIDESTRING_MEMMGR} FString: PWideChar; // "array of WideChar"; {$ELSE} FString: WideString; {$ENDIF OWN_WIDESTRING_MEMMGR} FObject: TObject; end; TWideStringItemList = array of TWideStringItem; TWideStringList = class(TWideStrings) private FList: TWideStringItemList; FCount: Integer; FSorted: Boolean; FDuplicates: TDuplicates; FOnChange: TNotifyEvent; FOnChanging: TNotifyEvent; procedure ExchangeItems(Index1, Index2: Integer); procedure Grow; procedure QuickSort(L, R: Integer); procedure InsertItem(Index: Integer; const S: WideString); procedure SetSorted(Value: Boolean); {$IFDEF OWN_WIDESTRING_MEMMGR} procedure SetListString(Index: Integer; const S: WideString); {$ENDIF OWN_WIDESTRING_MEMMGR} protected procedure Changed; virtual; procedure Changing; virtual; function Get(Index: Integer): WideString; override; function GetCapacity: Integer; override; function GetCount: Integer; override; function GetObject(Index: Integer): TObject; override; procedure Put(Index: Integer; const S: WideString); override; procedure PutObject(Index: Integer; AObject: TObject); override; procedure SetCapacity(NewCapacity: Integer); override; procedure SetUpdateState(Updating: Boolean); override; procedure SetLanguage(Value: LCID); override; public destructor Destroy; override; function Add(const S: WideString): Integer; override; procedure Clear; override; procedure Delete(Index: Integer); override; procedure Exchange(Index1, Index2: Integer); override; function Find(const S: WideString; var Index: Integer): Boolean; virtual; function IndexOf(const S: WideString): Integer; override; procedure Insert(Index: Integer; const S: WideString); override; procedure Sort; virtual; property Duplicates: TDuplicates read FDuplicates write FDuplicates; property Sorted: Boolean read FSorted write SetSorted; property OnChange: TNotifyEvent read FOnChange write FOnChange; property OnChanging: TNotifyEvent read FOnChanging write FOnChanging; end; // result type for number retrieval functions TUcNumber = record Numerator, Denominator: Integer; end; TFontCharSet = 0..255; const ReplacementCharacter: UCS4 = $0000FFFD; MaximumUCS2: UCS4 = $0000FFFF; MaximumUTF16: UCS4 = $0010FFFF; MaximumUCS4: UCS4 = $7FFFFFFF; SurrogateHighStart: UCS4 = $D800; SurrogateHighEnd: UCS4 = $DBFF; SurrogateLowStart: UCS4 = $DC00; SurrogateLowEnd: UCS4 = $DFFF; // functions involving null-terminated strings // NOTE: PWideChars as well as WideStrings are NOT managed by reference counting under Win32. // In Kylix this is different. WideStrings are reference counted there, just like ANSI strings. function StrLenW(Str: PWideChar): Cardinal; function StrEndW(Str: PWideChar): PWideChar; function StrMoveW(Dest, Source: PWideChar; Count: Cardinal): PWideChar; function StrCopyW(Dest, Source: PWideChar): PWideChar; function StrECopyW(Dest, Source: PWideChar): PWideChar; function StrLCopyW(Dest, Source: PWideChar; MaxLen: Cardinal): PWideChar; function StrPCopyWW(Dest: PWideChar; const Source: WideString): PWideChar; overload; function StrPCopyW(Dest: PWideChar; const Source: string): PWideChar; function StrPLCopyWW(Dest: PWideChar; const Source: WideString; MaxLen: Cardinal): PWideChar; function StrPLCopyW(Dest: PWideChar; const Source: string; MaxLen: Cardinal): PWideChar; function StrCatW(Dest: PWideChar; const Source: PWideChar): PWideChar; function StrLCatW(Dest, Source: PWideChar; MaxLen: Cardinal): PWideChar; function StrCompW(const Str1, Str2: PWideChar): Integer; function StrICompW(const Str1, Str2: PWideChar): Integer; function StrLCompW(const Str1, Str2: PWideChar; MaxLen: Cardinal): Integer; function StrLICompW(const Str1, Str2: PWideChar; MaxLen: Cardinal): Integer; function StrNScanW(const Str1, Str2: PWideChar): Integer; function StrRNScanW(const Str1, Str2: PWideChar): Integer; function StrScanW(Str: PWideChar; Chr: WideChar): PWideChar; overload; function StrScanW(Str: PWideChar; Chr: WideChar; StrLen: Cardinal): PWideChar; overload; function StrRScanW(Str: PWideChar; Chr: WideChar): PWideChar; function StrPosW(Str, SubStr: PWideChar): PWideChar; function StrAllocW(WideSize: Cardinal): PWideChar; function StrBufSizeW(const Str: PWideChar): Cardinal; function StrNewW(const Str: PWideChar): PWideChar; overload; function StrNewW(const Str: WideString): PWideChar; overload; procedure StrDisposeW(Str: PWideChar); procedure StrDisposeAndNilW(var Str: PWideChar); procedure StrSwapByteOrder(Str: PWideChar); // functions involving Delphi wide strings function WideAdjustLineBreaks(const S: WideString): WideString; function WideCharPos(const S: WideString; const Ch: WideChar; const Index: Integer): Integer; //az function WideCompose(const S: WideString): WideString; function WideDecompose(const S: WideString; Compatible: Boolean): WideString; function WideExtractQuotedStr(var Src: PWideChar; Quote: WideChar): WideString; function WideQuotedStr(const S: WideString; Quote: WideChar): WideString; function WideStringOfChar(C: WideChar; Count: Cardinal): WideString; function WideCaseFolding(C: WideChar): WideString; overload; function WideCaseFolding(const S: WideString): WideString; overload; function WideLowerCase(C: WideChar): WideString; overload; function WideLowerCase(const S: WideString): WideString; overload; function WideNormalize(const S: WideString; Form: TNormalizationForm): WideString; function WideSameText(const Str1, Str2: WideString): Boolean; function WideTitleCase(C: WideChar): WideString; overload; function WideTitleCase(const S: WideString): WideString; overload; function WideTrim(const S: WideString): WideString; function WideTrimLeft(const S: WideString): WideString; function WideTrimRight(const S: WideString): WideString; function WideUpperCase(C: WideChar): WideString; overload; function WideUpperCase(const S: WideString): WideString; overload; // Low level character routines function UnicodeNumberLookup(Code: UCS4; var Number: TUcNumber): Boolean; function UnicodeComposePair(First, Second: UCS4; var Composite: UCS4): Boolean; function UnicodeCaseFold(Code: UCS4): TUCS4Array; function UnicodeToUpper(Code: UCS4): TUCS4Array; function UnicodeToLower(Code: UCS4): TUCS4Array; function UnicodeToTitle(Code: UCS4): TUCS4Array; // Character test routines function UnicodeIsAlpha(C: UCS4): Boolean; function UnicodeIsDigit(C: UCS4): Boolean; function UnicodeIsAlphaNum(C: UCS4): Boolean; function UnicodeIsCased(C: UCS4): Boolean; function UnicodeIsControl(C: UCS4): Boolean; function UnicodeIsSpace(C: UCS4): Boolean; function UnicodeIsWhiteSpace(C: UCS4): Boolean; function UnicodeIsBlank(C: UCS4): Boolean; function UnicodeIsPunctuation(C: UCS4): Boolean; function UnicodeIsGraph(C: UCS4): Boolean; function UnicodeIsPrintable(C: UCS4): Boolean; function UnicodeIsUpper(C: UCS4): Boolean; function UnicodeIsLower(C: UCS4): Boolean; function UnicodeIsTitle(C: UCS4): Boolean; function UnicodeIsHexDigit(C: UCS4): Boolean; function UnicodeIsIsoControl(C: UCS4): Boolean; function UnicodeIsFormatControl(C: UCS4): Boolean; function UnicodeIsSymbol(C: UCS4): Boolean; function UnicodeIsNumber(C: UCS4): Boolean; function UnicodeIsNonSpacing(C: UCS4): Boolean; function UnicodeIsOpenPunctuation(C: UCS4): Boolean; function UnicodeIsClosePunctuation(C: UCS4): Boolean; function UnicodeIsInitialPunctuation(C: UCS4): Boolean; function UnicodeIsFinalPunctuation(C: UCS4): Boolean; function UnicodeIsComposed(C: UCS4): Boolean; function UnicodeIsQuotationMark(C: UCS4): Boolean; function UnicodeIsSymmetric(C: UCS4): Boolean; function UnicodeIsMirroring(C: UCS4): Boolean; function UnicodeIsNonBreaking(C: UCS4): Boolean; // Directionality functions function UnicodeIsRightToLeft(C: UCS4): Boolean; function UnicodeIsLeftToRight(C: UCS4): Boolean; function UnicodeIsStrong(C: UCS4): Boolean; function UnicodeIsWeak(C: UCS4): Boolean; function UnicodeIsNeutral(C: UCS4): Boolean; function UnicodeIsSeparator(C: UCS4): Boolean; // Other character test functions function UnicodeIsMark(C: UCS4): Boolean; function UnicodeIsModifier(C: UCS4): Boolean; function UnicodeIsLetterNumber(C: UCS4): Boolean; function UnicodeIsConnectionPunctuation(C: UCS4): Boolean; function UnicodeIsDash(C: UCS4): Boolean; function UnicodeIsMath(C: UCS4): Boolean; function UnicodeIsCurrency(C: UCS4): Boolean; function UnicodeIsModifierSymbol(C: UCS4): Boolean; function UnicodeIsNonSpacingMark(C: UCS4): Boolean; function UnicodeIsSpacingMark(C: UCS4): Boolean; function UnicodeIsEnclosing(C: UCS4): Boolean; function UnicodeIsPrivate(C: UCS4): Boolean; function UnicodeIsSurrogate(C: UCS4): Boolean; function UnicodeIsLineSeparator(C: UCS4): Boolean; function UnicodeIsParagraphSeparator(C: UCS4): Boolean; function UnicodeIsIdentifierStart(C: UCS4): Boolean; function UnicodeIsIdentifierPart(C: UCS4): Boolean; function UnicodeIsDefined(C: UCS4): Boolean; function UnicodeIsUndefined(C: UCS4): Boolean; function UnicodeIsHan(C: UCS4): Boolean; function UnicodeIsHangul(C: UCS4): Boolean; // Utility functions function CharSetFromLocale(Language: LCID): TFontCharSet; function GetCharSetFromLocale(Language: LCID; out FontCharSet: TFontCharSet): Boolean; function CodePageFromLocale(Language: LCID): Integer; function CodeBlockName(const CB: TUnicodeBlock): string; function CodeBlockRange(const CB: TUnicodeBlock): TUnicodeBlockRange; function CodeBlockFromChar(const C: UCS4): TUnicodeBlock; function KeyboardCodePage: Word; function KeyUnicode(C: Char): WideChar; function StringToWideStringEx(const S: string; CodePage: Word): WideString; function TranslateString(const S: string; CP1, CP2: Word): string; function WideStringToStringEx(const WS: WideString; CodePage: Word): string; // WideString conversion routines procedure ExpandANSIString(const Source: PChar; Target: PWideChar; Count: Cardinal); function WideStringToUTF8(S: WideString): AnsiString; function UTF8ToWideString(S: AnsiString): WideString; type TCompareFunc = function (const W1, W2: WideString; Locale: LCID): Integer; var WideCompareText: TCompareFunc; {$ENDIF SUPPORTS_WIDESTRING} type EJclUnicodeError = class(EJclError); implementation {$IFDEF SUPPORTS_WIDESTRING} // Unicode data for case mapping, decomposition, numbers etc. This data is // loaded on demand which means only those parts will be put in memory which are // needed by one of the lookup functions. // Note: There is a little tool called UDExtract which creates a resouce script from // the Unicode database file which can be compiled to the needed res file. // This tool, including its source code, can be downloaded from www.lischke-online.de/Unicode.html. {$R JclUnicode.res} uses {$IFDEF HAS_UNIT_RTLCONSTS} RtlConsts, {$ELSE} {$IFNDEF FPC} Consts, {$ENDIF ~FPC} {$ENDIF HAS_UNIT_RTLCONSTS} SysUtils, JclResources, JclSynch; const {$IFDEF FPC} // declarations from unit [Rtl]Consts SDuplicateString = 'String list does not allow duplicates'; SListIndexError = 'List index out of bounds (%d)'; SSortedListError = 'Operation not allowed on sorted string list'; {$ENDIF FPC} // some predefined sets to shorten parameter lists below and ease repeative usage ClassLetter = [ccLetterUppercase, ccLetterLowercase, ccLetterTitlecase, ccLetterModifier, ccLetterOther]; ClassSpace = [ccSeparatorSpace, ccSpaceOther]; ClassPunctuation = [ccPunctuationConnector, ccPunctuationDash, ccPunctuationOpen, ccPunctuationClose, ccPunctuationOther, ccPunctuationInitialQuote, ccPunctuationFinalQuote]; ClassMark = [ccMarkNonSpacing, ccMarkSpacingCombining, ccMarkEnclosing]; ClassNumber = [ccNumberDecimalDigit, ccNumberLetter, ccNumberOther]; ClassSymbol = [ccSymbolMath, ccSymbolCurrency, ccSymbolModifier, ccSymbolOther]; ClassEuropeanNumber = [ccEuropeanNumber, ccEuropeanNumberSeparator, ccEuropeanNumberTerminator]; // used to negate a set of categories ClassAll = [Low(TCharacterCategory)..High(TCharacterCategory)]; var // As the global data can be accessed by several threads it should be guarded // while the data is loaded. LoadInProgress: TJclCriticalSection; //----------------- support for character categories ----------------------------------------------- // Character category data is quite a large block since every defined character in Unicode is assigned at least // one category. Because of this we cannot use a sparse matrix to provide quick access as implemented for // e.g. composition data. // The approach used here is based on the fact that an application seldomly uses all characters defined in Unicode // simultanously. In fact the opposite is true. Most application will use either Western Europe or Arabic or // Far East character data, but very rarely all together. Based on this fact is the implementation of virtual // memory using the systems paging file (aka file mapping) to load only into virtual memory what is used currently. // The implementation is not yet finished and needs a lot of improvements yet. type // start and stop of a range of code points TRange = record Start, Stop: Cardinal; end; TRangeArray = array of TRange; TCategoriesArray = array of TCharacterCategories; var // character categories, stored in the system's swap file and mapped on demand CategoriesLoaded: Boolean; Categories: array [Byte] of TCategoriesArray; procedure LoadCharacterCategories; // Loads the character categories data (as saved by the Unicode database extractor, see also // the comments about JclUnicode.res above). var Size: Integer; Stream: TResourceStream; Category: TCharacterCategory; Buffer: TRangeArray; First, Second: Byte; J, K: Integer; begin // Data already loaded? if not CategoriesLoaded then begin // make sure no other code is currently modifying the global data area LoadInProgress.Enter; try CategoriesLoaded := True; Stream := TResourceStream.Create(HInstance, 'CATEGORIES', 'UNICODEDATA'); try while Stream.Position < Stream.Size do begin // a) read which category is current in the stream Stream.ReadBuffer(Category, 1); // b) read the size of the ranges and the ranges themself Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(Buffer, Size); Stream.ReadBuffer(Buffer[0], Size * SizeOf(TRange)); // c) go through every range and add the current category to each code point for J := 0 to Size - 1 do for K := Buffer[J].Start to Buffer[J].Stop do begin if K > $FFFF then Break; First := (K shr 8) and $FF; Second := K and $FF; // add second step array if not yet done if Categories[First] = nil then SetLength(Categories[First], 256); Include(Categories[First, Second], Category); end; end; end; finally Stream.Free; end; finally LoadInProgress.Leave; end; end; end; function CategoryLookup(Code: Cardinal; Cats: TCharacterCategories): Boolean; overload; // determines whether the Code is in the given category var First, Second: Byte; begin // load property data if not already done if not CategoriesLoaded then LoadCharacterCategories; First := (Code shr 8) and $FF; Second := Code and $FF; if Categories[First] <> nil then Result := Categories[First, Second] * Cats <> [] else Result := False; end; //----------------- support for case mapping ------------------------------------------------------- type TCase = array [0..3] of TUCS4Array; // mapping for case fold, lower, title and upper in this order TCaseArray = array of TCase; var // An array for all case mappings (including 1 to many casing if saved by the extraction program). // The organization is a sparse, two stage matrix. // SingletonMapping is to quickly return a single default mapping. CaseDataLoaded: Boolean; CaseMapping: array [Byte] of TCaseArray; SingletonMapping: TUCS4Array; procedure LoadCaseMappingData; var Stream: TResourceStream; I, Code, Size: Cardinal; First, Second: Byte; begin if not CaseDataLoaded then begin // make sure no other code is currently modifying the global data area LoadInProgress.Enter; try SetLength(SingletonMapping, 1); CaseDataLoaded := True; Stream := TResourceStream.Create(HInstance, 'CASE', 'UNICODEDATA'); try // the first entry in the stream is the number of entries in the case mapping table Stream.ReadBuffer(Size, 4); for I := 0 to Size - 1 do begin // a) read actual code point Stream.ReadBuffer(Code, 4); Assert(Code < $10000, LoadResString(@RsCasedUnicodeChar)); // if there is no high byte entry in the first stage table then create one First := (Code shr 8) and $FF; Second := Code and $FF; if CaseMapping[First] = nil then SetLength(CaseMapping[First], 256); // b) read fold case array Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(CaseMapping[First, Second, 0], Size); Stream.ReadBuffer(CaseMapping[First, Second, 0, 0], Size * SizeOf(UCS4)); end; // c) read lower case array Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(CaseMapping[First, Second, 1], Size); Stream.ReadBuffer(CaseMapping[First, Second, 1, 0], Size * SizeOf(UCS4)); end; // d) read title case array Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(CaseMapping[First, Second, 2], Size); Stream.ReadBuffer(CaseMapping[First, Second, 2, 0], Size * SizeOf(UCS4)); end; // e) read upper case array Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(CaseMapping[First, Second, 3], Size); Stream.ReadBuffer(CaseMapping[First, Second, 3, 0], Size * SizeOf(UCS4)); end; end; finally Stream.Free; end; finally LoadInProgress.Leave; end; end; end; function CaseLookup(Code: Cardinal; CaseType: Cardinal): TUCS4Array; // Performs a lookup of the given code and returns its case mapping if found. // CaseType must be 0 for case folding, 1 for lower case, 2 for title case and 3 for upper case, respectively. // If Code could not be found (or there is no case mapping) then the result is a mapping of length 1 with the // code itself. Otherwise an array of code points is returned which represent the mapping. var First, Second: Byte; begin // load case mapping data if not already done if not CaseDataLoaded then LoadCaseMappingData; First := (Code shr 8) and $FF; Second := Code and $FF; // Check first stage table whether there is a mapping for a particular block and // (if so) then whether there is a mapping or not. if (CaseMapping[First] = nil) or (CaseMapping[First, Second, CaseType] = nil) then begin SingletonMapping[0] := Code; Result := SingletonMapping; end else Result := CaseMapping[First, Second, CaseType]; end; function UnicodeCaseFold(Code: UCS4): TUCS4Array; // This function returnes an array of special case fold mappings if there is one defined for the given // code, otherwise the lower case will be returned. This all applies only to cased code points. // Uncased code points are returned unchanged. begin Result := CaseLookup(Code, 0); end; function UnicodeToUpper(Code: UCS4): TUCS4Array; begin Result := CaseLookup(Code, 3); end; function UnicodeToLower(Code: UCS4): TUCS4Array; begin Result := CaseLookup(Code, 1); end; function UnicodeToTitle(Code: UCS4): TUCS4Array; begin Result := CaseLookup(Code, 2); end; //----------------- support for decomposition ------------------------------------------------------ const // constants for hangul composition and hangul-to-jamo decomposition SBase = $AC00; // hangul syllables start code point LBase = $1100; // leading syllable VBase = $1161; TBase = $11A7; // trailing syllable LCount = 19; VCount = 21; TCount = 28; NCount = VCount * TCount; // 588 SCount = LCount * NCount; // 11172 type TDecompositions = array of TUCS4Array; TDecompositionsArray = array [Byte] of TDecompositions; var // list of decompositions, organized (again) as two stage matrix // Note: there are two tables, one for canonical decompositions and the other one // for compatibility decompositions. DecompositionsLoaded: Boolean; CanonicalDecompositions, CompatibleDecompositions: TDecompositionsArray; procedure LoadDecompositionData; var Stream: TResourceStream; I, Code, Size: Cardinal; First, Second: Byte; begin if not DecompositionsLoaded then begin // make sure no other code is currently modifying the global data area LoadInProgress.Enter; try DecompositionsLoaded := True; Stream := TResourceStream.Create(HInstance, 'DECOMPOSITION', 'UNICODEDATA'); try // determine how many decomposition entries we have Stream.ReadBuffer(Size, 4); for I := 0 to Size - 1 do begin Stream.ReadBuffer(Code, 4); Assert((Code and not $40000000) < $10000, LoadResString(@RsDecomposedUnicodeChar)); // if there is no high byte entry in the first stage table then create one First := (Code shr 8) and $FF; Second := Code and $FF; // insert into the correct table depending on bit 30 // (if set then it is a compatibility decomposition) if Code and $40000000 <> 0 then begin if CompatibleDecompositions[First] = nil then SetLength(CompatibleDecompositions[First], 256); Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(CompatibleDecompositions[First, Second], Size); Stream.ReadBuffer(CompatibleDecompositions[First, Second, 0], Size * SizeOf(UCS4)); end; end else begin if CanonicalDecompositions[First] = nil then SetLength(CanonicalDecompositions[First], 256); Stream.ReadBuffer(Size, 4); if Size > 0 then begin SetLength(CanonicalDecompositions[First, Second], Size); Stream.ReadBuffer(CanonicalDecompositions[First, Second, 0], Size * SizeOf(UCS4)); end; end; end; finally Stream.Free; end; finally LoadInProgress.Leave; end; end; end; function UnicodeDecomposeHangul(Code: UCS4): TUCS4Array; // algorithmically decomposes hangul character var Rest: Integer; begin Dec(Code, SBase); Rest := Code mod TCount; if Rest = 0 then SetLength(Result, 2) else SetLength(Result, 3); Result[0] := LBase + (Code div NCount); Result[1] := VBase + ((Code mod NCount) div TCount); if Rest <> 0 then Result[2] := TBase + Rest; end; function UnicodeDecompose(Code: UCS4; Compatible: Boolean): TUCS4Array; var First, Second: Byte; begin // load decomposition data if not already done if not DecompositionsLoaded then LoadDecompositionData; Result := nil; // if the code is hangul then decomposition is algorithmically if UnicodeIsHangul(Code) then Result := UnicodeDecomposeHangul(Code) else begin First := (Code shr 8) and $FF; Second := Code and $FF; if Compatible then begin // Check first stage table whether there is a particular block and // (if so) then whether there is a decomposition or not. if (CompatibleDecompositions[First] = nil) or (CompatibleDecompositions[First, Second] = nil) then begin // if there is no compatibility decompositions try canonical if (CanonicalDecompositions[First] = nil) or (CanonicalDecompositions[First, Second] = nil) then Result := nil else Result := CanonicalDecompositions[First, Second]; end else Result := CompatibleDecompositions[First, Second]; end else begin if (CanonicalDecompositions[First] = nil) or (CanonicalDecompositions[First, Second] = nil) then Result := nil else Result := CanonicalDecompositions[First, Second]; end; end; end; //----------------- support for combining classes -------------------------------------------------- type TClassArray = array of Byte; var // canonical combining classes, again as two stage matrix CCCsLoaded: Boolean; CCCs: array [Byte] of TClassArray; procedure LoadCombiningClassData; var Stream: TResourceStream; I, J, K, Size: Cardinal; Buffer: TRangeArray; First, Second: Byte; begin // make sure no other code is currently modifying the global data area LoadInProgress.Enter; try if not CCCsLoaded then begin CCCsLoaded := True; Stream := TResourceStream.Create(HInstance, 'COMBINING', 'UNICODEDATA'); try while Stream.Position < Stream.Size do begin // a) determine which class is stored here Stream.ReadBuffer(I, 4); // b) determine how many ranges are assigned to this class Stream.ReadBuffer(Size, 4); // c) read start and stop code of each range if Size > 0 then begin SetLength(Buffer, Size); Stream.ReadBuffer(Buffer[0], Size * SizeOf(TRange)); // d) put this class in every of the code points just loaded for J := 0 to Size - 1 do for K := Buffer[J].Start to Buffer[J].Stop do begin Assert(K < $10000, LoadResString(@RsCombiningClassUnicodeChar)); First := (K shr 8) and $FF; Second := K and $FF; // add second step array if not yet done if CCCs[First] = nil then SetLength(CCCs[First], 256); CCCs[First, Second] := I; end; end; end; finally Stream.Free; end; end; finally LoadInProgress.Leave; end; end; function CanonicalCombiningClass(Code: Cardinal): Cardinal; var First, Second: Byte; begin // load combining class data if not already done if not CCCsLoaded then LoadCombiningClassData; First := (Code shr 8) and $FF; Second := Code and $FF; if CCCs[First] <> nil then Result := CCCs[First, Second] else Result := 0; end; //----------------- support for numeric values ----------------------------------------------------- type // structures for handling numbers TCodeIndex = record Code, Index: Cardinal; end; var // array to hold the number equivalents for specific codes NumberCodes: array of TCodeIndex; // array of numbers used in NumberCodes Numbers: array of TUcNumber; procedure LoadNumberData; var Stream: TResourceStream; Size: Cardinal; begin // make sure no other code is currently modifying the global data area LoadInProgress.Enter; try if NumberCodes = nil then begin Stream := TResourceStream.Create(HInstance, 'NUMBERS', 'UNICODEDATA'); // Numbers are special (compared to other Unicode data) as they utilize two // arrays, one containing all used numbers (in nominator-denominator format) and // another one which maps a code point to one of the numbers in the first array. // a) determine size of numbers array Stream.ReadBuffer(Size, 4); SetLength(Numbers, Size); // b) read numbers data Stream.ReadBuffer(Numbers[0], Size * SizeOf(TUcNumber)); // c) determine size of index array Stream.ReadBuffer(Size, 4); SetLength(NumberCodes, Size); // d) read index data Stream.ReadBuffer(NumberCodes[0], Size * SizeOf(TCodeIndex)); Stream.Free; end; finally LoadInProgress.Leave; end; end; function UnicodeNumberLookup(Code: UCS4; var Number: TUcNumber): Boolean; // Searches for the given code and returns its number equivalent (if there is one). // Typical cases are: '1/6' (U+2159), '3/8' (U+215C), 'XII' (U+216B) etc. // Result is set to True if the code could be found. var L, R, M: Integer; begin // load number data if not already done if NumberCodes = nil then LoadNumberData; Result := False; L := 0; R := High(NumberCodes); while L <= R do begin M := (L + R) shr 1; if Code > NumberCodes[M].Code then L := M + 1 else begin if Code < NumberCodes[M].Code then R := M - 1 else begin Number := Numbers[NumberCodes[M].Index]; Result := True; Break; end; end; end; end; //----------------- support for composition -------------------------------------------------------- type // maps between a pair of code points to a composite code point // Note: the source pair is packed into one 4 byte value to speed up search. TCompositionPair = record Code: Cardinal; Composition: UCS4; end; var // list of composition mappings Compositions: array of TCompositionPair; procedure LoadCompositionData; var Stream: TResourceStream; Size: Cardinal; begin // make sure no other code is currently modifying the global data area LoadInProgress.Enter; try if Compositions = nil then begin Stream := TResourceStream.Create(HInstance, 'COMPOSITION', 'UNICODEDATA'); // a) determine size of compositions array Stream.ReadBuffer(Size, 4); SetLength(Compositions, Size); // b) read data Stream.ReadBuffer(Compositions[0], Size * SizeOf(TCompositionPair)); Stream.Free; end; finally LoadInProgress.Leave; end; end; function UnicodeComposePair(First, Second: UCS4; var Composite: UCS4): Boolean; // Maps the sequence of First and Second to a composite. // Result is True if there was a mapping otherwise it is False. var L, R, M, C: Integer; Pair: Integer; begin if Compositions = nil then LoadCompositionData; Result := False; L := 0; R := High(Compositions); Pair := Integer((First shl 16) or Word(Second)); while L <= R do begin M := (L + R) shr 1; C := Integer(Compositions[M].Code) - Pair; if C < 0 then L := M + 1 else begin R := M - 1; if C = 0 then begin Result := True; L := M; end; end; end; if Result then Composite := Compositions[L].Composition; end; //=== { TSearchEngine } ====================================================== constructor TSearchEngine.Create(AOwner: TWideStrings); begin FOwner := AOwner; FResults := TList.Create; end; destructor TSearchEngine.Destroy; begin Clear; FResults.Free; inherited Destroy; end; procedure TSearchEngine.AddResult(Start, Stop: Cardinal); begin FResults.Add(Pointer(Start)); FResults.Add(Pointer(Stop)); end; procedure TSearchEngine.Clear; begin ClearResults; end; procedure TSearchEngine.ClearResults; begin FResults.Clear; end; procedure TSearchEngine.DeleteResult(Index: Cardinal); // explicitly deletes a search result begin with FResults do begin // start index Delete(2 * Index); // stop index Delete(2 * Index); end; end; function TSearchEngine.GetCount: Integer; // returns the number of matches found begin Result := FResults.Count div 2; end; procedure TSearchEngine.GetResult(Index: Cardinal; var Start, Stop: Integer); // returns the start position of a match (end position can be determined by // adding the length of the pattern to the start position) begin Start := Cardinal(FResults[2 * Index]); Stop := Cardinal(FResults[2 * Index + 1]); end; //----------------- TUTBSearch --------------------------------------------------------------------- procedure TUTBMSearch.ClearPattern; begin FreeMem(FPattern); FPattern := nil; FFlags := []; FPatternUsed := 0; FPatternSize := 0; FPatternLength := 0; FreeMem(FSkipValues); FSkipValues := nil; FSkipsUsed := 0; FMD4 := 0; end; function TUTBMSearch.GetSkipValue(TextStart, TextEnd: PUCS2): Cardinal; // looks up the SkipValues value for a character var I: Integer; C1, C2: UCS4; Sp: PUTBMSkip; begin Result := 0; if Cardinal(TextStart) < Cardinal(TextEnd) then begin C1 := UCS4(TextStart^); if (TextStart + 1) < TextEnd then C2 := UCS4((TextStart + 1)^) else C2 := $FFFFFFFF; if (SurrogateHighStart <= C1) and (C1 <= SurrogateHighEnd) and (SurrogateLowStart <= C2) and (C2 <= $DDDD) then C1 := $10000 + (((C1 and $03FF) shl 10) or (C2 and $03FF)); Sp := FSkipValues; for I := 0 to FSkipsUsed - 1 do begin if not (Boolean(C1 xor Sp.BMChar.UpCase) and Boolean(C1 xor Sp.BMChar.LoCase) and Boolean(C1 xor Sp.BMChar.TitleCase)) then begin if (TextEnd - TextStart) < Sp.SkipValues then Result := TextEnd - TextStart else Result := Sp.SkipValues; Exit; end; Inc(Sp); end; Result := FPatternLength; end; end; function TUTBMSearch.Match(Text, Start, Stop: PUCS2; var MatchStart, MatchEnd: Cardinal): Boolean; // Checks once whether the text at position Start (which points to the end of the // current text part to be matched) matches. // Note: If whole words only are allowed then the left and right border tests are // done here too. The keypoint for the right border is that the next character // after the search string is either the text end or a space character. // For the left side this is similar, but there is nothing like a string // start marker (like the string end marker #0). // // It seems not obvious, but we still can use the passed Text pointer to do // the left check. Although this pointer might not point to the real string // start (e.g. in TUTBMSearch.FindAll Text is incremented as needed) it is // still a valid check mark. The reason is that Text either points to the // real string start or a previous match (happend already, keep in mind the // search options do not change in the FindAll loop) and the character just // before Text is a space character. // This fact implies, though, that strings passed to Find (or FindFirst, // FindAll in TUTBMSearch) always really start at the given address. Although // this might not be the case in some circumstances (e.g. if you pass only // the selection from an editor) it is still assumed that a pattern matching // from the first position on (from the search string start) also matches // when whole words only are allowed. var CheckSpace: Boolean; C1, C2: UCS4; Count: Integer; Cp: PUTBMChar; begin // be pessimistic Result := False; // set the potential match endpoint first MatchEnd := (Start - Text) + 1; C1 := UCS4(Start^); if (Start + 1) < Stop then C2 := UCS4((Start + 1)^) else C2 := $FFFFFFFF; if (SurrogateHighStart <= C1) and (C1 <= SurrogateHighEnd) and (SurrogateLowStart <= C2) and (C2 <= SurrogateLowEnd) then begin C1 := $10000 + (((C1 and $03FF) shl 10) or (C2 and $03FF)); // Adjust the match end point to occur after the UTF-16 character. Inc(MatchEnd); end; // check special cases if FPatternUsed = 1 then begin MatchStart := Start - Text; Result := True; Exit; end; // Early out if entire words need to be matched and the next character // in the search string is neither the string end nor a space character. if (sfWholeWordOnly in FFlags) and not ((Start + 1)^ = WideNull) and not UnicodeIsWhiteSpace(UCS4((Start + 1)^)) then Exit; // compare backward Cp := FPattern; Inc(Cp, FPatternUsed - 1); Count := FPatternLength; while (Start >= Text) and (Count > 0) do begin // ignore non-spacing characters if indicated if sfIgnoreNonSpacing in FFlags then begin while (Start > Text) and UnicodeIsNonSpacing(C1) do begin Dec(Start); C2 := UCS4(Start^); if (Start - 1) > Text then C1 := UCS4((Start - 1)^) else C1 := $FFFFFFFF; if (SurrogateLowStart <= C2) and (C2 <= SurrogateLowEnd) and (SurrogateHighStart <= C1) and (C1 <= SurrogateHighEnd) then begin C1 := $10000 + (((C1 and $03FF) shl 10) or (C2 and $03FF)); Dec(Start); end else C1 := C2; end; end; // handle space compression if indicated if sfSpaceCompress in FFlags then begin CheckSpace := False; while (Start > Text) and (UnicodeIsWhiteSpace(C1) or UnicodeIsControl(C1)) do begin CheckSpace := UnicodeIsWhiteSpace(C1); Dec(Start); C2 := UCS4(Start^); if (Start - 1) > Text then C1 := UCS4((Start - 1)^) else C1 := $FFFFFFFF; if (SurrogateLowStart <= C2) and (C2 <= SurrogateLowEnd) and (SurrogateHighStart <= C1) and (C1 <= SurrogateHighEnd) then begin C1 := $10000 + (((C1 and $03FF) shl 10) or (C2 and $03FF)); Dec(Start); end else C1 := C2; end; // Handle things if space compression was indicated and one or // more member characters were found. if CheckSpace then begin if Cp.UpCase <> $20 then Exit; Dec(Cp); Dec(Count); // If Count is 0 at this place then the space character(s) was the first // in the pattern and we need to correct the start position. if Count = 0 then Inc(Start); end; end; // handle the normal comparison cases if (Count > 0) and (Boolean(C1 xor Cp.UpCase) and Boolean(C1 xor Cp.LoCase) and Boolean(C1 xor Cp.TitleCase)) then Exit; if C1 >= $10000 then Dec(Count, 2) else Dec(Count, 1); if Count > 0 then begin Dec(Cp); // get the next preceding character if Start > Text then begin Dec(Start); C2 := UCS4(Start^); if (Start - 1) > Text then C1 := UCS4((Start - 1)^) else C1 := $FFFFFFFF; if (SurrogateLowStart <= C2) and (C2 <= SurrogateLowEnd) and (SurrogateHighStart <= C1) and (C1 <= SurrogateHighEnd) then begin C1 := $10000 + (((C1 and $03FF) shl 10) or (C2 and $03FF)); Dec(Start); end else C1 := C2; end; end; end; // So far the string matched. Now check its left border for a space character // if whole word only are allowed. if not (sfWholeWordOnly in FFlags) or (Start <= Text) or UnicodeIsWhiteSpace(UCS4((Start - 1)^)) then begin // set the match start position MatchStart := Start - Text; Result := True; end; end; procedure TUTBMSearch.Compile(Pattern: PUCS2; PatternLength: Integer; Flags: TSearchFlags); var HaveSpace: Boolean; I, J, K, SLen: Integer; Cp: PUTBMChar; Sp: PUTBMSkip; C1, C2, Sentinel: UCS4; begin if (Pattern <> nil) and (Pattern^ <> #0) and (PatternLength > 0) then begin // do some initialization FFlags := Flags; // extra skip flag FMD4 := 1; Sentinel := 0; // allocate more storage if necessary FPattern := AllocMem(SizeOf(TUTBMChar) * PatternLength); FSkipValues := AllocMem(SizeOf(TUTBMSkip) * PatternLength); FPatternSize := PatternLength; // Preprocess the pattern to remove controls (if specified) and determine case. Cp := FPattern; I := 0; HaveSpace := False; while I < PatternLength do begin C1 := UCS4(Pattern[I]); if (I + 1) < PatternLength then C2 := UCS4(Pattern[I + 1]) else C2 := $FFFFFFFF; if (SurrogateHighStart <= C1) and (C1 <= SurrogateHighEnd) and (SurrogateLowStart <= C2) and (C2 <= SurrogateLowEnd) then C1 := $10000 + (((C1 and $03FF) shl 10) or (C2 and $03FF)); // Make sure the HaveSpace flag is turned off if the character is not an // appropriate one. if not UnicodeIsWhiteSpace(C1) then HaveSpace := False; // If non-spacing characters should be ignored, do it here. if (sfIgnoreNonSpacing in Flags) and UnicodeIsNonSpacing(C1) then begin Inc(I); Continue; end; // check if spaces and controls need to be compressed if sfSpaceCompress in Flags then begin if UnicodeIsWhiteSpace(C1) then begin if not HaveSpace then begin // Add a space and set the flag. Cp.UpCase := $20; Cp.LoCase := $20; Cp.TitleCase := $20; Inc(Cp); // increase the real pattern length Inc(FPatternLength); Sentinel := $20; HaveSpace := True; end; Inc(I); Continue; end; // ignore all control characters if UnicodeIsControl(C1) then begin Inc(I); Continue; end; end; // add the character if not (sfCaseSensitive in Flags) then begin { TODO : use the entire mapping, not only the first character } Cp.UpCase := UnicodeToUpper(C1)[0]; Cp.LoCase := UnicodeToLower(C1)[0]; Cp.TitleCase := UnicodeToTitle(C1)[0]; end else begin Cp.UpCase := C1; Cp.LoCase := C1; Cp.TitleCase := C1; end; Sentinel := Cp.UpCase; // move to the next character Inc(Cp); // increase the real pattern length appropriately if C1 >= $10000 then Inc(FPatternLength, 2) else Inc(FPatternLength); // increment the loop index for UTF-16 characters if C1 > $10000 then Inc(I, 2) else Inc(I); end; // set the number of characters actually used FPatternUsed := (PChar(Cp) - PChar(FPattern)) div SizeOf(TUTBMChar); // Go through and construct the skip array and determine the actual length // of the pattern in UCS2 terms. SLen := FPatternLength - 1; Cp := FPattern; K := 0; for I := 0 to FPatternUsed - 1 do begin // locate the character in the FSkipValues array Sp := FSkipValues; J := 0; while (J < FSkipsUsed) and (Sp.BMChar.UpCase <> Cp.UpCase) do begin Inc(J); Inc(Sp); end; // If the character is not found, set the new FSkipValues element and // increase the number of FSkipValues elements. if J = FSkipsUsed then begin Sp.BMChar := Cp; Inc(FSkipsUsed); end; // Set the updated FSkipValues value. If the character is UTF-16 and is // not the last one in the pattern, add one to its FSkipValues value. Sp.SkipValues := SLen - K; if (Cp.UpCase >= $10000) and ((K + 2) < SLen) then Inc(Sp.SkipValues); // set the new extra FSkipValues for the sentinel character if ((Cp.UpCase >= $10000) and ((K + 2) <= SLen) or ((K + 1) <= SLen) and (Cp.UpCase = Sentinel)) then FMD4 := SLen - K; // increase the actual index if Cp.UpCase >= $10000 then Inc(K, 2) else Inc(K); Inc(Cp); end; end; end; function TUTBMSearch.Find(Text: PUCS2; TextLen: Cardinal; var MatchStart, MatchEnd: Cardinal): Boolean; // this is the main matching routine using a tuned Boyer-Moore algorithm var K: Cardinal; Start, Stop: PUCS2; begin Result := False; if (FPattern <> nil) and (FPatternUsed > 0) and (Text <> nil) and (TextLen > 0) and (TextLen >= FPatternLength) then begin Start := Text + FPatternLength - 1; Stop := Text + TextLen; // adjust the start point if it points to a low surrogate if (SurrogateLowStart <= UCS4(Start^)) and (UCS4(Start^) <= SurrogateLowEnd) and (SurrogateHighStart <= UCS4((Start - 1)^)) and (UCS4((Start - 1)^) <= SurrogateHighEnd) then Dec(Start); while Start < Stop do begin repeat K := GetSkipValue(Start, Stop); if K = 0 then Break; Inc(Start, K); if (Start < Stop) and (SurrogateLowStart <= UCS4(Start^)) and (UCS4(Start^) <= SurrogateLowEnd) and (SurrogateHighStart <= UCS4((Start - 1)^)) and (UCS4((Start - 1)^) <= SurrogateHighEnd) then Dec(Start); until False; if (Start < Stop) and Match(Text, Start, Stop, MatchStart, MatchEnd) then begin Result := True; Break; end; Inc(Start, FMD4); if (Start < Stop) and (SurrogateLowStart <= UCS4(Start^)) and (UCS4(Start^) <= SurrogateLowEnd) and (SurrogateHighStart <= UCS4((Start - 1)^)) and (UCS4((Start - 1)^) <= SurrogateHighEnd) then Dec(Start); end; end; end; procedure TUTBMSearch.Clear; begin ClearPattern; inherited Clear; end; function TUTBMSearch.FindAll(const Text: WideString): Boolean; begin Result := FindAll(PWideChar(Text), Length(Text)); end; function TUTBMSearch.FindAll(Text: PWideChar; TextLen: Cardinal): Boolean; // Looks for all occurences of the pattern passed to FindPrepare and creates an // internal list of their positions. var Start, Stop: Cardinal; Run: PWideChar; RunLen: Cardinal; begin ClearResults; Run := Text; RunLen := TextLen; // repeat to find all occurences of the pattern while Find(Run, RunLen, Start, Stop) do begin // store this result (consider text pointer movement)... AddResult(Start + Run - Text, Stop + Run - Text); // ... and advance text position and length Inc(Run, Stop); Dec(RunLen, Stop); end; Result := Count > 0; end; function TUTBMSearch.FindFirst(const Text: WideString; var Start, Stop: Cardinal): Boolean; // Looks for the first occurence of the pattern passed to FindPrepare in Text and // returns True if one could be found (in which case Start and Stop are set to // the according indices) otherwise False. This function is in particular of // interest if only one occurence needs to be found. begin ClearResults; Result := Find(PWideChar(Text), Length(Text), Start, Stop); if Result then AddResult(Start, Stop); end; function TUTBMSearch.FindFirst(Text: PWideChar; TextLen: Cardinal; var Start, Stop: Cardinal): Boolean; // Same as the WideString version of this method. begin ClearResults; Result := Find(Text, TextLen, Start, Stop); if Result then AddResult(Start, Stop); end; procedure TUTBMSearch.FindPrepare(const Pattern: WideString; Options: TSearchFlags); begin FindPrepare(PWideChar(Pattern), Length(Pattern), Options); end; procedure TUTBMSearch.FindPrepare(Pattern: PWideChar; PatternLength: Cardinal; Options: TSearchFlags); // prepares following search by compiling the given pattern into an internal structure begin Compile(Pattern, PatternLength, Options); end; //----------------- Unicode RE search core --------------------------------------------------------- const // error codes _URE_OK = 0; _URE_UNEXPECTED_EOS = -1; _URE_CCLASS_OPEN = -2; _URE_UNBALANCED_GROUP = -3; _URE_INVALID_PROPERTY = -4; _URE_INVALID_RANGE = -5; _URE_RANGE_OPEN = -6; // options that can be combined for searching URE_IGNORE_NONSPACING = $01; URE_DONT_MATCHES_SEPARATORS = $02; const // Flags used internally in the DFA _URE_DFA_CASEFOLD = $01; _URE_DFA_BLANKLINE = $02; // symbol types for the DFA _URE_ANY_CHAR = 1; _URE_CHAR = 2; _URE_CCLASS = 3; _URE_NCCLASS = 4; _URE_BOL_ANCHOR = 5; _URE_EOL_ANCHOR = 6; // op codes for converting the NFA to a DFA _URE_SYMBOL = 10; _URE_PAREN = 11; _URE_QUEST = 12; _URE_STAR = 13; _URE_PLUS = 14; _URE_ONE = 15; _URE_AND = 16; _URE_OR = 17; _URE_NOOP = $FFFF; //----------------- TURESearch --------------------------------------------------------------------- procedure TURESearch.Clear; begin inherited Clear; ClearUREBuffer; ClearDFA; end; procedure TURESearch.Push(V: Cardinal); begin with FUREBuffer do begin // If the 'Reducing' parameter is True, check to see if the value passed is // already on the stack. if Reducing and ExpressionList.Expressions[Word(V)].OnStack then Exit; if Stack.ListUsed = Length(Stack.List) then SetLength(Stack.List, Length(Stack.List) + 8); Stack.List[Stack.ListUsed] := V; Inc(Stack.ListUsed); // If the 'reducing' parameter is True, flag the element as being on the Stack. if Reducing then ExpressionList.Expressions[Word(V)].OnStack := True; end; end; function TURESearch.Peek: Cardinal; begin if FUREBuffer.Stack.ListUsed = 0 then Result := _URE_NOOP else Result := FUREBuffer.Stack.List[FUREBuffer.Stack.ListUsed - 1]; end; function TURESearch.Pop: Cardinal; begin if FUREBuffer.Stack.ListUsed = 0 then Result := _URE_NOOP else begin Dec(FUREBuffer.Stack.ListUsed); Result := FUREBuffer.Stack.List[FUREBuffer.Stack.ListUsed]; if FUREBuffer.Reducing then FUREBuffer.ExpressionList.Expressions[Word(Result)].OnStack := False; end; end; function TURESearch.ParsePropertyList(Properties: PUCS2; Limit: Cardinal; var Categories: TCharacterCategories): Cardinal; // Parse a comma-separated list of integers that represent character properties. // Combine them into a set of categories and return the number of characters consumed. var N: Cardinal; Run, ListEnd: PUCS2; begin Run := Properties; ListEnd := Run + Limit; N := 0; Categories := []; while (FUREBuffer.Error = _URE_OK) and (Run < ListEnd) do begin if Run^ = ',' then begin // Encountered a comma, so take the number parsed so far as category and // reset the number. Include(Categories, TCharacterCategory(N)); N := 0; end else begin if (Run^ >= '0') and (Run^ <= '9') then begin // Encountered a digit, so start or continue building the cardinal that // represents the character category. N := (N * 10) + Cardinal(Word(Run^) - Ord('0')); end else begin // Encountered something that is not part of the property list. // Indicate that we are done. Break; end; end; // If the number is to large then there is a problem. // Most likely a missing comma separator. if Integer(N) > Ord(High(TCharacterCategory)) then FUREBuffer.Error := _URE_INVALID_PROPERTY; Inc(Run); end; // Return the number of characters consumed. Result := Run - Properties; end; function TURESearch.MakeHexNumber(NP: PUCS2; Limit: Cardinal; var Number: Cardinal): Cardinal; // Collect a hex number with 1 to 4 digits and return the number of characters used. var I: Integer; Run, ListEnd: PUCS2; begin Run := np; ListEnd := Run + Limit; Number := 0; I := 0; while (I < 4) and (Run < ListEnd) do begin if (Run^ >= '0') and (Run^ <= '9') then Number := (Number shl 4) + Cardinal(Word(Run^) - Ord('0')) else begin if (Run^ >= 'A') and (Run^ <= 'F') then Number := (Number shl 4) + Cardinal(Word(Run^) - Ord('A')) + 10 else begin if (Run^ >= 'a') and (Run^ <= 'f') then Number := (Number shl 4) + Cardinal(Word(Run^) - Ord('a')) + 10 else Break; end; end; Inc(I); Inc(Run); end; Result := Run - NP; end; procedure TURESearch.AddRange(var CCL: TUcCClass; Range: TUcRange); // Insert a Range into a character class, removing duplicates and ordering them // in increasing Range-start order. var I: Integer; Temp: UCS4; begin // If the `Casefold' flag is set, then make sure both endpoints of the Range // are converted to lower. if (FUREBuffer.Flags and _URE_DFA_CASEFOLD) <> 0 then begin { TODO : use the entire mapping, not only the first character } Range.MinCode := UnicodeToLower(Range.MinCode)[0]; Range.MaxCode := UnicodeToLower(Range.MaxCode)[0]; end; // Swap the Range endpoints if they are not in increasing order. if Range.MinCode > Range.MaxCode then begin Temp := Range.MinCode; Range.MinCode := Range.MaxCode; Range.MaxCode := Temp; end; I := 0; while (I < CCL.RangesUsed) and (Range.MinCode < CCL.Ranges[I].MinCode) do Inc(I); // check for a duplicate if (I < CCL.RangesUsed) and (Range.MinCode = CCL.Ranges[I].MinCode) and (Range.MaxCode = CCL.Ranges[I].MaxCode) then Exit; if CCL.RangesUsed = Length(CCL.Ranges) then SetLength(CCL.Ranges, Length(CCL.Ranges) + 8); if I < CCL.RangesUsed then Move(CCL.Ranges[I], CCL.Ranges[I + 1], SizeOf(TUcRange) * (CCL.RangesUsed - I)); CCL.Ranges[I].MinCode := Range.MinCode; CCL.Ranges[I].MaxCode := Range.MaxCode; Inc(CCL.RangesUsed); end; type PTrie = ^TTrie; TTrie = record Key: UCS2; Len, Next: Cardinal; Setup: Integer; Categories: TCharacterCategories; end; procedure TURESearch.SpaceSetup(Symbol: PUcSymbolTableEntry; Categories: TCharacterCategories); var Range: TUcRange; begin Symbol.Categories := Symbol.Categories + Categories; Range.MinCode := UCS4(WideTabulator); Range.MaxCode := UCS4(WideTabulator); AddRange(Symbol.Symbol.CCL, Range); Range.MinCode := UCS4(WideCarriageReturn); Range.MaxCode := UCS4(WideCarriageReturn); AddRange(Symbol.Symbol.CCL, Range); Range.MinCode := UCS4(WideLineFeed); Range.MaxCode := UCS4(WideLineFeed); AddRange(Symbol.Symbol.CCL, Range); Range.MinCode := UCS4(WideFormFeed); Range.MaxCode := UCS4(WideFormFeed); AddRange(Symbol.Symbol.CCL, Range); Range.MinCode := $FEFF; Range.MaxCode := $FEFF; AddRange(Symbol.Symbol.CCL, Range); end; procedure TURESearch.HexDigitSetup(Symbol: PUcSymbolTableEntry); var Range: TUcRange; begin Range.MinCode := UCS4('0'); Range.MaxCode := UCS4('9'); AddRange(Symbol.Symbol.CCL, Range); Range.MinCode := UCS4('A'); Range.MaxCode := UCS4('F'); AddRange(Symbol.Symbol.CCL, Range); Range.MinCode := UCS4('a'); Range.MaxCode := UCS4('f'); AddRange(Symbol.Symbol.CCL, Range); end; const CClassTrie: array [0..64] of TTrie = ( (Key: #$003A; Len: 1; Next: 1; Setup: 0; Categories: []), (Key: #$0061; Len: 9; Next: 10; Setup: 0; Categories: []), (Key: #$0063; Len: 8; Next: 19; Setup: 0; Categories: []), (Key: #$0064; Len: 7; Next: 24; Setup: 0; Categories: []), (Key: #$0067; Len: 6; Next: 29; Setup: 0; Categories: []), (Key: #$006C; Len: 5; Next: 34; Setup: 0; Categories: []), (Key: #$0070; Len: 4; Next: 39; Setup: 0; Categories: []), (Key: #$0073; Len: 3; Next: 49; Setup: 0; Categories: []), (Key: #$0075; Len: 2; Next: 54; Setup: 0; Categories: []), (Key: #$0078; Len: 1; Next: 59; Setup: 0; Categories: []), (Key: #$006C; Len: 1; Next: 11; Setup: 0; Categories: []), (Key: #$006E; Len: 2; Next: 13; Setup: 0; Categories: []), (Key: #$0070; Len: 1; Next: 16; Setup: 0; Categories: []), (Key: #$0075; Len: 1; Next: 14; Setup: 0; Categories: []), (Key: #$006D; Len: 1; Next: 15; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 16; Setup: 1; Categories: ClassLetter + ClassNumber), (Key: #$0068; Len: 1; Next: 17; Setup: 0; Categories: []), (Key: #$0061; Len: 1; Next: 18; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 19; Setup: 1; Categories: ClassLetter), (Key: #$006E; Len: 1; Next: 20; Setup: 0; Categories: []), (Key: #$0074; Len: 1; Next: 21; Setup: 0; Categories: []), (Key: #$0072; Len: 1; Next: 22; Setup: 0; Categories: []), (Key: #$006C; Len: 1; Next: 23; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 24; Setup: 1; Categories: [ccOtherControl, ccOtherFormat]), (Key: #$0069; Len: 1; Next: 25; Setup: 0; Categories: []), (Key: #$0067; Len: 1; Next: 26; Setup: 0; Categories: []), (Key: #$0069; Len: 1; Next: 27; Setup: 0; Categories: []), (Key: #$0074; Len: 1; Next: 28; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 29; Setup: 1; Categories: ClassNumber), (Key: #$0072; Len: 1; Next: 30; Setup: 0; Categories: []), (Key: #$0061; Len: 1; Next: 31; Setup: 0; Categories: []), (Key: #$0070; Len: 1; Next: 32; Setup: 0; Categories: []), (Key: #$0068; Len: 1; Next: 33; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 34; Setup: 1; Categories: ClassMark + ClassNumber + ClassLetter + ClassPunctuation + ClassSymbol), (Key: #$006F; Len: 1; Next: 35; Setup: 0; Categories: []), (Key: #$0077; Len: 1; Next: 36; Setup: 0; Categories: []), (Key: #$0065; Len: 1; Next: 37; Setup: 0; Categories: []), (Key: #$0072; Len: 1; Next: 38; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 39; Setup: 1; Categories: [ccLetterLowercase]), (Key: #$0072; Len: 2; Next: 41; Setup: 0; Categories: []), (Key: #$0075; Len: 1; Next: 45; Setup: 0; Categories: []), (Key: #$0069; Len: 1; Next: 42; Setup: 0; Categories: []), (Key: #$006E; Len: 1; Next: 43; Setup: 0; Categories: []), (Key: #$0074; Len: 1; Next: 44; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 45; Setup: 1; Categories: ClassMark + ClassNumber + ClassLetter + ClassPunctuation + ClassSymbol + [ccSeparatorSpace]), (Key: #$006E; Len: 1; Next: 46; Setup: 0; Categories: []), (Key: #$0063; Len: 1; Next: 47; Setup: 0; Categories: []), (Key: #$0074; Len: 1; Next: 48; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 49; Setup: 1; Categories: ClassPunctuation), (Key: #$0070; Len: 1; Next: 50; Setup: 0; Categories: []), (Key: #$0061; Len: 1; Next: 51; Setup: 0; Categories: []), (Key: #$0063; Len: 1; Next: 52; Setup: 0; Categories: []), (Key: #$0065; Len: 1; Next: 53; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 54; Setup: 2; Categories: ClassSpace), (Key: #$0070; Len: 1; Next: 55; Setup: 0; Categories: []), (Key: #$0070; Len: 1; Next: 56; Setup: 0; Categories: []), (Key: #$0065; Len: 1; Next: 57; Setup: 0; Categories: []), (Key: #$0072; Len: 1; Next: 58; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 59; Setup: 1; Categories: [ccLetterUppercase]), (Key: #$0064; Len: 1; Next: 60; Setup: 0; Categories: []), (Key: #$0069; Len: 1; Next: 61; Setup: 0; Categories: []), (Key: #$0067; Len: 1; Next: 62; Setup: 0; Categories: []), (Key: #$0069; Len: 1; Next: 63; Setup: 0; Categories: []), (Key: #$0074; Len: 1; Next: 64; Setup: 0; Categories: []), (Key: #$003A; Len: 1; Next: 65; Setup: 3; Categories: []) ); function TURESearch.PosixCCL(CP: PUCS2; Limit: Cardinal; Symbol: PUcSymbolTableEntry): Cardinal; // Probe for one of the POSIX colon delimited character classes in the static trie. var I: Integer; N: Cardinal; TP: PTrie; Run, ListEnd: PUCS2; begin Result := 0; // If the number of characters left is less than 7, // then this cannot be interpreted as one of the colon delimited classes. if Limit >= 7 then begin Run := cp; ListEnd := Run + Limit; TP := @CClassTrie[0]; I := 0; while (Run < ListEnd) and (I < 8) do begin N := TP.Len; while (N > 0) and (TP.Key <> Run^) do begin Inc(TP); Dec(N); end; if N = 0 then begin Result := 0; Exit; end; if (Run^ = ':') and ((I = 6) or (I = 7)) then begin Inc(Run); Break; end; if (Run + 1) < ListEnd then TP := @CClassTrie[TP.Next]; Inc(I); Inc(Run); end; Result := Run - CP; case TP.Setup of 1: Symbol.Categories := Symbol.Categories + TP.Categories; 2: SpaceSetup(Symbol, TP.Categories); 3: HexDigitSetup(Symbol); else Result := 0; end; end; end; function TURESearch.BuildCharacterClass(CP: PUCS2; Limit: Cardinal; Symbol: PUcSymbolTableEntry): Cardinal; // Construct a list of ranges and return the number of characters consumed. var RangeEnd: Integer; N: Cardinal; Run, ListEnd: PUCS2; C, Last: UCS4; Range: TUcRange; begin Run := cp; ListEnd := Run + Limit; if Run^ = '^' then begin Symbol.AType := _URE_NCCLASS; Inc(Run); end else Symbol.AType := _URE_CCLASS; Last := 0; RangeEnd := 0; while (FUREBuffer.Error = _URE_OK) and (Run < ListEnd) do begin // Allow for the special case []abc], where the first closing bracket would end an empty // character class, which makes no sense. Hence this bracket is treaded literally. if (Run^ = ']') and (Symbol.Symbol.CCL.RangesUsed > 0) then Break; C := UCS4(Run^); Inc(Run); // escape character if C = Ord('\') then begin if Run = ListEnd then begin // The EOS was encountered when expecting the reverse solidus to be followed by the character it is escaping. // Set an Error code and return the number of characters consumed up to this point. FUREBuffer.Error := _URE_UNEXPECTED_EOS; Result := Run - CP; Exit; end; C := UCS4(Run^); Inc(Run); case UCS2(C) of 'a': C := $07; 'b': C := $08; 'f': C := $0C; 'n': C := $0A; 'R': C := $0D; 't': C := $09; 'v': C := $0B; 'p', 'P': begin Inc(Run, ParsePropertyList(Run, ListEnd - Run, Symbol.Categories)); // Invert the bit mask of the properties if this is a negated character class or if 'P' is used to specify // a list of character properties that should *not* match in a character class. if C = Ord('P') then Symbol.Categories := ClassAll - Symbol.Categories; Continue; end; 'x', 'X', 'u', 'U': begin if (Run < ListEnd) and ((Run^ >= '0') and (Run^ <= '9') or (Run^ >= 'A') and (Run^ <= 'F') or (Run^ >= 'a') and (Run^ <= 'f')) then Inc(Run, MakeHexNumber(Run, ListEnd - Run, C)); end; end; end else begin if C = Ord(':') then begin // Probe for a POSIX colon delimited character class. Dec(Run); N := PosixCCL(Run, ListEnd - Run, Symbol); if N = 0 then Inc(Run) else begin Inc(Run, N); Continue; end; end; end; // Check to see if the current character is a low surrogate that needs // to be combined with a preceding high surrogate. if Last <> 0 then begin if (C >= SurrogateLowStart) and (C <= SurrogateLowEnd) then begin // Construct the UTF16 character code. C := $10000 + (((Last and $03FF) shl 10) or (C and $03FF)) end else begin // Add the isolated high surrogate to the range. if RangeEnd = 1 then Range.MaxCode := Last and $FFFF else begin Range.MinCode := Last and $FFFF; Range.MaxCode := Last and $FFFF; end; AddRange(Symbol.Symbol.CCL, Range); RangeEnd := 0; end; end; // Clear the Last character code. Last := 0; // This slightly awkward code handles the different cases needed to construct a range. if (C >= SurrogateHighStart) and (C <= SurrogateHighEnd) then begin // If the high surrogate is followed by a Range indicator, simply add it as the Range start. Otherwise, // save it in the next character is a low surrogate. if Run^ = '-' then begin Inc(Run); Range.MinCode := C; RangeEnd := 1; end else Last := C; end else begin if RangeEnd = 1 then begin Range.MaxCode := C; AddRange(Symbol.Symbol.CCL, Range); RangeEnd := 0; end else begin Range.MinCode := C; Range.MaxCode := C; if Run^ = '-' then begin Inc(Run); RangeEnd := 1; end else AddRange(Symbol.Symbol.CCL, Range); end; end; end; if (Run < ListEnd) and (Run^ = ']') then Inc(Run) else begin // The parse was not terminated by the character class close symbol (']'), so set an error code. FUREBuffer.Error := _URE_CCLASS_OPEN; end; Result := Run - CP; end; function TURESearch.ProbeLowSurrogate(LeftState: PUCS2; Limit: Cardinal; var Code: UCS4): Cardinal; // probes for a low surrogate hex code var I: Integer; Run, ListEnd: PUCS2; begin I := 0; Code := 0; Run := LeftState; ListEnd := Run + Limit; while (I < 4) and (Run < ListEnd) do begin if (Run^ >= '0') and (Run^ <= '9') then Code := (Code shl 4) + Cardinal(Word(Run^) - Ord('0')) else begin if (Run^ >= 'A') and (Run^ <= 'F') then Code := (Code shl 4) + Cardinal(Word(Run^) - Ord('A')) + 10 else begin if (Run^ >= 'a') and (Run^ <= 'f') then Code := (Code shl 4) + Cardinal(Word(Run^) - Ord('a')) + 10 else Break; end; end; Inc(Run); end; if (SurrogateLowStart <= Code) and (Code <= SurrogateLowEnd) then Result := Run - LeftState else Result := 0; end; function TURESearch.CompileSymbol(S: PUCS2; Limit: Cardinal; Symbol: PUcSymbolTableEntry): Cardinal; var C: UCS4; Run, ListEnd: PUCS2; begin Run := S; ListEnd := S + Limit; C := UCS4(Run^); Inc(Run); if C = Ord('\') then begin if Run = ListEnd then begin // The EOS was encountered when expecting the reverse solidus to be followed // by the character it is escaping. Set an Error code and return the number // of characters consumed up to this point. FUREBuffer.Error := _URE_UNEXPECTED_EOS; Result := Run - S; Exit; end; C := UCS4(Run^); Inc(Run); case UCS2(C) of 'p', 'P': begin if UCS2(C) = 'p' then Symbol.AType :=_URE_CCLASS else Symbol.AType :=_URE_NCCLASS; Inc(Run, ParsePropertyList(Run, ListEnd - Run, Symbol.Categories)); end; 'a': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $07; end; 'b': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $08; end; 'f': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $0C; end; 'n': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $0A; end; 'r': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $0D; end; 't': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $09; end; 'v': begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := $0B; end; else case UCS2(C) of 'x', 'X', 'u', 'U': begin // Collect between 1 and 4 digits representing an UCS2 code. if (Run < ListEnd) and ((Run^ >= '0') and (Run^ <= '9') or (Run^ >= 'A') and (Run^ <= 'F') or (Run^ >= 'a') and (Run^ <= 'f')) then Inc(Run, MakeHexNumber(Run, ListEnd - Run, C)); end; end; // Simply add an escaped character here. Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := C; end; end else begin if (UCS2(C) = '^') or (UCS2(C) = '$') then begin // Handle the BOL and EOL anchors. This actually consists simply of setting // a flag that indicates that the user supplied anchor match function should // be called. This needs to be done instead of simply matching line/paragraph // separators because beginning-of-text and end-of-text tests are needed as well. if UCS2(C) = '^' then Symbol.AType := _URE_BOL_ANCHOR else Symbol.AType := _URE_EOL_ANCHOR; end else begin if UCS2(C) = '[' then begin // construct a character class Inc(Run, BuildCharacterClass(Run, ListEnd - Run, Symbol)); end else begin if UCS2(C) = '.' then Symbol.AType := _URE_ANY_CHAR else begin Symbol.AType := _URE_CHAR; Symbol.Symbol.Chr := C; end; end; end; end; // If the symbol type happens to be a character and is a high surrogate, then // probe forward to see if it is followed by a low surrogate that needs to be added. if (Run < ListEnd) and (Symbol.AType = _URE_CHAR) and (SurrogateHighStart <= Symbol.Symbol.Chr) and (Symbol.Symbol.Chr <= SurrogateHighEnd) then begin if (SurrogateLowStart <= UCS4(Run^)) and (UCS4(Run^) <= SurrogateLowEnd) then begin Symbol.Symbol.Chr := $10000 + (((Symbol.Symbol.Chr and $03FF) shl 10) or (UCS4(Run^) and $03FF)); Inc(Run); end else begin if (Run^ = '\') and (((Run + 1)^ = 'x') or ((Run + 1)^ = 'X') or ((Run + 1)^ = 'u') or ((Run + 1)^ = 'U')) then begin Inc(Run, ProbeLowSurrogate(Run + 2, ListEnd - (Run + 2), C)); if (SurrogateLowStart <= C) and (C <= SurrogateLowEnd) then begin // Take into account the \[xu] in front of the hex code. Inc(Run, 2); Symbol.Symbol.Chr := $10000 + (((Symbol.Symbol.Chr and $03FF) shl 10) or (C and $03FF)); end; end; end; end; // Last, make sure any _URE_CHAR type symbols are changed to lower if the // 'Casefold' flag is set. { TODO : use the entire mapping, not only the first character and use the case fold abilities of the unit. } if ((FUREBuffer.Flags and _URE_DFA_CASEFOLD) <> 0) and (Symbol.AType = _URE_CHAR) then Symbol.Symbol.Chr := UnicodeToLower(Symbol.Symbol.Chr)[0]; // If the symbol constructed is anything other than one of the anchors, // make sure the _URE_DFA_BLANKLINE flag is removed. if (Symbol.AType <> _URE_BOL_ANCHOR) and (Symbol.AType <> _URE_EOL_ANCHOR) then FUREBuffer.Flags := FUREBuffer.Flags and not _URE_DFA_BLANKLINE; // Return the number of characters consumed. Result := Run - S; end; function TURESearch.SymbolsAreDifferent(A, B: PUcSymbolTableEntry): Boolean; begin Result := False; if (A.AType <> B.AType) or (A.Mods <> B.Mods) or (A.Categories <> B.Categories) then Result := True else begin if (A.AType = _URE_CCLASS) or (A.AType = _URE_NCCLASS) then begin if A.Symbol.CCL.RangesUsed <> B.Symbol.CCL.RangesUsed then Result := True else begin if (A.Symbol.CCL.RangesUsed > 0) and not CompareMem(@A.Symbol.CCL.Ranges[0], @B.Symbol.CCL.Ranges[0], SizeOf(TUcRange) * A.Symbol.CCL.RangesUsed) then Result := True;; end; end else begin if (A.AType = _URE_CHAR) and (A.Symbol.Chr <> B.Symbol.Chr) then Result := True; end; end; end; function TURESearch.MakeSymbol(S: PUCS2; Limit: Cardinal; var Consumed: Cardinal): Cardinal; // constructs a symbol, but only keep unique symbols var I: Integer; Start: PUcSymbolTableEntry; Symbol: TUcSymbolTableEntry; begin // Build the next symbol so we can test to see if it is already in the symbol table. FillChar(Symbol, SizeOf(TUcSymbolTableEntry), 0); Consumed := CompileSymbol(S, Limit, @Symbol); // Check to see if the symbol exists. I := 0; Start := @FUREBuffer.SymbolTable.Symbols[0]; while (I < FUREBuffer.SymbolTable.SymbolsUsed) and SymbolsAreDifferent(@Symbol, Start) do begin Inc(I); Inc(Start); end; if I < FUREBuffer.SymbolTable.SymbolsUsed then begin // Free up any ranges used for the symbol. if (Symbol.AType = _URE_CCLASS) or (Symbol.AType = _URE_NCCLASS) then Symbol.Symbol.CCL.Ranges := nil; Result := FUREBuffer.SymbolTable.Symbols[I].ID; Exit; end; // Need to add the new symbol. if FUREBuffer.SymbolTable.SymbolsUsed = Length(FUREBuffer.SymbolTable.Symbols) then begin SetLength(FUREBuffer.SymbolTable.Symbols, Length(FUREBuffer.SymbolTable.Symbols) + 8); end; Symbol.ID := FUREBuffer.SymbolTable.SymbolsUsed; Inc(FUREBuffer.SymbolTable.SymbolsUsed); FUREBuffer.SymbolTable.Symbols[Symbol.ID] := Symbol; Result := Symbol.ID; end; function TURESearch.MakeExpression(AType, LHS, RHS: Cardinal): Cardinal; var I: Integer; begin // Determine if the expression already exists or not. with FUREBuffer.ExpressionList do begin for I := 0 to ExpressionsUsed - 1 do begin if (Expressions[I].AType = AType) and (Expressions[I].LHS = LHS) and (Expressions[I].RHS = RHS) then begin Result := I; Exit; end; end; // Need to add a new expression. if ExpressionsUsed = Length(Expressions) then SetLength(Expressions, Length(Expressions) + 8); Expressions[ExpressionsUsed].OnStack := False; Expressions[ExpressionsUsed].AType := AType; Expressions[ExpressionsUsed].LHS := LHS; Expressions[ExpressionsUsed].RHS := RHS; Result := ExpressionsUsed; Inc(ExpressionsUsed); end; end; function IsSpecial(C: Word): Boolean; begin Result := C in [Word('+'), Word('*'), Word('?'), Word('{'), Word('|'), Word(')')]; end; procedure TURESearch.CollectPendingOperations(var State: Cardinal); // collects all pending AND and OR operations and make corresponding expressions var Operation: Cardinal; begin repeat Operation := Peek; if (Operation <> _URE_AND) and (Operation <> _URE_OR) then Break; // make an expression with the AND or OR operator and its right hand side Operation := Pop; State := MakeExpression(Operation, Pop, State); until False; end; function TURESearch.ConvertRegExpToNFA(RE: PWideChar; RELength: Cardinal): Cardinal; // Converts the regular expression into an NFA in a form that will be easy to // reduce to a DFA. The starting state for the reduction will be returned. var C: UCS2; Head, Tail: PUCS2; S: WideString; Symbol, State, LastState, Used, M, N: Cardinal; I: Integer; begin State := _URE_NOOP; Head := RE; Tail := Head + RELength; while (FUREBuffer.Error = _URE_OK) and (Head < Tail) do begin C := Head^; Inc(Head); case C of '(': Push(_URE_PAREN); ')': // check for the case of too many close parentheses begin if Peek = _URE_NOOP then begin FUREBuffer.Error := _URE_UNBALANCED_GROUP; Break; end; CollectPendingOperations(State); // remove the _URE_PAREN off the stack Pop; end; '*': State := MakeExpression(_URE_STAR, State, _URE_NOOP); '+': State := MakeExpression(_URE_PLUS, State, _URE_NOOP); '?': State := MakeExpression(_URE_QUEST, State, _URE_NOOP); '|': begin CollectPendingOperations(State); Push(State); Push(_URE_OR); end; '{': // expressions of the form {m, n} begin C := #0; M := 0; N := 0; // get first number while UnicodeIsWhiteSpace(UCS4(Head^)) do Inc(Head); S := ''; while Head^ in [WideChar('0')..WideChar('9')] do begin S := S + Head^; Inc(Head); end; if S <> '' then M := StrToInt(S); while UnicodeIsWhiteSpace(UCS4(Head^)) do Inc(Head); if (Head^ <> ',') and (Head^ <> '}') then begin FUREBuffer.Error := _URE_INVALID_RANGE; Break; end; // check for an upper limit if Head^ <> '}' then begin Inc(Head); // get second number while UnicodeIsWhiteSpace(UCS4(Head^)) do Inc(Head); S := ''; while Head^ in [WideChar('0')..WideChar('9')] do begin S := S + Head^; Inc(Head); end; if S <> '' then N := StrToInt(S); end else N := M; if Head^ <> '}' then begin FUREBuffer.Error := _URE_RANGE_OPEN; Break; end else Inc(Head); // N = 0 means unlimited number of occurences if N = 0 then begin case M of 0: // {,} {0,} {0, 0} mean the same as the star operator State := MakeExpression(_URE_STAR, State, _URE_NOOP); 1: // {1,} {1, 0} mean the same as the plus operator State := MakeExpression(_URE_PLUS, State, _URE_NOOP); else begin // encapsulate the expanded branches as would they be in parenthesis // in order to avoid unwanted concatenation with pending operations/symbols Push(_URE_PAREN); // {m,} {m, 0} mean M fixed occurences plus star operator // make E^m... for I := 1 to M - 1 do begin Push(State); Push(_URE_AND); end; // ...and repeat the last symbol one or more times State := MakeExpression(_URE_PLUS, State, _URE_NOOP); CollectPendingOperations(State); Pop; end; end; end else begin // check proper range limits if M > N then begin FUREBuffer.Error := _URE_INVALID_RANGE; Break; end; // check special case {0, 1} (which corresponds to the ? operator) if (M = 0) and (N = 1) then State := MakeExpression(_URE_QUEST, State, _URE_NOOP) else begin // handle the general case by expanding {m, n} into the equivalent // expression E^m | E^(m + 1) | ... | E^n // encapsulate the expanded branches as would they be in parenthesis // in order to avoid unwanted concatenation with pending operations/symbols Push(_URE_PAREN); // keep initial state as this is the one all alternatives start from LastState := State; // Consider the special case M = 0 first. Because there's no construct // to enter a pure epsilon-transition into the expression array I // work around with the question mark operator to describe the first // and second branch alternative. if M = 0 then begin State := MakeExpression(_URE_QUEST, State, _URE_NOOP); Inc(M, 2); // Mark the pending OR operation (there must always follow at // least on more alternative because the special case {0, 1} has // already been handled). Push(State); Push(_URE_OR); end; while M <= N do begin State := LastState; // create E^M for I := 1 to Integer(M) - 1 do begin Push(State); Push(_URE_AND); end; // finish the branch and mark it as pending OR operation if it // isn't the last one CollectPendingOperations(State); if M < N then begin Push(State); Push(_URE_OR); end; Inc(M); end; // remove the _URE_PAREN off the stack Pop; end; end; end; else Dec(Head); Symbol := MakeSymbol(Head, Tail - Head, Used); Inc(Head, Used); State := MakeExpression(_URE_SYMBOL, Symbol, _URE_NOOP); end; if (C <> '(') and (C <> '|') and (C <> '{') and (Head < Tail) and (not IsSpecial(Word(Head^)) or (Head^ = '(')) then begin Push(State); Push(_URE_AND); end; end; CollectPendingOperations(State); if FUREBuffer.Stack.ListUsed > 0 then FUREBuffer.Error := _URE_UNBALANCED_GROUP; if FUREBuffer.Error = _URE_OK then Result := State else Result := _URE_NOOP; end; procedure TURESearch.AddSymbolState(Symbol, State: Cardinal); var I, J: Integer; Found: Boolean; begin // Locate the symbol in the symbol table so the state can be added. // If the symbol doesn't exist, then we are in serious trouble. with FUREBuffer.SymbolTable do begin I := 0; while (I < SymbolsUsed) and (Symbol <> Symbols[I].ID) do Inc(I); Assert(I < SymbolsUsed); end; // Now find out if the state exists in the symbol's state list. with FUREBuffer.SymbolTable.Symbols[I].States do begin Found := False; for J := 0 to ListUsed - 1 do begin if State <= List[J] then begin Found := True; Break; end; end; if not Found then J := ListUsed; if not Found or (State < List[J]) then begin // Need to add the state in order. if ListUsed = Length(List) then SetLength(List, Length(List) + 8); if J < ListUsed then Move(List[J], List[J + 1], SizeOf(Cardinal) * (ListUsed - J)); List[J] := State; Inc(ListUsed); end; end; end; function TURESearch.AddState(NewStates: array of Cardinal): Cardinal; var I: Integer; Found: Boolean; begin Found := False; for I := 0 to FUREBuffer.States.StatesUsed - 1 do begin if (FUREBuffer.States.States[I].StateList.ListUsed = Length(NewStates)) and CompareMem(@NewStates[0], @FUREBuffer.States.States[I].StateList.List[0], SizeOf(Cardinal) * Length(NewStates)) then begin Found := True; Break; end; end; if not Found then begin // Need to add a new DFA State (set of NFA states). if FUREBuffer.States.StatesUsed = Length(FUREBuffer.States.States) then SetLength(FUREBuffer.States.States, Length(FUREBuffer.States.States) + 8); with FUREBuffer.States.States[FUREBuffer.States.StatesUsed] do begin ID := FUREBuffer.States.StatesUsed; if (StateList.ListUsed + Length(NewStates)) >= Length(StateList.List) then SetLength(StateList.List, Length(StateList.List) + Length(NewStates) + 8); Move(NewStates[0], StateList.List[StateList.ListUsed], SizeOf(Cardinal) * Length(NewStates)); Inc(StateList.ListUsed, Length(NewStates)); end; Inc(FUREBuffer.States.StatesUsed); end; // Return the ID of the DFA state representing a group of NFA States. if Found then Result := I else Result := FUREBuffer.States.StatesUsed - 1; end; procedure TURESearch.Reduce(Start: Cardinal); var I, J, Symbols: Integer; State, RHS, s1, s2, ns1, ns2: Cardinal; Evaluating: Boolean; begin FUREBuffer.Reducing := True; // Add the starting state for the reduction. AddState([Start]); // Process each set of NFA states that get created. I := 0; // further states are added in the loop while I < FUREBuffer.States.StatesUsed do begin with FUREBuffer, States.States[I], ExpressionList do begin // Push the current states on the stack. for J := 0 to StateList.ListUsed - 1 do Push(StateList.List[J]); // Reduce the NFA states. Accepting := False; Symbols := 0; J := 0; // need a while loop here as the stack will be modified within the loop and // so also its usage count used to terminate the loop while J < FUREBuffer.Stack.ListUsed do begin State := FUREBuffer.Stack.List[J]; Evaluating := True; // This inner loop is the iterative equivalent of recursively // reducing subexpressions generated as a result of a reduction. while Evaluating do begin case Expressions[State].AType of _URE_SYMBOL: begin ns1 := MakeExpression(_URE_ONE, _URE_NOOP, _URE_NOOP); AddSymbolState(Expressions[State].LHS, ns1); Inc(Symbols); Evaluating := False; end; _URE_ONE: begin Accepting := True; Evaluating := False; end; _URE_QUEST: begin s1 := Expressions[State].LHS; ns1 := MakeExpression(_URE_ONE, _URE_NOOP, _URE_NOOP); State := MakeExpression(_URE_OR, ns1, s1); end; _URE_PLUS: begin s1 := Expressions[State].LHS; ns1 := MakeExpression(_URE_STAR, s1, _URE_NOOP); State := MakeExpression(_URE_AND, s1, ns1); end; _URE_STAR: begin s1 := Expressions[State].LHS; ns1 := MakeExpression(_URE_ONE, _URE_NOOP, _URE_NOOP); ns2 := MakeExpression(_URE_PLUS, s1, _URE_NOOP); State := MakeExpression(_URE_OR, ns1, ns2); end; _URE_OR: begin s1 := Expressions[State].LHS; s2 := Expressions[State].RHS; Push(s1); Push(s2); Evaluating := False; end; _URE_AND: begin s1 := Expressions[State].LHS; s2 := Expressions[State].RHS; case Expressions[s1].AType of _URE_SYMBOL: begin AddSymbolState(Expressions[s1].LHS, s2); Inc(Symbols); Evaluating := False; end; _URE_ONE: State := s2; _URE_QUEST: begin ns1 := Expressions[s1].LHS; ns2 := MakeExpression(_URE_AND, ns1, s2); State := MakeExpression(_URE_OR, s2, ns2); end; _URE_PLUS: begin ns1 := Expressions[s1].LHS; ns2 := MakeExpression(_URE_OR, s2, State); State := MakeExpression(_URE_AND, ns1, ns2); end; _URE_STAR: begin ns1 := Expressions[s1].LHS; ns2 := MakeExpression(_URE_AND, ns1, State); State := MakeExpression(_URE_OR, s2, ns2); end; _URE_OR: begin ns1 := Expressions[s1].LHS; ns2 := Expressions[s1].RHS; ns1 := MakeExpression(_URE_AND, ns1, s2); ns2 := MakeExpression(_URE_AND, ns2, s2); State := MakeExpression(_URE_OR, ns1, ns2); end; _URE_AND: begin ns1 := Expressions[s1].LHS; ns2 := Expressions[s1].RHS; ns2 := MakeExpression(_URE_AND, ns2, s2); State := MakeExpression(_URE_AND, ns1, ns2); end; end; end; end; end; Inc(J); end; // clear the state stack while Pop <> _URE_NOOP do { nothing }; // generate the DFA states for the symbols collected during the current reduction if (TransitionsUsed + Symbols) > Length(Transitions) then SetLength(Transitions, Length(Transitions) + Symbols); // go through the symbol table and generate the DFA state transitions for // each symbol that has collected NFA states Symbols := 0; J := 0; while J < FUREBuffer.SymbolTable.SymbolsUsed do begin begin if FUREBuffer.SymbolTable.Symbols[J].States.ListUsed > 0 then begin Transitions[Symbols].LHS := FUREBuffer.SymbolTable.Symbols[J].ID; with FUREBuffer.SymbolTable.Symbols[J] do begin RHS := AddState(Copy(States.List, 0, States.ListUsed)); States.ListUsed := 0; end; Transitions[Symbols].RHS := RHS; Inc(Symbols); end; end; Inc(J); end; // set the number of transitions actually used // Note: we need again to qualify a part of the TransistionsUsed path since the // state array could be reallocated in the AddState call above and the // with ... do will then be invalid. States.States[I].TransitionsUsed := Symbols; end; Inc(I); end; FUREBuffer.Reducing := False; end; procedure TURESearch.AddEquivalentPair(L, R: Cardinal); var I: Integer; begin L := FUREBuffer.States.States[L].ID; R := FUREBuffer.States.States[R].ID; if L <> R then begin if L > R then begin I := L; L := R; R := I; end; // Check to see if the equivalence pair already exists. I := 0; with FUREBuffer.EquivalentList do begin while (I < EquivalentsUsed) and ((Equivalents[I].Left <> L) or (Equivalents[I].Right <> R)) do Inc(I); if I >= EquivalentsUsed then begin if EquivalentsUsed = Length(Equivalents) then SetLength(Equivalents, Length(Equivalents) + 8); Equivalents[EquivalentsUsed].Left := L; Equivalents[EquivalentsUsed].Right := R; Inc(EquivalentsUsed); end; end; end; end; procedure TURESearch.MergeEquivalents; // merges the DFA states that are equivalent var I, J, K, Equal: Integer; Done: Boolean; State1, State2, LeftState, RightState: PUcState; begin for I := 0 to FUREBuffer.States.StatesUsed - 1 do begin State1 := @FUREBuffer.States.States[I]; if State1.ID = Cardinal(I) then begin J := 0; while J < I do begin State2 := @FUREBuffer.States.States[J]; if State2.ID = Cardinal(J) then begin FUREBuffer.EquivalentList.EquivalentsUsed := 0; AddEquivalentPair(I, J); Done := False; Equal := 0; while Equal < FUREBuffer.EquivalentList.EquivalentsUsed do begin LeftState := @FUREBuffer.States.States[FUREBuffer.EquivalentList.Equivalents[Equal].Left]; RightState := @FUREBuffer.States.States[FUREBuffer.EquivalentList.Equivalents[Equal].Right]; if (LeftState.Accepting <> RightState.Accepting) or (LeftState.TransitionsUsed <> RightState.TransitionsUsed) then begin Done := True; Break; end; K := 0; while (K < LeftState.TransitionsUsed) and (LeftState.Transitions[K].LHS = RightState.Transitions[K].LHS) do Inc(K); if K < LeftState.TransitionsUsed then begin Done := True; Break; end; for K := 0 to LeftState.TransitionsUsed - 1 do AddEquivalentPair(LeftState.Transitions[K].RHS, RightState.Transitions[K].RHS); Inc(Equal); end; if not Done then Break; end; Inc(J); end; if J < I then begin with FUREBuffer do begin for Equal := 0 to EquivalentList.EquivalentsUsed - 1 do begin States.States[EquivalentList.Equivalents[Equal].Right].ID := States.States[EquivalentList.Equivalents[Equal].Left].ID; end; end; end; end; end; // Renumber the states appropriately State1 := @FUREBuffer.States.States[0]; Equal := 0; for I := 0 to FUREBuffer.States.StatesUsed - 1 do begin if State1.ID = Cardinal(I) then begin State1.ID := Equal; Inc(Equal); end else State1.ID := FUREBuffer.States.States[State1.ID].ID; Inc(State1); end; end; procedure TURESearch.ClearUREBuffer; var I: Integer; begin with FUREBuffer do begin // quite a few dynamic arrays to free Stack.List := nil; ExpressionList.Expressions := nil; // the symbol table has been handed over to the DFA and will be freed on // release of the DFA SymbolTable.SymbolsUsed := 0; for I := 0 to States.StatesUsed - 1 do begin States.States[I].Transitions := nil; States.States[I].StateList.List := nil; States.States[I].StateList.ListUsed := 0; States.States[I].TransitionsUsed := 0; end; States.StatesUsed := 0; States.States := nil; EquivalentList.Equivalents := nil; end; FillChar(FUREBuffer, SizeOf(FUREBuffer), 0); end; procedure TURESearch.CompileURE(RE: PWideChar; RELength: Cardinal; Casefold: Boolean); var I, J: Integer; State: Cardinal; Run: PUcState; TP: Integer; procedure UREError(Text: string; RE: PWideChar); var S: string; begin S := RE; raise EJclUnicodeError.CreateResFmt(@RsUREErrorFmt, [LoadResString(@RsUREBaseString), Text, S]); end; begin // be paranoid if (RE <> nil) and (RE^ <> WideNull) and (RELength > 0) then begin // Reset the various fields of the compilation buffer. Default the Flags // to indicate the presense of the "^$" pattern. If any other pattern // occurs, then this flag will be removed. This is done to catch this // special pattern and handle it specially when matching. ClearUREBuffer; ClearDFA; FUREBuffer.Flags := _URE_DFA_BLANKLINE; if Casefold then FUREBuffer.Flags := FUREBuffer.Flags or _URE_DFA_CASEFOLD; // Construct the NFA. If this stage returns a 0, then an error occured or an // empty expression was passed. State := ConvertRegExpToNFA(RE, RELength); if State <> _URE_NOOP then begin // Do the expression reduction to get the initial DFA. Reduce(State); // Merge all the equivalent DFA States. MergeEquivalents; // Construct the minimal DFA. FDFA.Flags := FUREBuffer.Flags and (_URE_DFA_CASEFOLD or _URE_DFA_BLANKLINE); // Free up the NFA state groups and transfer the symbols from the buffer // to the DFA. FDFA.SymbolTable := FUREBuffer.SymbolTable; FUREBuffer.SymbolTable.Symbols := nil; // Collect the total number of states and transitions needed for the DFA. State := 0; for I := 0 to FUREBuffer.States.StatesUsed - 1 do begin if FUREBuffer.States.States[I].ID = State then begin Inc(FDFA.StateList.StatesUsed); Inc(FDFA.TransitionList.TransitionsUsed, FUREBuffer.States.States[I].TransitionsUsed); Inc(State); end; end; // Allocate enough space for the states and transitions. SetLength(FDFA.StateList.States, FDFA.StateList.StatesUsed); SetLength(FDFA.TransitionList.Transitions, FDFA.TransitionList.TransitionsUsed); // Actually transfer the DFA States from the buffer. State := 0; TP := 0; Run := @FUREBuffer.States.States[0]; for I := 0 to FUREBuffer.States.StatesUsed - 1 do begin if Run.ID = State then begin FDFA.StateList.States[I].StartTransition := TP; FDFA.StateList.States[I].NumberTransitions := Run.TransitionsUsed; FDFA.StateList.States[I].Accepting := Run.Accepting; // Add the transitions for the state for J := 0 to FDFA.StateList.States[I].NumberTransitions - 1 do begin FDFA.TransitionList.Transitions[TP].Symbol := Run.Transitions[J].LHS; FDFA.TransitionList.Transitions[TP].NextState := FUREBuffer.States.States[Run.Transitions[J].RHS].ID; Inc(TP); end; Inc(State); end; Inc(Run); end; end else begin // there might be an error while parsing the pattern, show it if so case FUREBuffer.Error of _URE_UNEXPECTED_EOS: UREError(LoadResString(@RsUREUnexpectedEOS), RE); _URE_CCLASS_OPEN: UREError(LoadResString(@RsURECharacterClassOpen), RE); _URE_UNBALANCED_GROUP: UREError(LoadResString(@RsUREUnbalancedGroup), RE); _URE_INVALID_PROPERTY: UREError(LoadResString(@RsUREInvalidCharProperty), RE); _URE_INVALID_RANGE: UREError(LoadResString(@RsUREInvalidRepeatRange), RE); _URE_RANGE_OPEN: UREError(LoadResString(@RsURERepeatRangeOpen), RE); else // expression was empty raise EJclUnicodeError.CreateRes(@RsUREExpressionEmpty); end; end; end; end; procedure TURESearch.ClearDFA; var I: Integer; begin with FDFA do begin for I := 0 to SymbolTable.SymbolsUsed - 1 do begin if (SymbolTable.Symbols[I].AType = _URE_CCLASS) or (SymbolTable.Symbols[I].AType = _URE_NCCLASS) then SymbolTable.Symbols[I].Symbol.CCL.Ranges := nil; end; for I := 0 to SymbolTable.SymbolsUsed - 1 do begin FDFA.SymbolTable.Symbols[I].States.List := nil; FDFA.SymbolTable.Symbols[I].States.ListUsed := 0; end; SymbolTable.SymbolsUsed := 0; SymbolTable.Symbols := nil; StateList.States := nil; TransitionList.Transitions := nil; end; FillChar(FDFA, SizeOf(FDFA), 0); end; function IsSeparator(C: UCS4): Boolean; begin Result := (C = $D) or (C = $A) or (C = $2028) or (C = $2029); end; function TURESearch.ExecuteURE(Flags: Cardinal; Text: PUCS2; TextLen: Cardinal; var MatchStart, MatchEnd: Cardinal): Boolean; var I, J: Integer; Matched, Found: Boolean; Start, Stop: Integer; C: UCS4; Run, Tail, lp: PUCS2; LastState: PDFAState; Symbol: PUcSymbolTableEntry; Rp: PUcRange; begin Result := False; if Text <> nil then begin // Handle the special case of an empty string matching the "^$" pattern. if (Textlen = 0) and ((FDFA.Flags and _URE_DFA_BLANKLINE) <> 0) then begin MatchStart := 0; MatchEnd := 0; Result := True; Exit; end; Run := Text; Tail := Run + TextLen; Start := -1; Stop := -1; LastState := @FDFA.StateList.States[0]; Found := False; while not Found and (Run < Tail) do begin lp := Run; C := UCS4(Run^); Inc(Run); // Check to see if this is a high surrogate that should be combined with a // following low surrogate. if (Run < Tail) and (SurrogateHighStart <= C) and (C <= SurrogateHighEnd) and (SurrogateLowStart <= UCS4(Run^)) and (UCS4(Run^) <= SurrogateLowEnd) then begin C := $10000 + (((C and $03FF) shl 10) or (UCS4(Run^) and $03FF)); Inc(Run); end; // Determine if the character is non-spacing and should be skipped. if ((Flags and URE_IGNORE_NONSPACING) <> 0) and UnicodeIsNonSpacingMark(C) then begin Inc(Run); Continue; end; if (FDFA.Flags and _URE_DFA_CASEFOLD) <> 0 then { TODO : use the entire mapping, not only the first character } C := UnicodeToLower(C)[0]; // See if one of the transitions matches. I := LastState.NumberTransitions - 1; Matched := False; while not Matched and (I >= 0) do begin Symbol := @FDFA.SymbolTable.Symbols[FDFA.TransitionList.Transitions[LastState.StartTransition + I].Symbol]; case Symbol.AType of _URE_ANY_CHAR: if ((Flags and URE_DONT_MATCHES_SEPARATORS) <> 0) or not IsSeparator(C) then Matched := True; _URE_CHAR: if C = Symbol.Symbol.Chr then Matched := True; _URE_BOL_ANCHOR: if Lp = Text then begin Run := lp; Matched := True; end else begin if IsSeparator(C) then begin if (C = $D) and (Run < Tail) and (Run^ = #$A) then Inc(Run); Lp := Run; Matched := True; end; end; _URE_EOL_ANCHOR: if IsSeparator(C) then begin // Put the pointer back before the separator so the match end // position will be correct. This will also cause the `Run' // pointer to be advanced over the current separator once the // match end point has been recorded. Run := Lp; Matched := True; end; _URE_CCLASS, _URE_NCCLASS: with Symbol^ do begin if Categories <> [] then Matched := CategoryLookup(C, Categories); if Symbol.CCL.RangesUsed > 0 then begin Rp := @Symbol.CCL.Ranges[0]; for J := 0 to Symbol.CCL.RangesUsed - 1 do begin if (Rp.MinCode <= C) and (C <= Rp.MaxCode) then begin Matched := True; Break; end; Inc(Rp); end; end; if AType = _URE_NCCLASS then Matched := not Matched; end; end; if Matched then begin if Start = -1 then Start := Lp - Text else Stop := Run - Text; LastState := @FDFA.StateList.States[FDFA.TransitionList.Transitions[LastState.StartTransition + I].NextState]; // If the match was an EOL anchor, adjust the pointer past the separator // that caused the match. The correct match position has been recorded // already. if Symbol.AType = _URE_EOL_ANCHOR then begin // skip the character that caused the match Inc(Run); // handle the infamous CRLF situation if (Run < Tail) and (C = $D) and (Run^ = #$A) then Inc(Run); end; end; Dec(I); end; if not Matched then begin Found := LastState.Accepting; if not Found then begin // If the last state was not accepting, then reset and start over. LastState := @FDFA.StateList.States[0]; Start := -1; Stop := -1; end else begin // set start and stop pointer if not yet done if Start = -1 then begin Start := Lp - Text; Stop := Run - Text; end else begin if Stop = -1 then Stop := Lp - Text; end; end; end else begin if Run = Tail then begin if not LastState.Accepting then begin // This ugly hack is to make sure the end-of-line anchors match // when the source text hits the end. This is only done if the last // subexpression matches. for I := 0 to LastState.NumberTransitions - 1 do begin if Found then Break; Symbol := @FDFA.SymbolTable.Symbols[FDFA.TransitionList.Transitions[LastState.StartTransition + I].Symbol]; if Symbol.AType =_URE_EOL_ANCHOR then begin LastState := @FDFA.StateList.States[FDFA.TransitionList.Transitions[LastState.StartTransition + I].NextState]; if LastState.Accepting then begin Stop := Run - Text; Found := True; end else Break; end; end; end else begin // Make sure any conditions that match all the way to the end of // the string match. Found := True; Stop := Run - Text; end; end; end; end; if Found then begin MatchStart := Start; MatchEnd := Stop; end; Result := Found; end; end; function TURESearch.FindAll(const Text: WideString): Boolean; begin Result := FindAll(PWideChar(Text), Length(Text)); end; function TURESearch.FindAll(Text: PWideChar; TextLen: Cardinal): Boolean; // Looks for all occurences of the pattern passed to FindPrepare and creates an // internal list of their positions. var Start, Stop: Cardinal; Run: PWideChar; RunLen: Cardinal; begin ClearResults; Run := Text; RunLen := TextLen; // repeat to find all occurences of the pattern while ExecuteURE(0, Run, RunLen, Start, Stop) do begin // store this result (consider text pointer movement)... AddResult(Start + Run - Text, Stop + Run - Text); // ... and advance text position and length Inc(Run, Stop); Dec(RunLen, Stop); end; Result := FResults.Count > 0; end; function TURESearch.FindFirst(const Text: WideString; var Start, Stop: Cardinal): Boolean; begin Result := FindFirst(PWideChar(Text), Length(Text), Start, Stop); end; function TURESearch.FindFirst(Text: PWideChar; TextLen: Cardinal; var Start, Stop: Cardinal): Boolean; // Looks for the first occurence of the pattern passed to FindPrepare in Text and // returns True if one could be found (in which case Start and Stop are set to // the according indices) otherwise False. This function is in particular of // interest if only one occurence needs to be found. begin ClearResults; Result := ExecuteURE(0, PWideChar(Text), Length(Text), Start, Stop); if Result then AddResult(Start, Stop); end; procedure TURESearch.FindPrepare(Pattern: PWideChar; PatternLength: Cardinal; Options: TSearchFlags); begin CompileURE(Pattern, PatternLength, not (sfCaseSensitive in Options)); end; procedure TURESearch.FindPrepare(const Pattern: WideString; Options: TSearchFlags); begin CompileURE(PWideChar(Pattern), Length(Pattern), not (sfCaseSensitive in Options)); end; //=== { TWideStrings } ======================================================= constructor TWideStrings.Create; begin inherited Create; FLanguage := GetUserDefaultLCID; FNormalizationForm := nfC; FSaveFormat := sfUnicodeLSB; end; procedure TWideStrings.SetLanguage(Value: LCID); begin FLanguage := Value; end; function TWideStrings.GetSaveUnicode: Boolean; begin Result := SaveFormat = sfUnicodeLSB; end; procedure TWideStrings.SetSaveUnicode(const Value: Boolean); begin if Value then SaveFormat := sfUnicodeLSB else SaveFormat := sfAnsi; end; function TWideStrings.Add(const S: WideString): Integer; begin Result := GetCount; Insert(Result, S); end; function TWideStrings.AddObject(const S: WideString; AObject: TObject): Integer; begin Result := Add(S); PutObject(Result, AObject); end; procedure TWideStrings.Append(const S: WideString); begin Add(S); end; procedure TWideStrings.AddStrings(Strings: TStrings); var I: Integer; S: WideString; CP: Integer; begin BeginUpdate; try CP := CodePageFromLocale(FLanguage); for I := 0 to Strings.Count - 1 do begin S := StringToWideStringEx(Strings[I], CP); AddObject(S, Strings.Objects[I]); end; finally EndUpdate; end; end; procedure TWideStrings.AddStrings(Strings: TWideStrings); var I: Integer; SourceCP, TargetCP: Integer; S: WideString; begin Assert(Strings <> nil); BeginUpdate; try if Strings.FLanguage <> FLanguage then begin SourceCP := CodePageFromLocale(Strings.FLanguage); TargetCP := CodePageFromLocale(FLanguage); for I := 0 to Strings.Count - 1 do begin S := TranslateString(Strings[I], SourceCP, TargetCP); AddObject(S, Strings.Objects[I]); end; end else begin for I := 0 to Strings.Count - 1 do AddObject(Strings[I], Strings.Objects[I]); end; finally EndUpdate; end; end; procedure TWideStrings.Assign(Source: TPersistent); // usual assignment routine, but able to assign wide and small strings begin if Source is TWideStrings then begin BeginUpdate; try Clear; AddStrings(TWideStrings(Source)); finally EndUpdate; end; end else begin if Source is TStrings then begin BeginUpdate; try Clear; AddStrings(TStrings(Source)); finally EndUpdate; end; end else inherited Assign(Source); end; end; procedure TWideStrings.AssignTo(Dest: TPersistent); // need to do also assignment to old style TStrings, but this class doesn't know // TWideStrings, so we need to do it from here var I: Integer; S: string; CP: Integer; begin if Dest is TStrings then begin with Dest as TStrings do begin BeginUpdate; try CP := CodePageFromLocale(FLanguage); Clear; for I := 0 to Self.Count - 1 do begin S := WideStringToStringEx(Self[I], CP); AddObject(S, Self.Objects[I]); end; finally EndUpdate; end; end; end else begin if Dest is TWideStrings then begin with Dest as TWideStrings do begin BeginUpdate; try Clear; AddStrings(Self); finally EndUpdate; end; end; end else inherited AssignTo(Dest); end; end; procedure TWideStrings.BeginUpdate; begin if FUpdateCount = 0 then SetUpdateState(True); Inc(FUpdateCount); end; procedure TWideStrings.DefineProperties(Filer: TFiler); // Defines a private property for the content of the list. // There's a bug in the handling of text DFMs in Classes.pas which prevents // WideStrings from loading under some circumstances. Zbysek Hlinka // (zhlinka att login dott cz) brought this to my attention and supplied also a solution. // See ReadData and WriteData methods for implementation details. //--------------- local function -------------------------------------------- function DoWrite: Boolean; begin if Filer.Ancestor <> nil then begin Result := True; if Filer.Ancestor is TWideStrings then Result := not Equals(TWideStrings(Filer.Ancestor)) end else Result := Count > 0; end; //--------------- end local function ---------------------------------------- begin Filer.DefineProperty('WideStrings', ReadData, WriteData, DoWrite); end; procedure TWideStrings.DoConfirmConversion(var Allowed: Boolean); begin if Assigned(FOnConfirmConversion) then FOnConfirmConversion(Self, Allowed); end; procedure TWideStrings.EndUpdate; begin Dec(FUpdateCount); if FUpdateCount = 0 then SetUpdateState(False); end; function TWideStrings.Equals(Strings: TWideStrings): Boolean; var I, Count: Integer; begin Assert(Strings <> nil); Result := False; Count := GetCount; if Count <> Strings.GetCount then Exit; { TODO : use internal comparation routine as soon as composition is implemented } for I := 0 to Count - 1 do if Get(I) <> Strings.Get(I) then Exit; Result := True; end; procedure TWideStrings.Error(const Msg: string; Data: Integer); function ReturnAddr: Pointer; asm MOV EAX, [EBP + 4] end; begin raise EStringListError.CreateFmt(Msg, [Data]) at ReturnAddr; end; procedure TWideStrings.Exchange(Index1, Index2: Integer); var TempObject: TObject; TempString: WideString; begin BeginUpdate; try TempString := Strings[Index1]; TempObject := Objects[Index1]; Strings[Index1] := Strings[Index2]; Objects[Index1] := Objects[Index2]; Strings[Index2] := TempString; Objects[Index2] := TempObject; finally EndUpdate; end; end; function TWideStrings.GetCapacity: Integer; // Descendants may optionally override/replace this default implementation. begin Result := Count; end; function TWideStrings.GetCommaText: WideString; var S: WideString; P: PWideChar; I, Count: Integer; begin Count := GetCount; if (Count = 1) and (Get(0) = '') then Result := '""' else begin Result := ''; for I := 0 to Count - 1 do begin S := Get(I); P := PWideChar(S); while not (P^ in [WideNull..WideSpace, WideChar('"'), WideChar(',')]) do Inc(P); if P^ <> WideNull then S := WideQuotedStr(S, '"'); Result := Result + S + ','; end; System.Delete(Result, Length(Result), 1); end; end; function TWideStrings.GetName(Index: Integer): WideString; var P: Integer; begin Result := Get(Index); P := Pos('=', Result); if P > 0 then SetLength(Result, P - 1) else Result := ''; end; function TWideStrings.GetObject(Index: Integer): TObject; begin Result := nil; end; function TWideStrings.GetSeparatedText(Separators: WideString): WideString; // Same as GetText but with customizable separator characters. var I, L, Size, Count, SepSize: Integer; P: PWideChar; S: WideString; begin Count := GetCount; SepSize := Length(Separators); Size := 0; for I := 0 to Count - 1 do Inc(Size, Length(Get(I)) + SepSize); // set one separator less, the last line does not need a trailing separator SetLength(Result, Size - SepSize); if Size > 0 then begin P := Pointer(Result); I := 0; while True do begin S := Get(I); L := Length(S); if L <> 0 then begin // add current string System.Move(Pointer(S)^, P^, 2 * L); Inc(P, L); end; Inc(I); if I = Count then Break; // add separators System.Move(Pointer(Separators)^, P^, SizeOf(WideChar) * SepSize); Inc(P, SepSize); end; end; end; function TWideStrings.GetTextStr: WideString; begin Result := GetSeparatedText(WideCRLF); end; function TWideStrings.GetText: PWideChar; begin Result := StrNewW(GetTextStr); end; function TWideStrings.GetValue(const Name: WideString): WideString; var I: Integer; begin I := IndexOfName(Name); if I >= 0 then Result := Copy(Get(I), Length(Name) + 2, MaxInt) else Result := ''; end; function TWideStrings.IndexOf(const S: WideString): Integer; var NormString: WideString; begin NormString := WideNormalize(S, FNormalizationForm); for Result := 0 to GetCount - 1 do if WideCompareText(Get(Result), NormString, FLanguage) = 0 then Exit; Result := -1; end; function TWideStrings.IndexOfName(const Name: WideString): Integer; var P: Integer; S: WideString; NormName: WideString; begin NormName := WideNormalize(Name, FNormalizationForm); for Result := 0 to GetCount - 1 do begin S := Get(Result); P := Pos('=', S); if (P > 0) and (WideCompareText(Copy(S, 1, P - 1), NormName, FLanguage) = 0) then Exit; end; Result := -1; end; function TWideStrings.IndexOfObject(AObject: TObject): Integer; begin for Result := 0 to GetCount - 1 do if GetObject(Result) = AObject then Exit; Result := -1; end; procedure TWideStrings.InsertObject(Index: Integer; const S: WideString; AObject: TObject); begin Insert(Index, S); PutObject(Index, AObject); end; procedure TWideStrings.LoadFromFile(const FileName: string); var Stream: TStream; begin try Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyNone); try LoadFromStream(Stream); finally Stream.Free; end; except RaiseLastOSError; end; end; procedure TWideStrings.LoadFromStream(Stream: TStream); // usual loader routine, but enhanced to handle byte order marks in stream var Size, BytesRead: Integer; ByteOrderMask: array [0..5] of Byte; // BOM size is max 5 bytes (cf: wikipedia) // but it is easier to implement with a multiple of 2 Loaded: Boolean; SW: WideString; SA: string; begin BeginUpdate; try Loaded := False; Size := Stream.Size - Stream.Position; BytesRead := Stream.Read(ByteOrderMask[0],SizeOf(ByteOrderMask)); // UTF16 LSB = Unicode LSB if (BytesRead >= 2) and (ByteOrderMask[0] = BOM_UTF16_LSB[0]) and (ByteOrderMask[1] = BOM_UTF16_LSB[1]) then begin FSaveFormat := sfUTF16LSB; SetLength(SW, (Size - 2) div SizeOf(WideChar)); Assert((Size and 1) <> 1,'Number of chars must be a multiple of 2'); System.Move(ByteOrderMask[2],SW[1],BytesRead-2); // max 4 bytes = 2 widechars Stream.Read(SW[3], Size-BytesRead); // first 2 chars were copied by System.Move SetText(SW); Loaded := True; end; // UTF16 MSB = Unicode MSB if (BytesRead >= 2) and (ByteOrderMask[0] = BOM_UTF16_MSB[0]) and (ByteOrderMask[1] = BOM_UTF16_MSB[1]) then begin FSaveFormat := sfUTF16MSB; SetLength(SW, (Size - 2) div SizeOf(WideChar)); Assert((Size and 1) <> 1,'Number of chars must be a multiple of 2'); System.Move(ByteOrderMask[2],SW[1],BytesRead-2); // max 4 bytes = 2 widechars Stream.Read(SW[3], Size-BytesRead); // first 2 chars were copied by System.Move StrSwapByteOrder(PWideChar(SW)); SetText(SW); Loaded := True; end; // UTF8 if (BytesRead >= 3) and (ByteOrderMask[0] = BOM_UTF8[0]) and (ByteOrderMask[1] = BOM_UTF8[1]) and (ByteOrderMask[2] = BOM_UTF8[2]) then begin FSaveFormat := sfUTF8; SetLength(SA, (Size-3) div SizeOf(Char)); System.Move(ByteOrderMask[3],SA[1],BytesRead-3); // max 3 bytes = 3 chars Stream.Read(SA[4], Size-BytesRead); // first 3 chars were copied by System.Move SW := UTF8ToWideString(SA); SetText(SW); Loaded := True; end; // default case (Ansi) if not Loaded then begin FSaveFormat := sfAnsi; SetLength(SA, Size div SizeOf(Char)); System.Move(ByteOrderMask[0],SA[1],BytesRead); // max 6 bytes = 6 chars Stream.Read(SA[7], Size-BytesRead); // first 6 chars were copied by System.Move SetText(SA); end; finally EndUpdate; end; end; procedure TWideStrings.Move(CurIndex, NewIndex: Integer); var TempObject: TObject; TempString: WideString; begin if CurIndex <> NewIndex then begin BeginUpdate; try TempString := Get(CurIndex); TempObject := GetObject(CurIndex); Delete(CurIndex); InsertObject(NewIndex, TempString, TempObject); finally EndUpdate; end; end; end; procedure TWideStrings.ReadData(Reader: TReader); begin case Reader.NextValue of vaLString, vaString: SetText(Reader.ReadString); else SetText(Reader.ReadWideString); end; end; procedure TWideStrings.SaveToFile(const FileName: string); var Stream: TStream; begin Stream := TFileStream.Create(FileName, fmCreate); try SaveToStream(Stream); finally Stream.Free; end; end; procedure TWideStrings.SaveToStream(Stream: TStream; WithBOM: Boolean = True); // Saves the currently loaded text into the given stream. WithBOM determines whether to write a // byte order mark or not. Note: when saved as ANSI text there will never be a BOM. var SW: WideString; SA: string; Allowed: Boolean; Run: PWideChar; begin // The application can decide in which format to save the content. // If FSaveUnicode is False then all strings are saved in standard ANSI format // which is also loadable by TStrings but you should be aware that all Unicode // strings are then converted to ANSI based on the current system locale. // An extra event is supplied to ask the user about the potential loss of // information when converting Unicode to ANSI strings. SW := GetTextStr; Allowed := True; FSaved := False; // be pessimistic // A check for potential information loss makes only sense if the application has // set an event to be used as call back to ask about the conversion. if (FSaveFormat = sfAnsi) and Assigned(FOnConfirmConversion) then begin // application requests to save only ANSI characters, so check the text and // call back in case information could be lost Run := PWideChar(SW); // only ask if there's at least one Unicode character in the text while Run^ in [WideChar(#1)..WideChar(#255)] do Inc(Run); // Note: The application can still set FSaveUnicode to True in the callback. if Run^ <> WideNull then DoConfirmConversion(Allowed); end; if Allowed then begin // only save if allowed case SaveFormat of sfUTF16LSB : begin Stream.WriteBuffer(BOM_UTF16_LSB[0],SizeOf(BOM_UTF16_LSB)); Stream.WriteBuffer(SW[1],Length(SW)*SizeOf(UTF16)); FSaved := True; end; sfUTF16MSB : begin Stream.WriteBuffer(BOM_UTF16_MSB[0],SizeOf(BOM_UTF16_MSB)); StrSwapByteOrder(PWideChar(SW)); Stream.WriteBuffer(SW[1],Length(SW)*SizeOf(UTF16)); FSaved := True; end; sfUTF8 : begin Stream.WriteBuffer(BOM_UTF8[0],SizeOf(BOM_UTF8)); SA := WideStringToUTF8(SW); Stream.WriteBuffer(SA[1],Length(SA)*SizeOf(UTF8)); FSaved := True; end; sfAnsi : begin SA := WideStringToStringEx(SW,CodePageFromLocale(FLanguage)); Stream.WriteBuffer(SA[1],Length(SA)*SizeOf(Char)); FSaved := True; end; end; end; end; procedure TWideStrings.SetCapacity(NewCapacity: Integer); begin // do nothing - descendants may optionally implement this method end; procedure TWideStrings.SetCommaText(const Value: WideString); var P, P1: PWideChar; S: WideString; begin BeginUpdate; try Clear; P := PWideChar(Value); while P^ in [WideChar(#1)..WideSpace] do Inc(P); while P^ <> WideNull do begin if P^ = '"' then S := WideExtractQuotedStr(P, '"') else begin P1 := P; while (P^ > WideSpace) and (P^ <> ',') do Inc(P); SetString(S, P1, P - P1); end; Add(S); while P^ in [WideChar(#1)..WideSpace] do Inc(P); if P^ = ',' then begin repeat Inc(P); until not (P^ in [WideChar(#1)..WideSpace]); end; end; finally EndUpdate; end; end; procedure TWideStrings.SetText(const Value: WideString); var Head, Tail: PWideChar; S: WideString; begin BeginUpdate; try Clear; Head := PWideChar(Value); while Head^ <> WideNull do begin Tail := Head; while not (Tail^ in [WideNull, WideLineFeed, WideCarriageReturn, WideVerticalTab, WideFormFeed]) and (Tail^ <> WideLineSeparator) and (Tail^ <> WideParagraphSeparator) do Inc(Tail); SetString(S, Head, Tail - Head); Add(S); Head := Tail; if Head^ <> WideNull then begin Inc(Head); if (Tail^ = WideCarriageReturn) and (Head^ = WideLineFeed) then Inc(Head); end; end; finally EndUpdate; end; end; procedure TWideStrings.SetUpdateState(Updating: Boolean); begin end; procedure TWideStrings.SetNormalizationForm(const Value: TNormalizationForm); var I: Integer; begin if FNormalizationForm <> Value then begin FNormalizationForm := Value; if FNormalizationForm <> nfNone then begin // renormalize all strings according to the new form for I := 0 to GetCount - 1 do Put(I, WideNormalize(Get(I), FNormalizationForm)); end; end; end; procedure TWideStrings.SetValue(const Name, Value: WideString); var I : Integer; begin I := IndexOfName(Name); if Value <> '' then begin if I < 0 then I := Add(''); Put(I, Name + '=' + Value); end else begin if I >= 0 then Delete(I); end; end; procedure TWideStrings.WriteData(Writer: TWriter); begin Writer.WriteWideString(GetTextStr); end; //=== { TWideStringList } ==================================================== destructor TWideStringList.Destroy; begin FOnChange := nil; FOnChanging := nil; Clear; inherited Destroy; end; function TWideStringList.Add(const S: WideString): Integer; begin if not Sorted then Result := FCount else begin if Find(S, Result) then begin case Duplicates of dupIgnore: Exit; dupError: Error(SDuplicateString, 0); end; end; end; InsertItem(Result, S); end; procedure TWideStringList.Changed; begin if (FUpdateCount = 0) and Assigned(FOnChange) then FOnChange(Self); end; procedure TWideStringList.Changing; begin if (FUpdateCount = 0) and Assigned(FOnChanging) then FOnChanging(Self); end; procedure TWideStringList.Clear; {$IFDEF OWN_WIDESTRING_MEMMGR} var I: Integer; {$ENDIF OWN_WIDESTRING_MEMMGR} begin if FCount <> 0 then begin Changing; {$IFDEF OWN_WIDESTRING_MEMMGR} for I := 0 to FCount - 1 do with FList[I] do if TDynWideCharArray(FString) <> nil then TDynWideCharArray(FString) := nil; {$ENDIF OWN_WIDESTRING_MEMMGR} // this will automatically finalize the array FList := nil; FCount := 0; SetCapacity(0); Changed; end; end; procedure TWideStringList.Delete(Index: Integer); begin if Cardinal(Index) >= Cardinal(FCount) then Error(SListIndexError, Index); Changing; {$IFDEF OWN_WIDESTRING_MEMMGR} SetListString(Index, ''); {$ELSE} FList[Index].FString := ''; {$ENDIF OWN_WIDESTRING_MEMMGR} Dec(FCount); if Index < FCount then begin System.Move(FList[Index + 1], FList[Index], (FCount - Index) * SizeOf(TWideStringItem)); Pointer(FList[FCount].FString) := nil; // avoid freeing the string, the address is now used in another element end; Changed; end; procedure TWideStringList.Exchange(Index1, Index2: Integer); begin if Cardinal(Index1) >= Cardinal(FCount) then Error(SListIndexError, Index1); if Cardinal(Index2) >= Cardinal(FCount) then Error(SListIndexError, Index2); Changing; ExchangeItems(Index1, Index2); Changed; end; procedure TWideStringList.ExchangeItems(Index1, Index2: Integer); var Temp: TWideStringItem; begin Temp := FList[Index1]; FList[Index1] := FList[Index2]; FList[Index2] := Temp; end; function TWideStringList.Find(const S: WideString; var Index: Integer): Boolean; var L, H, I, C: Integer; NormString: WideString; begin Result := False; NormString := WideNormalize(S, FNormalizationForm); L := 0; H := FCount - 1; while L <= H do begin I := (L + H) shr 1; C := WideCompareText(FList[I].FString, NormString, FLanguage); if C < 0 then L := I+1 else begin H := I - 1; if C = 0 then begin Result := True; if Duplicates <> dupAccept then L := I; end; end; end; Index := L; end; function TWideStringList.Get(Index: Integer): WideString; {$IFDEF OWN_WIDESTRING_MEMMGR} var Len: Integer; {$ENDIF OWN_WIDESTRING_MEMMGR} begin if Cardinal(Index) >= Cardinal(FCount) then Error(SListIndexError, Index); {$IFDEF OWN_WIDESTRING_MEMMGR} with FList[Index] do begin Len := Length(TDynWideCharArray(FString)); if Len > 0 then begin SetLength(Result, Len - 1); // exclude #0 if Result <> '' then System.Move(FString^, Result[1], Len * SizeOf(WideChar)); end else Result := ''; end; {$ELSE} Result := FList[Index].FString; {$ENDIF OWN_WIDESTRING_MEMMGR} end; function TWideStringList.GetCapacity: Integer; begin Result := Length(FList); end; function TWideStringList.GetCount: Integer; begin Result := FCount; end; function TWideStringList.GetObject(Index: Integer): TObject; begin if Cardinal(Index) >= Cardinal(FCount) then Error(SListIndexError, Index); Result := FList[Index].FObject; end; procedure TWideStringList.Grow; var Delta, Len: Integer; begin Len := Length(FList); if Len > 64 then Delta := Len div 4 else begin if Len > 8 then Delta := 16 else Delta := 4; end; SetCapacity(Len + Delta); end; function TWideStringList.IndexOf(const S: WideString): Integer; begin if not Sorted then Result := inherited IndexOf(S) else if not Find(S, Result) then Result := -1; end; procedure TWideStringList.Insert(Index: Integer; const S: WideString); begin if Sorted then Error(SSortedListError, 0); if Cardinal(Index) > Cardinal(FCount) then Error(SListIndexError, Index); InsertItem(Index, S); end; {$IFDEF OWN_WIDESTRING_MEMMGR} procedure TWideStringList.SetListString(Index: Integer; const S: WideString); var Len: Integer; A: TDynWideCharArray; begin with FList[Index] do begin A := TDynWideCharArray(FString); if A <> nil then A := nil; // free memory Len := Length(S); if Len > 0 then begin SetLength(A, Len + 1); // include #0 System.Move(S[1], A[0], Len * SizeOf(WideChar)); A[Len] := #0; end; FString := PWideChar(A); Pointer(A) := nil; // do not release the array on procedure exit end; end; {$ENDIF OWN_WIDESTRING_MEMMGR} procedure TWideStringList.InsertItem(Index: Integer; const S: WideString); begin Changing; if FCount = Length(FList) then Grow; if Index < FCount then System.Move(FList[Index], FList[Index + 1], (FCount - Index) * SizeOf(TWideStringItem)); with FList[Index] do begin Pointer(FString) := nil; // avoid freeing the string, the address is now used in another element FObject := nil; if (FNormalizationForm <> nfNone) and (Length(S) > 0) then {$IFDEF OWN_WIDESTRING_MEMMGR} SetListString(Index, WideNormalize(S, FNormalizationForm)) else SetListString(Index, S); {$ELSE} FString := WideNormalize(S, FNormalizationForm) else FString := S; {$ENDIF OWN_WIDESTRING_MEMMGR} end; Inc(FCount); Changed; end; procedure TWideStringList.Put(Index: Integer; const S: WideString); begin if Sorted then Error(SSortedListError, 0); if Cardinal(Index) >= Cardinal(FCount) then Error(SListIndexError, Index); Changing; if (FNormalizationForm <> nfNone) and (Length(S) > 0) then {$IFDEF OWN_WIDESTRING_MEMMGR} SetListString(Index, WideNormalize(S, FNormalizationForm)) else SetListString(Index, S); {$ELSE} FList[Index].FString := WideNormalize(S, FNormalizationForm) else FList[Index].FString := S; {$ENDIF OWN_WIDESTRING_MEMMGR} Changed; end; procedure TWideStringList.PutObject(Index: Integer; AObject: TObject); begin if Cardinal(Index) >= Cardinal(FCount) then Error(SListIndexError, Index); Changing; FList[Index].FObject := AObject; Changed; end; procedure TWideStringList.QuickSort(L, R: Integer); var I, J: Integer; P: WideString; begin repeat I := L; J := R; P := FList[(L + R) shr 1].FString; repeat while WideCompareText(FList[I].FString, P, FLanguage) < 0 do Inc(I); while WideCompareText(FList[J].FString, P, FLanguage) > 0 do Dec(J); if I <= J then begin ExchangeItems(I, J); Inc(I); Dec(J); end; until I > J; if L < J then QuickSort(L, J); L := I; until I >= R; end; procedure TWideStringList.SetCapacity(NewCapacity: Integer); begin SetLength(FList, NewCapacity); if NewCapacity < FCount then FCount := NewCapacity; end; procedure TWideStringList.SetSorted(Value: Boolean); begin if FSorted <> Value then begin if Value then Sort; FSorted := Value; end; end; procedure TWideStringList.SetUpdateState(Updating: Boolean); begin if Updating then Changing else Changed; end; procedure TWideStringList.Sort; begin if not Sorted and (FCount > 1) then begin Changing; QuickSort(0, FCount - 1); Changed; end; end; procedure TWideStringList.SetLanguage(Value: LCID); begin inherited SetLanguage(Value); if Sorted then Sort; end; //----------------- functions for null terminated strings ------------------------------------------ function StrLenW(Str: PWideChar): Cardinal; // returns number of characters in a string excluding the null terminator asm MOV EDX, EDI MOV EDI, EAX MOV ECX, 0FFFFFFFFH XOR AX, AX REPNE SCASW MOV EAX, 0FFFFFFFEH SUB EAX, ECX MOV EDI, EDX end; function StrEndW(Str: PWideChar): PWideChar; // returns a pointer to the end of a null terminated string asm MOV EDX, EDI MOV EDI, EAX MOV ECX, 0FFFFFFFFH XOR AX, AX REPNE SCASW LEA EAX, [EDI - 2] MOV EDI, EDX end; function StrMoveW(Dest, Source: PWideChar; Count: Cardinal): PWideChar; // Copies the specified number of characters to the destination string and returns Dest // also as result. Dest must have enough room to store at least Count characters. asm PUSH ESI PUSH EDI MOV ESI, EDX MOV EDI, EAX MOV EDX, ECX CMP EDI, ESI JG @@1 JE @@2 SHR ECX, 1 REP MOVSD MOV ECX, EDX AND ECX, 1 REP MOVSW JMP @@2 @@1: LEA ESI, [ESI + 2 * ECX - 2] LEA EDI, [EDI + 2 * ECX - 2] STD AND ECX, 1 REP MOVSW SUB EDI, 2 SUB ESI, 2 MOV ECX, EDX SHR ECX, 1 REP MOVSD CLD @@2: POP EDI POP ESI end; function StrCopyW(Dest, Source: PWideChar): PWideChar; // copies Source to Dest and returns Dest asm PUSH EDI PUSH ESI MOV ESI, EAX MOV EDI, EDX MOV ECX, 0FFFFFFFFH XOR AX, AX REPNE SCASW NOT ECX MOV EDI, ESI MOV ESI, EDX MOV EDX, ECX MOV EAX, EDI SHR ECX, 1 REP MOVSD MOV ECX, EDX AND ECX, 1 REP MOVSW POP ESI POP EDI end; function StrECopyW(Dest, Source: PWideChar): PWideChar; // copies Source to Dest and returns a pointer to the null character ending the string asm PUSH EDI PUSH ESI MOV ESI, EAX MOV EDI, EDX MOV ECX, 0FFFFFFFFH XOR AX, AX REPNE SCASW NOT ECX MOV EDI, ESI MOV ESI, EDX MOV EDX, ECX SHR ECX, 1 REP MOVSD MOV ECX, EDX AND ECX, 1 REP MOVSW LEA EAX, [EDI - 2] POP ESI POP EDI end; function StrLCopyW(Dest, Source: PWideChar; MaxLen: Cardinal): PWideChar; // copies a specified maximum number of characters from Source to Dest asm PUSH EDI PUSH ESI PUSH EBX MOV ESI, EAX MOV EDI, EDX MOV EBX, ECX XOR AX, AX TEST ECX, ECX JZ @@1 REPNE SCASW JNE @@1 INC ECX @@1: SUB EBX, ECX MOV EDI, ESI MOV ESI, EDX MOV EDX, EDI MOV ECX, EBX SHR ECX, 1 REP MOVSD MOV ECX, EBX AND ECX, 1 REP MOVSW STOSW MOV EAX, EDX POP EBX POP ESI POP EDI end; function StrPCopyWW(Dest: PWideChar; const Source: WideString): PWideChar; // copies a Pascal-style WideString to a null-terminated wide string begin Result := StrLCopyW(Dest, PWideChar(Source), Length(Source)); end; function StrPCopyW(Dest: PWideChar; const Source: string): PWideChar; // copies a Pascal-style string to a null-terminated wide string begin Result := StrPLCopyW(Dest, Source, Cardinal(Length(Source))); Result[Length(Source)] := WideNull; end; function StrPLCopyWW(Dest: PWideChar; const Source: WideString; MaxLen: Cardinal): PWideChar; // copies characters from a Pascal-style WideString into a null-terminated wide string begin Result := StrLCopyW(Dest, PWideChar(Source), MaxLen); end; function StrPLCopyW(Dest: PWideChar; const Source: string; MaxLen: Cardinal): PWideChar; // copies characters from a Pascal-style string into a null-terminated wide string asm PUSH EDI PUSH ESI MOV EDI, EAX MOV ESI, EDX MOV EDX, EAX XOR AX, AX @@1: LODSB STOSW DEC ECX JNZ @@1 MOV EAX, EDX POP ESI POP EDI end; function StrCatW(Dest: PWideChar; const Source: PWideChar): PWideChar; // appends a copy of Source to the end of Dest and returns the concatenated string begin StrCopyW(StrEndW(Dest), Source); Result := Dest; end; // appends a specified maximum number of WideCharacters to string function StrLCatW(Dest, Source: PWideChar; MaxLen: Cardinal): PWideChar; asm PUSH EDI PUSH ESI PUSH EBX MOV EDI, Dest MOV ESI, Source MOV EBX, MaxLen SHL EBX, 1 CALL StrEndW MOV ECX, EDI ADD ECX, EBX SUB ECX, EAX JBE @@1 MOV EDX, ESI SHR ECX, 1 CALL StrLCopyW @@1: MOV EAX, EDI POP EBX POP ESI POP EDI end; const // data used to bring UTF-16 coded strings into correct UTF-32 order for correct comparation UTF16Fixup: array [0..31] of Word = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, $2000, $F800, $F800, $F800, $F800 ); function StrCompW(const Str1, Str2: PWideChar): Integer; // Binary comparation of Str1 and Str2 with surrogate fix-up. // Returns < 0 if Str1 is smaller in binary order than Str2, = 0 if both strings are // equal and > 0 if Str1 is larger than Str2. // // This code is based on an idea of Markus W. Scherer (IBM). // Note: The surrogate fix-up is necessary because some single value code points have // larger values than surrogates which are in UTF-32 actually larger. var C1, C2: Word; Run1, Run2: PWideChar; begin Run1 := Str1; Run2 := Str2; repeat C1 := Word(Run1^); C1 := Word(C1 + UTF16Fixup[C1 shr 11]); C2 := Word(Run2^); C2 := Word(C2 + UTF16Fixup[C2 shr 11]); // now C1 and C2 are in UTF-32-compatible order Result := Integer(C1) - Integer(C2); if(Result <> 0) or (C1 = 0) or (C2 = 0) then Break; Inc(Run1); Inc(Run2); until False; // If the strings have different lengths but the comparation returned equity so far // then adjust the result so that the longer string is marked as the larger one. if Result = 0 then Result := (Run1 - Str1) - (Run2 - Str2); end; function StrICompW(const Str1, Str2: PWideChar): Integer; // Compares Str1 to Str2 without case sensitivity. // See also comments in StrCompW, but keep in mind that case folding might result in // one-to-many mappings which must be considered here. var C1, C2: Word; S1, S2: PWideChar; Run1, Run2: PWideChar; Folded1, Folded2: WideString; begin // Because of size changes of the string when doing case folding // it is unavoidable to convert both strings completely in advance. S1 := Str1; S2 := Str2; Folded1 := ''; while S1^ <> #0 do begin Folded1 := Folded1 + WideCaseFolding(S1^); Inc(S1); end; Folded2 := ''; while S2^ <> #0 do begin Folded2 := Folded2 + WideCaseFolding(S2^); Inc(S2); end; Run1 := PWideChar(Folded1); Run2 := PWideChar(Folded2); repeat C1 := Word(Run1^); C1 := Word(C1 + UTF16Fixup[C1 shr 11]); C2 := Word(Run2^); C2 := Word(C2 + UTF16Fixup[C2 shr 11]); // now C1 and C2 are in UTF-32-compatible order Result := Integer(C1) - Integer(C2); if(Result <> 0) or (C1 = 0) or (C2 = 0) then Break; Inc(Run1); Inc(Run2); until False; // If the strings have different lengths but the comparation returned equity so far // then adjust the result so that the longer string is marked as the larger one. if Result = 0 then Result := (Run1 - PWideChar(Folded1)) - (Run2 - PWideChar(Folded2)); end; function StrLICompW(const Str1, Str2: PWideChar; MaxLen: Cardinal): Integer; // compares strings up to MaxLen code points // see also StrICompW var S1, S2: PWideChar; C1, C2: Word; Run1, Run2: PWideChar; Folded1, Folded2: WideString; begin if MaxLen > 0 then begin // Because of size changes of the string when doing case folding // it is unavoidable to convert both strings completely in advance. S1 := Str1; S2 := Str2; Folded1 := ''; while S1^ <> #0 do begin Folded1 := Folded1 + WideCaseFolding(S1^); Inc(S1); end; Folded2 := ''; while S2^ <> #0 do begin Folded2 := Folded2 + WideCaseFolding(S2^); Inc(S2); end; Run1 := PWideChar(Folded1); Run2 := PWideChar(Folded2); repeat C1 := Word(Run1^); C1 := Word(C1 + UTF16Fixup[C1 shr 11]); C2 := Word(Run2^); C2 := Word(C2 + UTF16Fixup[C2 shr 11]); // now C1 and C2 are in UTF-32-compatible order { TODO : surrogates take up 2 words and are counted twice here, count them only once } Result := Integer(C1) - Integer(C2); Dec(MaxLen); if(Result <> 0) or (C1 = 0) or (C2 = 0) or (MaxLen = 0) then Break; Inc(Run1); Inc(Run2); until False; end else Result := 0; end; function StrLCompW(const Str1, Str2: PWideChar; MaxLen: Cardinal): Integer; // compares strings up to MaxLen code points // see also StrCompW var S1, S2: PWideChar; C1, C2: Word; begin if MaxLen > 0 then begin S1 := Str1; S2 := Str2; repeat C1 := Word(S1^); C1 := Word(C1 + UTF16Fixup[C1 shr 11]); C2 := Word(S2^); C2 := Word(C2 + UTF16Fixup[C2 shr 11]); // now C1 and C2 are in UTF-32-compatible order { TODO : surrogates take up 2 words and are counted twice here, count them only once } Result := Integer(C1) - Integer(C2); Dec(MaxLen); if(Result <> 0) or (C1 = 0) or (C2 = 0) or (MaxLen = 0) then Break; Inc(S1); Inc(S2); until False; end else Result := 0; end; function StrNScanW(const Str1, Str2: PWideChar): Integer; // Determines where (in Str1) the first time one of the characters of Str2 appear. // The result is the length of a string part of Str1 where none of the characters of // Str2 do appear (not counting the trailing #0 and starting with position 0 in Str1). var Run: PWideChar; begin Result := -1; if (Str1 <> nil) and (Str2 <> nil) then begin Run := Str1; while Run^ <> #0 do begin if StrScanW(Str2, Run^) <> nil then Break; Inc(Run); end; Result := Run - Str1; end; end; function StrRNScanW(const Str1, Str2: PWideChar): Integer; // This function does the same as StrRNScanW but uses Str1 in reverse order. This // means Str1 points to the last character of a string, is traversed reversely // and terminates with a starting #0. This is useful for parsing strings stored // in reversed macro buffers etc. var Run: PWideChar; begin Result := -1; if (Str1 <> nil) and (Str2 <> nil) then begin Run := Str1; while Run^ <> #0 do begin if StrScanW(Str2, Run^) <> nil then Break; Dec(Run); end; Result := Str1 - Run; end; end; // returns a pointer to first occurrence of a specified character in a string function StrScanW(Str: PWideChar; Chr: WideChar): PWideChar; asm PUSH EDI PUSH EAX MOV EDI, Str MOV ECX, 0FFFFFFFFH XOR AX, AX REPNE SCASW NOT ECX POP EDI MOV AX, Chr REPNE SCASW MOV EAX, 0 JNE @@1 MOV EAX, EDI SUB EAX, 2 @@1: POP EDI end; // Returns a pointer to first occurrence of a specified character in a string // or nil if not found. // Note: this is just a binary search for the specified character and there's no // check for a terminating null. Instead at most StrLen characters are // searched. This makes this function extremly fast. // // on enter EAX contains Str, EDX contains Chr and ECX StrLen // on exit EAX contains result pointer or nil function StrScanW(Str: PWideChar; Chr: WideChar; StrLen: Cardinal): PWideChar; asm TEST EAX, EAX JZ @@Exit // get out if the string is nil or StrLen is 0 JCXZ @@Exit @@Loop: CMP [EAX], DX // this unrolled loop is actually faster on modern processors JE @@Exit // than REP SCASW ADD EAX, 2 DEC ECX JNZ @@Loop XOR EAX, EAX @@Exit: end; // returns a pointer to the last occurance of Chr in Str function StrRScanW(Str: PWideChar; Chr: WideChar): PWideChar; asm PUSH EDI MOV EDI, Str MOV ECX, 0FFFFFFFFH XOR AX, AX REPNE SCASW NOT ECX STD SUB EDI, 2 MOV AX, Chr REPNE SCASW MOV EAX, 0 JNE @@1 MOV EAX, EDI ADD EAX, 2 @@1: CLD POP EDI end; // returns a pointer to the first occurance of SubStr in Str function StrPosW(Str, SubStr: PWideChar): PWideChar; asm PUSH EDI PUSH ESI PUSH EBX OR EAX, EAX JZ @@2 OR EDX, EDX JZ @@2 MOV EBX, EAX MOV EDI, EDX XOR AX, AX MOV ECX, 0FFFFFFFFH REPNE SCASW NOT ECX DEC ECX JZ @@2 MOV ESI, ECX MOV EDI, EBX MOV ECX, 0FFFFFFFFH REPNE SCASW NOT ECX SUB ECX, ESI JBE @@2 MOV EDI, EBX LEA EBX, [ESI - 1] @@1: MOV ESI, EDX LODSW REPNE SCASW JNE @@2 MOV EAX, ECX PUSH EDI MOV ECX, EBX REPE CMPSW POP EDI MOV ECX, EAX JNE @@1 LEA EAX, [EDI - 2] JMP @@3 @@2: XOR EAX, EAX @@3: POP EBX POP ESI POP EDI end; function StrAllocW(WideSize: Cardinal): PWideChar; // Allocates a buffer for a null-terminated wide string and returns a pointer // to the first character of the string. begin WideSize := SizeOf(WideChar) * WideSize + SizeOf(Cardinal); Result := AllocMem(WideSize); Cardinal(Pointer(Result)^) := WideSize; Inc(Result, SizeOf(Cardinal) div SizeOf(WideChar)); end; function StrBufSizeW(const Str: PWideChar): Cardinal; // Returns max number of wide characters that can be stored in a buffer // allocated by StrAllocW. var P: PWideChar; begin if Str <> nil then begin P := Str; Dec(P, SizeOf(Cardinal) div SizeOf(WideChar)); Result := (Cardinal(PInteger(P)^) - SizeOf(Cardinal)) div SizeOf(WideChar); end else Result := 0; end; function StrNewW(const Str: PWideChar): PWideChar; // Duplicates the given string (if not nil) and returns the address of the new string. var Size: Cardinal; begin if Str = nil then Result := nil else begin Size := StrLenW(Str) + 1; Result := StrMoveW(StrAllocW(Size), Str, Size); end; end; function StrNewW(const Str: WideString): PWideChar; begin Result := StrNewW(PWideChar(Str)); end; procedure StrDisposeW(Str: PWideChar); // releases a string allocated with StrNewW or StrAllocW begin if Str <> nil then begin Dec(Str, SizeOf(Cardinal) div SizeOf(WideChar)); FreeMem(Str); end; end; procedure StrDisposeAndNilW(var Str: PWideChar); begin StrDisposeW(Str); Str := nil; end; // exchanges in each character of the given string the low order and high order // byte to go from LSB to MSB and vice versa. // EAX contains address of string procedure StrSwapByteOrder(Str: PWideChar); asm PUSH ESI PUSH EDI MOV ESI, EAX MOV EDI, ESI XOR EAX, EAX // clear high order byte to be able to use 32bit operand below @@1: LODSW OR EAX, EAX JZ @@2 XCHG AL, AH STOSW JMP @@1 @@2: POP EDI POP ESI end; function WideAdjustLineBreaks(const S: WideString): WideString; var Source, SourceEnd, Dest: PWideChar; begin Source := Pointer(S); SourceEnd := Source + Length(S); Source := Pointer(S); SetString(Result, nil, SourceEnd - Source); Dest := Pointer(Result); while Source < SourceEnd do begin case Source^ of WideLineFeed: begin Dest^ := WideLineSeparator; Inc(Dest); Inc(Source); end; WideCarriageReturn: begin Dest^ := WideLineSeparator; Inc(Dest); Inc(Source); if Source^ = WideLineFeed then Inc(Source); end; else Dest^ := Source^; Inc(Dest); Inc(Source); end; end; SetLength(Result, (Integer(Dest) - Integer(Result)) div 2); end; function WideQuotedStr(const S: WideString; Quote: WideChar): WideString; // works like QuotedStr from SysUtils.pas but can insert any quotation character var P, Src, Dest: PWideChar; AddCount: Integer; begin AddCount := 0; P := StrScanW(PWideChar(S), Quote); while (P <> nil) do begin Inc(P); Inc(AddCount); P := StrScanW(P, Quote); end; if AddCount = 0 then Result := Quote + S + Quote else begin SetLength(Result, Length(S) + AddCount + 2); Dest := PWideChar(Result); Dest^ := Quote; Inc(Dest); Src := PWideChar(S); P := StrScanW(Src, Quote); repeat Inc(P); Move(Src^, Dest^, 2 * (P - Src)); Inc(Dest, P - Src); Dest^ := Quote; Inc(Dest); Src := P; P := StrScanW(Src, Quote); until P = nil; P := StrEndW(Src); Move(Src^, Dest^, 2 * (P - Src)); Inc(Dest, P - Src); Dest^ := Quote; end; end; function WideExtractQuotedStr(var Src: PWideChar; Quote: WideChar): WideString; // extracts a string enclosed in quote characters given by Quote var P, Dest: PWideChar; DropCount: Integer; begin Result := ''; if (Src = nil) or (Src^ <> Quote) then Exit; Inc(Src); DropCount := 1; P := Src; Src := StrScanW(Src, Quote); while Src <> nil do // count adjacent pairs of quote chars begin Inc(Src); if Src^ <> Quote then Break; Inc(Src); Inc(DropCount); Src := StrScanW(Src, Quote); end; if Src = nil then Src := StrEndW(P); if (Src - P) <= 1 then Exit; if DropCount = 1 then SetString(Result, P, Src - P - 1) else begin SetLength(Result, Src - P - DropCount); Dest := PWideChar(Result); Src := StrScanW(P, Quote); while Src <> nil do begin Inc(Src); if Src^ <> Quote then Break; Move(P^, Dest^, 2 * (Src - P)); Inc(Dest, Src - P); Inc(Src); P := Src; Src := StrScanW(Src, Quote); end; if Src = nil then Src := StrEndW(P); Move(P^, Dest^, 2 * (Src - P - 1)); end; end; function WideStringOfChar(C: WideChar; Count: Cardinal): WideString; // returns a string of Count characters filled with C var I: Integer; begin SetLength(Result, Count); for I := 1 to Count do Result[I] := C; end; function WideTrim(const S: WideString): WideString; var I, L: Integer; begin L := Length(S); I := 1; while (I <= L) and (UnicodeIsWhiteSpace(UCS4(S[I])) or UnicodeIsControl(UCS4(S[I]))) do Inc(I); if I > L then Result := '' else begin while UnicodeIsWhiteSpace(UCS4(S[L])) or UnicodeIsControl(UCS4(S[L])) do Dec(L); Result := Copy(S, I, L - I + 1); end; end; function WideTrimLeft(const S: WideString): WideString; var I, L: Integer; begin L := Length(S); I := 1; while (I <= L) and (UnicodeIsWhiteSpace(UCS4(S[I])) or UnicodeIsControl(UCS4(S[I]))) do Inc(I); Result := Copy(S, I, Maxint); end; function WideTrimRight(const S: WideString): WideString; var I: Integer; begin I := Length(S); while (I > 0) and (UnicodeIsWhiteSpace(UCS4(S[I])) or UnicodeIsControl(UCS4(S[I]))) do Dec(I); Result := Copy(S, 1, I); end; // returns the index of character Ch in S, starts searching at index Index // Note: This is a quick memory search. No attempt is made to interpret either // the given charcter nor the string (ligatures, modifiers, surrogates etc.) // Code from Azret Botash. function WideCharPos(const S: WideString; const Ch: WideChar; const Index: Integer): Integer; asm TEST EAX,EAX // make sure we are not null JZ @@StrIsNil DEC ECX // make index zero based JL @@IdxIsSmall PUSH EBX PUSH EDI MOV EDI, EAX // EDI := S XOR EAX, EAX MOV AX, DX // AX := Ch MOV EDX, [EDI - 4] // EDX := Length(S) * 2 SHR EDX, 1 // EDX := EDX div 2 MOV EBX, EDX // save the length to calc. result SUB EDX, ECX // EDX = EDX - Index = # of chars to scan JLE @@IdxIsBig SHL ECX, 1 // two bytes per char ADD EDI, ECX // point to index'th char MOV ECX, EDX // loop counter REPNE SCASW JNE @@NoMatch MOV EAX, EBX // result := saved length - SUB EAX, ECX // loop counter value POP EDI POP EBX RET @@IdxIsBig: @@NoMatch: XOR EAX,EAX POP EDI POP EBX RET @@IdxIsSmall: XOR EAX, EAX @@StrIsNil: end; function WideComposeHangul(const Source: WideString): WideString; var Len: Integer; Ch, Last: WideChar; I: Integer; LIndex, VIndex, SIndex, TIndex: Integer; begin Result := ''; Len := Length(Source); if Len > 0 then begin Last := Source[1]; Result := Last; for I := 2 to Len do begin Ch := Source[I]; // 1. check to see if two current characters are L and V LIndex := Word(Last) - LBase; if (0 <= LIndex) and (LIndex < LCount) then begin VIndex := Word(Ch) - VBase; if (0 <= VIndex) and (VIndex < VCount) then begin // make syllable of form LV Last := WideChar((SBase + (LIndex * VCount + VIndex) * TCount)); Result[Length(Result)] := Last; // reset last Continue; // discard Ch end; end; // 2. check to see if two current characters are LV and T SIndex := Word(Last) - SBase; if (0 <= SIndex) and (SIndex < SCount) and ((SIndex mod TCount) = 0) then begin TIndex := Word(Ch) - TBase; if (0 <= TIndex) and (TIndex <= TCount) then begin // make syllable of form LVT Inc(Word(Last), TIndex); Result[Length(Result)] := Last; // reset last Continue; // discard Ch end; end; // if neither case was true, just add the character Last := Ch; Result := Result + Ch; end; end; end; // Returns canonical composition of characters in S. function WideCompose(const S: WideString): WideString; var StarterPos, CompPos, DecompPos: Integer; Composite: UCS4; Ch, StarterChar: WideChar; LastClass, CurrentClass: Cardinal; begin // Set an arbitrary length for the result. This is automatically done when checking // for hangul composition. Result := WideComposeHangul(S); if Result = '' then Exit; StarterPos := 1; CompPos := 2; StarterChar := Result[StarterPos]; LastClass := CanonicalCombiningClass(UCS4(StarterChar)); if LastClass <> 0 then LastClass := 256; // fix for irregular combining sequence // Loop on the (decomposed) characters, combining where possible. for DecompPos := 2 to Length(Result) do begin Ch := Result[DecompPos]; CurrentClass := CanonicalCombiningClass(UCS4(Ch)); if UnicodeComposePair(UCS4(StarterChar), UCS4(Ch), Composite) and ((LastClass < CurrentClass) or (LastClass = 0)) then begin Result[StarterPos] := UCS2(Composite); StarterChar := UCS2(Composite); end else begin if CurrentClass = 0 then begin StarterPos := CompPos; StarterChar := Ch; end; LastClass := CurrentClass; Result[CompPos] := Ch; Inc(CompPos); end; end; // since we have likely shortened the source string we have to set the correct length on exit SetLength(Result, CompPos - 1); end; procedure FixCanonical(var S: WideString); // Examines S and reorders all combining marks in the string so that they are in canonical order. var I: Integer; Temp: WideChar; CurrentClass, LastClass: Cardinal; begin I := Length(S); if I > 1 then begin CurrentClass := CanonicalCombiningClass(UCS4(S[I])); repeat Dec(I); LastClass := CurrentClass; CurrentClass := CanonicalCombiningClass(UCS4(S[I])); // A swap is presumed to be rare (and a double-swap very rare), // so don't worry about efficiency here. if (CurrentClass > LastClass) and (LastClass > 0) then begin // swap characters Temp := S[I]; S[I] := S[I + 1]; S[I + 1] := Temp; // if not at end, backup (one further, to compensate for loop) if I < Length(S) - 1 then Inc(I, 2); // reset type, since we swapped. CurrentClass := CanonicalCombiningClass(UCS4(S[I])); end; until I = 1; end; end; procedure GetDecompositions(Compatible: Boolean; Code: UCS4; var Buffer: TUCS4Array); // helper function to recursively decompose a code point var Decomp: TUCS4Array; I: Integer; begin Decomp := UnicodeDecompose(Code, Compatible); if Assigned(Decomp) then begin for I := 0 to High(Decomp) do GetDecompositions(Compatible, Decomp[I], Buffer); end else // if no decomp, append begin I := Length(Buffer); SetLength(Buffer, I + 1); Buffer[I] := Code; end; end; function WideDecompose(const S: WideString; Compatible: Boolean): WideString; // returns a string with all characters of S but decomposed, e.g. Ê is returned as E^ etc. var I, J: Integer; Decomp: TUCS4Array; begin Result := ''; Decomp := nil; // iterate through each source code point for I := 1 to Length(S) do begin Decomp := nil; GetDecompositions(Compatible, UCS4(S[I]), Decomp); if Decomp = nil then Result := Result + S[I] else for J := 0 to High(Decomp) do Result := Result + WideChar(Decomp[J]); end; // combining marks must be sorted according to their canonical combining class FixCanonical(Result); end; //----------------- general purpose case mapping --------------------------------------------------- // Note that most of the assigned code points don't have a case mapping and are therefore // returned as they are. Other code points, however, might be converted into several characters // like the german ß (eszett) whose upper case mapping is SS. function WideCaseFolding(C: WideChar): WideString; // Special case folding function to map a string to either its lower case or // to special cases. This can be used for case-insensitive comparation. var I: Integer; Mapping: TUCS4Array; begin Mapping := UnicodeCaseFold(UCS4(C)); SetLength(Result, Length(Mapping)); for I := 0 to High(Mapping) do Result[I + 1] := WideChar(Mapping[I]); end; function WideCaseFolding(const S: WideString): WideString; var I: Integer; begin Result := ''; for I := 1 to Length(S) do Result := Result + WideCaseFolding(S[I]); end; function WideLowerCase(C: WideChar): WideString; var I: Integer; Mapping: TUCS4Array; begin Mapping := UnicodeToLower(UCS4(C)); SetLength(Result, Length(Mapping)); for I := 0 to High(Mapping) do Result[I + 1] := WideChar(Mapping[I]); end; function WideLowerCase(const S: WideString): WideString; var I: Integer; begin Result := ''; for I := 1 to Length(S) do Result := Result + WideLowerCase(S[I]); end; function WideNormalize(const S: WideString; Form: TNormalizationForm): WideString; var Temp: WideString; Compatible: Boolean; begin Result := S; if Form = nfNone then Exit; // No normalization needed. Compatible := Form in [nfKC, nfKD]; if Form in [nfD, nfKD] then Result := WideDecompose(S, Compatible) else begin Temp := WideDecompose(S, Compatible); Result := WideCompose(Temp); end; end; function WideSameText(const Str1, Str2: WideString): Boolean; // Compares both strings case-insensitively and returns True if both are equal, otherwise False is returned. begin Result := Length(Str1) = Length(Str2); if Result then Result := StrICompW(PWideChar(Str1), PWideChar(Str2)) = 0; end; function WideTitleCase(C: WideChar): WideString; var I: Integer; Mapping: TUCS4Array; begin Mapping := UnicodeToTitle(UCS4(C)); SetLength(Result, Length(Mapping)); for I := 0 to High(Mapping) do Result[I + 1] := WideChar(Mapping[I]); end; function WideTitleCase(const S: WideString): WideString; var I: Integer; begin Result := ''; for I := 1 to Length(S) do Result := Result + WideTitleCase(S[I]); end; function WideUpperCase(C: WideChar): WideString; var I: Integer; Mapping: TUCS4Array; begin Mapping := UnicodeToUpper(UCS4(C)); SetLength(Result, Length(Mapping)); for I := 0 to High(Mapping) do Result[I + 1] := WideChar(Mapping[I]); end; function WideUpperCase(const S: WideString): WideString; var I: Integer; begin Result := ''; for I := 1 to Length(S) do Result := Result + WideUpperCase(S[I]); end; //----------------- character test routines -------------------------------------------------------- function UnicodeIsAlpha(C: UCS4): Boolean; // Is the character alphabetic? begin Result := CategoryLookup(C, ClassLetter); end; function UnicodeIsDigit(C: UCS4): Boolean; // Is the character a digit? begin Result := CategoryLookup(C, [ccNumberDecimalDigit]); end; function UnicodeIsAlphaNum(C: UCS4): Boolean; // Is the character alphabetic or a number? begin Result := CategoryLookup(C, ClassLetter + [ccNumberDecimalDigit]); end; function UnicodeIsCased(C: UCS4): Boolean; // Is the character a "cased" character, i.e. either lower case, title case or upper case begin Result := CategoryLookup(C, [ccLetterLowercase, ccLetterTitleCase, ccLetterUppercase]); end; function UnicodeIsControl(C: UCS4): Boolean; // Is the character a control character? begin Result := CategoryLookup(C, [ccOtherControl, ccOtherFormat]); end; function UnicodeIsSpace(C: UCS4): Boolean; // Is the character a spacing character? begin Result := CategoryLookup(C, ClassSpace); end; function UnicodeIsWhiteSpace(C: UCS4): Boolean; // Is the character a white space character (same as UnicodeIsSpace plus // tabulator, new line etc.)? begin Result := CategoryLookup(C, ClassSpace + [ccWhiteSpace, ccSegmentSeparator]); end; function UnicodeIsBlank(C: UCS4): Boolean; // Is the character a space separator? begin Result := CategoryLookup(C, [ccSeparatorSpace]); end; function UnicodeIsPunctuation(C: UCS4): Boolean; // Is the character a punctuation mark? begin Result := CategoryLookup(C, ClassPunctuation); end; function UnicodeIsGraph(C: UCS4): Boolean; // Is the character graphical? begin Result := CategoryLookup(C, ClassMark + ClassNumber + ClassLetter + ClassPunctuation + ClassSymbol); end; function UnicodeIsPrintable(C: UCS4): Boolean; // Is the character printable? begin Result := CategoryLookup(C, ClassMark + ClassNumber + ClassLetter + ClassPunctuation + ClassSymbol + [ccSeparatorSpace]); end; function UnicodeIsUpper(C: UCS4): Boolean; // Is the character already upper case? begin Result := CategoryLookup(C, [ccLetterUppercase]); end; function UnicodeIsLower(C: UCS4): Boolean; // Is the character already lower case? begin Result := CategoryLookup(C, [ccLetterLowercase]); end; function UnicodeIsTitle(C: UCS4): Boolean; // Is the character already title case? begin Result := CategoryLookup(C, [ccLetterTitlecase]); end; function UnicodeIsHexDigit(C: UCS4): Boolean; // Is the character a hex digit? begin Result := CategoryLookup(C, [ccHexDigit]); end; function UnicodeIsIsoControl(C: UCS4): Boolean; // Is the character a C0 control character (< 32)? begin Result := CategoryLookup(C, [ccOtherControl]); end; function UnicodeIsFormatControl(C: UCS4): Boolean; // Is the character a format control character? begin Result := CategoryLookup(C, [ccOtherFormat]); end; function UnicodeIsSymbol(C: UCS4): Boolean; // Is the character a symbol? begin Result := CategoryLookup(C, ClassSymbol); end; function UnicodeIsNumber(C: UCS4): Boolean; // Is the character a number or digit? begin Result := CategoryLookup(C, ClassNumber); end; function UnicodeIsNonSpacing(C: UCS4): Boolean; // Is the character non-spacing? begin Result := CategoryLookup(C, [ccMarkNonSpacing]); end; function UnicodeIsOpenPunctuation(C: UCS4): Boolean; // Is the character an open/left punctuation (e.g. '[')? begin Result := CategoryLookup(C, [ccPunctuationOpen]); end; function UnicodeIsClosePunctuation(C: UCS4): Boolean; // Is the character an close/right punctuation (e.g. ']')? begin Result := CategoryLookup(C, [ccPunctuationClose]); end; function UnicodeIsInitialPunctuation(C: UCS4): Boolean; // Is the character an initial punctuation (e.g. U+2018 LEFT SINGLE QUOTATION MARK)? begin Result := CategoryLookup(C, [ccPunctuationInitialQuote]); end; function UnicodeIsFinalPunctuation(C: UCS4): Boolean; // Is the character a final punctuation (e.g. U+2019 RIGHT SINGLE QUOTATION MARK)? begin Result := CategoryLookup(C, [ccPunctuationFinalQuote]); end; function UnicodeIsComposed(C: UCS4): Boolean; // Can the character be decomposed into a set of other characters? begin Result := CategoryLookup(C, [ccComposed]); end; function UnicodeIsQuotationMark(C: UCS4): Boolean; // Is the character one of the many quotation marks? begin Result := CategoryLookup(C, [ccQuotationMark]); end; function UnicodeIsSymmetric(C: UCS4): Boolean; // Is the character one that has an opposite form (i.e. <>)? begin Result := CategoryLookup(C, [ccSymmetric]); end; function UnicodeIsMirroring(C: UCS4): Boolean; // Is the character mirroring (superset of symmetric)? begin Result := CategoryLookup(C, [ccMirroring]); end; function UnicodeIsNonBreaking(C: UCS4): Boolean; // Is the character non-breaking (i.e. non-breaking space)? begin Result := CategoryLookup(C, [ccNonBreaking]); end; function UnicodeIsRightToLeft(C: UCS4): Boolean; // Does the character have strong right-to-left directionality (i.e. Arabic letters)? begin Result := CategoryLookup(C, [ccRightToLeft]); end; function UnicodeIsLeftToRight(C: UCS4): Boolean; // Does the character have strong left-to-right directionality (i.e. Latin letters)? begin Result := CategoryLookup(C, [ccLeftToRight]); end; function UnicodeIsStrong(C: UCS4): Boolean; // Does the character have strong directionality? begin Result := CategoryLookup(C, [ccLeftToRight, ccRightToLeft]); end; function UnicodeIsWeak(C: UCS4): Boolean; // Does the character have weak directionality (i.e. numbers)? begin Result := CategoryLookup(C, ClassEuropeanNumber + [ccArabicNumber, ccCommonNumberSeparator]); end; function UnicodeIsNeutral(C: UCS4): Boolean; // Does the character have neutral directionality (i.e. whitespace)? begin Result := CategoryLookup(C, [ccSeparatorParagraph, ccSegmentSeparator, ccWhiteSpace, ccOtherNeutrals]); end; function UnicodeIsSeparator(C: UCS4): Boolean; // Is the character a block or segment separator? begin Result := CategoryLookup(C, [ccSeparatorParagraph, ccSegmentSeparator]); end; function UnicodeIsMark(C: UCS4): Boolean; // Is the character a mark of some kind? begin Result := CategoryLookup(C, ClassMark); end; function UnicodeIsModifier(C: UCS4): Boolean; // Is the character a letter modifier? begin Result := CategoryLookup(C, [ccLetterModifier]); end; function UnicodeIsLetterNumber(C: UCS4): Boolean; // Is the character a number represented by a letter? begin Result := CategoryLookup(C, [ccNumberLetter]); end; function UnicodeIsConnectionPunctuation(C: UCS4): Boolean; // Is the character connecting punctuation? begin Result := CategoryLookup(C, [ccPunctuationConnector]); end; function UnicodeIsDash(C: UCS4): Boolean; // Is the character a dash punctuation? begin Result := CategoryLookup(C, [ccPunctuationDash]); end; function UnicodeIsMath(C: UCS4): Boolean; // Is the character a math character? begin Result := CategoryLookup(C, [ccSymbolMath]); end; function UnicodeIsCurrency(C: UCS4): Boolean; // Is the character a currency character? begin Result := CategoryLookup(C, [ccSymbolCurrency]); end; function UnicodeIsModifierSymbol(C: UCS4): Boolean; // Is the character a modifier symbol? begin Result := CategoryLookup(C, [ccSymbolModifier]); end; function UnicodeIsNonSpacingMark(C: UCS4): Boolean; // Is the character a non-spacing mark? begin Result := CategoryLookup(C, [ccMarkNonSpacing]); end; function UnicodeIsSpacingMark(C: UCS4): Boolean; // Is the character a spacing mark? begin Result := CategoryLookup(C, [ccMarkSpacingCombining]); end; function UnicodeIsEnclosing(C: UCS4): Boolean; // Is the character enclosing (i.e. enclosing box)? begin Result := CategoryLookup(C, [ccMarkEnclosing]); end; function UnicodeIsPrivate(C: UCS4): Boolean; // Is the character from the Private Use Area? begin Result := CategoryLookup(C, [ccOtherPrivate]); end; function UnicodeIsSurrogate(C: UCS4): Boolean; // Is the character one of the surrogate codes? begin Result := CategoryLookup(C, [ccOtherSurrogate]); end; function UnicodeIsLineSeparator(C: UCS4): Boolean; // Is the character a line separator? begin Result := CategoryLookup(C, [ccSeparatorLine]); end; function UnicodeIsParagraphSeparator(C: UCS4): Boolean; // Is th character a paragraph separator; begin Result := CategoryLookup(C, [ccSeparatorParagraph]); end; function UnicodeIsIdentifierStart(C: UCS4): Boolean; // Can the character begin an identifier? begin Result := CategoryLookup(C, ClassLetter + [ccNumberLetter]); end; function UnicodeIsIdentifierPart(C: UCS4): Boolean; // Can the character appear in an identifier? begin Result := CategoryLookup(C, ClassLetter + [ccNumberLetter, ccMarkNonSpacing, ccMarkSpacingCombining, ccNumberDecimalDigit, ccPunctuationConnector, ccOtherFormat]); end; function UnicodeIsDefined(C: UCS4): Boolean; // Is the character defined (appears in one of the data files)? begin Result := CategoryLookup(C, [ccAssigned]); end; function UnicodeIsUndefined(C: UCS4): Boolean; // Is the character undefined (not assigned in the Unicode database)? begin Result := not CategoryLookup(C, [ccAssigned]); end; function UnicodeIsHan(C: UCS4): Boolean; // Is the character a Han ideograph? begin Result := ((C >= $4E00) and (C <= $9FFF)) or ((C >= $F900) and (C <= $FAFF)); end; function UnicodeIsHangul(C: UCS4): Boolean; // Is the character a pre-composed Hangul syllable? begin Result := (C >= $AC00) and (C <= $D7FF); end; // I need to fix a problem (introduced by MS) here. The first parameter can be a pointer // (and is so defined) or can be a normal DWORD, depending on the dwFlags parameter. // As usual, lpSrc has been translated to a var parameter. But this does not work in // our case, hence the redeclaration of the function with a pointer as first parameter. function TranslateCharsetInfoEx(lpSrc: PDWORD; var lpCs: TCharsetInfo; dwFlags: DWORD): BOOL; stdcall; external 'gdi32.dll' name 'TranslateCharsetInfo'; function GetCharSetFromLocale(Language: LCID; out FontCharSet: TFontCharSet): Boolean; var CP: Cardinal; CSI: TCharsetInfo; begin CP:= CodePageFromLocale(Language); Result := TranslateCharsetInfoEx(Pointer(CP), CSI, TCI_SRCCODEPAGE); if Result then FontCharset := CSI.ciCharset; end; function CharSetFromLocale(Language: LCID): TFontCharSet; begin if not GetCharSetFromLocale(Language, Result) then RaiseLastOSError; end; function CodePageFromLocale(Language: LCID): Integer; // determines the code page for a given locale var Buf: array [0..6] of Char; begin GetLocaleInfo(Language, LOCALE_IDefaultAnsiCodePage, Buf, 6); Result := StrToIntDef(Buf, GetACP); end; function KeyboardCodePage: Word; begin Result := CodePageFromLocale(GetKeyboardLayout(0) and $FFFF); end; function KeyUnicode(C: Char): WideChar; // converts the given character (as it comes with a WM_CHAR message) into its // corresponding Unicode character depending on the active keyboard layout begin MultiByteToWideChar(KeyboardCodePage, MB_USEGLYPHCHARS, @C, 1, @Result, 1); end; function CodeBlockRange(const CB: TUnicodeBlock): TUnicodeBlockRange; // http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt begin case CB of ubBasicLatin: begin Result.RangeStart := $0000; Result.RangeEnd := $007F; end; ubLatin1Supplement: begin Result.RangeStart := $0080; Result.RangeEnd := $00FF; end; ubLatinExtendedA: begin Result.RangeStart := $0100; Result.RangeEnd := $017F; end; ubLatinExtendedB: begin Result.RangeStart := $0180; Result.RangeEnd := $024F; end; ubIPAExtensions: begin Result.RangeStart := $0250; Result.RangeEnd := $02AF; end; ubSpacingModifierLetters: begin Result.RangeStart := $02B0; Result.RangeEnd := $02FF; end; ubCombiningDiacriticalMarks: begin Result.RangeStart := $0300; Result.RangeEnd := $036F; end; ubGreek: begin Result.RangeStart := $0370; Result.RangeEnd := $03FF; end; ubCyrillic: begin Result.RangeStart := $0400; Result.RangeEnd := $04FF; end; ubCyrillicSupplement: begin Result.RangeStart := $0500; Result.RangeEnd := $052F; end; ubArmenian: begin Result.RangeStart := $0530; Result.RangeEnd := $058F; end; ubHebrew: begin Result.RangeStart := $0590; Result.RangeEnd := $05FF; end; ubArabic: begin Result.RangeStart := $0600; Result.RangeEnd := $06FF; end; ubSyriac: begin Result.RangeStart := $0700; Result.RangeEnd := $074F; end; ubArabicSupplement: begin Result.RangeStart := $0750; Result.RangeEnd := $077F; end; ubThaana: begin Result.RangeStart := $0780; Result.RangeEnd := $07BF; end; ubDevanagari: begin Result.RangeStart := $0900; Result.RangeEnd := $097F; end; ubBengali: begin Result.RangeStart := $0980; Result.RangeEnd := $09FF; end; ubGurmukhi: begin Result.RangeStart := $0A00; Result.RangeEnd := $0A7F; end; ubGujarati: begin Result.RangeStart := $0A80; Result.RangeEnd := $0AFF; end; ubOriya: begin Result.RangeStart := $0B00; Result.RangeEnd := $0B7F; end; ubTamil: begin Result.RangeStart := $0B80; Result.RangeEnd := $0BFF; end; ubTelugu: begin Result.RangeStart := $0C00; Result.RangeEnd := $0C7F; end; ubKannada: begin Result.RangeStart := $0C80; Result.RangeEnd := $0CFF; end; ubMalayalam: begin Result.RangeStart := $0D00; Result.RangeEnd := $0D7F; end; ubSinhala: begin Result.RangeStart := $0D80; Result.RangeEnd := $0DFF; end; ubThai: begin Result.RangeStart := $0E00; Result.RangeEnd := $0E7F; end; ubLao: begin Result.RangeStart := $0E80; Result.RangeEnd := $0EFF; end; ubTibetan: begin Result.RangeStart := $0F00; Result.RangeEnd := $0FFF; end; ubMyanmar: begin Result.RangeStart := $1000; Result.RangeEnd := $109F; end; ubGeorgian: begin Result.RangeStart := $10A0; Result.RangeEnd := $10FF; end; ubHangulJamo: begin Result.RangeStart := $1100; Result.RangeEnd := $11FF; end; ubEthiopic: begin Result.RangeStart := $1200; Result.RangeEnd := $137F; end; ubEthiopicSupplement: begin Result.RangeStart := $1380; Result.RangeEnd := $139F; end; ubCherokee: begin Result.RangeStart := $13A0; Result.RangeEnd := $13FF; end; ubUnifiedCanadianAboriginalSyllabics: begin Result.RangeStart := $1400; Result.RangeEnd := $167F; end; ubOgham: begin Result.RangeStart := $1680; Result.RangeEnd := $169F; end; ubRunic: begin Result.RangeStart := $16A0; Result.RangeEnd := $16FF; end; ubTagalog: begin Result.RangeStart := $1700; Result.RangeEnd := $171F; end; ubHanunoo: begin Result.RangeStart := $1720; Result.RangeEnd := $173F; end; ubBuhid: begin Result.RangeStart := $1740; Result.RangeEnd := $175F; end; ubTagbanwa: begin Result.RangeStart := $1760; Result.RangeEnd := $177F; end; ubKhmer: begin Result.RangeStart := $1780; Result.RangeEnd := $17FF; end; ubMongolian: begin Result.RangeStart := $1800; Result.RangeEnd := $18AF; end; ubLimbu: begin Result.RangeStart := $1900; Result.RangeEnd := $194F; end; ubTaiLe: begin Result.RangeStart := $1950; Result.RangeEnd := $197F; end; ubNewTaiLue: begin Result.RangeStart := $1980; Result.RangeEnd := $19DF; end; ubKhmerSymbols: begin Result.RangeStart := $19E0; Result.RangeEnd := $19FF; end; ubBuginese: begin Result.RangeStart := $1A00; Result.RangeEnd := $1A1F; end; ubPhoneticExtensions: begin Result.RangeStart := $1D00; Result.RangeEnd := $1D7F; end; ubPhoneticExtensionsSupplement: begin Result.RangeStart := $1D80; Result.RangeEnd := $1DBF; end; ubCombiningDiacriticalMarksSupplement: begin Result.RangeStart := $1DC0; Result.RangeEnd := $1DFF; end; ubLatinExtendedAdditional: begin Result.RangeStart := $1E00; Result.RangeEnd := $1EFF; end; ubGreekExtended: begin Result.RangeStart := $1F00; Result.RangeEnd := $1FFF; end; ubGeneralPunctuation: begin Result.RangeStart := $2000; Result.RangeEnd := $206F; end; ubSuperscriptsandSubscripts: begin Result.RangeStart := $2070; Result.RangeEnd := $209F; end; ubCurrencySymbols: begin Result.RangeStart := $20A0; Result.RangeEnd := $20CF; end; ubCombiningMarksforSymbols: begin Result.RangeStart := $20D0; Result.RangeEnd := $20FF; end; ubLetterlikeSymbols: begin Result.RangeStart := $2100; Result.RangeEnd := $214F; end; ubNumberForms: begin Result.RangeStart := $2150; Result.RangeEnd := $218F; end; ubArrows: begin Result.RangeStart := $2190; Result.RangeEnd := $21FF; end; ubMathematicalOperators: begin Result.RangeStart := $2200; Result.RangeEnd := $22FF; end; ubMiscellaneousTechnical: begin Result.RangeStart := $2300; Result.RangeEnd := $23FF; end; ubControlPictures: begin Result.RangeStart := $2400; Result.RangeEnd := $243F; end; ubOpticalCharacterRecognition: begin Result.RangeStart := $2440; Result.RangeEnd := $245F; end; ubEnclosedAlphanumerics: begin Result.RangeStart := $2460; Result.RangeEnd := $24FF; end; ubBoxDrawing: begin Result.RangeStart := $2500; Result.RangeEnd := $257F; end; ubBlockElements: begin Result.RangeStart := $2580; Result.RangeEnd := $259F; end; ubGeometricShapes: begin Result.RangeStart := $25A0; Result.RangeEnd := $25FF; end; ubMiscellaneousSymbols: begin Result.RangeStart := $2600; Result.RangeEnd := $26FF; end; ubDingbats: begin Result.RangeStart := $2700; Result.RangeEnd := $27BF; end; ubMiscellaneousMathematicalSymbolsA: begin Result.RangeStart := $27C0; Result.RangeEnd := $27EF; end; ubSupplementalArrowsA: begin Result.RangeStart := $27F0; Result.RangeEnd := $27FF; end; ubBraillePatterns: begin Result.RangeStart := $2800; Result.RangeEnd := $28FF; end; ubSupplementalArrowsB: begin Result.RangeStart := $2900; Result.RangeEnd := $297F; end; ubMiscellaneousMathematicalSymbolsB: begin Result.RangeStart := $2980; Result.RangeEnd := $29FF; end; ubSupplementalMathematicalOperators: begin Result.RangeStart := $2A00; Result.RangeEnd := $2AFF; end; ubMiscellaneousSymbolsandArrows: begin Result.RangeStart := $2B00; Result.RangeEnd := $2BFF; end; ubGlagolitic: begin Result.RangeStart := $2C00; Result.RangeEnd := $2C5F; end; ubCoptic: begin Result.RangeStart := $2C80; Result.RangeEnd := $2CFF; end; ubGeorgianSupplement: begin Result.RangeStart := $2D00; Result.RangeEnd := $2D2F; end; ubTifinagh: begin Result.RangeStart := $2D30; Result.RangeEnd := $2D7F; end; ubEthiopicExtended: begin Result.RangeStart := $2D80; Result.RangeEnd := $2DDF; end; ubSupplementalPunctuation: begin Result.RangeStart := $2E00; Result.RangeEnd := $2E7F; end; ubCJKRadicalsSupplement: begin Result.RangeStart := $2E80; Result.RangeEnd := $2EFF; end; ubKangxiRadicals: begin Result.RangeStart := $2F00; Result.RangeEnd := $2FDF; end; ubIdeographicDescriptionCharacters: begin Result.RangeStart := $2FF0; Result.RangeEnd := $2FFF; end; ubCJKSymbolsandPunctuation: begin Result.RangeStart := $3000; Result.RangeEnd := $303F; end; ubHiragana: begin Result.RangeStart := $3040; Result.RangeEnd := $309F; end; ubKatakana: begin Result.RangeStart := $30A0; Result.RangeEnd := $30FF; end; ubBopomofo: begin Result.RangeStart := $3100; Result.RangeEnd := $312F; end; ubHangulCompatibilityJamo: begin Result.RangeStart := $3130; Result.RangeEnd := $318F; end; ubKanbun: begin Result.RangeStart := $3190; Result.RangeEnd := $319F; end; ubBopomofoExtended: begin Result.RangeStart := $31A0; Result.RangeEnd := $31BF; end; ubCJKStrokes: begin Result.RangeStart := $31C0; Result.RangeEnd := $31EF; end; ubKatakanaPhoneticExtensions: begin Result.RangeStart := $31F0; Result.RangeEnd := $31FF; end; ubEnclosedCJKLettersandMonths: begin Result.RangeStart := $3200; Result.RangeEnd := $32FF; end; ubCJKCompatibility: begin Result.RangeStart := $3300; Result.RangeEnd := $33FF; end; ubCJKUnifiedIdeographsExtensionA: begin Result.RangeStart := $3400; Result.RangeEnd := $4DBF; end; ubYijingHexagramSymbols: begin Result.RangeStart := $4DC0; Result.RangeEnd := $4DFF; end; ubCJKUnifiedIdeographs: begin Result.RangeStart := $4E00; Result.RangeEnd := $9FFF; end; ubYiSyllables: begin Result.RangeStart := $A000; Result.RangeEnd := $A48F; end; ubYiRadicals: begin Result.RangeStart := $A490; Result.RangeEnd := $A4CF; end; ubModifierToneLetters: begin Result.RangeStart := $A700; Result.RangeEnd := $A71F; end; ubSylotiNagri: begin Result.RangeStart := $A800; Result.RangeEnd := $A82F; end; ubHangulSyllables: begin Result.RangeStart := $AC00; Result.RangeEnd := $D7AF; end; ubHighSurrogates: begin Result.RangeStart := $D800; Result.RangeEnd := $DB7F; end; ubHighPrivateUseSurrogates: begin Result.RangeStart := $DB80; Result.RangeEnd := $DBFF; end; ubLowSurrogates: begin Result.RangeStart := $DC00; Result.RangeEnd := $DFFF; end; ubPrivateUse: begin Result.RangeStart := $E000; Result.RangeEnd := $F8FF; end; ubCJKCompatibilityIdeographs: begin Result.RangeStart := $F900; Result.RangeEnd := $FAFF; end; ubAlphabeticPresentationForms: begin Result.RangeStart := $FB00; Result.RangeEnd := $FB4F; end; ubArabicPresentationFormsA: begin Result.RangeStart := $FB50; Result.RangeEnd := $FDFF; end; ubVariationSelectors: begin Result.RangeStart := $FE00; Result.RangeEnd := $FE0F; end; ubVerticalForms: begin Result.RangeStart := $FE10; Result.RangeEnd := $FE1F; end; ubCombiningHalfMarks: begin Result.RangeStart := $FE20; Result.RangeEnd := $FE2F; end; ubCJKCompatibilityForms: begin Result.RangeStart := $FE30; Result.RangeEnd := $FE4F; end; ubSmallFormVariants: begin Result.RangeStart := $FE50; Result.RangeEnd := $FE6F; end; ubArabicPresentationFormsB: begin Result.RangeStart := $FE70; Result.RangeEnd := $FEFF; end; ubHalfwidthandFullwidthForms: begin Result.RangeStart := $FF00; Result.RangeEnd := $FFEF; end; ubSpecials: begin Result.RangeStart := $FFF0; Result.RangeEnd := $FFFF; end; ubLinearBSyllabary: begin Result.RangeStart := $10000; Result.RangeEnd := $1007F; end; ubLinearBIdeograms: begin Result.RangeStart := $10080; Result.RangeEnd := $100FF; end; ubAegeanNumbers: begin Result.RangeStart := $10100; Result.RangeEnd := $1013F; end; ubAncientGreekNumbers: begin Result.RangeStart := $10140; Result.RangeEnd := $1018F; end; ubOldItalic: begin Result.RangeStart := $10300; Result.RangeEnd := $1032F; end; ubGothic: begin Result.RangeStart := $10330; Result.RangeEnd := $1034F; end; ubUgaritic: begin Result.RangeStart := $10380; Result.RangeEnd := $1039F; end; ubOldPersian: begin Result.RangeStart := $103A0; Result.RangeEnd := $103DF; end; ubDeseret: begin Result.RangeStart := $10400; Result.RangeEnd := $1044F; end; ubShavian: begin Result.RangeStart := $10450; Result.RangeEnd := $1047F; end; ubOsmanya: begin Result.RangeStart := $10480; Result.RangeEnd := $104AF; end; ubCypriotSyllabary: begin Result.RangeStart := $10800; Result.RangeEnd := $1083F; end; ubKharoshthi: begin Result.RangeStart := $10A00; Result.RangeEnd := $10A5F; end; ubByzantineMusicalSymbols: begin Result.RangeStart := $1D000; Result.RangeEnd := $1D0FF; end; ubMusicalSymbols: begin Result.RangeStart := $1D100; Result.RangeEnd := $1D1FF; end; ubAncientGreekMusicalNotation: begin Result.RangeStart := $1D200; Result.RangeEnd := $1D24F; end; ubTaiXuanJingSymbols: begin Result.RangeStart := $1D300; Result.RangeEnd := $1D35F; end; ubMathematicalAlphanumericSymbols: begin Result.RangeStart := $1D400; Result.RangeEnd := $1D7FF; end; ubCJKUnifiedIdeographsExtensionB: begin Result.RangeStart := $20000; Result.RangeEnd := $2A6DF; end; ubCJKCompatibilityIdeographsSupplement: begin Result.RangeStart := $2F800; Result.RangeEnd := $2FA1F; end; ubTags: begin Result.RangeStart := $E0000; Result.RangeEnd := $E007F; end; ubVariationSelectorsSupplement: begin Result.RangeStart := $E0100; Result.RangeEnd := $E01EF; end; ubSupplementaryPrivateUseAreaA: begin Result.RangeStart := $F0000; Result.RangeEnd := $FFFFF; end; ubSupplementaryPrivateUseAreaB: begin Result.RangeStart := $100000; Result.RangeEnd := $10FFFF; end; else begin Result.RangeStart := 0; Result.RangeEnd := 0; end; end; end; // Returns the CodeBlockName of the Block specified by CB // Names taken from http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt function CodeBlockName(const CB: TUnicodeBlock): string; begin case CB of ubBasicLatin: Result := 'Basic Latin'; ubLatin1Supplement: Result := 'Latin-1 Supplement'; ubLatinExtendedA: Result := 'Latin Extended-A'; ubLatinExtendedB: Result := 'Latin Extended-B'; ubIPAExtensions: Result := 'IPA Extensions'; ubSpacingModifierLetters: Result := 'Spacing Modifier Letters'; ubCombiningDiacriticalMarks: Result := 'Combining Diacritical Marks'; //ubGreekandCoptic: ubGreek: Result := 'Greek and Coptic'; ubCyrillic: Result := 'Cyrillic'; ubCyrillicSupplement: Result := 'Cyrillic Supplement'; ubArmenian: Result := 'Armenian'; ubHebrew: Result := 'Hebrew'; ubArabic: Result := 'Arabic'; ubSyriac: Result := 'Syriac'; ubArabicSupplement: Result := 'Arabic Supplement'; ubThaana: Result := 'Thaana'; ubDevanagari: Result := 'Devanagari'; ubBengali: Result := 'Bengali'; ubGurmukhi: Result := 'Gurmukhi'; ubGujarati: Result := 'Gujarati'; ubOriya: Result := 'Oriya'; ubTamil: Result := 'Tamil'; ubTelugu: Result := 'Telugu'; ubKannada: Result := 'Kannada'; ubMalayalam: Result := 'Malayalam'; ubSinhala: Result := 'Sinhala'; ubThai: Result := 'Thai'; ubLao: Result := 'Lao'; ubTibetan: Result := 'Tibetan'; ubMyanmar: Result := 'Myanmar'; ubGeorgian: Result := 'Georgian'; ubHangulJamo: Result := 'Hangul Jamo'; ubEthiopic: Result := 'Ethiopic'; ubEthiopicSupplement: Result := 'Ethiopic Supplement'; ubCherokee: Result := 'Cherokee'; ubUnifiedCanadianAboriginalSyllabics: Result := 'Unified Canadian Aboriginal Syllabics'; ubOgham: Result := 'Ogham'; ubRunic: Result := 'Runic'; ubTagalog: Result := 'Tagalog'; ubHanunoo: Result := 'Hanunoo'; ubBuhid: Result := 'Buhid'; ubTagbanwa: Result := 'Tagbanwa'; ubKhmer: Result := 'Khmer'; ubMongolian: Result := 'Mongolian'; ubLimbu: Result := 'Limbu'; ubTaiLe: Result := 'Tai Le'; ubNewTaiLue: Result := 'New Tai Lue'; ubKhmerSymbols: Result := 'Khmer Symbols'; ubBuginese: Result := 'Buginese'; ubPhoneticExtensions: Result := 'Phonetic Extensions'; ubPhoneticExtensionsSupplement: Result := 'Phonetic Extensions Supplement'; ubCombiningDiacriticalMarksSupplement: Result := 'Combining Diacritical Marks Supplement'; ubLatinExtendedAdditional: Result := 'Latin Extended Additional'; ubGreekExtended: Result := 'Greek Extended'; ubGeneralPunctuation: Result := 'General Punctuation'; ubSuperscriptsandSubscripts: Result := 'Superscripts and Subscripts'; ubCurrencySymbols: Result := 'Currency Symbols'; //ubCombiningDiacriticalMarksforSymbols: ubCombiningMarksforSymbols: Result := 'Combining Diacritical Marks for Symbols'; ubLetterlikeSymbols: Result := 'Letterlike Symbols'; ubNumberForms: Result := 'Number Forms'; ubArrows: Result := 'Arrows'; ubMathematicalOperators: Result := 'Mathematical Operators'; ubMiscellaneousTechnical: Result := 'Miscellaneous Technical'; ubControlPictures: Result := 'Control Pictures'; ubOpticalCharacterRecognition: Result := 'Optical Character Recognition'; ubEnclosedAlphanumerics: Result := 'Enclosed Alphanumerics'; ubBoxDrawing: Result := 'Box Drawing'; ubBlockElements: Result := 'Block Elements'; ubGeometricShapes: Result := 'Geometric Shapes'; ubMiscellaneousSymbols: Result := 'Miscellaneous Symbols'; ubDingbats: Result := 'Dingbats'; ubMiscellaneousMathematicalSymbolsA: Result := 'Miscellaneous Mathematical Symbols-A'; ubSupplementalArrowsA: Result := 'Supplemental Arrows-A'; ubBraillePatterns: Result := 'Braille Patterns'; ubSupplementalArrowsB: Result := 'Supplemental Arrows-B'; ubMiscellaneousMathematicalSymbolsB: Result := 'Miscellaneous Mathematical Symbols-B'; ubSupplementalMathematicalOperators: Result := 'Supplemental Mathematical Operators'; ubMiscellaneousSymbolsandArrows: Result := 'Miscellaneous Symbols and Arrows'; ubGlagolitic: Result := 'Glagolitic'; ubCoptic: Result := 'Coptic'; ubGeorgianSupplement: Result := 'Georgian Supplement'; ubTifinagh: Result := 'Tifinagh'; ubEthiopicExtended: Result := 'Ethiopic Extended'; ubSupplementalPunctuation: Result := 'Supplemental Punctuation'; ubCJKRadicalsSupplement: Result := 'CJK Radicals Supplement'; ubKangxiRadicals: Result := 'Kangxi Radicals'; ubIdeographicDescriptionCharacters: Result := 'Ideographic Description Characters'; ubCJKSymbolsandPunctuation: Result := 'CJK Symbols and Punctuation'; ubHiragana: Result := 'Hiragana'; ubKatakana: Result := 'Katakana'; ubBopomofo: Result := 'Bopomofo'; ubHangulCompatibilityJamo: Result := 'Hangul Compatibility Jamo'; ubKanbun: Result := 'Kanbun'; ubBopomofoExtended: Result := 'Bopomofo Extended'; ubCJKStrokes: Result := 'CJK Strokes'; ubKatakanaPhoneticExtensions: Result := 'Katakana Phonetic Extensions'; ubEnclosedCJKLettersandMonths: Result := 'Enclosed CJK Letters and Months'; ubCJKCompatibility: Result := 'CJK Compatibility'; ubCJKUnifiedIdeographsExtensionA: Result := 'CJK Unified Ideographs Extension A'; ubYijingHexagramSymbols: Result := 'Yijing Hexagram Symbols'; ubCJKUnifiedIdeographs: Result := 'CJK Unified Ideographs'; ubYiSyllables: Result := 'Yi Syllables'; ubYiRadicals: Result := 'Yi Radicals'; ubModifierToneLetters: Result := 'Modifier Tone Letters'; ubSylotiNagri: Result := 'Syloti Nagri'; ubHangulSyllables: Result := 'Hangul Syllables'; ubHighSurrogates: Result := 'High Surrogates'; ubHighPrivateUseSurrogates: Result := 'High Private Use Surrogates'; ubLowSurrogates: Result := 'Low Surrogates'; //ubPrivateUseArea: ubPrivateUse: Result := 'Private Use Area'; ubCJKCompatibilityIdeographs: Result := 'CJK Compatibility Ideographs'; ubAlphabeticPresentationForms: Result := 'Alphabetic Presentation Forms'; ubArabicPresentationFormsA: Result := 'Arabic Presentation Forms-A'; ubVariationSelectors: Result := 'Variation Selectors'; ubVerticalForms: Result := 'Vertical Forms'; ubCombiningHalfMarks: Result := 'Combining Half Marks'; ubCJKCompatibilityForms: Result := 'CJK Compatibility Forms'; ubSmallFormVariants: Result := 'Small Form Variants'; ubArabicPresentationFormsB: Result := 'Arabic Presentation Forms-B'; ubHalfwidthandFullwidthForms: Result := 'Halfwidth and Fullwidth Forms'; ubSpecials: Result := 'Specials'; ubLinearBSyllabary: Result := 'Linear B Syllabary'; ubLinearBIdeograms: Result := 'Linear B Ideograms'; ubAegeanNumbers: Result := 'Aegean Numbers'; ubAncientGreekNumbers: Result := 'Ancient Greek Numbers'; ubOldItalic: Result := 'Old Italic'; ubGothic: Result := 'Gothic'; ubUgaritic: Result := 'Ugaritic'; ubOldPersian: Result := 'Old Persian'; ubDeseret: Result := 'Deseret'; ubShavian: Result := 'Shavian'; ubOsmanya: Result := 'Osmanya'; ubCypriotSyllabary: Result := 'Cypriot Syllabary'; ubKharoshthi: Result := 'Kharoshthi'; ubByzantineMusicalSymbols: Result := 'Byzantine Musical Symbols'; ubMusicalSymbols: Result := 'Musical Symbols'; ubAncientGreekMusicalNotation: Result := 'Ancient Greek Musical Notation'; ubTaiXuanJingSymbols: Result := 'Tai Xuan Jing Symbols'; ubMathematicalAlphanumericSymbols: Result := 'Mathematical Alphanumeric Symbols'; ubCJKUnifiedIdeographsExtensionB: Result := 'CJK Unified Ideographs Extension B'; ubCJKCompatibilityIdeographsSupplement: Result := 'CJK Compatibility Ideographs Supplement'; ubTags: Result := 'Tags'; ubVariationSelectorsSupplement: Result := 'Variation Selectors Supplement'; ubSupplementaryPrivateUseAreaA: Result := 'Supplementary Private Use Area-A'; ubSupplementaryPrivateUseAreaB: Result := 'Supplementary Private Use Area-B'; else Result := 'Undefined'; end; end; // Returns an ID for the Unicode code block to which C belongs. // If C does not belong to any of the defined blocks then ubUndefined is returned. // Note: the code blocks listed here are based on Unicode Version 3.1. function CodeBlockFromChar(const C: UCS4): TUnicodeBlock; // http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt begin case C of $0000..$007F: Result := ubBasicLatin; $0080..$00FF: Result := ubLatin1Supplement; $0100..$017F: Result := ubLatinExtendedA; $0180..$024F: Result := ubLatinExtendedB; $0250..$02AF: Result := ubIPAExtensions; $02B0..$02FF: Result := ubSpacingModifierLetters; $0300..$036F: Result := ubCombiningDiacriticalMarks; $0370..$03FF: Result := ubGreek; //ubGreekandCoptic; $0400..$04FF: Result := ubCyrillic; $0500..$052F: Result := ubCyrillicSupplement; $0530..$058F: Result := ubArmenian; $0590..$05FF: Result := ubHebrew; $0600..$06FF: Result := ubArabic; $0700..$074F: Result := ubSyriac; $0750..$077F: Result := ubArabicSupplement; $0780..$07BF: Result := ubThaana; $0900..$097F: Result := ubDevanagari; $0980..$09FF: Result := ubBengali; $0A00..$0A7F: Result := ubGurmukhi; $0A80..$0AFF: Result := ubGujarati; $0B00..$0B7F: Result := ubOriya; $0B80..$0BFF: Result := ubTamil; $0C00..$0C7F: Result := ubTelugu; $0C80..$0CFF: Result := ubKannada; $0D00..$0D7F: Result := ubMalayalam; $0D80..$0DFF: Result := ubSinhala; $0E00..$0E7F: Result := ubThai; $0E80..$0EFF: Result := ubLao; $0F00..$0FFF: Result := ubTibetan; $1000..$109F: Result := ubMyanmar; $10A0..$10FF: Result := ubGeorgian; $1100..$11FF: Result := ubHangulJamo; $1200..$137F: Result := ubEthiopic; $1380..$139F: Result := ubEthiopicSupplement; $13A0..$13FF: Result := ubCherokee; $1400..$167F: Result := ubUnifiedCanadianAboriginalSyllabics; $1680..$169F: Result := ubOgham; $16A0..$16FF: Result := ubRunic; $1700..$171F: Result := ubTagalog; $1720..$173F: Result := ubHanunoo; $1740..$175F: Result := ubBuhid; $1760..$177F: Result := ubTagbanwa; $1780..$17FF: Result := ubKhmer; $1800..$18AF: Result := ubMongolian; $1900..$194F: Result := ubLimbu; $1950..$197F: Result := ubTaiLe; $1980..$19DF: Result := ubNewTaiLue; $19E0..$19FF: Result := ubKhmerSymbols; $1A00..$1A1F: Result := ubBuginese; $1D00..$1D7F: Result := ubPhoneticExtensions; $1D80..$1DBF: Result := ubPhoneticExtensionsSupplement; $1DC0..$1DFF: Result := ubCombiningDiacriticalMarksSupplement; $1E00..$1EFF: Result := ubLatinExtendedAdditional; $1F00..$1FFF: Result := ubGreekExtended; $2000..$206F: Result := ubGeneralPunctuation; $2070..$209F: Result := ubSuperscriptsandSubscripts; $20A0..$20CF: Result := ubCurrencySymbols; $20D0..$20FF: Result := ubCombiningMarksforSymbols; //ubCombiningDiacriticalMarksforSymbols; $2100..$214F: Result := ubLetterlikeSymbols; $2150..$218F: Result := ubNumberForms; $2190..$21FF: Result := ubArrows; $2200..$22FF: Result := ubMathematicalOperators; $2300..$23FF: Result := ubMiscellaneousTechnical; $2400..$243F: Result := ubControlPictures; $2440..$245F: Result := ubOpticalCharacterRecognition; $2460..$24FF: Result := ubEnclosedAlphanumerics; $2500..$257F: Result := ubBoxDrawing; $2580..$259F: Result := ubBlockElements; $25A0..$25FF: Result := ubGeometricShapes; $2600..$26FF: Result := ubMiscellaneousSymbols; $2700..$27BF: Result := ubDingbats; $27C0..$27EF: Result := ubMiscellaneousMathematicalSymbolsA; $27F0..$27FF: Result := ubSupplementalArrowsA; $2800..$28FF: Result := ubBraillePatterns; $2900..$297F: Result := ubSupplementalArrowsB; $2980..$29FF: Result := ubMiscellaneousMathematicalSymbolsB; $2A00..$2AFF: Result := ubSupplementalMathematicalOperators; $2B00..$2BFF: Result := ubMiscellaneousSymbolsandArrows; $2C00..$2C5F: Result := ubGlagolitic; $2C80..$2CFF: Result := ubCoptic; $2D00..$2D2F: Result := ubGeorgianSupplement; $2D30..$2D7F: Result := ubTifinagh; $2D80..$2DDF: Result := ubEthiopicExtended; $2E00..$2E7F: Result := ubSupplementalPunctuation; $2E80..$2EFF: Result := ubCJKRadicalsSupplement; $2F00..$2FDF: Result := ubKangxiRadicals; $2FF0..$2FFF: Result := ubIdeographicDescriptionCharacters; $3000..$303F: Result := ubCJKSymbolsandPunctuation; $3040..$309F: Result := ubHiragana; $30A0..$30FF: Result := ubKatakana; $3100..$312F: Result := ubBopomofo; $3130..$318F: Result := ubHangulCompatibilityJamo; $3190..$319F: Result := ubKanbun; $31A0..$31BF: Result := ubBopomofoExtended; $31C0..$31EF: Result := ubCJKStrokes; $31F0..$31FF: Result := ubKatakanaPhoneticExtensions; $3200..$32FF: Result := ubEnclosedCJKLettersandMonths; $3300..$33FF: Result := ubCJKCompatibility; $3400..$4DBF: Result := ubCJKUnifiedIdeographsExtensionA; $4DC0..$4DFF: Result := ubYijingHexagramSymbols; $4E00..$9FFF: Result := ubCJKUnifiedIdeographs; $A000..$A48F: Result := ubYiSyllables; $A490..$A4CF: Result := ubYiRadicals; $A700..$A71F: Result := ubModifierToneLetters; $A800..$A82F: Result := ubSylotiNagri; $AC00..$D7AF: Result := ubHangulSyllables; $D800..$DB7F: Result := ubHighSurrogates; $DB80..$DBFF: Result := ubHighPrivateUseSurrogates; $DC00..$DFFF: Result := ubLowSurrogates; $E000..$F8FF: Result := ubPrivateUse; //ubPrivateUseArea; $F900..$FAFF: Result := ubCJKCompatibilityIdeographs; $FB00..$FB4F: Result := ubAlphabeticPresentationForms; $FB50..$FDFF: Result := ubArabicPresentationFormsA; $FE00..$FE0F: Result := ubVariationSelectors; $FE10..$FE1F: Result := ubVerticalForms; $FE20..$FE2F: Result := ubCombiningHalfMarks; $FE30..$FE4F: Result := ubCJKCompatibilityForms; $FE50..$FE6F: Result := ubSmallFormVariants; $FE70..$FEFF: Result := ubArabicPresentationFormsB; $FF00..$FFEF: Result := ubHalfwidthandFullwidthForms; $FFF0..$FFFF: Result := ubSpecials; $10000..$1007F: Result := ubLinearBSyllabary; $10080..$100FF: Result := ubLinearBIdeograms; $10100..$1013F: Result := ubAegeanNumbers; $10140..$1018F: Result := ubAncientGreekNumbers; $10300..$1032F: Result := ubOldItalic; $10330..$1034F: Result := ubGothic; $10380..$1039F: Result := ubUgaritic; $103A0..$103DF: Result := ubOldPersian; $10400..$1044F: Result := ubDeseret; $10450..$1047F: Result := ubShavian; $10480..$104AF: Result := ubOsmanya; $10800..$1083F: Result := ubCypriotSyllabary; $10A00..$10A5F: Result := ubKharoshthi; $1D000..$1D0FF: Result := ubByzantineMusicalSymbols; $1D100..$1D1FF: Result := ubMusicalSymbols; $1D200..$1D24F: Result := ubAncientGreekMusicalNotation; $1D300..$1D35F: Result := ubTaiXuanJingSymbols; $1D400..$1D7FF: Result := ubMathematicalAlphanumericSymbols; $20000..$2A6DF: Result := ubCJKUnifiedIdeographsExtensionB; $2F800..$2FA1F: Result := ubCJKCompatibilityIdeographsSupplement; $E0000..$E007F: Result := ubTags; $E0100..$E01EF: Result := ubVariationSelectorsSupplement; $F0000..$FFFFF: Result := ubSupplementaryPrivateUseAreaA; $100000..$10FFFF: Result := ubSupplementaryPrivateUseAreaB; else Result := ubUndefined; end; end; function CompareTextWin95(const W1, W2: WideString; Locale: LCID): Integer; // special comparation function for Win9x since there's no system defined // comparation function, returns -1 if W1 < W2, 0 if W1 = W2 or 1 if W1 > W2 var S1, S2: string; CP: Integer; L1, L2: Integer; begin L1 := Length(W1); L2 := Length(W2); SetLength(S1, L1); SetLength(S2, L2); CP := CodePageFromLocale(Locale); WideCharToMultiByte(CP, 0, PWideChar(W1), L1, PChar(S1), L1, nil, nil); WideCharToMultiByte(CP, 0, PWideChar(W2), L2, PChar(S2), L2, nil, nil); Result := CompareStringA(Locale, NORM_IGNORECASE, PChar(S1), Length(S1), PChar(S2), Length(S2)) - 2; end; function CompareTextWinNT(const W1, W2: WideString; Locale: LCID): Integer; // Wrapper function for WinNT since there's no system defined comparation function // in Win9x and we need a central comparation function for TWideStringList. // Returns -1 if W1 < W2, 0 if W1 = W2 or 1 if W1 > W2 begin Result := CompareStringW(Locale, NORM_IGNORECASE, PWideChar(W1), Length(W1), PWideChar(W2), Length(W2)) - 2; end; function StringToWideStringEx(const S: string; CodePage: Word): WideString; var InputLength, OutputLength: Integer; begin InputLength := Length(S); OutputLength := MultiByteToWideChar(CodePage, 0, PChar(S), InputLength, nil, 0); SetLength(Result, OutputLength); MultiByteToWideChar(CodePage, 0, PChar(S), InputLength, PWideChar(Result), OutputLength); end; function WideStringToStringEx(const WS: WideString; CodePage: Word): string; var InputLength, OutputLength: Integer; begin InputLength := Length(WS); OutputLength := WideCharToMultiByte(CodePage, 0, PWideChar(WS), InputLength, nil, 0, nil, nil); SetLength(Result, OutputLength); WideCharToMultiByte(CodePage, 0, PWideChar(WS), InputLength, PChar(Result), OutputLength, nil, nil); end; function TranslateString(const S: string; CP1, CP2: Word): string; begin Result:= WideStringToStringEx(StringToWideStringEx(S, CP1), CP2); end; //----------------- conversion routines ------------------------------------------------------------ // Converts the given source ANSI string into a Unicode string by expanding each character // from one byte to two bytes. // EAX contains Source, EDX contains Target, ECX contains Count procedure ExpandANSIString(const Source: PChar; Target: PWideChar; Count: Cardinal); asm JECXZ @@Finish // go out if there is nothing to do PUSH ESI MOV ESI, EAX XOR EAX, EAX @@1: MOV AL, [ESI] INC ESI MOV [EDX], AX ADD EDX, 2 DEC ECX JNZ @@1 POP ESI @@Finish: end; const HalfShift: Integer = 10; HalfBase: UCS4 = $0010000; HalfMask: UCS4 = $3FF; OffsetsFromUTF8: array [0..5] of UCS4 = ($00000000, $00003080, $000E2080, $03C82080, $FA082080, $82082080); BytesFromUTF8: array [0..255] of Byte = (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5); FirstByteMark: array [0..6] of Byte = ($00, $00, $C0, $E0, $F0, $F8, $FC); function WideStringToUTF8(S: WideString): AnsiString; var Ch: UCS4; L, J, T, BytesToWrite: Cardinal; ByteMask: UCS4; ByteMark: UCS4; begin if S = '' then Result := '' else begin SetLength(Result, Length(S) * 6); // assume worst case T := 1; ByteMask := $BF; ByteMark := $80; for J := 1 to Length(S) do begin Ch := UCS4(S[J]); if Ch < $80 then BytesToWrite := 1 else if Ch < $800 then BytesToWrite := 2 else if Ch < $10000 then BytesToWrite := 3 else if Ch < $200000 then BytesToWrite := 4 else if Ch < $4000000 then BytesToWrite := 5 else if Ch <= MaximumUCS4 then BytesToWrite := 6 else begin BytesToWrite := 2; Ch := ReplacementCharacter; end; for L := BytesToWrite downto 2 do begin Result[T + L - 1] := Char((Ch or ByteMark) and ByteMask); Ch := Ch shr 6; end; Result[T] := Char(Ch or FirstByteMark[BytesToWrite]); Inc(T, BytesToWrite); end; SetLength(Result, T - 1); // set to actual length end; end; function UTF8ToWideString(S: AnsiString): WideString; var L, J, T: Cardinal; Ch: UCS4; ExtraBytesToWrite: Word; begin if S = '' then Result := '' else begin SetLength(Result, Length(S)); // create enough room L := 1; T := 1; while L <= Cardinal(Length(S)) do begin Ch := 0; ExtraBytesToWrite := BytesFromUTF8[Ord(S[L])]; for J := ExtraBytesToWrite downto 1 do begin Ch := Ch + Ord(S[L]); Inc(L); Ch := Ch shl 6; end; Ch := Ch + Ord(S[L]); Inc(L); Ch := Ch - OffsetsFromUTF8[ExtraBytesToWrite]; if Ch <= MaximumUCS2 then begin Result[T] := WideChar(Ch); Inc(T); end else if Ch > MaximumUCS4 then begin Result[T] := WideChar(ReplacementCharacter); Inc(T); end else begin Ch := Ch - HalfBase; Result[T] := WideChar((Ch shr HalfShift) + SurrogateHighStart); Inc(T); Result[T] := WideChar((Ch and HalfMask) + SurrogateLowStart); Inc(T); end; end; SetLength(Result, T - 1); // now fix up length end; end; procedure PrepareUnicodeData; // Prepares structures which are globally needed. begin LoadInProgress := TJclCriticalSection.Create; if (Win32Platform and VER_PLATFORM_WIN32_NT) <> 0 then @WideCompareText := @CompareTextWinNT else @WideCompareText := @CompareTextWin95; end; procedure FreeUnicodeData; // Frees all data which has been allocated and which is not automatically freed by Delphi. begin FreeAndNil(LoadInProgress); end; initialization PrepareUnicodeData; finalization FreeUnicodeData; {$ENDIF SUPPORTS_WIDESTRING} // History: // $Log: JclUnicode.pas,v $ // Revision 1.31 2005/10/26 09:15:13 marquardt // most functions now have the same const parameters as their Ansi counterparts // // Revision 1.30 2005/10/26 08:36:29 marquardt // StrPCopyWW and StrPLCopyWW introduced to solve overloaded problem // // Revision 1.29 2005/10/25 18:20:10 outchy // IT3174: UTF8-file support in JclUnicode.pas // // Revision 1.28 2005/10/25 16:27:36 marquardt // StrPCopyW and StrPLCopyW overloaded versions deactivated because of Delphi5 compiler problems // // Revision 1.27 2005/10/25 10:33:40 marquardt // made StrPCopyW and StrPLCopyW compatible with the original Unicode.pas by adding overloaded versions // // Revision 1.26 2005/10/25 09:47:04 marquardt // minor fixes and cleanups // // Revision 1.25 2005/10/25 08:54:57 marquardt // make a union of the Str*W family of functions in JclUnicode and JclWideStrings // // Revision 1.24 2005/10/16 05:16:51 marquardt // TWideStrings now has GetText and GetTextStr like TStrings // // Revision 1.23 2005/07/19 21:28:26 outchy // IT 3066: JclUnicode.pas updated to Unicode 4.1 // // Revision 1.22 2005/03/08 08:33:23 marquardt // overhaul of exceptions and resourcestrings, minor style cleaning // // Revision 1.21 2005/03/01 15:37:40 marquardt // addressing Mantis 0714, 0716, 0720, 0731, 0740 partly or completely // // Revision 1.20 2005/02/24 16:34:53 marquardt // remove divider lines, add section lines (unfinished) // // Revision 1.19 2005/02/24 07:36:25 marquardt // resolved the compiler warnings, style cleanup, removed code from JclContainerIntf.pas // // Revision 1.18 2005/02/14 03:20:59 rrossmair // - fixed issues #0000713 ( make CompareTextWin95/NT functions use const string parameters) and #0001909 (JclUnicode.CharSetFromLocale - result ignored) // // Revision 1.17 2004/11/22 19:17:18 ahuser // Fixed memory leak // Style cleaning // // Revision 1.16 2004/10/17 21:00:16 mthoma // cleaning // // Revision 1.15 2004/08/01 11:40:24 marquardt // move constructors/destructors // // Revision 1.14 2004/07/31 06:21:03 marquardt // fixing TStringLists, adding BeginUpdate/EndUpdate, finalization improved // // Revision 1.13 2004/07/28 18:00:54 marquardt // various style cleanings, some minor fixes // // Revision 1.12 2004/06/16 07:30:31 marquardt // added tilde to all IFNDEF ENDIFs, inherited qualified // // Revision 1.11 2004/06/14 13:05:22 marquardt // style cleaning ENDIF, Tabs // // Revision 1.10 2004/06/14 11:05:53 marquardt // symbols added to all ENDIFs and some other minor style changes like removing IFOPT // // Revision 1.9 2004/05/05 07:33:49 rrossmair // header updated according to new policy: initial developers & contributors listed // // Revision 1.8 2004/04/06 04:55:18 // adapt compiler conditions, add log entry // end.