Something about iterating through UTF-16 Unicode strings (.NET, C#)

Every single character in Unicode has its own unique number – code point. Now Unicode defines more than 100000 characters, so a code point may not fit into 2 bytes of string array element of type Char. These characters are from supplementary planes, in UTF-16 string their code points are encoded in a special way and occupy 2 string array elements (4 bytes) which are called surrogate pair. The first array element is called high surrogate, the second – low surrogate.

Apart from that, the visual character (grapheme) can be represented by multiple code points. These are characters (including those from supplementary planes) modified by combining diacritical marks that also have code points.

Representation of characters in UTF-16 Unicode strings is of variable length and generally it is wrong to iterate through strings simply incrementing current position by 1. This representation is called text element.

System.Globalization namespace contains StringInfo and CharUnicodeInfo classes. StringInfo allows to split a string into text elements and to iterate through these text elements. CharUnicodeInfo retrieves information about a Unicode character.

Splitting string into text elements

StringInfo class has static method ParseCombiningCharacters for splitting string into text elements.

public static int[] ParseCombiningCharacters(string str)

This method returns array of indexes of text elements within a given string. If a string has no surrogate pair or characters with diacritical marks the array will be: {0,1,2,3,4,5..}. If a string has surrogate pair in second position array will be {0,1,3,4,5..}. If a string has underlined character with grave accent (for example à̲ “\u0092\u0300\u0332”) in the second position, the array will be {0,1,4,5..}.

Skipping alpha characters sequence

First, let define what is an alpha character. CharUnicodeInfo.GetUnicodeCategory(String, Int32) method gives type of the unicode character.

using System.Text;
using System.Globalization;

...

public static bool IsCharAlpha(string text, int pos)
{
	UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(text, pos);
	
	return uc == UnicodeCategory.UppercaseLetter || 
		uc == UnicodeCategory.LowercaseLetter || 
		uc == UnicodeCategory.TitlecaseLetter;
}
...

Skip alphas from start of the string.

	...
	string S = "H\u0302=T\u0302+V\u0302"; // "Ĥ=T̂+V̂" (hamiltonian)
	int[] TextElements = StringInfo.ParseCombiningCharacters(S);
	
	int n = TextElements.Length;
	int i = 0;
	
	while (i < n && IsCharAlpha(S, TextElements[i])) i++;
	...

Now TextElements[i] is an index of "=" character.