Unicode caret

12/24/2023

Not everything is as smooth as it seems however. The only difference now is that the TextDocument can affect changes to the raw file through the piece-table as well as read it’s content. This makes sense for the moment, because the TextDocument can still accesses the file in it’s raw form prior to converting it to UTF-16. The piece-table (or sequence class) therefore presents the underlying file in it’s raw form - as a sequence of BYTEs rather than WCHAR units. To achieve this aim I layered the piece-table directly on top of the raw file content. The basic idea has been to incorporate the piece-table in a manner that caused minimal impact on the rest of the design. The image below illustrates the various components of the text editor as it stands at the moment. I’m not completely certain it’s the best way to do things in an editor but at least it shows that the idea can work in practice.

My first attempt at integrating the piece-table into Neatpad went quite smoothly. Usually that’s where the tricky part comes in, but fortunately we now have the piece-chain sequence class which will enable us to make edits to the underlying document each time a character is received. Regardless of how the text is entered, all we need to do is receive each WM_CHAR as it is sent to the TextView and process it accordingly. However it is very unlikely that a user will manually enter two surrogate values separately - more than likely they will be using an Input Method Editor, and it will be the IME that breaks their keyboard input into UTF-16 units.

> 0xFFFF in value) will be sent as two separate messages, one for each surrogate character. As long as we compile with the UNICODE macro defined we will receive UTF-16 characters. We don’t need to do anything special to receive Unicode input. The code below shows the standard method of handling character-input in a Win32 program:Ĭase WM_CHAR : return OnChar ( wParam, lParam ) Although we will ignore these additional input-messages, we will not be losing any functionality by simply handling WM_CHAR at this point. Likewise the WM_IME_CHAR message is only sent under special circumstances. I suspect that this is a message that is sent by other applications (such as IME’s) rather than the OS itself. Supposedly the WM_UNICHAR message sends UTF-32 characters rather than the 16bit WCHARs - however I have never seen WM_UNICHAR being sent to a program, even on a XP machine. The other messages look interesting but are not really necessary. The Windows Input Method Editor will be the subject of a future tutorial. Even complex scripts will be handled seemlessly because keyboard input for these languages is usually associated with an Input Method Editor (IME) - which will translate any ‘complex’ key-strokes into the appropriate stream of UTF-16 characters, without any extra work on our part. This is perfect for us, because Neatpad is already an UTF-16 (wide-character) application. For any UNICODE application, the WM_CHAR message sends a single UTF-16 character value instead of a plain ANSI character. We already looked at keyboard navigation in Part 16 - Keyboard Navigation, in which we discussed caret movement within a Unicode document, and we briefly looked at the various Win32 character-input messages that a program can encounter when receiving keyboard input:Įven though the WM_CHAR message has been around since the first versions of Windows, it is still the most appropriate way for a Win32 application to receive character input. The purpose of this tutorial is therefore to document the modifications required by Neatpad to support the piece-table editing model. The sequence class was presented which encapsulates the piece-table and these basic editing operations within a single C++ object. Unlimited undo and redo are also supported. The last tutorial saw the implementation of a piece-table data structure which implements three basic edit operations: insert, erase, and replace. The Uniscribe API will again be used to aid us in this area.Ĭharacter input (of any kind) is not possible without some form of data-structure to manage and represent any alterations to the document. Modifications to a Unicode text-file require careful coding to ensure that character cluster-boundaries are preserved and that no invalid sequences are inadvertantly introduced into the document. The main difficulties are the Unicode ‘combining sequences’ - where multiple code-points are combined to form a single selectable ‘character cluster’. Unicode character input presents some unique problems for text-editors - issues that did not have to be considered when the first ASCII editors were written.

0 Comments

Unicode caret

Leave a Reply.

Author

Archives

Categories