`unicode.script`

. Initially, I relatively simple representation of the data: there was a byte array, where the index was the code point and the elements were bytes corresponding to scripts. (It's possible to use a byte array because there are only seventy-some scripts to care about.) Lookup consisted of `char>num-table nth num>name-table nth`

. But this was pretty inefficient. The largest code point (that I wanted to represent here) was something around number 195,000, meaning that the byte array took up almost 200Kb. Even if I somehow got rid of that empty space (and I don't see an obvious way how, without a bunch of overhead), there are 100,000 code points whose script I wanted to encode. But we can do better than taking up 100Kb. The thing about this data is that scripts are in a bunch of contiguous ranges. That is, two characters that are next to each other in code point order are very likely to have the same script. The file in the Unicode Character Database encoding this information actually uses special syntax to denote a range, rather than write out each one individually. So what if we store these intervals directly rather than store each element of the intervals?

A data structure to hold intervals with O(log n) lookup and insertion has already been developed: interval trees. They're described in Chapter 14 of Introduction to Algorithms starting on page 311, but I won't describe them here. At first, I tried to implement these, but I realized that, for my purposes, they're overkill. They're really easy to get wrong: if you implement them on top of another kind of balanced binary tree, you have to make sure that balancing preserves certain invariants about annotations on the tree. Still, if you need fast insertion and deletion, they make the most sense.

A much simpler solution is to just have a sorted array of intervals, each associated with a value. The right interval, and then the corresponding value, can be found by simple binary search. I don't even need to know how to do binary search, because it's already in the Factor library! This is efficient as long as the interval map is constructed all at once, which it is in this case. By a high constant factor, this is also more space-efficient than using binary trees. The whole solution takes less than 30 lines of code.

(Note: the intervals here are closed and must be disjoint. <=> must be defined on them. They don't use the intervals in

`math.intervals`

to save space, and since they're overkill. Interval maps don't follow the assoc protocol because intervals aren't discrete, eg floats are acceptable as keys.)First, the tuples we'll be using: an

`interval-map`

is the whole associative structure, containing a single slot for the underlying array.

TUPLE: interval-map array ;

That array consists of

`interval-node`

s, which have a beginning, end and corresponding value.

TUPLE: interval-node from to value ;

Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.

: find-interval ( key interval-map -- i )

[ from>> <=> ] binsearch ;

`interval-contains?`

tests if a node contains a given key.

: interval-contains? ( object interval-node -- ? )

[ from>> ] [ to>> ] bi between? ;

Finally,

`interval-at*`

searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.

: fixup-value ( value ? -- value/f ? )

[ drop f f ] unless* ;

: interval-at* ( key map -- value ? )

array>> [ find-interval ] 2keep swapd nth

[ nip value>> ] [ interval-contains? ] 2bi

fixup-value ;

A few convenience words, analogous to those for assocs:

: interval-at ( key map -- value ) interval-at* drop ;

: interval-key? ( key map -- ? ) interval-at* nip ;

So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:

: all-intervals ( sequence -- intervals )

[ >r dup number? [ dup 2array ] when r> ] assoc-map

{ } assoc-like ;

Once that is done, the objects should be converted to intervals:

: >intervals ( specification -- intervals )

[ >r first2 r> interval-node boa ] { } assoc>map ;

After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the

`monotonic?`

combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)

: disjoint? ( node1 node2 -- ? )

[ to>> ] [ from>> ] bi* < ;

: ensure-disjoint ( intervals -- intervals )

dup [ disjoint? ] monotonic?

[ "Intervals are not disjoint" throw ] unless ;

And, to put it all together, using a tuple array for improved space efficiency:

: <interval-map> ( specification -- map )

all-intervals [ [ first second ] compare ] sort

>intervals ensure-disjoint >tuple-array

interval-map boa ;

All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay!

## 3 comments:

Hi,

Have you read "Unicode Demystified" by Richard Gillam?

That book offers a number of techniques for these and other related problems.

IIRC, inversion list is what he uses.

Cheers!

Krishna,

I'm in the middle of that book, but hadn't gotten up to that part yet. So thanks for the pointer! I may end up switching to that, though this representation seems reasonable to me.

Good brief and this fill someone in on helped me alot in my college assignement. Say thank you you seeking your information.

Post a Comment