<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-273593670040001243</id><updated>2012-01-24T08:03:41.211-08:00</updated><category term='xml'/><category term='scheme'/><category term='math'/><category term='theory'/><category term='module systems'/><category term='operating systems'/><category term='introduction'/><category term='ignorable'/><category term='macros'/><category term='factor'/><category term='icfp'/><category term='college'/><category term='oop'/><category term='event'/><category term='parsing'/><category term='encodings'/><category term='databases'/><category term='sequences'/><category term='meta'/><category term='pattern matching'/><category term='essay'/><category term='data structures'/><category term='garbage collection'/><category term='bragging'/><category term='license'/><category term='link'/><category term='unicode'/><category term='naming'/><category term='compiler'/><category term='rant'/><title type='text'>Useless Factor</title><subtitle type='html'>Adventures in computing and the Factor programming language</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default?start-index=101&amp;max-results=100'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>136</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1762056878099601099</id><published>2011-08-31T08:26:00.000-07:00</published><updated>2011-08-31T14:15:42.512-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='databases'/><title type='text'>Startup idea: Cloud storage platform with configurable SLAs</title><content type='html'>Different applications have different performance and reliability requirements from their underlying storage system. Requirements even change over time for a particular application—when a webapp gets really popular, the creators might not just want to scale out to a larger data set, but also bring the total server response time down to 200ms for 99% of users, and give users the feeling that their data will never be lost.&lt;br /&gt;&lt;br /&gt;Today, the situation with databases is messy. If you want different performance properties, you often have to switch database interfaces. Several different interfaces exist for largely historical reasons; today, they can all be used to serve a similar purpose: a key-value store.&lt;br /&gt;&lt;br /&gt;In addition to the duplication of effort needed to maintain things across different interfaces, users don't have good control over SLAs. Durability, availability, median latency, and tail latency requirements are available in only a few broad classes. S3 is really durable and available but has high latency. Memcached is has high availability, no durability, and really short latencies. MySQL on EC2 has durability that's somewhere between Memcached and EC2, and latency that's also in the middle. But other combinations would also be useful.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The idea&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There should be a storage-as-a-service provider with two distinguishing features:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;All the different classes of storage should be accessible from the same interface, probably building off an existing one.&lt;/li&gt;&lt;li&gt;The customer asks for an SLA (or rather several SLAs for different tiers of storage), and the service provider will give a price for it.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;With a detailed SLA indicating durability, availability, tail latency and median latency requirements, a Sufficiently Smart Database will be able to optimize for it. By optimizing for the customer's real requirements, the service provider's costs are lower, and can pass some of these savings on to customers. It should also be easier to evolve applications that use this storage service.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;An aside about nice interfaces&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There's been a lot of innovation in database performance and architecture recently, but there's also been innovation in interfaces. Some applications make great use of the special features of Redis, various SQL dialects, CouchDB and others to add a significant amount of value over a plain key-value store.&lt;br /&gt;&lt;br /&gt;The best thing for these interfaces would be if they were usable on top of several SLA classes. Right now, the implementations are coupled, but if the configurable key-value store had just a few extra features (ordered indexes, triggers, transactions, etc) then it should be possible to build all these nice interfaces on top.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Some unsolved problems&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There are more problems, but here's some:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;How do you take advantage of the SLA to the fullest degree?&lt;/strong&gt;&lt;br /&gt;It's not as simple as using a deadline scheduler for your disk when a single disk read might take too long--some data will have to sit in RAM, and some data will have to sit in flash. Disk and flash both have cases where the latency of a request can be unexpectedly long. And how can you evaluate reliability of existing systems to keep within that?&lt;/li&gt;&lt;li&gt;&lt;strong&gt;How do you price things?&lt;/strong&gt;&lt;br /&gt;You really want the price to accurately reflect how much things cost. But with a complicated database system, it's difficult to tease out what will happen on account of who, especially when there are at least three places where data might sit, and several customers will be sharing a cluster&lt;/li&gt;&lt;li&gt;&lt;strong&gt;How do you explain the pricing to the customer?&lt;/strong&gt;&lt;br /&gt;They don't want to know all about this fancy database, they just want to figure out the cheapest thing that works.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;What's the policy when the SLA is violated?&lt;/strong&gt;&lt;br /&gt;It'd be cool if this never happened, but you have to compensate your customer somehow, but this creates a potential for abuse if you compensate them too well.&lt;/li&gt;&lt;/ul&gt;&lt;strong&gt;Who's gonna do it?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I really believe that a system like this (though maybe not in full generality) is what we'll eventually get from cloud storage providers. The question is just, who will implement it? A few possibilities:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Google or Amazon, as an extension of their existing cloud storage offerings&lt;/li&gt;&lt;li&gt;An enterprise storage company like EMC—but I have trouble believing they'll be able to set the prices low enough&lt;/li&gt;&lt;li&gt;An existing database startup like CouchBase or RethinkDB&lt;/li&gt;&lt;li&gt;Academia could figure out the database design—this would be a great paper&lt;/li&gt;&lt;li&gt;Your startup!&lt;/li&gt;&lt;/ul&gt;I've never heard of a service which does this. Existing companies (whether small or large) have their existing business as a priority and may not see the possibility here. So why not a startup?  I might do this if I had nothing to do, but unfortunately Google is awesome and I have way too much interesting stuff to do there. What about you?&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;How I'd implement this&lt;/strong&gt;&lt;br /&gt;&lt;strong&gt;&lt;br /&gt;&lt;/strong&gt;&lt;br /&gt;Somehow, this system has to get off the ground, going from no customers and no advanced technology to lots of customers and really efficient technology. I'd work on both in parallel. At the beginning, buy some really high-performance enterprise database system with tons of flash and RAM that you run all of your customers on. Provide them with several SLAs of various prices that you think you'll eventually be able to support cost-effectively, and serve them all from your high-performance high-reliability system at a huge loss. If the customer buys a cheaper SLA, then insert delays and losses into their results so they don't start depending on the performance properties of the real implementation.&lt;br /&gt;Now that momentum is starting to build up, and there has been a confused writeup in TechCrunch about you, you can get more capital to scale this hugely loss-making model up. You can also hire people to build the real system. At that point, you'll have a lot more data on real customer requirements and the properties of their workloads. You'll have the urgency that your financial statements give you to produce this system, make it work in a real way and provide improvements. And you'll be visible enough to get good people to work for you.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;This post reflects my own personal opinions and not those of my &lt;a href="http://google.com/"&gt;current&lt;/a&gt; or &lt;a href="http://rethinkdb.com/"&gt;former&lt;/a&gt; employer. I have no insider knowledge of any plans or lack-of-plans in this area at either company.&lt;/em&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1762056878099601099?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1762056878099601099/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1762056878099601099' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1762056878099601099'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1762056878099601099'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2011/08/startup-idea-cloud-storage-platform.html' title='Startup idea: Cloud storage platform with configurable SLAs'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-3985484522624050828</id><published>2011-05-18T08:37:00.000-07:00</published><updated>2011-05-18T13:14:48.306-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='operating systems'/><category scheme='http://www.blogger.com/atom/ns#' term='databases'/><title type='text'>Why not mmap?</title><content type='html'>&lt;code&gt;mmap()&lt;/code&gt; is a beautiful interface. Rather than accessing files through a series of read and write operations, &lt;code&gt;mmap()&lt;/code&gt; lets you virtually load the whole file into a big array and access whatever part you want just like you would with other RAM. (It lets you do other things, too—in particular, it's the basis of memory allocation. See &lt;a href="http://linux.die.net/man/2/mmap"&gt;the man page&lt;/a&gt; for details.) In this article, I'll be discussing &lt;code&gt;mmap()&lt;/code&gt; on Linux, as it works in virtual memory systems like x86.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;mmap()&lt;/code&gt; doesn't actually load the whole file in when you call it. Instead, it loads nothing in but file metadata. In the memory page table, all of the mapped pages are given the setting to make a page fault if they are read or written. The page fault handler loads the page and puts it into main memory, modifying the page table to not fault for this page later. In this way, the file is lazily read into memory. The file is written back out through the same writeback mechanism used for the page cache in buffered I/O: after some time or under some memory pressure, the contents of memory are automatically synchronized with the disk.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;mmap()&lt;/code&gt; is a system call, implemented by the kernel. Why? As far as I can tell, what I described above could be implemented in user-space: user-space has page fault handlers and file read/write operations. But the kernel allows several other advantages:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If a file is being manipulated by &lt;code&gt;mmap()&lt;/code&gt; as well as something else at the same time, the kernel can keep these in sync&lt;/li&gt;&lt;li&gt;The kernel can do it faster, with specialized implementations for different file systems and fewer context switches between kernel-space and user-space&lt;/li&gt;&lt;li&gt;The kernel can do a better job, using its internal statistics to determine when to write back to disk and when to prefetch extra pages of the file&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;One situation where &lt;code&gt;mmap()&lt;/code&gt; looks useful is databases. What could be easier for a database implementor than an array of "memory" that's transparently persisted to disk? Database authors often think they know better than the OS, so they like to have explicit control over caching policy. And various file and memory operations give you this, in conjunction with &lt;code&gt;mmap()&lt;/code&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://linux.die.net/man/2/mlock"&gt;&lt;code&gt;mlock()&lt;/code&gt;&lt;/a&gt; lets you force a series of pages to be held in physical memory, and &lt;code&gt;munlock()&lt;/code&gt; lets you release it. Memory locking here is basically equivalent to making part of the file present in the user-space cache, when no swap is configured on the server.&lt;br /&gt;&lt;br /&gt;Memory locking can be dangerous in an environment with many processes running because the out-of-memory killer (OOM killer) might some other process as a result of your profligate use of memory. However, the use of &lt;a href="http://en.wikipedia.org/wiki/Cgroups"&gt;cgroups&lt;/a&gt; or virtualization can mitigate this possibility and provide isolation.&lt;/li&gt;&lt;li&gt;&lt;a href="http://linux.die.net/man/2/madvise"&gt;&lt;code&gt;madvise()&lt;/code&gt;&lt;/a&gt; and &lt;a href="http://linux.die.net/man/2/posix_fadvise"&gt;&lt;code&gt;posix_fadvise&lt;/code&gt;&lt;/a&gt; let you give the OS hints about how to behave with respect to the file. These can be used to encourage things to be pulled into memory or pushed out. &lt;code&gt;MADV_DONTNEED&lt;/code&gt; is a quick call to zero a series of pages completely, and it could be translated into TRIM on SSDs.&lt;/li&gt;&lt;li&gt;&lt;a href="http://linux.die.net/man/2/fdatasync"&gt;&lt;code&gt;fdatasync()&lt;/code&gt;&lt;/a&gt; lets a a process force some data onto the disk right now, rather than trusting writeback to get it there eventually. This is useful for implementing durable transactions.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Great! And in Linux, you can open up a raw block device just by opening a file like &lt;code&gt;/dev/hda1&lt;/code&gt; and use &lt;code&gt;mmap()&lt;/code&gt; straight from there, so this gives database implementors a way to control the whole disk with the same interface. This is great if you're a typical database developer who doesn't like the OS and doesn't trust the file system.&lt;br /&gt;&lt;br /&gt;So this sounds like a nice, clean way to write a database or something else that does serious file manipulation. Some databases use this, for example &lt;a href="https://github.com/mongodb/mongo/blob/master/db/mongommf.cpp"&gt;MongoDB&lt;/a&gt;. But the more advanced database implementations tend to open the database file in &lt;code&gt;O_DIRECT&lt;/code&gt; mode and implement their own caching system in user-space. Whereas &lt;code&gt;mmap()&lt;/code&gt; lets you use the hardware (on x86) page tables for the indirection between the logical address of the data and where it's stored in physical memory, these databases force you to go through an &lt;em&gt;extra&lt;/em&gt; indirection in their own data structures. And these databases have to implement their own caches, even though the resulting caches often aren't smarter than the default OS cache. (The logic that makes the caching smarter is often encoded in an application-specific prefetcher, which can be done pretty clearly though memory mapping.)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A problem with &lt;code&gt;mmap()&lt;/code&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;High-performance databases often get lots of requests. So many requests that, if they were to spawn a thread for each one of them, the overhead of a kernel task per request would slow them down (where task means 'thread or process', in Linux terminology). There's a bit of overhead for threads:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Each thread must have its own stack, which takes up memory (though this is mitigated by the lazy allocation of physical memory to back stacks, which is done by default)&lt;/li&gt;&lt;li&gt;Some Linux CPU schedulers use a lot of CPU themselves. So blocking and then getting resumed has a certain amount of overhead. In particular, overhead is incurred so that the scheduler can be completely fair, and so that it can load-balance between cores.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;To solve these issues, database implementors often respond to each request with a user-level coroutine, or even with an explicitly managed piece of state sent around through various callbacks.&lt;br /&gt;&lt;br /&gt;Let's say we have a coroutine responding to a database request, and this coroutine wants to read from the database in a location that is currently stored on disk. If it accesses the big array, then it will cause a memory fault leading to a disk read. This will make the current task block until the disk read can be completed. But we don't want the whole task to block—we just want to switch to another coroutine when we have to wait, and we want to execute that coroutine from the same task.&lt;br /&gt;&lt;br /&gt;The typical way around this problem is using asynchronous or non-blocking operations. For non-blocking I/O, there's &lt;a href="http://linux.die.net/man/4/epoll"&gt;&lt;code&gt;epoll&lt;/code&gt;&lt;/a&gt;, which works for some kinds of files. For direct I/O on disk, Linux provides a different interface called asynchronous I/O, with system calls like &lt;a href="http://linux.die.net/man/2/io_submit"&gt;&lt;code&gt;io_submit&lt;/code&gt;&lt;/a&gt;. These two mechanisms can be hooked up with an eventfd, which is triggered whenever there are AIO results, using the undocumented system call &lt;code&gt;io_set_eventfd&lt;/code&gt;. The basic idea is that you set up a bunch of requests in an object, and then you have a main loop, driven by &lt;code&gt;epoll&lt;/code&gt;, where you repeatedly ask for the next available event. The coroutine scheduler resumes the coroutine that had the event complete on it, and executes that coroutine until it blocks again. Details about using this mechanism are a bit obtuse, but not very deep or complicated.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A proposed solution&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;What the &lt;code&gt;mmap()&lt;/code&gt; interface is missing is a non-blocking way to access memory. Maybe this would take the form of a call based around &lt;code&gt;mlock&lt;/code&gt;, like&lt;br /&gt;&lt;pre&gt;    int mlock_eventfd(const void *addr, ssize_t len, int eventfd);&lt;/pre&gt;&lt;br /&gt;which would trigger the eventfd once the memory from addr going length len was locked in memory. The eventfd could be placed in an &lt;code&gt;epoll&lt;/code&gt; loop and then the memory requested would be dereferenced for real once it was locked. A similar mechanism would be useful for &lt;code&gt;fdatasync&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;We could implement &lt;code&gt;mlock_eventfd&lt;/code&gt; in user-space using a thread pool, and the same goes for &lt;code&gt;fdatasync&lt;/code&gt;. But this would probaly eliminate the performance advantages of using coroutines in the first place, since accessing the disk is pretty frequent in databases.&lt;br /&gt;&lt;br /&gt;As databases and the devices that underlie them grow more complex, it becomes difficult to manage this complexity. The operating system provides a useful layer of indirection between the database and the drive, but old and messy interfaces make the use of the OS more difficult. Clean, high-level OS interfaces which let applications take full advantage of the hardware and kernel-internal mechanisms and statistics would be a great boon to further database development, allowing the explosion of new databases and solid-state drives to be fully exploited.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-3985484522624050828?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/3985484522624050828/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=3985484522624050828' title='21 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3985484522624050828'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3985484522624050828'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2011/05/why-not-mmap.html' title='Why not mmap?'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>21</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5255341427981553650</id><published>2011-04-16T13:58:00.000-07:00</published><updated>2011-04-16T14:07:35.469-07:00</updated><title type='text'>I have been sucked into the vortex</title><content type='html'>Last Monday, I started at Google. I'm working on the kernel storage team, trying to optimize Linux asynchronous I/O for flash, which we are experimenting with. I really love it at Google; the food is great and the people are extremely smart. In the kernel, many top Linux hackers are employed by Google, and it's amazing that I can work with them.&lt;br /&gt;&lt;br /&gt;I haven't been posting much here, partly out of laziness and partly out of perfectionism, but I have a few half-written blog posts that I'd really like to get out. Now that I am working, I'll have even less time than before. On the other hand, since what I'm doing is for the kernel, you can expect to see detailed explanations of my changes here once they're ready to go upstream. I'm really excited to be working on a project so deep in the stack, and it will be especially satisfying to release changes as open-source. Wish me luck!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5255341427981553650?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5255341427981553650/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5255341427981553650' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5255341427981553650'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5255341427981553650'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2011/04/i-have-been-sucked-into-vortex.html' title='I have been sucked into the vortex'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7319373844094211214</id><published>2010-12-23T12:33:00.000-08:00</published><updated>2010-12-31T15:04:59.623-08:00</updated><title type='text'>Work at RethinkDB</title><content type='html'>I've been working at RethinkDB for the past month. My project has been to convert the code base from using callbacks for asynchronous I/O&amp;mdash;essentially continuation-passing style&amp;mdash;to a more natural style of concurrency using coroutines. You can read about my progress on the RethinkDB blog:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.rethinkdb.com/blog/2010/12/improving-a-large-c-project-with-coroutines/"&gt;Improving a large C++ project with coroutines&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.rethinkdb.com/blog/2010/12/making-coroutines-fast/"&gt;Making coroutines fast&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.rethinkdb.com/blog/2010/12/handling-stack-overflow-on-custom-stacks/"&gt;Handling stack overflow on custom stacks&lt;/a&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7319373844094211214?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7319373844094211214/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7319373844094211214' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7319373844094211214'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7319373844094211214'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/12/work-at-rethinkdb.html' title='Work at RethinkDB'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8522638278890422814</id><published>2010-10-03T06:38:00.000-07:00</published><updated>2010-12-08T09:15:03.960-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><category scheme='http://www.blogger.com/atom/ns#' term='bragging'/><category scheme='http://www.blogger.com/atom/ns#' term='essay'/><title type='text'>Tired of academia, gonna work for Google</title><content type='html'>&lt;strong&gt;Update&lt;/strong&gt;: I forgot to mention, before this I'll be doing an internship with RethinkDB, which I'm really looking forward to. I'll be in the Bay area in December 2010 just for the month and then I'll move there indefinitely April 2011.&lt;br /&gt;&lt;br /&gt;Since high school, or maybe before that, I had dreams of becoming a computer science professor after getting a PhD in programming languages. Reading people on Slashdot and thedailywtf talk about real-world jobs made me never want to have one. So, to see what research is like, build relationships with professors and help my resume, I've done research internships with professors each summer rather. There's a special NSF program called REU to fund these, and because of this I've never found it difficult to find these positions.&lt;br /&gt;&lt;br /&gt;The research topics always sound really interesting, but somehow the mechanics of doing research never really worked out for me. I could make &lt;em&gt;some&lt;/em&gt; progress, but didn't make a significant piece of software or contribute a polished component to someone else's software. I learned some ideas, and read and modified some significant pieces of code, but I always lost motivation somewhere in there.&lt;br /&gt;&lt;br /&gt;I take this to be a product of the incentive structures in academia: &lt;strong&gt;everyone has to work for themselves&lt;/strong&gt;. Sure, most people are motivated to some degree by the idea of science and technological progress. But that's not what the system is set up to do. Academia isn't about ideas directly; it's about publishing many papers in prestigious venues. You don't just do what you want, you do what the program committee will like. If you don't do this, then you won't be able to stay in academia: at every step--getting into grad school, getting an academic position once you have a PhD, getting tenure--most people are kicked out of the system. You have to get your name first on enough papers to survive, so collaborative groups are often very small and technical interaction, outside conferences, can be limited.&lt;br /&gt;&lt;br /&gt;On the upside, there are tons of smart, interesting people in academia, and the work can be challenging and interesting. I've just realized, recently, that they don't have a monopoly on these things. Working, if it's for the right company, should have these things too. I know there's gruntwork, but if you're working on a real project in academia, this exists there too. In fact, it can be really bad: grad students need to get something publishable written as quickly as possible, allowing both messy and buggy code. It doesn't really matter anyway, nobody's going to use it and the PC won't see the code when reviewing the paper.&lt;br /&gt;&lt;br /&gt;Interviewing with companies was so easy. It was so much less work than the mess of personal statements--one for each of 10 grad schools--and recommendations that I was preparing to do. You just send in your resume (which I just wrote  in a couple hours), they respond asking when you're free, you interview on the phone once or twice, and then meet them in person for a few hours. If the process takes two months because of delays on their end, everyone complains about the slowness. For grad school applications, it can take months to assemble your part of the application materials, and more months for them to respond. And you have to apply 9 months before you'll start.&lt;br /&gt;&lt;br /&gt;Within a couple months of getting tired of academia, I had a couple job offers and a few more promising interviews that felt like they'd lead to an offer if I kept going. It was hard for me to choose Google, but I think it should be the right environment for me as a recent graduate. They (should I start saying 'we' soon?) have amazingly advanced technology, great perks and what sounds like a good support and training structure for employees. There are many people working for Google as software engineers that don't need to work, and could probably find a good job at lots of places.&lt;br /&gt;&lt;br /&gt;And, I guess this isn't so important, but it doesn't hurt that the starting salary for recent graduates at Google is something like what a tenured professor makes, or that I'll be able to live in San Francisco, my favorite city. Maybe the work will be difficult and have long hours, but my life will be nice in ways that I never expected to happen so quickly.&lt;br /&gt;&lt;br /&gt;If I have any readers in the Bay Area, contact me and we should chat! I love talking with other programmers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-8522638278890422814?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/8522638278890422814/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=8522638278890422814' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8522638278890422814'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8522638278890422814'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/10/tired-of-academia-gonna-work-for-google.html' title='Tired of academia, gonna work for Google'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7246596667295783470</id><published>2010-05-30T05:22:00.000-07:00</published><updated>2010-06-01T08:14:48.867-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>Paper for DLS 2010</title><content type='html'>&lt;a href="http://factor-language.blogspot.com/"&gt;Slava Pestov&lt;/a&gt;, &lt;a href="http://duriansoftware.com/joe/"&gt;Joe Groff&lt;/a&gt; and I are writing a paper about Factor for the &lt;a href="http://www.dynamic-languages-symposium.org/dls-10/"&gt;Dynamic Languages Symposium&lt;/a&gt;, part of OOPSLA. It's a survey discussing both the language design and its implementation, and you can read the draft on &lt;a href="http://factorcode.org/littledan/dls.pdf"&gt;the Factor website&lt;/a&gt;. The paper is due on Tuesday, and we would really appreciate any comments you have on it.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: I've submitted the paper. Thanks everyone for your helpful suggestions! I'll hear back about this July 15, and report here when I do.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7246596667295783470?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7246596667295783470/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7246596667295783470' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7246596667295783470'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7246596667295783470'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/05/paper-for-dls-2010.html' title='Paper for DLS 2010'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-3827749739081079306</id><published>2010-04-25T00:27:00.000-07:00</published><updated>2010-04-25T14:48:22.121-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='compiler'/><title type='text'>Guarded method inlining for Factor</title><content type='html'>As a compiler optimization to make code faster, Factor's compiler tries to eliminate calls to generic words, replacing them by direct calls to the appropriate method. If the method is declared inline, then its definition will also be inlined. Together, these two optimizations are informally called 'method inlining'. Method inlining is essential for making object-oriented code in Factor fast. And since basic operations like accessing elements of a sequence or adding two numbers are implemented by method calls, this is needed for any Factor code to be fast.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;How things worked&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Here's how the current algorithm works. Method inlining takes place within Factor's sparse conditional constant propagation pass.  SCCP infers upper bounds for the classes of values that the code manipulates. When SCCP processes a generic word call, it examines the class of the receiver as well as the list of classes that have methods on the generic word. If SCCP can tell that a particular method will always be called, it can select that method. Below is some pseudocode for how it detects that.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;For each method on the generic word:&lt;br /&gt;    If the class for that method intersects the receiver class:&lt;br /&gt;       If the class for that method is a superclass of the receiver class:&lt;br /&gt;           Put it on a list&lt;br /&gt;       Else:&lt;br /&gt;           We don't know whether this method will be called at runtime&lt;br /&gt;             or not, so bail out and fail to inline a method&lt;br /&gt;Inline the method for the smallest class on the list&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;There's an additional complication. It might be that a word is compiled, with the method inlining optimization applied, but then after this, more vocabs get loaded that add additional methods to the generic word. This might invalidate the correctness of the method inlining, and the word should get recompiled to fix this problem. Factor uses a simple system to track these dependencies, in &lt;code&gt;stack-checker.dependencies&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;My new addition&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;A few days ago, &lt;a href="http://factorcode.org/slava/"&gt;another Factor developer&lt;/a&gt; told me about a hack he'd added to SCCP to make a particular benchmark faster. The hack was, if &lt;code&gt;&gt;fixnum&lt;/code&gt; is called on an object that SCCP knows is either a &lt;code&gt;fixnum&lt;/code&gt; or &lt;code&gt;f&lt;/code&gt;, then the &lt;code&gt;&gt;fixnum&lt;/code&gt; call is replaced with &lt;code&gt;dup [ \ &gt;fixnum no-method ] unless&lt;/code&gt;. This works because &lt;code&gt;&gt;fixnum&lt;/code&gt; doesn't have any methods on &lt;code&gt;f&lt;/code&gt;, and &lt;code&gt;&gt;fixnum&lt;/code&gt; on &lt;code&gt;fixnum&lt;/code&gt;s is a no-op.&lt;br /&gt;&lt;br /&gt;My immediate instinct here was to generalize this solution. The first step is to convert that code into &lt;code&gt;dup fixnum? [ M\ fixnum &gt;fixnum ] [ \ &gt;fixnum no-method ] if&lt;/code&gt;, which we can do since &lt;code&gt;&gt;fixnum&lt;/code&gt; doesn't have any other methods on the union of &lt;code&gt;f&lt;/code&gt; and &lt;code&gt;fixnum&lt;/code&gt;. Unlike the kind of method inlining described earlier, this requires the insertion of a guard. Later code will know (through SCCP) that the object is a &lt;code&gt;fixnum&lt;/code&gt;, even after the conditional exits, since it can tell that an exception would be thrown otherwise.&lt;br /&gt;&lt;br /&gt;The second step is to convert &lt;code&gt;fixnum?&lt;/code&gt; into &lt;code&gt;&gt;boolean&lt;/code&gt;, which we can do because we know that the value on the top of the stack is either &lt;code&gt;f&lt;/code&gt; or a &lt;code&gt;fixnum&lt;/code&gt;. With these two transformations, we should generate the same code as the hack generated, but the transformation should also work on other things.&lt;br /&gt;&lt;br /&gt;The second change was easy. I just added custom inlining for the &lt;code&gt;instance?&lt;/code&gt; word to detect basically this exact case, and convert the test into &lt;code&gt;&gt;boolean&lt;/code&gt; when this is valid.&lt;br /&gt;&lt;br /&gt;The first change was a lot more work. First, I had to come up with the algorithm for choosing the right method (even if it seems obvious now in retrospect).&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Make a list of methods on the generic word whose class intersects the receiver class&lt;br /&gt;If this list consists of one element, then return it&lt;br /&gt;If the list consists of multiple or zero elements, then there is no method to inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once this class is found, then propagation should generate code like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;dup method's-class instance? [&lt;br /&gt;    M\ method's-class generic-word execute&lt;br /&gt;] [&lt;br /&gt;    \ generic-word no-method&lt;br /&gt;] if&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This is valid because we know that, if the receiver fails the test, then it must be of some class that has no method on the generic word.&lt;br /&gt;&lt;br /&gt;An extra requirement for the correct implementation of this compiler optimization is to track dependencies. For this, I made two new types of dependencies. One corresponds to the test that the generic word only has one particular method intersecting the class that's on the stack. The other tracks if a method that's inlined is overwritten. I wasn't able to reuse the dependency tracking for the other kind of method inlining, but it all fits into the same framework and didn't take much code.&lt;br /&gt;&lt;br /&gt;All of this was pretty hard for me to debug, but it was a fun thing to work on. Among the code loaded in a basic development image, over 2400 methods are inlined using this technique, which were previously impossible to inline. There was probably also more method inlining done following this, due to improved type information, though I haven't collected statistics. And most importantly, I was able to eliminate the hack with &lt;code&gt;&gt;fixnum&lt;/code&gt; with no regression in performance.&lt;br /&gt;&lt;br /&gt;Once I get a couple kinks worked out, this should be merged into mainline Factor. For now, it's in the &lt;a href="http://github.com/littledan/Factor/tree/propagation"&gt;propagation&lt;/a&gt; branch of my repository.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-3827749739081079306?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/3827749739081079306/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=3827749739081079306' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3827749739081079306'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3827749739081079306'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/04/guarded-method-inlining-for-factor.html' title='Guarded method inlining for Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8865402568388300247</id><published>2010-04-06T20:04:00.000-07:00</published><updated>2010-04-07T14:21:02.460-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>A couple language design ideas</title><content type='html'>I know I shouldn't really write about something I haven't implemented yet, but here are a couple ideas that I've been talking about with some of the other Factor developers that I'd like to share with you. Credit for these ideas goes mostly to Slava Pestov and Joe Groff.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Multimethods in Factor&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Factor should have multimethods. They'd clean up a lot of code. There already are multimethods, in &lt;code&gt;extra/multimethods&lt;/code&gt;, but these are extremely slow (you have to search through half of method list to find the right one in the average case, which is unacceptable). They could also have their syntax cleaned up a bit. These multimethods, as they're already implemented, combine dispatching on things on the stack with dispatching on the values of dynamically scoped variables. The latter is called 'hooks' in Factor, and is useful for what other languages do using conditional compilation.&lt;br /&gt;&lt;br /&gt;Here's a sample of what the syntax might look like:&lt;pre&gt;! Before the -- is upper bounds on the allowable methods&lt;br /&gt;! after the -- is a type declaration that is checked.&lt;br /&gt;GENERIC: like ( seq: sequence exemplar: sequence -- newseq: sequence ) &lt;br /&gt;&lt;br /&gt;! Parameters with specific types in the parentheses--this is not a stack effect&lt;br /&gt;M: like ( array array ) drop ;&lt;br /&gt;&lt;br /&gt;! The single parameter refers to the top of the stack&lt;br /&gt;! and the second element isn't used for dispatch&lt;br /&gt;! so it defaults to sequence&lt;br /&gt;M: like ( array ) &gt;array ;&lt;br /&gt;&lt;br /&gt;GENERIC: generate-insn ( instruction: insn -- | cpu ) ! after the | is the hook variables&lt;br /&gt;&lt;br /&gt;! Before the | comes from the stack, after is variables. x86 is a singleton class.&lt;br /&gt;M: ( ##add-float | cpu: x86 )&lt;br /&gt;    [ dst&gt;&gt; ] [ src1&gt;&gt; ] [ src2&gt;&gt; ] tri&lt;br /&gt;    double-rep two-operand ADDSD ;&lt;/pre&gt;&lt;br /&gt;I've implemented some of this syntax in the &lt;code&gt;multimethods&lt;/code&gt; branch of my git repository. Another cool part, language-design-wise, is that multimethods can replace a few different other language features.&lt;br /&gt;&lt;br /&gt;For one, they can replace Joe Groff's &lt;code&gt;TYPED:&lt;/code&gt;. That lets you declare the types of arguments of a word, and have those be automatically checked, with checks removed if the compiler can prove they're unnecessary. In the ideal design, &lt;code&gt;:&lt;/code&gt; will replace &lt;code&gt;TYPED:&lt;/code&gt;, and if any parameters have type declarations, then the colon definition is macro-expanded into a generic word with only one method. This implements the type checking.&lt;br /&gt;&lt;br /&gt;Multimethods could also be used to implement 'hints' better than they are right now. Hints are a feature of the compiler where a programmer can instruct the compiler to create specialized versions of a function for certain parameter types. Currently, hints use a very inefficient dispatch mechanism: when you call a word with hints, it goes down the list of specialized implementations of the word, testing the stack to see if the contents of the stack are members of those classes. But if efficient multimethod dispatch were worked out, this could be much faster. Also, method inlining would make it so that the hints dispatch would be eliminated if types are known by the compiler. This isn't done right now.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://useless-factor.blogspot.com/2009/07/simple-interprocedural-optimization.html"&gt;Callsite specialization&lt;/a&gt; could also be implemented with the help of multimethods. The difference between this and hints is just that a new hint would be generated if type inference found certain types for the arguments of a function, but there wasn't already a hint for that type. The basic idea for callsite specialization is that the specialized versions would only be used when the compiler can prove that they're appropriate. But if efficient multimethod dispatch were used, then callsite specialization could be used to generate hints that are also available on values whose types are unknown until runtime. &lt;a href="http://useless-factor.blogspot.com/2010/01/type-feedback-in-factor.html"&gt;Runtime type feedback&lt;/a&gt; could work this way too.&lt;br /&gt;&lt;br /&gt;There are some implementation challenges in making multimethods fast. From what I've read, it seems like the best way is something like &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.6735&amp;rep=rep1&amp;type=pdf"&gt;this algorithm by Chambers and Chen used by the Cecil compiler&lt;/a&gt;. It generates a dispatch DAG for each multiple-dispatch generic word, where the nodes are single dispatches and the leaves are methods. The algorithm described directly in the paper looks a little impractical, since it includes iterating over all classes, but I think if you just iterated over classes that are method arguments, together with the intersections of these, then that'd be enough.&lt;br /&gt;&lt;br /&gt;There's a problem, though, which is that Factor's object system is complicated, and it's not obvious how to get the intersection of classes. The current implementation is probably buggy; if it's not, then I can't understand why it works. So class algebra will have to be fixed first, before multimethods can work. I hope I can figure out how to do this. I bet it has something to do with Binary Decision Diagrams, since we're talking about intersections, unions and negation. We might want to make things simpler by factoring predicate classes out in the way that the Cecil paper does. Either way, things are complicated.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A protocol for variables&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I never would have expected it, but the stack-based programming language Factor uses tons of variables. And lots of different kinds, too. &lt;ul&gt;&lt;li&gt;Lexically scoped variables in the &lt;code&gt;locals&lt;/code&gt; vocab&lt;/li&gt;&lt;li&gt;Dynamically scoped variables in the &lt;code&gt;namespaces&lt;/code&gt; vocab&lt;/li&gt;&lt;li&gt;Two types of globals, using &lt;code&gt;VALUE:&lt;/code&gt; and using words for dynamically scoped variables in the global scope&lt;/li&gt;&lt;li&gt;Thread-local variables in the &lt;code&gt;thread&lt;/code&gt; vocab&lt;/li&gt;&lt;li&gt;For web programming with Furnace, there are session variables and conversation variables and scope variables&lt;/li&gt;&lt;/ul&gt; These are all useful and necessary (though maybe we don't need two kinds of globals), but it's not necessary that they all use different syntactic conventions. They could all use the same syntax.&lt;br /&gt;&lt;br /&gt;Some of these variables always correspond to words (locals and values) and some of them don't. Probably it'd be better if all variables corresponded to words, to reduce the risk of typos and to make it possible to have terser syntax to use them, the way we have terse syntax to use locals and values.&lt;br /&gt;&lt;br /&gt;Here's my proposal for syntax. Whenever you want to use a variable (unless it's a local), you have to declare it beforehand in some vocab. For example, the following would declare a dynamically scoped variable &lt;code&gt;foo&lt;/code&gt; and a thread-local variable &lt;code&gt;bar&lt;/code&gt;. Currently, you wouldn't declare these ahead of time, and instead you would just use symbols for these purposes.&lt;pre&gt;DYNAMIC: foo&lt;br /&gt;THREAD-LOCAL: bar&lt;/pre&gt;Then, to read the values of these variables, the following syntax would be used&lt;pre&gt;foo&lt;br /&gt;bar&lt;/pre&gt;Reading a variable should just be done by executing the word that is the variable. This is how it works in almost all programming languages. Since the variable has been declared ahead of time, we know what kind of variable it is and how to read it. There could be a single parsing word for writing to variables, such as &lt;code&gt;to:&lt;/code&gt; (this is what values uses right now), as in the following.&lt;pre&gt;4 to: foo&lt;/pre&gt;Again, since the variable is declared ahead of time, we know how to write to it, and the right code is generated at parsetime. We could also have combinators like &lt;code&gt;change&lt;/code&gt;, written in a way that works with all variable types:&lt;pre&gt;[ 1 + ] change: foo&lt;/pre&gt;&lt;br /&gt;Another change I'm interested in is the scoping rules for dynamically scoped variables. Right now, when &lt;code&gt;with-variable&lt;/code&gt; is invoked, a totally new scope is made such that if any dynamic variable is set, that setting is only kept as long as the innermost &lt;code&gt;with-variable&lt;/code&gt; call. These semantics were made back when the fundamental construct for variables was &lt;code&gt;bind&lt;/code&gt;, but in modern Factor code, this isn't so important anymore.&lt;br /&gt;&lt;br /&gt;The current semantics are a frequent source of bugs. Basically, it's the funarg problem all over again: if you have a higher order function which includes a &lt;code&gt;with-variable&lt;/code&gt; call, and if the quotation sets any variables, then these variable settings don't escape the higher order function call. This is usually not what the programmer intended! There are various hacks, euphemistically  called idioms, used in Factor to get around this.&lt;br /&gt;&lt;br /&gt;The semantics should really be that &lt;code&gt;set&lt;/code&gt; modifies the binding in the scope that defined the variable, with the original binding always coming from &lt;code&gt;with-scope&lt;/code&gt;. But this change would break tons of Factor code that already exists, so I don't expect it to be used any time soon. Also, it would make things less efficient if each variable binding had its own scope, since right now things are often grouped together.&lt;br /&gt;&lt;br /&gt;One of my far-out hopes with variables is that maybe dynamically scoped variables could be optimized so that they would be as efficient as lexically scoped variables. But this would require lots of changes to Factor's implementation. For example, right now no level of the Factor compiler knows anything about any kind of variables; they're either subroutine calls or stack shuffling by the time the compiler sees it. Still, a sufficiently smart compiler could translate a usage of &lt;code&gt;make&lt;/code&gt; into a version that stores the partially built sequence on the stack, and there have been compilers that do the equivalent of this.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-8865402568388300247?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/8865402568388300247/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=8865402568388300247' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8865402568388300247'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8865402568388300247'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/04/couple-language-design-ideas.html' title='A couple language design ideas'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-3594733755794257874</id><published>2010-03-17T23:15:00.000-07:00</published><updated>2010-03-17T23:32:08.423-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='module systems'/><title type='text'>Expressing joint behavior of modules</title><content type='html'>Sometimes, special code is needed to make two modules interact together. This seems to come up all the time in Factor. For example, there are two language features, locals and prettyprinting, implemented in the Factor library. If they are both loaded, then there is special code to make locals prettyprint. But neither one depends on the other.&lt;br /&gt;&lt;br /&gt;The old way to handle this situation was to have code that looked like this, as a top-level form in the &lt;code&gt;locals&lt;/code&gt; vocab:&lt;pre&gt;"prettyprint" vocab [&lt;br /&gt;    "locals.prettyprint" require&lt;br /&gt;] when&lt;/pre&gt;But this is problematic. What if locals is loaded first, and then prettyprint? Well, this actually happened, due to a change in some other code. Then locals don't prettyprint correctly! You could fix this by adding top-level statements in both vocabs, as mirrors and specialized arrays did, but then this single joint dependency is duplicated in two places. Maintenance is more difficult than it should be.&lt;br /&gt;&lt;br /&gt;To fix this problem, I added a new word, &lt;code&gt;require-when&lt;/code&gt;. The code above, as a top-level form in the locals vocab, would be replaced with &lt;pre&gt;"prettyprint" "locals.prettyprint" require-when&lt;/pre&gt;The logic is: if the &lt;code&gt;prettyprint&lt;/code&gt; vocab is already loaded, then &lt;code&gt;locals.prettyprint&lt;/code&gt; will be loaded. Otherwise, the dependency is registered with the module system, so if &lt;code&gt;prettyprint&lt;/code&gt; is loaded later by something else, then &lt;code&gt;locals.prettyprint&lt;/code&gt; will be as well.&lt;br /&gt;&lt;br /&gt;I'm pretty satisfied with this solution to a somewhat long-standing problem in Factor. I wonder how other module systems solve the same problem. I have this implemented in &lt;a href="http://github.com/littledan/Factor/tree/conditional"&gt;a branch in my repository&lt;/a&gt; and it should be incorporated into mainline Factor soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-3594733755794257874?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/3594733755794257874/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=3594733755794257874' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3594733755794257874'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3594733755794257874'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/03/expressing-joint-behavior-of-modules.html' title='Expressing joint behavior of modules'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7743409328412682020</id><published>2010-03-08T20:01:00.000-08:00</published><updated>2010-05-26T19:13:51.834-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='college'/><category scheme='http://www.blogger.com/atom/ns#' term='compiler'/><title type='text'>Paper about call( -- ) inlining</title><content type='html'>Since I think my method of inlining closures in Factor is stronger than previous similar optimizations, and since I wouldn't mind a potentially free trip to Toronto, I wrote an abstract on what I did for the &lt;a href="http://cs.stanford.edu/pldi10/src/"&gt;PLDI student research contest&lt;/a&gt;. It's called &lt;a href="http://factorcode.org/littledan/abstract.pdf"&gt;Closure Elimination as Constant Propagation&lt;/a&gt; (not sure if that's the best title). It's supposed to be under 800 words, and it's about double that now. Readers, I'd really appreciate your help in fixing errors, figuring out what to cut, and finding what is unclear and what I left out. Send me any suggestions as comments here or by email at ehrenbed@carleton.edu. Thanks!&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update:&lt;/strong&gt; I edited the paper, and changed the link above.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update 2:&lt;/strong&gt; There's also a severely abridged version, which is more like what I'll submit, &lt;a href="http://factorcode.org/littledan/abstract.pdf"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update 3:&lt;/strong&gt; I got accepted to PLDI! If you're coming, I'll see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7743409328412682020?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7743409328412682020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7743409328412682020' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7743409328412682020'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7743409328412682020'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/03/paper-about-call-inlining.html' title='Paper about call( -- ) inlining'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5578683435432693811</id><published>2010-03-04T12:19:00.000-08:00</published><updated>2010-03-04T17:58:49.389-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='oop'/><title type='text'>A protocol for sets</title><content type='html'>It's bothered me for some time that Factor has a protocol (an informal abstract class) for sequences and associative mappings, but nothing for sets. In &lt;code&gt;core&lt;/code&gt; and &lt;code&gt;basis&lt;/code&gt;, there are three set implementations: you can use a hashtable as a set, a bit array, or a plain old sequence. Different words are used to manipulate these, and each set type has a rather incomplete set of operations.&lt;br /&gt;&lt;br /&gt;In &lt;a href="http://github.com/littledan/Factor/tree/bags"&gt;a branch in my repository&lt;/a&gt;, I've fixed this omission. Now all three are under a common protocol, and more can easily be added. If you're interested in reading about the details of the protocol, you can pull from this repository, bootstrap and read the documentation included.&lt;br /&gt;&lt;br /&gt;There are a number of benefits to this. &lt;ol&gt;&lt;li&gt;All sets now support all operations using names that correspond to the intuitive meanings of the operations&lt;/li&gt;&lt;li&gt;It is much easier to change set representations within a piece of code: just change the code that initializes the set&lt;/li&gt;&lt;li&gt;It's easier to implement new types of sets because you get many operations 'for free' and a nice common API.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;The design I went with had a few conflicting goals:&lt;ul&gt;&lt;li&gt;The protocol should be clean and simple, and therefore easy to implement&lt;/li&gt;&lt;li&gt;The new protocol should be mostly compatible with the existing &lt;code&gt;sets&lt;/code&gt; vocabulary.&lt;/li&gt;&lt;li&gt;There should be no performance overhead (besides method dispatch) for using this protocol&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Each of these had to be sacrificed a little. For performance, I had to make the protocol include many different operations, since different set implementations might have a way to override them for efficiency. To keep things simple and easy to implement, I did this only for operations that were currently in use in the Factor code base, and I included default methods for as many operations as I could. For compatibility, I made unordered sequences act as sets in a way that's very similar to the old &lt;code&gt;sets&lt;/code&gt; vocabulary, and the major operations are generalizations of the operations on sequences.&lt;br /&gt;&lt;br /&gt;There is not total backwards compatibility, though. If that had been a design requirement (say, if this were a change made on Factor 1.0), the design would be much cruftier. One change is that the word &lt;code&gt;prune&lt;/code&gt; is gone, subsumed by &lt;code&gt;members&lt;/code&gt;, which generates a sequence of the members of a set. Given a sequence, this gives a sequence with the same elements but only one copy of each.&lt;br /&gt;&lt;br /&gt;A bigger change is that hashtables are no longer meant to be used as sets. In their place is a new data structure, the hash set. Hash sets have literal syntax like &lt;code&gt;HS{ 1 2 3 }&lt;/code&gt;, similar to other collections. In their current implementation, they use a hashtable underneath, but in a future implementation a more memory-efficient construction may be used. Hash sets implement set operations and hashtables do not. Previously, words like &lt;code&gt;conjoin&lt;/code&gt;, &lt;code&gt;conjoin-at&lt;/code&gt; and &lt;code&gt;key?&lt;/code&gt; were used to query hashtables as sets, but now these are subsumed by the set words &lt;code&gt;adjoin&lt;/code&gt;, &lt;code&gt;adjoin-at&lt;/code&gt; and &lt;code&gt;in?&lt;/code&gt;. There is a lot of code that uses hashtables as sets, and it's not easy to sort out the set uses of hashtables from the non-set uses for someone who hasn't written the code. So for now, words to manipulate hashtables as sets are still present. &lt;code&gt;conjoin&lt;/code&gt; and &lt;code&gt;conjoin-at&lt;/code&gt;  will be eliminated when all code in the Factor repository is updated to use hash sets instead.&lt;br /&gt;&lt;br /&gt;It's somewhat to have a language evolution process where the language is not guaranteed to be compatible from one version to the next, as Factor is right now. There will continue to be incompatibilities until version 1.0 is released for the sake of clean organization of the language. There is a tradeoff here: incompatibilities make Factor harder to use now, and prevent adoption today, but the resulting system will be better-organized and easier to use as a result. Factor would probably be in much worse shape today if a policy of backwards compatibility had been adopted a few years ago, and it's a little too soon to start now in freezing the language.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: By the way, I forgot to mention in the original post: many other high-level programming languages don't have such a nice generic collection of data structures. Factor's isn't complete, but in many ways it's more advanced than many other popular programming languages. Java, C++ and C# are languages with generic data structures libraries, but using the data structures is more verbose due to certain missing language features, like the lack of syntax for literal hashtables. Popular scripting languages like Python, Ruby and Perl tend to privilege hashtables and resizable arrays over other data structures. It's possible to create data types that look just like arrays or hashtables using Python or Ruby in terms of their interface. But the higher methods for these data structures will only work for the builtin types and there's no way to make them work for your data type but to reimplement them. Functional languages like Scheme and Haskell have libraries for lists and arrays, but the interfaces are different for different data types. Even though Haskell has type classes, both of these languages' standard libraries are written with lists in mind for the most common operations. Factor used to resemble scripting languages in its support of data structures, but experience in writing large programs with these data structures has led to a better thought-out, more object-oriented model.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5578683435432693811?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5578683435432693811/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5578683435432693811' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5578683435432693811'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5578683435432693811'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/03/protocol-for-sets.html' title='A protocol for sets'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7466326246377296505</id><published>2010-02-10T22:26:00.000-08:00</published><updated>2010-02-24T21:19:49.299-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='compiler'/><title type='text'>Instruction scheduling for register pressure</title><content type='html'>Today, I finished the first version of a compiler optimization for Factor's compiler that reorders instructions in order to reduce the number of spills and reloads that the register allocator has to emit. I've been working on this since last fall, so I'm really happy to have a working version. In the current version, it eliminates 20% of those operations in a full development image on the 32-bit x86 architecture. 32-bit x86 is the main target for this optimization because it has so few registers. Scheduling for register allocation isn't an optimization most compilers do, but it looks like it's a profitable one.&lt;br /&gt;&lt;br /&gt;The algorithm I used is from the paper &lt;a href="http://portal.acm.org/citation.cfm?id=377849"&gt;Register-sensitive selection, duplication, and sequencing of instructions&lt;/a&gt; by &lt;a href="http://www.cs.rice.edu/~vsarkar/"&gt;Vivek Sarkar&lt;/a&gt; et al. (I only did the sequencing part.) The main idea of this paper is to adapt the &lt;a href="http://en.wikipedia.org/wiki/Sethi-Ullman_algorithm"&gt;Sethi-Ullman algorithm&lt;/a&gt;, which finds the instruction ordering with minimum register pressure for a tree, to the more general case of a &lt;a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;DAG&lt;/a&gt; where some edges represent data dependence and other edges represent control dependence. It uses a backwards &lt;a href="http://en.wikipedia.org/wiki/Instruction_scheduling#Algorithms"&gt;list scheduling&lt;/a&gt; pass to order the instructions based on a heuristic derived from the Sethi-Ullman numbering.&lt;br /&gt;&lt;br /&gt;The whole implementation took around 300 lines of high-level Factor code. It's in two vocabs &lt;code&gt;compiler.cfg.dependence&lt;/code&gt; and &lt;code&gt;compiler.cfg.scheduling&lt;/code&gt;. The first vocab defines procedures for constructing the dependence graph of a basic block, as well as partitioning that graph into fan-in trees and calculating the Sethi-Ullman numbering for these trees. The second vocab defines a word &lt;code&gt;schedule-instructions&lt;/code&gt; which schedules the instructions of each basic block of a CFG. This word is called after &lt;a href="http://factor-language.blogspot.com/2008/11/new-low-level-optimizer-and-code.html"&gt;write barrier elimination&lt;/a&gt; and before &lt;a href="http://factor-language.blogspot.com/2009/08/global-float-unboxing-and-some-other.html"&gt;representation selection&lt;/a&gt;. To save time, only blocks which might cause register spilling are scheduled.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: The code is now up in &lt;a href="http://github.com/littledan/Factor/tree/s3/basis/compiler/"&gt;a branch&lt;/a&gt; of my git repository.&lt;br /&gt;&lt;br /&gt;On &lt;a href="http://www.reddit.com/r/programming/comments/b0tl0/factor_instruction_scheduling_for_register/"&gt;Reddit&lt;/a&gt;, Slava Pestov explains why scheduling instructions can reduce register pressure.&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;This is a really cool optimization, but Dan doesn't explain why re-ordering instructions can reduce the number of spills and reloads. If this doesn't make sense to you, here's a really basic example. Suppose you have this sequence of operations in a hypothetical SSA IR:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;x = 1&lt;br /&gt;y = 2&lt;br /&gt;z[0] = x&lt;br /&gt;z[1] = y&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This needs three registers to compile without spilling, one for each one of x y and z. If your CPU only has two registers, there will be additional memory accesses generated.&lt;br /&gt;If we re-arrange the instructions like so, and assuming that x and y are not referenced anywhere else, then x and y can share a register, since their live ranges no longer overlap:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;x = 1&lt;br /&gt;z[0] = x&lt;br /&gt;y = 2&lt;br /&gt;z[1] = y&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This code no longer generates spills on a two-register CPU.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: The scheduling doesn't actually produce much of a speedup yet. It makes the fasta benchmark about 6% faster, and other benchmarks are unchanged as far as I can tell. Oops! I should have benchmarked before blogging. Anyway, I think I might be able to improve this by making a more accurate dependency graph, so there aren't so many false control dependences. At this point, the optimization isn't very useful, but I'm just happy that I managed to get any speedup at all.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update 3&lt;/strong&gt;: With a couple minor modifications, it actually makes the fasta benchmark 9% faster, and removes 29% of spills and 26% of reloads, but this still isn't so great. Also, the total size of compiled code in the development image decreased by 1.5% with this change.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update 4&lt;/strong&gt;: I modified the dependence graph construction code to reorder stack instructions (peek and replace). To do this without extra false dependences, I revived the old stack height normalization code, which is necessary because stack heights may change multiple times within a basic block due block joining. Stack height normalization was removed from the compiler when the new DCN was added. Now, around 35% of spills and reloads are eliminated, compared to the version without scheduling.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7466326246377296505?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7466326246377296505/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7466326246377296505' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7466326246377296505'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7466326246377296505'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/02/instruction-scheduling-for-register.html' title='Instruction scheduling for register pressure'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-144155204039243418</id><published>2010-01-30T12:45:00.000-08:00</published><updated>2010-01-30T15:17:41.010-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='compiler'/><title type='text'>Runtime type feedback in Factor</title><content type='html'>I realized today that, thanks to all the great machinery that Slava Pestov has built, Factor already has everything needed to support feedback-directed optimizations. To demonstrate this, I implemented a simple version of &lt;a href="http://research.sun.com/self/papers/type-feedback.html"&gt;type feedback&lt;/a&gt;. The &lt;a href="http://paste.factorcode.org/paste?id=1216"&gt;whole thing&lt;/a&gt; is less than 30 lines of code. It could definitely use a bit of improvement, but in a simple benchmark, I recorded a speedup of more than two times compared to the purely statically optimized version.&lt;br /&gt;&lt;br /&gt;The idea underlying runtime type feedback is that we can produce specialized versions of code based on the types it is used with. For example, if we consider a word like &lt;code&gt;append&lt;/code&gt;. On each iteration of the main loop, it calls a method on one of the two inputs to read from the input sequence. It also calls a method on the output to write to it. In general, none of these types are known until the program runs. However, if we know that the inputs are both arrays, then the output will also be an array and all of these method calls can be inlined using the static analysis of the compiler's propagation pass. This makes the program run significantly faster.&lt;br /&gt;&lt;br /&gt;The existing &lt;a href="http://docs.factorcode.org/content/word-HINTS__colon__%2Chints.html"&gt;hints&lt;/a&gt; mechanism capitalizes on this. Hints make it possible to declare, "this word is likely to be called with the following input from the following classes". With this declaration, the compiler can make a few specialized versions of a word annotated with hints, one for each predicted combination of classes. When the word gets called, it does a series of type checks to choose which version to use, based on the types of the arguments. If none of the types match, it calls the default version.&lt;br /&gt;&lt;br /&gt;The problem with this is that you have to declare, in advance, what types you expect to come up. Sometimes, you don't really know what types to expect until the program runs. This is where runtime type feedback comes in. In this system, while the program runs, the language runtime collects profile information about what types words are being called with. If a word is called with a particular type enough times, then hints are added to the word to make it specialized on that type. The word is recompiled with these hints, and future calls to the word have access to this specialized version.&lt;br /&gt;&lt;br /&gt;When I first heard about type feedback, it sounded really hard to implement. But it turns out, because Factor already has the hints mechanism and supports runtime redefinition of words, that type feedback is easy. All I did was use a hashtable to store the frequencies of types that a word was called with, and add hints based on this.&lt;br /&gt;&lt;br /&gt;There are some disadvantages to this approach. The first is that the compiler must be packaged with any program that wants to use type feedback at runtime. This takes up memory at runtime and makes the program take longer to start up because of the bigger executable. Additionally, any runtime optimization makes programs run slower at the beginning. The system has a certain degree of "warmup time" where it is not running at full speed because runtime optimizations have not yet been applied. An extreme case of this is in a full just-in-time (JIT) compiler, where all compilation takes place at runtime, and the program is initially running in an interpreter or something similar. This would not be quite as big of an issue here, where runtime type feedback is added to an ahead-of-time (AOT) compiler, but it's still a problem.&lt;br /&gt;&lt;br /&gt;My implementation of type feedback is far from perfect, and should be considered a proof of concept at this point. It won't be included as a default feature of Factor any time soon. First, the code that performs profiling might be faster if this code were written in a lower-level style. On some benchmarks, the cost of profiling in this implementation outweighs the benefits of the type feedback. Second, the way I use hints is pretty inefficient. I add a hint and then recompile the entire word with all the hints. So if n type combinations are added, then it spends O(n&lt;sup&gt;2&lt;/sup&gt;) time compiling, where it should have only spent O(n) time. Third, hints are not implemented as well as they could be in terms of runtime performance. A better implementation, based on multimethods, would probably improve the performance of type feedback. Finally, the Factor compiler itself isn't as fast as it could be. Compilation speed matters much more in the presence of runtime compilation than for AOT compilation, so this hasn't been the top priority so far.&lt;br /&gt;&lt;br /&gt;Despite these flaws, runtime type feedback, even implemented naively, creates a significant speedup. It seems like dynamic optimizations have a significant benefit for dynamically typed, object-oriented languages like Factor, that static optimizations alone can't completely match.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-144155204039243418?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/144155204039243418/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=144155204039243418' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/144155204039243418'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/144155204039243418'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2010/01/type-feedback-in-factor.html' title='Runtime type feedback in Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-209438821498437353</id><published>2009-10-10T10:51:00.000-07:00</published><updated>2009-10-12T17:55:01.613-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='compiler'/><title type='text'>Bitfields in Factor structs and the special style</title><content type='html'>Sometimes, it's useful to pack multiple small numbers into a small amount of memory. Say you have a tuple like&lt;br /&gt;&lt;pre&gt;TUPLE: nums a b c ;&lt;/pre&gt;where you know that &lt;code&gt;a&lt;/code&gt; is always between 0 and 31, &lt;code&gt;b&lt;/code&gt; is always between -2 and 1, and &lt;code&gt;c&lt;/code&gt; is always either 0 or 1. It should be possible to store this in a single byte, since &lt;code&gt;a&lt;/code&gt; could be represented in 5 bits, &lt;code&gt;b&lt;/code&gt; can be represented in 2 bits, and &lt;code&gt;c&lt;/code&gt; can be represented in 1 bit.&lt;br /&gt;&lt;br /&gt;One way to implement this is to define words to store in a single fixnum what a &lt;code&gt;nums&lt;/code&gt; tuple stores. These words would consist of shifts, masks and maybe calls to a sign extension routine. But this code would be boring and annoying to write.&lt;br /&gt;&lt;br /&gt;I've automated this process by allowing Factor structs to have bit fields, like C structs can. Here's how the implementation of &lt;code&gt;nums&lt;/code&gt; would look:&lt;br /&gt;&lt;pre&gt;STRUCT: nums&lt;br /&gt;   { a uint bits: 5 }&lt;br /&gt;   { b int bits: 2 }&lt;br /&gt;   { c uint bits: 1 } ;&lt;/pre&gt;Thanks to Joe Groff's changes on how structs work, this is a class, with prettyprinting and accessors just like tuple classes. Structs were originally a feature of Factor for the foreign function interface, but they're actually useful in pure Factor programs too: they're an efficient way to manipulate binary data.&lt;br /&gt;&lt;br /&gt;Bitfields deviate from the FFI origins of structs because they don't follow any standard convention on bitfield layout. In C, unlike for structs, there is no single standard ABI for bitfields, even on a single OS/architecture platform. Different compilers act differently. So why not make up my own convention? I store everything little-endian, and the first bitfield stores in the least significant bits of the byte. Because bitfield access only reads the underlying memory one byte at a time in the current implementation, this is just as efficient on big-endian hardware as little-endian hardware.&lt;br /&gt;&lt;br /&gt;Factor supports efficient homogeneous arrays of structs, allowing lots of data to be packed together efficiently. Because I extended structs for bitfields, rather than creating a new type of class, struct arrays can immediately be used with structs that have bitfields. This worked immediately; I didn't have to modify struct arrays.&lt;br /&gt;&lt;br /&gt;The actual code for the setters and getters is implemented in a high-level style. There are no type declarations, and all math is done with generic arithmetic calls. Factor's compiler is smart enough to translate this into fixnum arithmetic with no overflow checks in the case that the bitfields are small enough. If you make a field 100 bits wide, it'll use generic integer arithmetic, to take into account possible overflow into bignums. But this will only be used for certain calculations, the ones that could overflow.&lt;br /&gt;&lt;br /&gt;The Factor code generated for accessors is naive, not only in how it doesn't declare types, but it also does some things that could be special-cased for more efficient code. For example, in every byte read of the array, a mask is applied to the result with bitwise and, so that the relevant bits can be extracted. Sometimes, that mask will be 255, so it won't actually mask away any bits, and does nothing. Rather than special-casing a mask of 255 and generating code that doesn't call bitand, I extended the compiler to understand the algebraic identity that, in the range of numbers from 0 to 255, &lt;code&gt;255 bitand&lt;/code&gt; is the identity function. (This works for (2^n)-1, for any n.)&lt;br /&gt;&lt;br /&gt;Not all code can be compiled as efficiently for the compiler as the bitfield accessors. There is a special style of code that the compiler can compile more efficiently. As the compiler becomes more advanced, the subset grows. Really, there's a continuum of programs that the compiler can compile more or less efficiently. Originally, the special style consisted of code whose stack effect could be inferred by the compiler, allowing optimizations to take place. Now, all valid Factor code has an inferrable stack effect, but the compiler is advanced enough that it can do further optimizations when more information is available.&lt;br /&gt;&lt;br /&gt;Code written in the special style has to have something indicating all of this information about the code. In the case of bitfield accessors, we want the compiler to be able to infer that it can use fixnum arithmetic without overflow checks if the bitfield is small enough that the result won't ever overflow a bignum. The compiler can figure this out in the case of bitfield getters because it is only doing shifts, bitands and bitors on the result of &lt;code&gt;alien-unsigned-1&lt;/code&gt;, a word to read one byte of memory. The shifts are all constants.&lt;br /&gt;&lt;br /&gt;In the sparse conditional constant propagation pass, the compiler tracks the possible interval of values that a virtual register could hold. The compiler understands how bitands, bitors and shifts by a literal relate the interval of their output to the interval or their input. Another pass, the modular arithmetic pass, propagates information backwards about the modulus that earlier calculations could be done in, based on how the data is used later. This also allows overflow checks to be eliminated.&lt;br /&gt;&lt;br /&gt;This aspect of the special style allows overflow checks to be eliminated, allowing the use of more direct machine arithmetic. Other aspects of special style allow other efficient code to be generated. On code where types can infer, and a float or SIMD type is used, the compiler can allocate float or SIMD registers, and inline efficient code for arithmetic operations. Code using immutable tuples get the benefit of tuple unboxing, so repeated lookup of slots is cheap, and sometimes allocation can be eliminated. Code that uses mutable datastructures gets unboxed at another point in the compiler, but since it is harder to reason about, unboxing right now has only been implemented within a restricted piece of code. When the class of the input to a generic word is known, the correct method is called directly, and sometimes inlined. There are many other optimizations that have been implemented over the years, too many to describe in this blog post.&lt;br /&gt;&lt;br /&gt;Not all code could be written in the special style, and it would be unreasonable to expect most Factor programmers to learn about it. The compiler and the UI framework, for example, would be difficult to translate into a form taking heavy advantage of style. But a lot of the code in the standard library could be written this way. For example, the code for appending sequences is written in a style that allows the inner loop to have no subroutine calls inside of it and register allocation to be performed, when operating on certain types of sequences. There is only one piece of code that implements append, but using the hints feature lets specialized versions be generated for certain types which are often appended together. Append is used in many places, so code that's written in a dynamic style will benefit from this, even if the dynamic style code isn't optimized better.&lt;br /&gt;&lt;br /&gt;Special style is an extremely important feature of Factor, and without it, Factor would be as slow as many scripting languages. Without special style, many libraries would have to be implemented in C, as scripting languages do. Because of special style, Factor can be self-hosting and use many libraries written in Factor without an overwhelming performance penalty. Without special style, to implement bitfields as fast as they are right now, it would have been necessary to generate machine code on each platform.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: I guess I forgot to tell you exactly what special style is, and how to write code in it. Well, it's complicated. I'll write about it in the future. Special style grows as the compiler becomes more advanced, but I can't describe how it is right now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-209438821498437353?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/209438821498437353/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=209438821498437353' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/209438821498437353'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/209438821498437353'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/10/bitfields-in-factor-structs-and-special.html' title='Bitfields in Factor structs and the special style'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-269736305554763576</id><published>2009-08-13T19:55:00.000-07:00</published><updated>2009-08-13T21:28:00.076-07:00</updated><title type='text'>Hoisting write barriers out of loops</title><content type='html'>In a generational garbage collector, pointers from the old generation to the young generation need to be tracked. In every minor collection, these need to be considered as roots. A small piece of code called a write barrier is run on each pointer write. The write barrier records the object as modified. Each minor collection considers modified objects as potential roots into the youngest generation.&lt;br /&gt;&lt;br /&gt;Write barriers don't have to be run on every single write, actually. There are two cases where they don't have to be run:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If a small enough object has been allocated, and no GC could have been run since the allocation, the it must be in the nursery.&lt;/li&gt;&lt;li&gt;If a write barrier has been run on an object and the GC hasn't been run after that, then the write barrier does not need to run on further writes to the object.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;These things don't work across subroutine calls, since there might be a garbage collection there. They're also invalid across GC checks. But there's still a lot of code that can be improved with these observations.&lt;br /&gt;&lt;br /&gt;For example, the word &lt;code&gt;reverse&lt;/code&gt;, if specialized for arrays, doesn't have any subroutine calls or allocations in its inner loop, after enough compiler passes have run. But a naive code generator would put a write barrier call on each loop iteration. It's enough to just call the write barrier once, outside of the loop, and doing this gives you a 15% speedup.&lt;br /&gt;&lt;br /&gt;I implemented this as a compiler optimization on Factor's low-level IR, extending Slava's local write barrier elimination pass, described &lt;a href="http://factor-language.blogspot.com/2008/11/new-low-level-optimizer-and-code.html"&gt;here&lt;/a&gt;. Slava's pass eliminates redundant write barriers within a basic block, based on the two facts I just mentioned. For the local case, Slava's implementation is optimal, but with control flow we can do much better.&lt;br /&gt;&lt;br /&gt;Here's the idea: first, insert a call to the write barrier outside of any loop that calls the write barrier on each iteration. Next, delete redundant write barriers using a dataflow analysis. With Factor's new dataflow analysis and loop analysis frameworks, both of these tasks are pretty easy.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Inserting write barriers outside of loops&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Slava just implemented loop detection in low-level IR. For each loop, I want to find the set of registers that each iteration will definitely call the write barrier on. Once I find this set, I just place the write barriers right before the start of the loop. Deleting the ones inside the loop comes as the next step.&lt;br /&gt;&lt;br /&gt;The output of loop detection is a list of loops, and each loop has a header (the entry point), a list of ends (sources for jumps back to the header) and a list of basic blocks contained in the loop. If a basic block dominates each end, then it must be run on each iteration of the loop. So a conservative approximation of the list of write barriers that must be run on each iteration is the list of write barriers contained in basic blocks that dominate each end of the loop. It turns out this is enough to get all of the meaningful, practical cases like &lt;code&gt;append&lt;/code&gt; and &lt;code&gt;reverse&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;We have to be a little bit careful, though. You can't always insert a write barrier outside of a loop, because you can't run the write barrier on something like a fixnum. If you do, the VM might crash. Because type information isn't available in low-level IR, I reconstruct what can have the write barrier called on it by seeing what has had a slot lookup. This is a simple dataflow analysis.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Removing redundant write barriers&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Write barriers can be removed with another dataflow analysis. Here, for each basic block, we want to calculate the registers where the write barrier does not need to be called. Once we have this set, we can run the old write barrier removal algorithm.&lt;br /&gt;&lt;br /&gt;This is a forward analysis. I call the sets of registers where the write barrier does not need to be called again the "safe" set. Safe-in for a basic block is the intersection of the safe-outs of the predecessors if the current block has no allocation, and it is the empty set if it does have allocation. Safe-out is safe-in plus all registers that have been allocated in the block, and those that have had the write barrier run on them. Factor's dataflow framework handles this perfectly.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Hoisting write barriers out of loops was easier than I expected, just two evenings of work. Unfortunately, it isn't as general as it should be. The type reconstruction by looking at slot calls doesn't work as well as it should, since there is a subroutine call between the call to slot and the loop in the case of &lt;code&gt;append&lt;/code&gt; and &lt;code&gt;push-all&lt;/code&gt;. I think the solution to this would be for high-level IR to pass type information down to low-level IR. It would be useful here, and I bet as the low-level optimizer becomes more advanced, it will become useful in other places too. The code implementing the optimization is &lt;a href="http://paste.factorcode.org/paste?id=819#435"&gt;in the Factor pastebin&lt;/a&gt; and should be in the main Factor repository soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-269736305554763576?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/269736305554763576/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=269736305554763576' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/269736305554763576'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/269736305554763576'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/08/hoisting-write-barriers-out-of-loops.html' title='Hoisting write barriers out of loops'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4149167537400153478</id><published>2009-07-16T18:10:00.000-07:00</published><updated>2009-07-16T19:45:42.772-07:00</updated><title type='text'>A simple interprocedural optimization</title><content type='html'>An important feature of a compiler is that it compile code fast enough so that you don't feel like you're waiting forever. For this reason, most optimizations stop at the boundary between words.&lt;br /&gt;&lt;br /&gt;If one word calls another word, then the callee, in general, doesn't get the benefit of any information collected about what its arguments look like. And the caller doesn't get any information about what might be returned. There are exceptions to this, for specific words that have special optimization behavior in the compiler. For example, class predicates interact specially with the propagation pass to fold to constants whenever the compiler can prove that they'll always evaluate to true or false.&lt;br /&gt;&lt;br /&gt;One optimization the compiler works hard to do is eliminating type checks and runtime generic dispatch. It likes to turn virtual method calls into direct jumps, both because this is faster and because it enables further optimizations. Type inference in the SCCP pass is what drives the elimination of dispatch.&lt;br /&gt;&lt;br /&gt;But type inference has to stop at procedure boundaries, in general. We can't know all of the possible inputs to a word, since it can be called from anywhere, including the listener. And it would take too much time for callers to trace through every procedure they call to see what they can deduce about the output from what they know about the input.&lt;br /&gt;&lt;br /&gt;On the other hand, there would sometimes be a lot of benefit for callers and callees interact to perform optimizations. It'd be especially helpful for things words like &lt;code&gt;append&lt;/code&gt; which would benefit the most from type inference. &lt;code&gt;append&lt;/code&gt; consists of many generic calls (to &lt;code&gt;length&lt;/code&gt; and &lt;code&gt;nth-unsafe&lt;/code&gt;), and the dispatch can be eliminated if the types of the inputs are known. Additionally, the type of the output follows from the types of the inputs, and &lt;br /&gt;&lt;br /&gt;Maybe interprocedural analysis is too much work in general, but for something like &lt;code&gt;append&lt;/code&gt;, it would be helpful to have versions specialized for several different types, which are used when the type of the inputs is known. I implemented in Factor a simple system where words can be annotated to do this. The code is &lt;a href="http://paste.factorcode.org/paste?id=786#412"&gt;in the Factor pastebin&lt;/a&gt;; this is just a prototype and needs some changes before it's fully read to use.&lt;br /&gt;&lt;br /&gt;With this system, to make the word &lt;code&gt;append&lt;/code&gt; automatically create specialized versions based on the types of its two inputs, you can use the declaration&lt;br /&gt;&lt;pre&gt;SPECIALIZED: append 2&lt;/pre&gt;&lt;br /&gt;This doesn't immediately compile a ton of different versions of the word. Instead, it compiles them "on demand", whenever the propagation pass finds that append is used with certain types.&lt;br /&gt;&lt;br /&gt;When I applied this to the nbody benchmark, part of the Programming Language Shootout, by making certain words in &lt;code&gt;math.vectors&lt;/code&gt;, the running time went from around 4.3 seconds to around 4.0 seconds. This is a modest gain, but it's on top of something already highly optimized--there is some code which gives the compiler special knowledge of how to run vector operations on the kind of array used in the benchmark. I hope that this technique can make most of that code go away.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4149167537400153478?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4149167537400153478/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4149167537400153478' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4149167537400153478'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4149167537400153478'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/07/simple-interprocedural-optimization.html' title='A simple interprocedural optimization'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1675280860821166741</id><published>2009-07-14T12:35:00.001-07:00</published><updated>2009-07-15T22:49:08.650-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='compiler'/><title type='text'>Some new compiler optimizations</title><content type='html'>I've been working on Factor's optimizing compiler, adding a few new simple optimizations. I've made &lt;code&gt;call(&lt;/code&gt; and &lt;code&gt;execute(&lt;/code&gt; do more inlining, extended dead code elimination, increased the number of cases where overflow checks can be eliminated, and made object instantiation fast in more cases. Here, I'll explain what the optimizations are and how they're implemented.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Inlining with &lt;code&gt;call(&lt;/code&gt; and &lt;code&gt;execute(&lt;/code&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;call( -- )&lt;/code&gt; and &lt;code&gt;execute( -- )&lt;/code&gt; are words which let you call quotations or execute words. Slava explained them &lt;a href="http://factor-language.blogspot.com/2009/03/better-static-safety-for-higher-order.html"&gt;in a blog post&lt;/a&gt;. They differ from &lt;code&gt;call&lt;/code&gt; and &lt;code&gt;execute&lt;/code&gt; in that they don't require that the word or quotation is available through combinator inlining. But they require an explicit stack effect to be given, to ensure that it takes and returns the right number of parameters. This is nice, because the versions with a stack effect have an additional safety property: they'll only run with if the code has the right stack effect.&lt;br /&gt;&lt;br /&gt;Until now, these combinators carried with them a perfomance penalty over using &lt;code&gt;call&lt;/code&gt; or &lt;code&gt;execute&lt;/code&gt; with known quotations. The penalty is that the stack effect of the quotation must be checked, at runtime, to match the stack effect of the callsite. With an optimization I implemented, the performance is the same when the quotation is known. With matching performance in this case (they're both completely free in the case where either would work), it should be easier to write code that uses the checked versions.&lt;br /&gt;&lt;br /&gt;For example, the following implementation of an absolute value function compiles down to the same code that you'd write with the normal &lt;code&gt;if&lt;/code&gt; combinator.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;:: iff ( x cond true-quot false-quot -- x' )&lt;br /&gt;    cond&lt;br /&gt;    [ x true-quot call( x -- x' ) ]&lt;br /&gt;    [ x false-quot call( x -- x' ) ] if ; inline&lt;br /&gt;: abs ( n -- abs(n) )&lt;br /&gt;    dup 0 &lt; [ neg ] [ ] iff ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This is implemented as part of the sparse conditional constant propagation (SCCP) pass in Factor's compiler. &lt;code&gt;call-effect&lt;/code&gt; and &lt;code&gt;execute-effect&lt;/code&gt; have custom inlining behavior there, which takes advantage of information collected by the propagation pass. If enough is known about the quotation at this time (if it is a literal quotation, or a literal quotation with something (literal or non-literal) curried on to it, or two literal quotations composed together) so that the stack effect can be inferred and the code can be inlined, then it is inlined.&lt;br /&gt;&lt;br /&gt;I could have implemented this as a transform in the stack checker, but this strategy gives a stronger optimization, since it can interact with everything in constant propagation. For example, it interacts with method inlining. This will help improve performance in the &lt;a href="http://factor-language.blogspot.com/2009/04/sup-dawg-we-heard-you-like-smalltalk-so.html"&gt;Factor Smalltalk implementation&lt;/a&gt;, where previously combinator inlining would have been impossible without special support from the Smalltalk-to-Factor translator.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Eliminating overflow checks&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;In Factor, unlike C and Java, calculations on integers never overflow. Instead, numbers that are too big are converted to a representation that scales to arbitrary size. The smaller integers which are faster to calculate on are called "fixnums" and the larger ones, which are slower to use, are called "bignums"&lt;br /&gt;&lt;br /&gt;Factor's compiler does a lot of work to convert general arithmetic to arithmetic on fixnums when possible. One thing the compiler does is try to infer that integers will be in a small enough range that no overflow can happen. This is part of the SCCP pass.&lt;br /&gt;&lt;br /&gt;Another compiler pass checks if a value is only used in places where overflowing doesn't matter. For example, if the code &lt;code&gt;+ &gt;fixnum&lt;/code&gt; is run in a context where it's known that its arguments will both be fixnums, then it doesn't matter if an overflow check is done because of how modular arithmetic works.&lt;br /&gt;&lt;br /&gt;I extended this pass to take into account a couple other words that have this property that &lt;code&gt;&gt;fixnum&lt;/code&gt; has. Specifically, the words which set memory at a given pointer address, like &lt;code&gt;set-alien-unsigned-1&lt;/code&gt; and take integers as arguments. Depending on the word size of the platform (32 vs 64 bits), this optimization is valid for a different set of words.&lt;br /&gt;&lt;br /&gt;One time that this comes up is in &lt;code&gt;v+n&lt;/code&gt; when called with a byte array and a fixnum. Adding two fixnums gives back something that can be either a fixnum or a bignum, but the form of addition without an overflow check can be used since the result is going to be stored back into a byte array. Storing into a byte array is implemented with &lt;code&gt;set-alien-unsigned-1&lt;/code&gt;, so the optimization applies.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Inlining for &lt;code&gt;new&lt;/code&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The word &lt;code&gt;new&lt;/code&gt; instantiates a tuple with default values for all slots, given a tuple class. By default, this is done dynamically: the tuple class does not need to be known before runtime. But usually it is known ahead of time. If it is known ahead of time, then code can be inlined which instantiates the specific tuple that is there.&lt;br /&gt;&lt;br /&gt;This was previously implemented as part of the stack checker. I moved it to the propagation pass, which makes the optimization stronger. I plan on moving more transforms like this (for example, the one for &lt;code&gt;member?&lt;/code&gt;) to the propagation pass.&lt;br /&gt;&lt;br /&gt;[&lt;strong&gt;Update&lt;/strong&gt;: I did this, adding a utility called &lt;code&gt;define-partial-eval&lt;/code&gt;. Its interface is identical to &lt;code&gt;define-transform&lt;/code&gt;, but it operates on the propagation pass IR. Transformations which don't need to interact with stack checking should use &lt;code&gt;define-partial-eval&lt;/code&gt; rather than &lt;code&gt;define-transform&lt;/code&gt;, since it creates a stronger optimization.]&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Better dead code elimination&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I extended dead code elimination in the low-level IR to be more accurate. Now, it realizes that an allocation is dead if it is only written to, and never read from. With this, together with alias analysis, the code &lt;code&gt;2array first2&lt;/code&gt; compiles into a no-op, and no allocation takes place. (This isn't optimized out by tuple unboxing in the high-level IR, because arrays are mutable.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1675280860821166741?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1675280860821166741/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1675280860821166741' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1675280860821166741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1675280860821166741'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/07/some-new-compiler-optimizations.html' title='Some new compiler optimizations'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4369911903168028159</id><published>2009-03-17T11:19:00.000-07:00</published><updated>2009-03-17T15:25:34.259-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>The implementation of Factor's regexp library</title><content type='html'>I've been working on Factor's regular expression library, initially written by Doug Coleman, for the past few weeks. Recently, the library became good enough that I've pushed it to Factor's main repository. The latest Factor binaries have this new library.&lt;br /&gt;&lt;br /&gt;The library uses an standard algorithm of converting a regular expression into an NFA, and that into a DFA which can be executed. This is a tradeoff: the code generated will be faster than you would get from a backtracking search or an NFA interpreter, but it takes exponential time, in the worst case, to generate the DFA. I might revisit this later.&lt;br /&gt;&lt;br /&gt;The main features missing now are&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Possessive and reluctant matching&lt;/li&gt;&lt;li&gt;Group capture&lt;/li&gt;&lt;li&gt;Unicode support, in the form of &lt;a href="http://unicode.org/unicode/reports/tr18/"&gt;UTS 18&lt;/a&gt; level 1 compliance with some level 2 features&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Right now, I'm working on Unicode support. Regexps already use Unicode strings, because all strings in Factor represent a sequence of Unicode code points, but not many Unicode properties are exposed now. I plan on working on this, and implementing more Unicode algorithms and properties, to reach level 1 compliance. &lt;br /&gt;&lt;br /&gt;The rest of this article is an overview of how the regexp engine works. It is implemented as a series of passes, where the first pass takes a string as input, and the last pass outputs a word which runs the code of the regexp. In this way, it is rather like any other compiler, where the parse tree, DFA table and NFA table are just intermediate representations.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The parser&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The parser is implemented with Chris Double's packrat parsing library. This makes it not very efficient, but the time spent in parsing is much less than the time in later processing stages, so the cost isn't very large. Things like &lt;code&gt;/a{2,4}/&lt;/code&gt; are expanded into the equivalent, but simpler, form &lt;code&gt;/aa|aaa|aaaa/&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;If I were working only with ASCII, then ranges would be expanded into disjunctions as well, but the alphabet is far too big for that. Instead, something like &lt;code&gt;/[ac-z]/&lt;/code&gt; is represented in the syntax tree as a item, a character class object, representing the information that it matches the character a, or something in the class which is the range c-z. For a character class like &lt;code&gt;/[^\dc]/&lt;/code&gt;, an object is created which represents a character which is not a digit or c.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Constructing an NFA&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;From the syntax tree, a nondeterministic finite-state automaton is built. The algorithm is described &lt;a href="http://swtch.com/~rsc/regexp/regexp1.html"&gt;here&lt;/a&gt;, and there is nothing special about the implementation.&lt;br /&gt;&lt;br /&gt;Lookahead, lookbehind and anchors (like $ and ^) expand to entries in the syntax tree called tagged epsilons. When these are encountered in building the NFA, an epsilon transition is created which is annotated with this information.&lt;br /&gt;&lt;br /&gt;Negation is implemented here. If a negation syntax node is encountered, then the NFA builder constructs an NFA for the enclosed term, disambiguates it, converts it to a DFA, minimizes it, and attaches it back to the larger NFA that is being constructed.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Disambiguation&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;As I previously &lt;a href="http://useless-factor.blogspot.com/2009/02/regular-languages-with-large-alphabets.html"&gt;described&lt;/a&gt;, since the implementation doesn't iterate over every element of the alphabet, there needs to be a procedure to make transitions over characters have disjoint labels. Transitions are labeled by sets, and the output from creating an NFA might have intersecting outgoing sets from a transition.&lt;br /&gt;&lt;br /&gt;The best way I've thought of doing this is to get all of the intersections of &lt;em&gt;all&lt;/em&gt; of the edge labels, basically forming a Venn diagram. This is, unfortunately, exponential time and space to do. But I see no way of avoiding it when compiling a regular expression like &lt;code&gt;/\p{letter}a|[0-9a-z]b|\{script=latin}c|&lt;/code&gt;...&lt;code&gt;/&lt;/code&gt; where there are a large number of incomparable character classes used.&lt;br /&gt;&lt;br /&gt;I implemented a small optimization for this: numbers (ie literal characters) are set aside at the beginning and treated specially, so work isn't wasted intersecting them with other classes. The complexity of the algorithm stays exponential, but instead of being exponential in the total number of character classes in the regexp, it becomes exponential in just the non-literal classes.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Constructing a DFA&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;This is also a standard algorithm. My only modification is to support the tagged epsilon transitions created by lookaround and anchors. I described the modification &lt;a href="http://useless-factor.blogspot.com/2009/03/naive-lookaround-in-dfa.html"&gt;in a previous blog post&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Minimization&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Next, the resulting DFA is minimized. I wrote about regexp minimization &lt;a href="http://useless-factor.blogspot.com/2009/02/dfa-minimization.html"&gt;before&lt;/a&gt;. The algorithm had to be modified slightly to allow for the conditional transitions introduced by processing lookaround in the previous step.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Compilation to Factor&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;At this point, we have a nice minimal DFA with disjoint outward transitions. Translating it into Factor code is actually quite easy. For each state, we make a gensym. The gensym takes as arguments a string and an index. If the index is at the end of the string, the word returns a boolean, indicating whether the current state is an accepting state. If the index is not at the end of the string, the current character is found, and the word figures out which transition to take. A transition is taken by incrementing the index and then making a tail call to another state word.&lt;br /&gt;&lt;br /&gt;The strategy for finding the right transition is somewhat complicated. First, the literal transitions (over constant characters) are partitioned out from the non-literal transitions. The literal transitions are formed into a case statement, where the default case handles non-literal transitions.&lt;br /&gt;&lt;br /&gt;Non-literal transitions are all boolean expressions, built with the class algebra described below. They have a number of logic variables (classes of characters). So we can build a truth table over the logic variables, and test each condition exactly once to figure out which transition to take. For example, in the regexp &lt;code&gt;/\p{script=latin}b|\p{lower}c/&lt;/code&gt;, the DFA will have three transitions from the start: one over characters which are Latin script and lower-cased, one over characters which are lower-cased but not in Latin script, and one over characters which are in Latin script but not lower-cased. Rather than having the compiled DFA check if the character is in the composite classes directly (which would duplicate cost, since it would be looked up multiple times if a character is lower-cased or Latin script), the compiler builds nested if statements that figure out the composite class while testing each property only once. This leads directly to the transition.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Class algebra&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There is a system for simplifying the intersections built in disambiguation, as well as character class literals. It is built off simplifying logical expressions built with and, or, not. The things contained are true (the whole set of characters), false (the empty set), and all possible character classes.&lt;br /&gt;&lt;br /&gt;There are constructors for these three logical operations, and then a few simple tactics are used to reduce them. Reducing the expression to simplest form is equivalent to circuit minimization. A friend told me that this is on the second level of the polynomial hierarchy. So I'll just live with the heuristics.&lt;br /&gt;&lt;br /&gt;The not constructor is simple. If it's given true, it outputs false. False to true. If it's given a negation as input, it returns the contents. If it's given an and class, it uses De Morgan's law and negates each entry, returning an or. And vice versa.&lt;br /&gt;&lt;br /&gt;The and/or constructors are slightly more complicated. I will describe how the and constructor works; the or constructor can be easily derived using De Morgan's law. The input is a sequence of classes, and we want to get their intersection. First, if the input contains intersection (and) classes, these are flattened into the larger sequence. Next, the sequence is sorted into categories: integers, negations of integers, simple classes (like the class digits), negations of those, union classes (ors), and booleans. Delete true from the booleans list, if it's there, as it cannot affect the outcome. If there is a false in the booleans list, then the answer is false. If there is more than one integer, the answer is immediately false. If there is exactly one integer, then the answer is that integer if it is contained in all of the other classes, otherwise false. Now, we are working with a sequence which does not have integer literals, or true or false. If there is a simple class and a not-simple class for the same class, we know that their intersection is false, so the entire expression is false. We can remove not-integers where the integer is contained in an existing not-simple class, as these are redundant. Finally, the or classes within the and class can be simplified in the case where they have logic variables overlapping with other things in the and class: these can all be substituted with true. For example, if you have and(lowercase, or(lowercase, latin)), this can be simplified to and(lowercase, latin). This is because true is substituted for lowercase in the or expression, and or(true, latin) simplifies to latin.&lt;br /&gt;&lt;br /&gt;Previously, class algebra was not this strong in what reductions it did. This caused problems. For example, nested negations (useful in implementing conjunction) would result in multiple nested disambiguations, which would cause a very fast blowup in the size of the DFA. Now, running disambiguation twice gives the same results as running it once. (In other words, disambiguation is idempotent.) At least I think so.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Implementing regular expressions took a lot more thinking than I expected, but I'm satisfied with the result so far. Unlike traditional regular expressions, where pathologies might make regexp matching slow, in this system pathologies make regexp compilation time slow. That seems more acceptable to me. Factor's extensible syntax allows me to make regexp literals, which compile before the program runs, even though regexps are a library rather than built in.&lt;br /&gt;&lt;br /&gt;If I were starting from scratch, I might instead use the algorithm of constructing a DFA directly from a regular expression using regexp derivatives. It's described in &lt;a href="http://www.ccs.neu.edu/home/turon/re-deriv.pdf"&gt;this paper [PDF]&lt;/a&gt;. I'm not sure how performance in practice compares, but it implements negation and conjunction in a much cleaner way, and basically eliminates the need for minimization. Most importantly, it allows for a variation of disambiguation which avoids exponential behavior in many cases.&lt;br /&gt;&lt;br /&gt;In a later blog post, I'll talk about the API that I expose for regexps.&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;&lt;small&gt;I haven't actually implemented &lt;code&gt;\p{script=foo}&lt;/code&gt; yet, but I will very soon.&lt;/small&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4369911903168028159?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4369911903168028159/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4369911903168028159' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4369911903168028159'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4369911903168028159'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/03/implementation-of-factors-regexp.html' title='The implementation of Factor&apos;s regexp library'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7313167227624788108</id><published>2009-03-08T18:22:00.000-07:00</published><updated>2009-03-08T19:10:47.544-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='encodings'/><title type='text'>Text encodings and API design</title><content type='html'>This blog post is about a problem that I haven't figured out the answer to: How should vendor extensions to encodings be exposed to programmers?&lt;br /&gt;&lt;br /&gt;It seems like basically all pre-Unicode text encodings have proprietary Microsoft extensions that become, in practice, the next version of the standard. Sometimes these are basically supersets, but sometimes the extensions are backwards-incompatible in particular tiny places. One example of this that should be familiar to Westerners is the distinction between Latin 1 (ISO 8859-1) and Windows 1252. The only difference is in the range 0x80-0x9F, where Latin 1 has a set of basically useless control characters.&lt;br /&gt;&lt;br /&gt;A similar situation exists with many East Asian character sets (eg. KS X 1001/1003 for Korean, JIS X 208 for Japanese). In these cases, the backslash (\ 0x5C) is mapped to a national currency symbol in the official standard. But in the Microsoft extension, 0x5C is mapped to backslash to maintain compatibility with ASCII, and another character represents the national currency symbol.&lt;br /&gt;&lt;br /&gt;Some websites mark themselves as ISO 8859-1, but are in fact encoded in Windows 1252, and many web browsers take this interpretation into account. Similarly, the Microsoft versions of East Asian text encodings are often used in contexts where the standard versions are declared. In some cases, the Microsoft versions are registered with IANA separately for use in internet protocols (eg Shift-JIS/Windows_31J, and Latin1/Windows-1252), but in other cases there is only one registered encoding (eg EUC-KR).&lt;br /&gt;&lt;br /&gt;So, there are two questions.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;When interpreting an HTTP response, or receiving an email, where the encoding is declared in the header, should the text be interpreted with the standard encoding or the Microsoft extension?&lt;/li&gt;&lt;li&gt;In the encodings API for a programming language, if a particular encoding is used to read or write a file, should the standard be used or the Microsoft extension?&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;When I started reading about encodings, I assumed that everything could be done reasonably by following the standards precisely. Now I'm not sure what to do.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7313167227624788108?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7313167227624788108/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7313167227624788108' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7313167227624788108'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7313167227624788108'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/03/text-encodings-and-api-design.html' title='Text encodings and API design'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1215713541800298435</id><published>2009-03-05T15:45:00.000-08:00</published><updated>2009-03-05T16:47:25.580-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>Naive lookaround in a DFA</title><content type='html'>&lt;strong&gt;Note:&lt;/strong&gt; If you don't understand what's below, read &lt;a href="http://swtch.com/~rsc/regexp/regexp1.html"&gt;this&lt;/a&gt; for background on implementing regular expressions with DFAs.&lt;br /&gt;&lt;br /&gt;I just implemented lookahead and lookbehind in the Factor regexp module. This lets you write regular expressions like &lt;code&gt;/(?&lt;=foo)bar/&lt;/code&gt;, to match "bar" when it's preceded by "foo". I couldn't think of a better way to do it for the general case, so I just did it the naive way. There should be a better way, though, because lookahead and lookbehind don't extend the set of possible strings matched. This algorithm isn't so good because it makes things worse than linear time in the general case.&lt;br /&gt;&lt;br /&gt;For a lookahead or lookbehind clause, there is a little regular expression compiled. This regular expression annotates an epsilon transition. If I had an NFA interpreter, then the interpreter would just run the regular expression on the input string starting at the current point when it wants to know if it can use the epsilon transition. I'm using a DFA, so I needed to modify the subset construction to make this work.&lt;br /&gt;&lt;br /&gt;What needs to be modified is the procedure to get the epsilon closure of an NFA state. Instead of returning a set of NFA states, this procedure should return a set of states and the conditions for reaching each state. Usually, there will be no conditions, but sometimes there will be a conjunction or disjunction of lookarounds. It could be the conjunction in regular expression like &lt;code&gt;/(?&lt;=foo)(?=...bar)baz/&lt;/code&gt;, and it could be the disjunction in a regular expression like &lt;code&gt;/((?&lt;=foo)|(?=...bar))bar/&lt;/code&gt;, and since I'm trying to be fully general, all of this is supported.&lt;br /&gt;&lt;br /&gt;The epsilon closure procedure is usually something like this*. Start with your state on a list, and look at all of the epsilon transitions outwards. Add these to your list. Now, if anything on your list has more epsilon transitions outwards, add these to your list. Keep going until there's nothing to add.&lt;br /&gt;&lt;br /&gt;With the modification: Start with your state on a list, with the associated information that there are no conditions that have to be met for it. Then, look at all of the outward epsilon transitions from all the states on your list. For each new transition that you're considering, the requirement to get from the original state to the target of the transition is to meet the requirement of the source, plus the requirement of the transition. If you already had a way to get to the target of the transition, then now there are two conditions, and either one can be used. Keep repeating this until examining epsilon transitions doesn't change anything.&lt;br /&gt;&lt;br /&gt;Now, a little post-processing can turn the result of this into a bunch of nested conditionals, which can quickly tell you what states you can get to given which conditions are met. In the usual case, where there are no conditions, this tree is just a single leaf, a list of states. From here, transitions in the DFA go not to sets of states but to trees of conditions and states. The start state is also one of these trees.&lt;br /&gt;&lt;br /&gt;The DFA interpreter** needs a little bit of extra logic to test the conditions and traverse the tree. The interpreter gives the condition the input string and index, and the condition can do whatever it wants with that information. I've implemented anchors (eg. $ and ^) this way. They could be lookaround, but it's easier to just implement them directly as a predicate on the string and index.&lt;br /&gt;&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;*No, I don't implement it this way, because it would be a lot of repeated work, but this expresses the idea.&lt;br /&gt;** Actually, everything compiles down to Factor code, which is compiled with the optimizing compiler if you want.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1215713541800298435?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1215713541800298435/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1215713541800298435' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1215713541800298435'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1215713541800298435'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/03/naive-lookaround-in-dfa.html' title='Naive lookaround in a DFA'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7708608601455882245</id><published>2009-02-24T20:37:00.000-08:00</published><updated>2010-05-26T19:16:35.299-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='pattern matching'/><title type='text'>Draft paper about inverse</title><content type='html'>I recently submitted an abstract about &lt;a href="http://useless-factor.blogspot.com/2007/06/concatenative-pattern-matching.html"&gt;inverse&lt;/a&gt; to &lt;a href="http://mics.sdsmt.edu/"&gt;an undergraduate computer science conference&lt;/a&gt;, and I just got word that it was accepted. I've written a draft paper, available &lt;a href="http://factorcode.org/littledan/match.pdf"&gt;on factorcode.org&lt;/a&gt;, and I'd like to hear your comments on it. I've written it for a completely general audience, not assuming knowledge of functional programming or stack-based programming languages. I didn't get into formal semantics or anything like that, partly because it'd make the paper too long and partly because I don't completely understand the point.&lt;br /&gt;&lt;br /&gt;Speaking of undergraduate research, if you're interested in doing research in practical computer science this summer, I recommend applying to &lt;a href="http://www.cs.hmc.edu/reu/"&gt;Harvey Mudd's CS REU&lt;/a&gt;. Applications are due Sunday. But only if you're a US citizen or permanent resident! The funding agencies don't like foreigners. (This is actually a serious problem for several of my friends, who have difficulty finding the same summer academic and internship opportunities because they are actively discriminated against as international students. I hope that, one day, discrimination based on citizenship is as illegal as discrimination against racial minorities or women, but we are very far away from that today. Now I'm getting off-topic.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7708608601455882245?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7708608601455882245/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7708608601455882245' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7708608601455882245'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7708608601455882245'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/02/draft-paper-about-inverse.html' title='Draft paper about inverse'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1064790149624082448</id><published>2009-02-23T11:29:00.000-08:00</published><updated>2009-02-23T12:39:29.693-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>Regular languages with large alphabets</title><content type='html'>In implementing regular expressions, the formal theory of regular languages can be useful, especially when compiling regexps to DFAs. But the standard definition of a regular language is a little inconvenient when working with Unicode. In a standard DFA, a transition is over a single character. But for the range &lt;code&gt;/[\0-\u0f423f]/&lt;/code&gt;, I don't really want to compile a DFA with a million transition edges!&lt;br /&gt;&lt;br /&gt;A slightly modified formalism is useful here, where transitions in NFAs and DFAs happen over sets of letters, rather than individual letters. Then, things like ranges and character classes (eg &lt;code&gt;\p{Lower}&lt;/code&gt;) are represented as sets which annotate transition edges.&lt;br /&gt;&lt;br /&gt;It's perfectly straightforward to translate such a regular expression into an NFA with the standard construction, and the typical algorithm for executing an NFA by running all possible states in parallel works. For DFAs, though, there's a small complication: we have to remove ambiguity about where to go next.&lt;br /&gt;&lt;br /&gt;Say you have the regexp &lt;code&gt;/\p{InLatin}b|\p{Lower}c/&lt;/code&gt; This matches strings like "ab", "ac", "Ab", "πc" but not "Ac" or "πb". The simple textbook algorithm for regular expressions would have me expand &lt;code&gt;/\p{InLatin}/&lt;/code&gt; out to &lt;code&gt;/a|A|b|B|.../&lt;/code&gt;, and expand &lt;code&gt;/\p{Lower}/&lt;/code&gt; to &lt;code&gt;/a|π|.../&lt;/code&gt;. This strategy would work, but the size of the resulting NFA and DFA would be gigantic.&lt;br /&gt;&lt;br /&gt;What we actually want to do is change the two outward transitions from the start state--transitions over the sets &lt;code&gt;InLatin&lt;/code&gt; and &lt;code&gt;Lower&lt;/code&gt;--to transitions over &lt;code&gt;InLatin ∩ Lower&lt;/code&gt;, &lt;code&gt;InLatin - Lower&lt;/code&gt; and &lt;code&gt;Lower - InLatin&lt;/code&gt;. Since these are disjoint, there's no ambiguity about which one to take. In general, for a state with n outward transitions, you have to look at 2&lt;sup&gt;n&lt;/sup&gt; possibilities, since for each subset, you have to make a transition for characters which are in each of those transition groups, but not in any of the others.&lt;br /&gt;&lt;br /&gt;Implemented naively, this would make the size of DFAs blow up. For the regexp &lt;code&gt;/ab|cd/&lt;/code&gt;, you'd have a transition on characters that are a but not c, characters that are c but not a, and characters that are both c and a. Fortunately, it's simple to work out a system which recognizes that a - c = a, c - a = c and c ∩ a = ∅. With this in place, the resulting DFA (which didn't have any ambiguity in the first place) is just like what it would be without the system, but the DFA for &lt;code&gt;ab|\p{Lower}c&lt;/code&gt; has two transitions from the start state: one over a, and one over lower cased letters that aren't a.&lt;br /&gt;&lt;br /&gt;I've implemented all of this in the Factor regexp library. If you want to see the code, it's in the main repository, in the &lt;code&gt;regexp&lt;/code&gt; branch.&lt;br /&gt;&lt;br /&gt;&lt;small&gt;&lt;strong&gt;PS.&lt;/strong&gt; When you're considering transitions over sets, it's possible to consider "regular" languages over an infinite alphabet. It might be convenient to think of Unicode as infinite, since it's so large. But it's easy to prove that for any such "regular" language, there is a string homomorphism to a finite-alphabet regular language where a string is in the original language if and only if its homomorphic image is in the smaller regular language. So, in this way, it's a mathematically boring formalism to study. Other people have studied regular language formalisms with infinite alphabets that actually do have more power--they have the power to compare characters for equality in certain contexts. But that's completely different.&lt;/small&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1064790149624082448?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1064790149624082448/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1064790149624082448' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1064790149624082448'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1064790149624082448'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/02/regular-languages-with-large-alphabets.html' title='Regular languages with large alphabets'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-2031387665487799540</id><published>2009-02-18T23:11:00.000-08:00</published><updated>2009-02-19T00:10:24.351-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>DFA minimization</title><content type='html'>I've been working on the &lt;code&gt;regexp&lt;/code&gt; vocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the &lt;code&gt;xmode&lt;/code&gt; syntax highlighter. This means compatibility with most of Java's regexp syntax.&lt;br /&gt;&lt;br /&gt;The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote &lt;a href="http://swtch.com/~rsc/regexp/regexp1.html"&gt;a good introduction to all of this&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;There's a limitation: backreferences (like &lt;code&gt;/(.*)\1/&lt;/code&gt;) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.&lt;br /&gt;&lt;br /&gt;Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is &lt;a href="http://paste.factorcode.org/paste?id=449"&gt;in the Factor pastebin&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression &lt;code&gt;/ac|bc/&lt;/code&gt;, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.&lt;br /&gt;&lt;br /&gt;What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Assume that all states are the same*.&lt;/li&gt;&lt;li&gt;Separate the final states from the non-final states.&lt;/li&gt;&lt;li&gt;Repeat the following until it doesn't make a change in the partitioning:&lt;ul&gt;&lt;li&gt;Separate any two states which have a transition with the same label to two states which are separated.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Once this doesn't change anything, the states have been divided into the sets that we want.&lt;br /&gt;&lt;br /&gt;Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.&lt;br /&gt;&lt;br /&gt;In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-2031387665487799540?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/2031387665487799540/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=2031387665487799540' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/2031387665487799540'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/2031387665487799540'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/02/dfa-minimization.html' title='DFA minimization'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7555961450826980273</id><published>2009-02-05T11:12:00.000-08:00</published><updated>2009-02-05T13:15:26.304-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rant'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><title type='text'>XML pattern matching</title><content type='html'>I've implemented syntax for pattern matching on interpolated XML literals. This is a Scala-inspired feature which may or may not be useful, but is definitely cool-looking. Here's a sample of code:&lt;br /&gt;&lt;pre&gt;: dispatch ( xml -- string )&lt;br /&gt;    {&lt;br /&gt;        { [ [XML &amp;lt;a&gt;&amp;lt;-&gt;&amp;lt;/a&gt; XML] ] [ "a" prepend ] }&lt;br /&gt;        { [ [XML &amp;lt;b&gt;&amp;lt;-&gt;&amp;lt;/b&gt; XML] ] [ "b" prepend ] }&lt;br /&gt;        { [ [XML &amp;lt;b val='yes'/&gt; XML] ] [ "yes" ] }&lt;br /&gt;        { [ [XML &amp;lt;b val=&amp;lt;-&gt;/&gt; XML] ] [ "no" prepend ] }&lt;br /&gt;    } switch ;&lt;/pre&gt;&lt;br /&gt;And here's some examples of what it does:&lt;br /&gt;&lt;pre&gt;[XML &amp;lt;a&gt;pple&amp;lt;/a&gt; XML] dispatch&lt;br /&gt;         =&gt; "apple"&lt;br /&gt;[XML &amp;lt;b&gt;anana&amp;lt;/b&gt; XML] dispatch&lt;br /&gt;        =&gt; "banana"&lt;br /&gt;[XML &amp;lt;b val="yes"/&gt; XML] dispatch&lt;br /&gt;        =&gt; "yes"&lt;br /&gt;[XML &amp;lt;b val="where"/&gt; XML] dispatch&lt;br /&gt;        =&gt; "nowhere"&lt;/pre&gt;&lt;br /&gt;The pattern matching here is based on my &lt;code&gt;inverse&lt;/code&gt; library. Hopefully you get the high-level idea of how XML pattern matching works. A caveat is that namespaces are ignored, as in Scala's XML pattern matching, and I haven't thought of a good way to incorporate namespaces.&lt;br /&gt;&lt;br /&gt;Jeff Attwood recently wrote a &lt;a href="http://www.codinghorror.com/blog/archives/001223.html"&gt;blog post&lt;/a&gt; about XML literal syntax*, and how it's a big help. In part, I agree with his statement that it lets you write cleaner code to have XML literals mixed in with everything. That's why I implemented them in Factor.&lt;br /&gt;&lt;br /&gt;But I think it's insane to have them built in the way that Scala and, apparently, VB.NET have. Just think if the designers of Java had this idea 15 years ago. They would have made SGML literal syntax rather than XML, and there would be no way to remove this for backwards compatibility concerns. Every implementation of Java would have to contain a full-featured SGML parser. Generating HTML would be easier and allow for very pretty code, though.&lt;br /&gt;&lt;br /&gt;XML won't be so prominent forever. It'll always be there, of course, just like SGML is still there in non-XML HTML and DocBook files which refuse to die. But I wouldn't be surprised if, in 20 years, what we want is a literal syntax for something other than XML. We'll still be using a lot of the same programming languages as were invented today, though.&lt;br /&gt;&lt;br /&gt;Factor's XML literal syntax and XML pattern matching are just libraries, and I wouldn't want it any other way. Factor's parser needed no adjustments at all to allow for XML literals. If an XML literal is written when the vocabulary &lt;code&gt;xml.literals&lt;/code&gt; isn't in scope, it's a syntax error. Sure, you need delimiters &lt;code&gt;[XML XML]&lt;/code&gt; around the XML literals, but this is a small price to pay.&lt;br /&gt;&lt;br /&gt;In 20 years, if XML isn't useful any longer, then Factor programmers will be just as well off as if they are. If there's a new data format, it'll just be a new library to download, and you'll have literal syntax for that data format. If you were using VB.NET, though, you'd have to upgrade to the most recent version of the language, with a parser that's been vastly complicated.&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;*He actually talked about HTML literal syntax, but that's such a ridiculous idea that I know he just mistyped. There's no way to tell where an HTML fragment ends, for one, and the HTML DTD would have to be part of the core grammar of the language. You would need delimiters around where the HTML starts and ends, and none of his code fragments have those delimiters. The X key is right next to the letters H and T, so it's an honest typo. His pseudocode sample just before the VB.NET fragment must have also been a simple mistake, as it would seem to imply that either XML literals get printed immediately or that there is some implicit way of collecting them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7555961450826980273?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7555961450826980273/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7555961450826980273' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7555961450826980273'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7555961450826980273'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/02/xml-pattern-matching.html' title='XML pattern matching'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-3580034269687959527</id><published>2009-01-25T22:19:00.000-08:00</published><updated>2009-01-25T22:42:42.528-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><title type='text'>Factor supports XML literal syntax</title><content type='html'>Factor can now express XML as literals in code. There's a new library, &lt;code&gt;xml.interpolate&lt;/code&gt;, which lets you create an XML chunk or document using interpolating either by locals or using a fry-like syntax. Here's a taste, from the &lt;code&gt;syndication&lt;/code&gt; vocab, to create an Atom file:&lt;pre&gt;: feed&gt;xml ( feed -- xml )&lt;br /&gt;    [ title&gt;&gt; ]&lt;br /&gt;    [ url&gt;&gt; present ]&lt;br /&gt;    [ entries&gt;&gt; [ entry&gt;xml ] map ] tri&lt;br /&gt;    &amp;lt;XML&lt;br /&gt;        &amp;lt;feed xmlns="http://www.w3.org/2005/Atom"&gt;&lt;br /&gt;            &amp;lt;title&gt;&amp;lt;-&gt;&amp;lt;/title&gt;&lt;br /&gt;            &amp;lt;link href=&amp;lt;-&gt; /&gt;&lt;br /&gt;            &amp;lt;-&gt;&lt;br /&gt;        &amp;lt;/feed&gt; &lt;br /&gt;    XML&gt; ;&lt;/pre&gt;This could also be written with locals:&lt;pre&gt;:: feed&gt;xml ( feed -- xml )&lt;br /&gt;    feed title&gt;&gt; :&gt; title&lt;br /&gt;    feed url&gt;&gt; present :&gt; url&lt;br /&gt;    feed entries&gt;&gt; [ entry&gt;xml ] map :&gt; entries&lt;br /&gt;    &amp;lt;XML&lt;br /&gt;        &amp;lt;feed xmlns="http://www.w3.org/2005/Atom"&gt;&lt;br /&gt;            &amp;lt;title&gt;&amp;lt;-title-&gt;&amp;lt;/title&gt;&lt;br /&gt;            &amp;lt;link href=&amp;lt;-url-&gt; /&gt;&lt;br /&gt;            &amp;lt;-entries-&gt; &lt;br /&gt;        &amp;lt;/feed&gt;&lt;br /&gt;    XML&gt; ;&lt;/pre&gt;Here's an example with more complicated logic:&lt;br /&gt;&lt;pre&gt;"one two three" " " split&lt;br /&gt;[ [XML &amp;lt;item&gt;&amp;lt;-&gt;&amp;lt;/item&gt; XML] ] map&lt;br /&gt;&amp;lt;XML &amp;lt;doc&gt;&amp;lt;-&gt;&amp;lt;/doc&gt; XML&gt;&lt;/pre&gt;whose prettyprinted output (using &lt;code&gt;xml.writer:pprint-xml&lt;/code&gt;) is&lt;br /&gt;&lt;pre&gt;&amp;lt;?xml version="1.0" encoding="UTF-8"?&gt;&lt;br /&gt;&amp;lt;doc&gt;&lt;br /&gt;  &amp;lt;item&gt;&lt;br /&gt;    one&lt;br /&gt;  &amp;lt;/item&gt;&lt;br /&gt;  &amp;lt;item&gt;&lt;br /&gt;    two&lt;br /&gt;  &amp;lt;/item&gt;&lt;br /&gt;  &amp;lt;item&gt;&lt;br /&gt;    three&lt;br /&gt;  &amp;lt;/item&gt;&lt;br /&gt;&amp;lt;/doc&gt;&lt;/pre&gt;The word &lt;code&gt;&amp;lt;XML&lt;/code&gt; starts a literal XML document, and &lt;code&gt;[XML&lt;/code&gt; starts a literal XML chunk. (A document has a prolog and exactly one tag, whereas a chunk can be any kind of snippet, as long as tags are balanced.) The syntax for splicing things in is using a tag like &lt;code&gt;&amp;lt;-foo-&gt;&lt;/code&gt;. This syntax is a strict superset of XML, as a tag name in XML 1.0 is not allowed to start with &lt;code&gt;-&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;It took me just an evening to hack this up. It's less than 15 lines of code modifying the XML parser, and all the interpolation is less than 75 lines of code. Best of all, unlike Scala's XML literals, this doesn't affect the core of the language or make anything else more complicated. It's just nicely tucked away in its own little vocabulary. I plan on replacing usages of &lt;code&gt;html.components&lt;/code&gt; with this, once I make some tweaks.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-3580034269687959527?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/3580034269687959527/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=3580034269687959527' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3580034269687959527'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3580034269687959527'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/01/factor-supports-xml-literal-syntax.html' title='Factor supports XML literal syntax'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1421830288703297336</id><published>2009-01-17T13:02:00.000-08:00</published><updated>2009-01-17T14:22:55.518-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>Quote of the day</title><content type='html'>&lt;blockquote&gt;&lt;em&gt;It should be obvious that while garbage collection (or compaction) is going on, the execution of the user's program must be suspended. Hence out of common courtesy to the user, we must provide for some notification--perhaps a flashing message in the corner of the screen--that garbage collection is in progress. Otherwise when the program stops running, the user may suspect a bug, and an inexperienced user may panic.&lt;/em&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;div style="text-align:right"&gt;--&lt;em&gt;Introduction to Compiler Construction&lt;/em&gt; by Thomas Parsons, Chapter 8.2, "Runtime memory management"&lt;br /&gt;(The book spends about half a page on garbage collection, mostly to describe the mark-sweep algorithm.)&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;With Lisp machines, it used to be that people would talk about things like running the garbage collector overnight, when the machine was not in use. Virtual memory in use was typically several times the size of the physical memory. Within its context, the quote is accurate (or at least 10 years earlier it was): running mark-sweep on mostly virtual memory with a slow disc will take a lot of time.&lt;br /&gt;&lt;br /&gt;I'm happy that this concern is very outdated. If it weren't outdated, Factor (or Java) could never be a reasonable choice of programming language. There are two main reasons for the improvement.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Memory size has increased. Programs that would use tons of swap space 15 years ago now fit comfortably on the RAM.&lt;/li&gt;&lt;li&gt;Algorithms have improved. Not only do we have generational garbage collection to improve throughput, but there are also incremental techniques to spread out the pause, and various methods to take advantage of multiple processors.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;So now, the idea of a flashing message for garbage collection in progress is hilariously quaint. If I were going to write a compiler textbook, I'd spend a whole chapter on garbage collection. It is essential for any modern programming language, and a good algorithm is essential.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1421830288703297336?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1421830288703297336/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1421830288703297336' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1421830288703297336'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1421830288703297336'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/01/quote-of-day.html' title='Quote of the day'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-3933423084419663081</id><published>2009-01-15T13:25:00.000-08:00</published><updated>2009-01-24T13:37:20.558-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><title type='text'>XML encoding auto-detection</title><content type='html'>The Factor XML parser now auto-detects the encodings of XML documents. This is implemented for all of the encodings that are implemented in Factor. To see how it's implemented, look at &lt;a href="http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info"&gt;the XML standard&lt;/a&gt;, because it explains it much better than my blog post, which was below.&lt;br /&gt;&lt;br /&gt;&lt;strike&gt;I was mystified myself when I first read that XML documents can specify what encoding they are, &lt;em&gt;in the document itself&lt;/em&gt;. The encoding, if it's not UTF-8, must specified in the prolog like this:&lt;br /&gt;&lt;pre&gt;&amp;lt?xml version="1.0" encoding="ISO-8859-1"?&gt;&lt;/pre&gt;&lt;br /&gt;The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8.  If there's no BOM, then the first character must be &amp;lt; or whitespace. If it's &amp;lt;, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the &amp;lt;. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.&lt;br /&gt;&lt;br /&gt;To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.&lt;br /&gt;&lt;br /&gt;[&lt;strong&gt;Update&lt;/strong&gt;: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]&lt;/strike&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-3933423084419663081?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/3933423084419663081/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=3933423084419663081' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3933423084419663081'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/3933423084419663081'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/01/xml-encoding-auto-detection.html' title='XML encoding auto-detection'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4909877039539870216</id><published>2009-01-08T08:21:00.000-08:00</published><updated>2009-01-08T08:52:33.482-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>I'm back</title><content type='html'>Hi again! I've been gone for the past few months because in the fall I was away in Budapest studying math, and it was really hard. Now, I've decided to take three months off from school to work on Factor full-time. My plan is first work on finishing up Unicode and XML, and then try to improve Factor's garbage collector.&lt;br /&gt;&lt;br /&gt;For Unicode, over the past couple days, I fixed bugs in normalization and grapheme breaking, and I implemented word breaking. This was all made a lot easier by the test suites that Unicode includes, and these are now used as unit tests. I also wrote docs for everything. I still need to optimize everything and clean up the code, though. There are a lot more algorithms that I could implement, and plan on implementing eventually, but I'm just going to do this for now.&lt;br /&gt;&lt;br /&gt;For XML, I plan on hooking up the XML conformance test suite and fixing any conformance issues that come up, for starters. I've been terribly informal about testing in the past, and I'll try to change this. Instead of optimizing the current XML code directly, I plan on replacing it with a generated lexer and parser: I'll try to make a system like lex/yacc in Factor. Previously, Chris Double has made some pretty cool stuff for parsing with parser combinators and parsing expression grammars, but these both have efficiency problems. I know that a traditional system like lex/yacc can solve the problem of parsing XML, and while it's not as flexible as PEGs, it still might be possible to add high-level OMeta-like features. Parsing is something that I don't know a lot about, so I'll be doing some reading for this.&lt;br /&gt;&lt;br /&gt;For garbage collection, I want to actually implement something after studying it for so long! I plan on first making Factor's GC generational mark-sweep, and investigating changes to the generational policy to reduce thrashing. Then, there are two things that I want to do: make it incremental, and make some kind of compaction. Maybe this will take the form of mark-copy, or maybe it'll just be an incremental mark-sweep collector with occasional stop-the-world compaction (which could be disabled for certain applications). Either way, expect the footprint of Factor programs in the future to be reduced by almost half, and hopefully GC pauses will be shorter most of the time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4909877039539870216?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4909877039539870216/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4909877039539870216' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4909877039539870216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4909877039539870216'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2009/01/im-back.html' title='I&apos;m back'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-328310167094601712</id><published>2008-10-31T01:09:00.000-07:00</published><updated>2008-11-01T07:12:50.136-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>Garbage collection camp post-mortem</title><content type='html'>This summer, I went to &lt;a href="http://www.hmc.edu/"&gt;Harvey Mudd&lt;/a&gt;'s &lt;a href="http://www.cs.hmc.edu/reu/"&gt;REU&lt;/a&gt; (an &lt;a href="http://www.nsf.gov/"&gt;NSF&lt;/a&gt;-funded research program) in computer science to study garbage collection. My adviser was &lt;a href="http://www.cs.hmc.edu/~oneill/"&gt;Melissa O'Neill&lt;/a&gt;. I didn't really accomplish anything, but I guess I learned a lot.&lt;br /&gt;&lt;br /&gt;My focus was on the memory behavior of garbage collection algorithms. Simple analyses of GC algorithms, like those of most algorithms, assume a &lt;a href="http://en.wikipedia.org/wiki/RAM_model"&gt;random access machine&lt;/a&gt; (RAM) model: that all memory accesses have constant cost. In truth, different orders of access have different costs in a more or less predictable way, based off the way that memory is implemented.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Locality&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I won't go into the details of the &lt;a href="http://en.wikipedia.org/wiki/Memory_hierarchy"&gt;memory hierarchy&lt;/a&gt;, the reason for this difference, in this post. A good but very long reference is &lt;a href="http://people.redhat.com/drepper/cpumemory.pdf"&gt;What Every Programmer Should Know about Memory [PDF]&lt;/a&gt; by Ulrich Drepper. It goes well beyond what every programmer needs to know. Basically, there are two properties which a program should have to use the memory hierarchy efficiently: spatial locality and temporal locality.&lt;br /&gt;&lt;br /&gt;If you've just accessed a particular location of memory, it is fast to access it again if you do it soon enough. A program which accesses the same locations in a pattern to maximize this is said to have good temporal locality. This isn't something that can be easily changed in programs, most of the time.&lt;br /&gt;&lt;br /&gt;A more flexible property is spatial locality. If you've just accessed a particular piece of memory, it is also fast to access nearby locations in memory. Programs that take advantage of this have good spatial locality. There are many ways to modify programs to take advantage of this property, so this is what I and most other people are talking about when they talk about improving programs' locality.&lt;br /&gt;&lt;br /&gt;Memory is implemented this way because, naturally, most programs have pretty good spatial and temporal locality. Only a small percentage of memory reads and writes actually interact with the main memory, and usually faster mechanisms can be used. But, still, even though most reads and writes use this faster mechanism, a good portion of programs' running time is taken up by the slower method. Improvements on this could potentially have a very significant effect on performance.&lt;br /&gt;&lt;br /&gt;A note: sometimes, people refer to giving programs "good cache behavior." The CPU cache is a part of the memory hierarchy that creates a part of the properties of spatial and temporal locality.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Copying orders&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There are a couple ways that the garbage collector can improve locality. One way is that the traversal that the GC does can be optimized for good spatial locality. Another way is that the garbage collector can reorganize the data during copying such that the mutator has good spatial locality. Both of these have been researched in the past, to varying degree.&lt;br /&gt;&lt;br /&gt;My research was on the second strategy. There are a few simple copying orders that we already know about: breadth-first coyping, depth-first copying, sliding mark-compact (which leaves things in the exact order of their creation time), and the control: not coyping. My advisor for the summer noticed that, on some example graphs, it appeared to improve locality to put things in a sort of "topological sort" order (with cycles arbitrarily broken). There's also a technique called hierarchical decomposition, which attempts to decompose a tree such into a number of subtrees with as few edges between them as possible. (That's what we want! but it might not be the best if we don't have trees.) For a little while, I was even trying to figure out the "optimal" compaction using a genetic algorithm, but it turned out to be extremely computationally expensive.&lt;br /&gt;&lt;br /&gt;If I were less lazy, this would be the point where I draw a graph and show you the coyping orders indicated by each of these. Most of it is described in the Jones and Lins book, &lt;em&gt;Garbage Collection Algorithms&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The results&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;My goal was to implement all of these and collect empirical data on their performance. I implemented several of them, but I found out very close to the end of the summer that, in fact, the system I was using (Jikes RVM) puts data in a different order than I was assuming it was. Apparently, it uses some kind of hybrid between breadth-first and depth-first order that one paper calls "approximate depth-first order", but my code had been based on the assumption that it used depth-first order, and I couldn't figure out how to fix it.&lt;br /&gt;&lt;br /&gt;Nevertheless, I collected a bunch of data on the performance of the DaCapo benchmark suite on non-generational copying (in approximate depth-first order), mark-compact and mark-sweep. I used perfctr to measure the cache misses that the mutator encountered. I expected that mark-compact would have the fewest misses, and then copying and then mark-sweep.&lt;br /&gt;&lt;br /&gt;I assumed this because many papers I read said things like, "It is widely assumed that sliding mark-compact is the optimal compaction order, so we'll assume this without further comment that that's true, and that's all the justification we need to keep going with this design."&lt;br /&gt;&lt;br /&gt;&lt;em&gt;But&lt;/em&gt;, as it turns out, mark-compact does worse than copying collection, by a margin that's greater than published results for the difference between depth-first and breadth-first copying. Well, it does this for a bunch of the DaCapo benchmarks, at least. And, umm, I didn't do all that many trials, and I set them up wrong so the running time was probably dominated by the JIT rather than the program actually running. But still, I got a new result! This contradicts what's been assumed in certain papers published just a few months earlier! And with a little more work and experience with Jikes, I could make the results rigorous and conference-worthy, right?&lt;br /&gt;&lt;br /&gt;Oh, wait, no. Someone in a conference &lt;em&gt;that June&lt;/em&gt; published a &lt;a href="http://cs.anu.edu.au/~Steve.Blackburn/pubs/papers/immix-pldi-2008.pdf"&gt;paper&lt;/a&gt; with a little graph in the introduction that proved the same thing. Except with a good methodology, and with a bunch of tiny, disjoint 99% confidence intervals. Oops! Better luck next time, I guess.&lt;br /&gt;&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;&lt;br /&gt;[A little note: I haven't been writing very much, or programming Factor very much, beacuse this semester I'm in Budapest in an intense math study abroad program for American undergraduates. I'm taking 5 math classes, the hardest of which is a group theory course whose midterm I probably just failed this morning. (A random fun question: If G is a group and H is a subgroup, where |G:H| = 55, and a is an element of G such that the order of a is 89, prove that a is in H. One of the few questions I could actually answer on the test...) Most of my time is spent studying with people, and though I might physically have time for programming, I've been feeling a little burned out. But I'm planning on devoting a lot of time next semester to working on Factor's garbage collector, first making an incremental generational mark-sweep collector (by snapshot-at-the-beginning) and then deciding where to go from there.]&lt;br /&gt;&lt;br /&gt;[&lt;strong&gt;Update:&lt;/strong&gt; Thank you all for correcting my typos. Spatial, not spacial, and I meant that &lt;em&gt;spatial&lt;/em&gt; locality is more flexible, not temporal. Sorry if I confused anyone!]&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-328310167094601712?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/328310167094601712/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=328310167094601712' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/328310167094601712'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/328310167094601712'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/10/garbage-collection-camp-post-mortem.html' title='Garbage collection camp post-mortem'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-6774015044698849079</id><published>2008-08-03T16:25:00.000-07:00</published><updated>2008-08-03T19:26:03.912-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>New mark-compact algorithms</title><content type='html'>It'd be good for a garbage collector to have low space overhead, good cache behavior and low fragmentation. Low space overhead is useful wherever there's a limited amount of RAM. Eliminating fragmentation is important, since fragmentation reduces the usable amount of memory and can make allocation slower. Improving cache behavior of ordinary programs is always good for performance, and surprisingly a garbage collector can help. These three goals point to mark-compact garbage collection as a good way to go.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The idea&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;First, mark all of the reachable data by traversing the heap. Then, slide the live (reachable) data as far over to one side as possible. This will (hopefully) free up some space on the other side, so we have a big contiguous free area. With this free space, we can allocate in by just keeping a pointer that shows where free space begins, and incrementing it when allocation happens. Pointer-bumping allocation eliminates fragmentation and creates good cache behavior, since all new data is allocated close together. In a practical implementation, you'd want to use a generational system, where the oldest generation is maintained by mark-compact.&lt;br /&gt;&lt;br /&gt;Many recently developed GC algorithms fit into this class in a more general way. The heap is marked, and then through some means, the live data is clumped together. There are many different ways to do this, but to me they all feel fundamentally similar. The new algorithms have a greater space overhead, getting a speed improvement or incrementality (short pause times) as a tradeoff. The resulting overhead is much less than semispace (copying) collection, which takes more than two times as much space as the program uses.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Basic implementation&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The marking part of this isn't too hard. You can perform depth-first search on the heap using an explicit stack. For locality reasons, a bit vector can be maintained off to the side to the side to hold the information about what's been marked. If the mark stack overflows (a rare condition), additional pushes to the stack can be ignored and the heap can be rescanned afterwards for marked objects with unmarked children. Marking can be done in parallel with a load balancing technique known as work-stealing queues, and it can be done incrementally or concurrently using a certain write barrier.&lt;br /&gt;&lt;br /&gt;The hard part is the compacting. Traditional mark-compact algorithms, like the ones discussed in &lt;a href="http://www.amazon.com/Garbage-Collection-Algorithms-Automatic-Management/dp/0471941484"&gt;the Jones book on GC&lt;/a&gt; try to use little or no additional space for the process. The easiest way to do it is to have an extra word per object which is used to store forwarding addresses. After marking, there are three passes over the heap. In the first pass, the live data is iterated over in address order, and new addresses for all objects are calculated and stored in the forwarding addresses. On the second pass of the heap, every pointer is replaced with the new forwarding address. On the third pass, objects are moved to this new forwarding address.&lt;br /&gt;&lt;br /&gt;Other methods (table-based methods and threading methods) eliminate the word per object overhead and reduce the number of passes to two, but they remain relatively slow. It's not obvious how to make them incremental, concurrent or parallel in the compacting stage, though incremental, concurrent and parallel marking is well-studied. The difficulty comes from the fact that, in all of these methods, pointers temporarily point to locations that aren't filled with the right thing. Unlike with concurrent/incremental copying, it's difficult to design even a read barrier for this. Additionally, sliding compaction is difficult to parallelize because of the way pointers are calculated from an accumulator, and the sliding itself must be done in address order, otherwise uncopied things could be overwritten.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;New ideas&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There are a number of ways to go about changing mark-compact. Here are some of them that I've read papers about:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Reserve some additional space off to the side to calculate forwarding addresses from the mark bitmap, so only one pass over the heap is required&lt;/strong&gt; This is the idea developed in &lt;a href="http://www.cs.technion.ac.il/~erez/Papers/compressor-pldi.pdf"&gt;The Compressor [PDF]&lt;/a&gt;, by Haim Kermany. Rather than compact the heap onto itself, it is copied in page-size chunks. Since the to-pointers are calculated off to the side beforehand, pointers can be updated at the same time as copying happens, and only one pass over the heap is required. Once a page is copied, the old page can be reused, making space overhead bounded. It can even run in parallel, copying to multiple pages at once without needing any synchronization beyond handing out the pages to copy. The paper also provides a concurrent version of this.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Divide the heap up into chunks, copying over by chunks, keeping track of some of the pointers between chunks.&lt;/strong&gt; This is the approach taken by the &lt;a href="http://www.cs.utah.edu/~eeide/compilers/old/papers/oopsla03-sachindran.ps.gz"&gt;mark-copy [PS]&lt;/a&gt; algorithm by Narendran Sachindran, further developed as &lt;a href="http://www.cs.umass.edu/~emery/pubs/04-15.pdf"&gt;MC&lt;sup&gt;2&lt;/sup&gt; [PDF]&lt;/a&gt;. In order to allow the copying to go on independently, there are remembered sets that each chunk has of pointers going into the chunk. Then, copying can proceed without looking at anything else, with the remsets going and updating just those parts afterward. The remsets are built during marking. To reduce the size of these remsets, the chunks are ordered, and pointers only need to be maintained going in a certain direction. This algorithm can be made incremental because the mutator can run in between copying windows, as long as a write barrier maintains the remsets. The MC&lt;sup&gt;2&lt;/sup&gt; paper describes a specific implementation where space overhead is bounded to 12.5%.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Switch from pointers to something else temporarily during compaction to let things go incrementally&lt;/strong&gt; The grandiosely titled paper &lt;a href="http://portal.acm.org/citation.cfm?id=1296907.1296928&amp;coll=ACM&amp;dl=ACM"&gt;Mark-Sweep or Copying? A "Best of Both Worlds" Algorithm and a Hardware-Supported Real-Time Implementation&lt;/a&gt; [ACM] does this to implement a very concurrent mark-compact algorithm on specialized hardware using a read barrier. On one pass, all live objects are given "handles" in a separate part of the heap. These handles point back to the original object, and the original object points to the handle. Next, all pointers are replaced with handles. Objects are slid as far as possible towards the beginning of the heap, and their handles are updated appropriately. Finally, all objects have their pointers in the form of handles replaced with what those handles point to. This is given a concurrent implementation on specialized "object-oriented" hardware using multiple read barriers at different points in execution. During compaction, the read barrier makes sure that the mutator sees only handles, and afterwards, the read barrier makes sure that the mutator sees only pointers.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Use mark-sweep and a free list-based allocator, but occasionally compact areas that are extremely fragmented.&lt;/strong&gt; This is perhaps the most widely studied idea, developed in a number of places. I can't find a good reference, but it's pretty clear that you can use mark-sweep over several GC cycles until the heap gets a certain amount fragmented, and then compact the entire heap. This can be done by compacting every n cycles, or by compacting when the free list gets to a certain length. But it seems like it'd be better to compact just the areas that are particularly fragmented. This technique is used in the hard real-time &lt;a href="http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.metronomegc.html"&gt;Metronome&lt;/a&gt; garbage collector as well as the mark-region &lt;a href="http://cs.anu.edu.au/techreports/2007/TR-CS-07-04.pdf"&gt;Immix [PDF]&lt;/a&gt; collector. The compaction in both of these is actually done by evacuation, that is, copying to a different place. This involves some space overhead. Any old way of copying will do, including any incremental, concurrent or parallel method. &lt;a href="http://portal.acm.org/citation.cfm?id=1296907.1296927"&gt;A&lt;/a&gt; &lt;a href="http://portal.acm.org/citation.cfm?id=1296907.1296927"&gt;few&lt;/a&gt; [both ACM, sorry] concurrent copying schemes have been developed especially for this purpose. All of these deserve a full explanation, because they're really interesting, but I'll leave that for another blog post. One possibility related to this would be reference counting (with something to collect cycles) with occasional compaction, but I've never seen anything written about this. In these systems, the allocator has to be more complicated than pointer bumping, but it can still be very efficient.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Some of these ideas also use sliding compaction order, but this isn't really a necessary trait of mark-compact GC. Earlier algorithms (the two finger algorithm, linearizing compaction) used different orders. Nevertheless, the creators of these algorithms often don't refer to their own work as "mark-compact" when it doesn't use this order. I think it makes sense to put them in one category, since they all share the quality that the heap is marked and then it is compacted, at least partly.&lt;br /&gt;&lt;br /&gt;When I tell people who know about CS that I've been trying to research garbage collection this summer, the reaction is often something like, "Hasn't that been pretty much solved?" There may be solutions, but interesting improvements in terms of time and space overhead are being developed all the time. All of the papers I've referenced are from the past decade.&lt;br /&gt;&lt;br /&gt;[&lt;strong&gt;A historical note&lt;/strong&gt;: Imagine if everywhere I said "compacting" I instead said "compactifying." &lt;a href="http://scholar.google.com/scholar?q=compactifying+garbage+collection&amp;hl=en&amp;lr=&amp;btnG=Search"&gt;People actually did this&lt;/a&gt; in the 70s and 80s in the GC literature. I'm very glad they came to their senses. &lt;strong&gt;An honesty note&lt;/strong&gt;: I don't actually understand the concurrent algorithms very well.]&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-6774015044698849079?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/6774015044698849079/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=6774015044698849079' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/6774015044698849079'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/6774015044698849079'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/08/new-mark-compact-algorithms.html' title='New mark-compact algorithms'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-2587164573099155792</id><published>2008-07-10T11:04:00.000-07:00</published><updated>2008-07-20T12:13:21.416-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='data structures'/><title type='text'>Persistent heaps in Factor</title><content type='html'>Inspired by &lt;a href="http://sourceforge.net/mailarchive/forum.php?thread_name=806f58f20807051647y48802b9brcb55fe6a9a4b9418%40mail.gmail.com&amp;forum_name=factor-talk"&gt;an email&lt;/a&gt; where Slava proposed that Factor begin to use more persistent data structures, I implemented a functional &lt;a href="http://en.wikipedia.org/wiki/Priority_queue"&gt;priority queue&lt;/a&gt; (minheap). It's in the main Factor repository in &lt;a href="http://factorcode.org/responder/cgi/gitweb.cgi?p=factor.git;a=blob;f=extra/persistent-heaps/persistent-heaps.factor;hb=HEAD"&gt;extra/persistent-heaps&lt;/a&gt;. Strangely, the &lt;a href="http://www.haskell.org/ghc/docs/latest/html/libraries/"&gt;Haskell Hierarchical Libraries&lt;/a&gt; don't seem to have a priority queue implementation, but there's a nice &lt;a href="http://en.literateprograms.org/Priority_Queue_(Haskell)"&gt;literate program&lt;/a&gt; by Michael Richter implementing them in Haskell, and my implementation is generally similar to that. (Neither one really takes advantage of laziness.)&lt;br /&gt;&lt;br /&gt;They're based on a really cool idea that's described in the book &lt;a href="http://www.cl.cam.ac.uk/~lp15/MLbook/"&gt;ML for the Working Programmer&lt;/a&gt; by L. C. Paulson. Instead of pushing onto a heap by putting an element at the end of the array and percolating up, we can add it going down from the top. This frees us from the use of the array, allowing a functional pointer-based structure to be used. The heap can be balanced if we alternate back and forth which side we push onto. This can be done easily if we always push on to the right side and then swap the left and the right sides (unless the priority is the least, in which case we relegate the old top value to go to the right side and swap the sides).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-2587164573099155792?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/2587164573099155792/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=2587164573099155792' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/2587164573099155792'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/2587164573099155792'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/07/persistent-heaps-in-factor.html' title='Persistent heaps in Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8569253969494473125</id><published>2008-06-23T23:22:00.000-07:00</published><updated>2008-06-25T19:45:38.639-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>Messing around with Factor's GC</title><content type='html'>Unfortunately, my summer project is going ahead using the MMTk component of Jikes RVM. At this point, it looks like we're going to focus on an attempt to parallelize MC&lt;sup&gt;2&lt;/sup&gt;. So, work on Factor's GC is just taking place in my free time until after the project is over. I haven't gotten much done yet, just a couple micro-optimizations and minor doc fixes. But I've gotten an understanding of the structure, which I'll try to explain to you.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Basic Factor object demographics&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;To figure out what we need in a garbage collector, it's useful to know some things about the data usage patterns in Factor code. A &lt;a href="http://www.cs.kent.ac.uk/pubs/2008/2749/content.pdf"&gt;very detailed study of this&lt;/a&gt; has been done for Java. I hypothesize that these results, crucially the ones about allocation site predicting lifespan, are not entirely accurate for functional languages like Factor. But that's a topic for another time.&lt;br /&gt;&lt;br /&gt;I want to find something really basic: what's the distribution of object sizes in Factor? This can be done easily without delving into C code:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;USING: kernel math.statistics memory sequences ;&lt;br /&gt;: object-data ( -- mean median std-dev )&lt;br /&gt;    gc&lt;br /&gt;    [ drop t ] instances [ size ] map&lt;br /&gt;    [ mean ] [ median ] [ std ] tri ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When I ran that code on the current version of Factor on a 32-bit x86 processor at the command line with an unmodified image (you have to run it with an extra-big tenured space), I got a mean of about 64, median 24 and standard deviation of 4514. This should be interpreted to mean, roughly, that a typical Factor object takes up six words, but that some objects are far bigger. The distribution is skewed far to the right.&lt;br /&gt;&lt;br /&gt;There are only 242 objects in the heap which are larger than a page (4kB) in size, totaling 10MB out of the heap's total 37MB of live data. The biggest object is about 3MB. Two of these are byte arrays, three are regular arrays, and the rest are hashtables.&lt;br /&gt;&lt;br /&gt;So, in the Factor GC, we're dealing with lots of little objects, and a few big ones. Both of these need to be dealt with efficiently. This also gives us the information that, in terms of space overhead, it wouldn't be completely out of the question for the GC to use, say, an extra word per object, as this would take up only a relatively small proportion of the heap space. The median object size, actually, is a bit bigger than I expected.&lt;br /&gt;&lt;br /&gt;(This data might not be representative, because it consists of things in tenured space, recorded while nothing useful was going on. Further study would be useful.)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The structure of Factor's heap&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Factor's heap consists of two parts: the code heap and the data heap. The data heap is maintained by a three-generational semispace (aka copying) collector, and the code heap is maintained by a mark-sweep collector. Separation here isn't unusual. In MMTk, there are several (maybe 7 or 8, IIRC) different heap subspaces.&lt;br /&gt;&lt;br /&gt;There are a few reasons for this separation. For a platform like x86, with separate data and instruction caches, it is beneficial to keep these two things separate. If data and code are mixed together, the icache will be frequently and unnecessarily cleared due to modifications of the data. This will mess up pipelining and all kinds of other hardware optimizations that I don't really understand. Additionally, on other platforms like PowerPC, jumps can only be within 32 megabytes of code space, to keep instruction width fixed. Keeping the code heap separate makes it so that code is all together, so no extra jumps have to be inserted.&lt;br /&gt;&lt;br /&gt;Within the data heap, the structure is relatively simple. The nursery is a single semispace and the aging and tenured spaces consist of two semispaces. Card marking is used to track old-to-young pointers. A little optimization called decks, which Slava apparently invented, makes it so that fewer cards have to be scanned. They're basically cards of cards. The generational write barrier used to just mark a card corresponding to the pointer modified; it now marks two cards: the small card corresponding to less than a kilobyte of memory, and the deck corresponding to a subsection of the cards. This makes nursery collection faster, since it's easier to scan the cards for roots.&lt;br /&gt;&lt;br /&gt;Originally, the collections of two spaces were not completely coordinated. The data heap could be collected without the code heap being collected. But this is insufficient: since the data and code heap can both point to each other, a collection of one has to take place at the same time as a collection of the other. If this weren't the case, then, for example, the data heap could be prematurely exhausted: imagine that there are 1000 quotations made and immediately discarded, each of which points to a 10MB array. The data heap will fill up but the code heap won't be collected, so none of the arrays could be deleted.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Future plans&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I plan on continuing to try to work on Factor's garbage collector, both in small optimizations and bigger changes to policy. By "policy" I mean the strategies the GC uses to be efficient. Policy ideas that I'm looking at include:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Reexamining generational policy&lt;/strong&gt;&lt;br /&gt;Factor uses three generations to make sure that not just anything gets into tenured space. This idea isn't new, but most modern collectors use two generations. This is combined with a more subtle strategy to make sure objects have to survive a few collections before being promoted to aging space.&lt;br /&gt;&lt;br /&gt;The JVM's (Java) nursery is, in effect, like a nursery and an aging space, except that when the nursery fills up, there is an aging space collection. This simplifies the cards needed, and the write barrier can be made more efficient, possibly. But when I made a naive implementation of this in Factor, I got mixed results on benchmarks, so I didn't commit the changes.&lt;br /&gt;&lt;br /&gt;Another possibility is to say that an object has to survive &lt;em&gt;n&lt;/em&gt; nursery collections to be promoted to tenured space. This can be done most easily by having &lt;em&gt;n&lt;/em&gt; nursery-sized spaces that things are copied through. On each nursery collection, objects are promoted one level, and objects on the highest nursery level are promoted to tenured space. Ideally there's some way to take up less space than that. GHC's (Haskell) runtime uses this strategy, and the heap's structure allows the higher levels to only use as many pages of memory as they need.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Bounded variable-sized nursery&lt;/strong&gt;&lt;br /&gt;Right now, Factor's nursery (and aging space) is a fixed size. But it's been known for some time that this isn't optimal: a tenured collection can be delayed slightly if the nursery shrinks when tenured space starts to fill up, allowing more of the heap to be used at once. A slight delay in tenured space collection translates into higher througput, which we want.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Lower space overhead for tenured space&lt;/strong&gt;&lt;br /&gt;Right now, with a semispace copying collector managing tenured space, Factor needs about three times as much memory to run as there is live data. I say three times rather than two because, with less than that, copying becomes too frequent and performance degrades. Mark-sweep and mark-compact are alternatives for managing the old generation which take much less space overhead, but they both have disadvantages. Mark-sweep can cause memory fragmentation, and allocation by free list isn't very fast (though this isn't necessarily true). And it's difficult to make mark-compact collection efficient.&lt;br /&gt;&lt;br /&gt;Nevertheless, they'd be in some ways an improvement over the existing collector, because a relatively small footprint is very important for the desktop and embedded domains. Once one of these two collectors is implemented (or maybe some simple combination if the two), it could be used as a basis for implementing something like &lt;a href="http://www.cs.umass.edu/~emery/pubs/04-15.pdf"&gt;MC&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;, &lt;a href="http://cs.anu.edu.au/techreports/2007/TR-CS-07-04.pdf"&gt;Immix&lt;/a&gt;, &lt;a href="http://www.cs.technion.ac.il/~erez/Papers/compressor-pldi.pdf"&gt;the Compressor&lt;/a&gt; or a parallel or concurrent mark-sweep collector.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-8569253969494473125?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/8569253969494473125/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=8569253969494473125' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8569253969494473125'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8569253969494473125'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/06/messing-around-with-factors-gc.html' title='Messing around with Factor&apos;s GC'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4079578118675732413</id><published>2008-06-10T11:01:00.000-07:00</published><updated>2008-06-10T21:00:40.039-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>An idea for garbage collection research</title><content type='html'>This summer, I'm participating in an REU (Research Experience for Undergraduates) at Harvey Mudd College in southern California. For the past week or so, we've been trying to come up with ideas for what to research. I think I've stumbled on a reasonable one, though it might take a long time to implement. Hopefully, I'll spend the summer implementing it with a co-conspirator, Tony Leguia.&lt;br /&gt;&lt;br /&gt;The idea is to combine MC&lt;sup&gt;2&lt;/sup&gt; and Sapphire. MC&lt;sup&gt;2&lt;/sup&gt; a copying algorithm which can be incremental and which has much lower memory overhead than the normal semispace algorithm, and Sapphire is an algorithm for concurrent copying GC. Together, they could form an efficient algorithm for concurrent compacting garbage collection with minimal space overhead. This has been done before, but also this idea allows pointer-bumping allocation and no reliance on hardware for things like memory page protection. I haven't found a paper which does all of these things at once.&lt;br /&gt;&lt;h3&gt;Motivation&lt;/h3&gt;&lt;br /&gt;For me, but of course not for anyone else except maybe Tony, the motivation is to see how far this mark-copy algorithm can go. As far as I know (and as far as the co-author of the paper that I contacted knows), there have only been two papers published on it, the original one and MC&lt;sup&gt;2&lt;/sup&gt;, none of which mention multiple processors.&lt;br /&gt;&lt;br /&gt;One idea is that this would be useful for large servers with many cores that are under memory pressure. Existing algorithms which depend on page faults can be slow when those faults occur. Normal semispace collectors do poorly under memory pressure, usually. With a concurrent collector like this, lower pause times could be achieved, giving faster response times to clients. But, for servers, other techniques might actually be more relevant, like making separately collected nurseries for each thread.&lt;br /&gt;&lt;br /&gt;In theory, a concurrent collector would increase throughput, too, especially for programs that have fewer threads than the machine they're running on has cores. But this is a somewhat different use case than the server. Picture a single-threaded media player running on a two-core desktop machine. Really, on a desktop, all programs should be as small a footprint as possible and have as short pause times as possible, and this idea can hopefully provide that. The lack of dependence on page protections should also help performance as compared to other algorithms.&lt;br /&gt;&lt;br /&gt;The third use case is an embedded system with multiple processors, which is apparently increasingly common. A concurrent version of MC&lt;sup&gt;2&lt;/sup&gt; would be useful for the same reasons as MC&lt;sup&gt;2&lt;/sup&gt; is on uniprocessor embedded systems.&lt;br /&gt;&lt;br /&gt;I'm not positive that this is the best way to go about it. Maybe I really want a parallel collector, along the lines of &lt;a href="http://research.microsoft.com/~simonpj/papers/parallel-gc/index.htm"&gt;GHC's recent parallel copying collector&lt;/a&gt;. As that paper discusses, it's pretty tricky to actually get a performance improvement, but it ends up being worth the effort throughput-wise. Maybe I should be making mark-copy parallel. This is a different research project, but it might be one that we end up looking into more. Maybe we could even make a parallel concurrent collector this way! But definitely not over the next 9 weeks, which is how much time I have left.&lt;br /&gt;&lt;h3&gt;Details&lt;/h3&gt;&lt;br /&gt;I &lt;a href="http://useless-factor.blogspot.com/2008/05/couple-gc-algorithms-in-more-detail.html"&gt;wrote about&lt;/a&gt; MC&lt;sup&gt;2&lt;/sup&gt; earlier, but not about Sapphire. &lt;a href="http://citeseer.ist.psu.edu/hudson01sapphire.html"&gt;Sapphire&lt;/a&gt; is a mostly concurrent copying collector that works without a read barrier of any kind. I say mostly concurrent because some synchronization needs to be done surrounding the program stack, but not very much. Previous concurrent copying collectors worked with a read barrier that preserved the invariant that the mutator can only see things that have been copied into tospace. Sapphire, on the other hand, uses a write barrier to preserve the invariant that, roughly, during copying, fromspace and tospace are consistent. &lt;br /&gt;&lt;br /&gt;The collector consists of four phases: (1) mark (2) allocate (3) copy (4) flip. In the mark phase, a concurrent snapshot-at-the-beginning mark is done. In the allocate phase, space in tospace is set up for each object and non-clobbering pointers are made to tospace from everything in fromspace. They can be non-clobbering because they take up some space in the header, and the old header can be put in tospace. Next, copy fromspace to tospace. The allocate step laid things out in breadth-first order, so Cheney's algorithm works here. In the flip step, the stack and other things outside of the space that's being copied has its pointers changed from pointing to fromspace to pointing to tospace. Throughout this, the write barrier assures that modifications to fromspace are propagated to tospace.&lt;br /&gt;&lt;br /&gt;As far as I can tell, it takes nothing special idea-wise to combine these two things. The real work would be in the implementation details, and the benchmarks proving that this is possible and advantageous. One simplification we could make (though I'm not sure if it's a good idea in terms of locality) is that forwarding pointers and "reverse forwarding pointers" could be held each in window-sized blocks. So, overall, the collector would consist of&lt;br /&gt;&lt;ol&gt;&lt;li&gt;A mark phase, run concurrently with the mutator, either the incremental update or snapshot variants. This would collect the additional data which &lt;/li&gt;&lt;li&gt;A grouping phase as in MC&lt;sup&gt;2&lt;/sup&gt;&lt;/li&gt;&lt;li&gt;For each group:&lt;ol&gt;&lt;li&gt;Allocate tospace pointers in the empty window&lt;/li&gt;&lt;li&gt;Copy fromspace to tospace&lt;/li&gt;&lt;/li&gt;Flip all of the pointers recorded in the remembered set for fromspace&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;The Sapphire paper discusses condensing mark, allocate and copy phases into a single replicate phase. Within this framework, we could only combine the allocate and copy phases.&lt;br /&gt;&lt;h3&gt;Related work&lt;/h3&gt;&lt;br /&gt;The number of other garbage collection systems trying to achieve the same goals is so dense and diverse that I feel hesitant to join in. But here are a bunch of related things, aside from what's already been mentioned.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;One of the first incremental (and extendible to concurrent) copying collectors is that of &lt;a href="http://citeseer.ist.psu.edu/baker78list.html"&gt;Henry Baker&lt;/a&gt;. But this uses a read barrier, a little bit of code inserted before every pointer read, which ensures that it's not looking at fromspace.&lt;/li&gt;&lt;li&gt;An &lt;a href="http://citeseer.ist.psu.edu/appel88realtime.html"&gt;Appel-Ellis-Li&lt;/a&gt;-style copying collector (yes, they actually call it that) uses memory page permissions to make, in effect, a free-unless-it-gets-triggered read barrier maintaining the invariant that everything the mutator sees is in to-space. If it gets triggered, it's expensive, but the hope is that it won't get triggered very often because the concurrent copying will go fast enough.&lt;/li&gt;&lt;li&gt;&lt;a href="http://portal.acm.org/citation.cfm?id=1134023"&gt;The Compressor&lt;/a&gt; is a garbage collector which, I believe, maintains invariants similar to the Appel-Ellis-Li collector but uses this for a unique type of mark-compact collector which operates in just one pass. The algorithm is both parallel and concurrent.&lt;/li&gt;&lt;li&gt;There's been a lot of work in concurrent mark-sweep, but this leads to fragmentation over time. So a system for occasional &lt;a href="http://portal.acm.org/citation.cfm?doid=1029873.1029877"&gt;mostly concurrent compaction&lt;/a&gt; has been developed.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In all but the last of these systems, a read may be expensive, due either to an explicit read barrier or the possibility of a page protection fault. The last one makes allocation (into the generation that is mark-swept) expensive due to a free list. Nevertheless, it might be reasonable to look into using an Appel-Ellis-Li-style barrier in place of Sapphire's write barrier.&lt;br /&gt;&lt;h3&gt;Our plan&lt;/h3&gt;&lt;br /&gt;The plan is to implement a basic mark-copy collector, maybe on the Factor runtime, and then make it concurrent (or maybe parallel) somehow. At each step, we'll try to do the most basic thing possible. If we could get a concurrent or parallel version of mark-copy done by the end of this summer, I'll be happy. This'll be done while writing the paper (which this could be considered the first draft for). Optimizations along the line of MC&lt;sup&gt;2&lt;/sup&gt; and the finishing touches on the paper can be done after the summer, as long as we get most things done over the next 9 weeks.&lt;br /&gt;&lt;br /&gt;It's an exciting time to be involved in academic computer science because so many basic results have been elaborated by now. The only problem is the minefield of patents (especially dense in the practical field of garbage collection) and the fact that everyone else has thought of your idea before you.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4079578118675732413?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4079578118675732413/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4079578118675732413' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4079578118675732413'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4079578118675732413'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/06/idea-for-garbage-collection-research.html' title='An idea for garbage collection research'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1635576312072334772</id><published>2008-06-07T15:21:00.000-07:00</published><updated>2008-06-07T16:45:48.519-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>A second introduction to Unicode</title><content type='html'>If you're a normal programmer, you probably really don't want to have to think about Unicode. In what you're doing, text processing probably isn't a very important aspect, and most of your users will be using English. Nevertheless, text has a tendency to creep its way into almost everything, as the basic computer/human interface. So it might be a little beneficial to know about the basics of text processing and the Unicode character set.&lt;br /&gt;&lt;br /&gt;A lot of people have written blog posts which are introductions to Unicode, and I didn't want to write another one with no new information in it. A popular one is &lt;a href="http://www.joelonsoftware.com/articles/Unicode.html"&gt;Joel (on Software)'s&lt;/a&gt; one, which describes what Unicode is and why it's important. You've likely already read an introduction to Unicode, so I'll just summarize the most important points:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;You can't assume text is in ASCII anymore&lt;/strong&gt; This isn't just about being nice to non-English speakers. Even Americans enjoy their &amp;ldquo;curly quotes&amp;rdquo;, their caf&lt;em&gt;é&lt;/em&gt;s&amp;mdash;and their em dashes. User input might come with these non-ASCII characters, and it must be handed properly by robust applications.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Unicode is the character set to use internally&lt;/strong&gt; A bunch of character sets have been developed over the years for different purposes, but Unicode can represent more scripts than any one other character set. Unicode was designed to be able to include the characters from basically all other character sets in use. If you're using C or C++, wchar_t rather than char for strings works for most cases. If you're using a higher level language, then strings should already be stored in some representation that allows for Unicode uniformly.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;There are several text encodings&lt;/strong&gt; Not all text is in ASCII, and very little text is in the most common 8-bit extension, Latin 1 (ISO-8859-1). Lots of input is in UTF-8, which can represent all of Unicode, but there are other Unicode encodings, as well as specific East Asian encodings like GB 2312 and Shift JIS, in active use. Generally, UTF-8 should be used for output, and it's on the rise in terms of usage. Depending on the programming language or library used, you might have to account for the encoding when doing text processing internally. UTF-16 and UTF-8 are the most common, and careless programming can get meaningless results in non-ASCII or non-BMP cases if the encoding is ignored.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Unicode encodes characters, not glyphs&lt;/strong&gt; Unicode can be seen as a mapping between numbers and code points, where a code point is the basic unit of Unicode stuff. It's been decided that this basic unit is for characters, like letters and spaces, rather than specific presentation forms, which are referred to as glyphs. Glyphs are something that only font designers and people who work on text rendering have to care about.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;But there's a little bit more that programmers have to know about. Unicode is part of a bigger program of internationalization within a single framework of encodings and algorithms. The Unicode standard includes several important algorithms that programmers should be aware of. They don't have to be able to implement them, just to figure out where in the library they are.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Normalization&lt;/strong&gt; Because of complications in the design, some Unicode strings have more than one possible form that are actually equivalent. There are a number of normalization forms that have been defined to get rid of these differences, and the one you should use is probably NFC. Usually, you should normalize before doing something like comparing for equality. This is independent of locale.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Grapheme, word, sentence and line breaks&lt;/strong&gt; It's not true, anymore, that a single character forms a single unit for screen display. If you have a q with an umlaut over it, this needs to be represented as two characters, yet it is one &lt;em&gt;grapheme&lt;/em&gt;. If you're dealing with screen units (imagine an internationalized Hangman), a library should be used for grapheme breaks. Similarly, you can't identify words as things separated by spaces or punctuation, or line break opportunities by looking for whitespace, or sentence breaks by looking just at punctuation marks. It's easy to write a regular expression which tries to do one of these things but does it wrong for English, and it's even easier to do it wrong for other languages, which use other conventions. So use a Unicode library for this. The locale affects how these breaks happen.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Bidirectional text&lt;/strong&gt; When displaying text on a screen, it doesn't always go left to right as in most languages. Some scripts, like Hebrew and Arabic, go right to left. To account for this, use the Unicode Bidirectional Text Algorithm (BIDI), which should be implemented in your Unicode library. Locale doesn't matter here.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Case conversion&lt;/strong&gt; Putting a string in lowercase is more complicated than replacing [A-Z] with [a-z]. Accent marks and other scripts should be taken into account, as well as a few weird cases like the character ß going to SS in upper case. The locale is also relevant in case conversion, to handle certain dots in Turkish, Azeri and Lithuanian.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Collation&lt;/strong&gt; There's an algorithm for Unicode collation that works much better than sorting by ASCII value, and works reasonably for most languages. Depending on the locale, it should be modified. Even in English, the Unicode Collation Algorithm produces much more natural results. Parts of the collation key can be used for insensitive comparisons, eg. ignoring case.&lt;/ul&gt;&lt;br /&gt;For further reading, you can look at &lt;a href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;item_id=IWS-Chapter04a"&gt;this much more in-depth article from SIL&lt;/a&gt;, or the &lt;a href="http://unicode.org/versions/Unicode5.1.0/"&gt;Unicode 5.1 standard itself&lt;/a&gt;, which isn't that bad. Most programmers can't be expected to know all of this stuff, and they shouldn't. But it'd be nice if everyone used the appropriate library for text processing when needed, so that applications could be more easily internationalized.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1635576312072334772?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1635576312072334772/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1635576312072334772' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1635576312072334772'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1635576312072334772'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/06/second-introduction-to-unicode.html' title='A second introduction to Unicode'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-806747126262622886</id><published>2008-05-25T19:28:00.000-07:00</published><updated>2008-05-29T13:12:12.169-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>Unicode collation works now</title><content type='html'>This morning I got Unicode collation to pass all of the 130,000+ unit tests. It was a bit more difficult than I imagined, and it's still far from complete for a number of reasons. The whole thing is less than 200 lines (the core algorithm in about 100) in the &lt;code&gt;unicode.collation&lt;/code&gt; vocab in the working version of Factor in git. Here's what I figured out:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Weird format of &lt;code&gt;UnicodeData.txt&lt;/code&gt;&lt;/strong&gt; It's not documented anywhere that I can find, but the &lt;code&gt;UnicodeData.txt&lt;/code&gt; resource file has a weird range format for specifying certain properties, including character classes, which are used in collation. It looks just like two regular lines, but they have weird names for the characters that apparently need to be parsed. For example, lines that look like&lt;br /&gt;&lt;pre&gt;100000;&amp;lt;Plane 16 Private Use, First&gt;;Co;0;L;;;;;N;;;;;&lt;br /&gt;10FFFD;&amp;lt;Plane 16 Private Use, Last&gt;;Co;0;L;;;;;N;;;;;&lt;/pre&gt; mean that all of the characters in the range U+100000 to U+10FFFF have the category Co, the combining class 0, etc.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;My normalization bugs&lt;/strong&gt; Working on this uncovered a bunch of bugs in older code, including that my conjoining Jamo behavior inserted nonexistent terminator characters.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Triple contractions&lt;/strong&gt; The UCA specifies that collation graphemes should be found by checking if an adjacent character or non-blocked combining character has a contraction with a previous character. But this incremental approach doesn't work with four of the contractions listed in the DUCET which consist of three, not two, elements, without having the first two forming a contraction. So a simple identity contraction for the first two of each of those must be added.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Combining character contractions&lt;/strong&gt; Apparently, two combining marks can form a contraction. A straight reading of the UCA wouldn't predict this, but not all of the UCA tests pass unless you check for non-adjacent combining marks being in a contraction together, without a noncombining mark to start it off.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;And here's what I still have to do:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Korean stuff&lt;/strong&gt; Because of some disagreement with the ISO people, the DUCET doesn't actually decide the best way to sort Korean. Instead, they provide three methods, both of which require modifying the table. I don't really understand the issue right now.&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tailoring for locales&lt;/strong&gt; Actually, heh, the default algorithm is inaccurate for any specific locale you might be in. And, for human interfaces, it's actually pretty important that the sort order corresponds to expectations. So, if you want to do sorting that's correct, you have to modify the data. Unfortunately, the standard doesn't go into the specific algorithms for tailoring the table, though there is data available through the Common Locale Data Repository (CLDR).&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Efficiency&lt;/strong&gt; My implementation is both time and space inefficient, because I paid absolutely no attention to those, because solving the basic problem is hard enough (for me). Collation keys should be made shorter, and they should be made in fewer passes over the string.&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Here's an overview of the words that are defined in the vocabulary. It's about the minimum that any UCA implementation should have, in my opinion:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;code&gt;sort-strings&lt;/code&gt; This word takes a sequence of strings and sorts them according to the UCA, using code point order as a tie-breaker.&lt;/li&gt;&lt;li&gt;&lt;code&gt;collation-key&lt;/code&gt; This takes a string and gives a representation of the collation key, which can be compared with &lt;code&gt;&amp;lt;=&gt;&lt;/code&gt;&lt;/li&gt;&lt;li&gt;&lt;code&gt;string&amp;lt;=&gt;&lt;/code&gt; This word takes two strings and compares them using the UCA with the DUCET, using code point order as a tie-breaker.&lt;/li&gt;&lt;li&gt;&lt;code&gt;primary=&lt;/code&gt; This checks whether the first level of collation is identical. This is the least specific kind of equality test. In Latin script, it can be understood as ignoring case, punctuation and accent marks.&lt;/li&gt;&lt;li&gt;&lt;code&gt;secondary=&lt;/code&gt; This checks whether the first two levels of collation are equal. For Latin script, this means accent marks are significant again.&lt;/li&gt;&lt;li&gt;&lt;code&gt;tertiary=&lt;/code&gt; Along the same lines as &lt;code&gt;secondary=&lt;/code&gt;, but case is significant.&lt;/li&gt;&lt;li&gt;&lt;code&gt;quaternary=&lt;/code&gt; This is a little less typical (and definitely non-essential, unlike the other things), and it's my own nomenclature, but it makes punctuation significant again, while still leaving out things like null bytes and Hebrew vowel marks, which mean absolutely nothing in collation.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-806747126262622886?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/806747126262622886/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=806747126262622886' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/806747126262622886'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/806747126262622886'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/05/unicode-collation-works-now.html' title='Unicode collation works now'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1564123885498545413</id><published>2008-05-23T11:30:00.000-07:00</published><updated>2008-05-23T14:07:37.200-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='sequences'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><category scheme='http://www.blogger.com/atom/ns#' term='macros'/><title type='text'>Little things I've been working on</title><content type='html'>I've been working on a few different things that, individually, are too insignificant for a blog post, so I'll put them together.&lt;br /&gt;&lt;br /&gt;One is, I expanded my &lt;a href="http://useless-factor.blogspot.com/2008/01/matching-diffing-and-merging-xml.html"&gt;previous survey&lt;/a&gt; of algorithms for XML diffing, and the result is &lt;a href="http://factorforge.org/dan/XHTML_diff_survey.pdf"&gt;here [PDF]&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I've been working on a few libraries in Factor. One is &lt;a href="http://useless-factor.blogspot.com/2007/12/extradelegation-annotated-vocabulary.html"&gt;&lt;code&gt;extra/delegate&lt;/code&gt;&lt;/a&gt;, which now interacts properly with reloading. For example, if you define a protocol, then say that a class consults something for that protocol, and then redefine the protocol to include more generic words, the consulting class will be automatically updated. Unfortunately, this doubled the size of the code, give or take. Slava changed duplex-streams to use extra/delegate, and the code is much simpler now, as it previously amounted to manual delegation. I got rid of mimic because it's unsafe and violates encapsulation in unexpected ways.&lt;br /&gt;&lt;br /&gt;Another little thing I made was &lt;code&gt;extra/lcs&lt;/code&gt;, a library for calculating Levenshtein distance between two strings, the longest common subsequence of two strings, and an edit script between two strings. Because the LCS problem and Levenshtein distance are duals, I was able to share most of the code between them. I used the simple quadratic time and space algorithm that &lt;a href="http://en.wikipedia.org/wiki/Longest_common_subsequence_problem"&gt;Wikipedia describes&lt;/a&gt; rather than the better &lt;a href="http://www.xmailserver.org/diff2.pdf"&gt;O(nd) algorithm [PDF]&lt;/a&gt; commonly called the GNU diff algorithm. I'll upgrade it to this once I understand the algorithm. This replaces &lt;code&gt;extra/levenshtein&lt;/code&gt;. I expected it to be significantly faster, because the old code used dynamically scoped variables and this uses statically scoped locals, but the speed improvement turned out to be just around 4% in small informal benchmarks on short strings.&lt;br /&gt;&lt;br /&gt;Now, I'm working on the &lt;a href="http://unicode.org/reports/tr10/"&gt;Unicode Collation Algorithm&lt;/a&gt;. The basics were simple, but I'm still unsure how to recognize collation graphemes efficiently in general. Either way, I discovered a bug in normalization: my insertion sort, used for canonical ordering of combining marks, wasn't a stable sort as required for normalization. It was actually an anti-stable sort: it &lt;em&gt;reversed&lt;/em&gt; subsequences which were of the same sort key. That was really stupid of me. I'm going to work on incorporating existing test suites for things like this. For the test suite for collation, all but 8000 of 130,000 tests pass, making it far from ready.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1564123885498545413?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1564123885498545413/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1564123885498545413' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1564123885498545413'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1564123885498545413'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/05/little-things-ive-been-working-on.html' title='Little things I&apos;ve been working on'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5476529418682853626</id><published>2008-05-18T09:56:00.000-07:00</published><updated>2008-05-19T09:16:02.457-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>Writings on regexp group capture</title><content type='html'>So, in researching regular expression group capture, I had a little bit of trouble. It turns out that some people call it "capture groups", others call it "submatch extraction" and some people call it "subexpression match". In Google, it looks like "submatch extraction" gets the most research hits, and "subexpression match" is the most broadly used.&lt;br /&gt;&lt;br /&gt;That behind me, I'm not the first one to come up with an algorithm for group capture in regular expressions in linear time and space. (My algorithm was, basically: annotate the NFA states which lie on a group boundary, then turn this into a DFA which marks a location in the string when that state could be entered. Run this, and then run the same thing on the reverse regular expression, putting the string in backwards, and find the intersection between the possible points of group boundary. Then, get the first possible group boundary point for each one, or the last. This can be proven correct easily in the case of one boundary point: if a proposed boundary is in the set marked for the forward pass and the backward pass, then the part before the boundary matches the first part of the regexp, and the part after the boundary matches the second part.)&lt;br /&gt;&lt;br /&gt;Actually, there's been a bit of research here over the past 20 years. I haven't read the following papers very closely (though I plan to), but for anyone interested in understanding how to process regular expressions efficiently to get a parse tree, here are a few interesting papers:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://citeseer.ist.psu.edu/kearns91extending.html"&gt;Extending Regular Expressions with Context Operators and Parse Extraction&lt;/a&gt; by Steven Kearns, 1991. This does something like the algorithm I was developing, but it's further thought-out&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://citeseer.ist.psu.edu/340667.html"&gt;Efficiently building a parse tree from a regular expression&lt;/a&gt; by  Danny Dubé, Marc Feeley, 2000. This goes into more depth on building parse trees, but their algorithm is apparently less efficient than the one just below.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://laurikari.net/ville/regex-submatch.pdf"&gt;Efficient submatch addressing for regular expressions [PDF]&lt;/a&gt; by Ville Laurikari, 2001. This is someone's Master dissertation, so it's easier to read and presents background information. The formal model of a tagged NFA is introduced. Benchmarks are provided, showing the system to be much faster than other widely used libraries.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://citeseer.ist.psu.edu/frisch04greedy.html"&gt;Greedy Regular Expression Matching&lt;/a&gt; by Alain Frisch, Luca Cardell, 2004&lt;/li&gt;. This takes an interesting axiomatic approach to the issue, and develops a different way to resolve ambiguity.&lt;/ul&gt;&lt;br /&gt;All of these papers go about submatch extraction in somewhat difficult ways. I hope I helped someone avoid a difficult literature search like I had.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: It seems the best way to do a literature search is to blog about something, and have commenters give you relevant papers. &lt;a href="http://citeseer.ist.psu.edu/emir04compiling.html"&gt;Here's one&lt;/a&gt; by Burak Emir describing how to get the shortest match (think non-greedy, but globally optimal) with group capture, taking advantage of transformations of regexes. Thanks, Alain Frisch!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5476529418682853626?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5476529418682853626/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5476529418682853626' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5476529418682853626'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5476529418682853626'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/05/regexp-research.html' title='Writings on regexp group capture'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-238500383383454362</id><published>2008-05-10T11:05:00.000-07:00</published><updated>2008-05-17T10:47:08.323-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><title type='text'>Parsing with regular expressions and group capture</title><content type='html'>&lt;strong&gt;Update&lt;/strong&gt;: This idea is completely not new. See  Ville Laurikari's master's thesis, &lt;a href="http://citeseer.ist.psu.edu/480392.html"&gt;Efficient Submatch Addressing for Regular Expressions&lt;/a&gt;, especially chapter 2.&lt;br /&gt;&lt;br /&gt;Though I'd rather avoid it, string parsing is a crucial part of programming. Since we're more than 60 years into the use of modern computers, it seems like we should have a pretty good handle on how to build abstractions over parsing. Indeed, there are tons of great tools out there, like GNU Yacc, Haskell's Parsec, newer Packrat-based parsers like Factor's EBNF syntax for PEGs, and a bunch of other high level parsing libraries. These libraries are relatively easy to use once you understand the underlying structure (each one parses a different subset of context-free grammars), because they expose the programmer to a tree-like view of the string.&lt;br /&gt;&lt;br /&gt;However, these incur too much overhead to be used for certain domains, like the parsing that goes on in an HTTP server or client. They're really overkill when, as in the case of HTTP interchanges, what you're dealing with is a regular language and processing can be done on-line. (I'll get back later to what I mean by those two things.) The main tools that exist to deal with this are Lex and Ragel.&lt;br /&gt;&lt;br /&gt;Ragel seems like a really interesting solution for this domain. The entire description of parsing is eventually put in one regular expression, which is compiled to a DFA, where states and transitions can be annotated by actions. Fine-grained control is given to limit non-determinism. But the user must be acutely aware of how regular expressions correspond to DFAs in order to use the abstraction. So it is somewhat leaky. Also, it's difficult to get a tree-like view on things: actions are used purely for their side effect.&lt;br /&gt;&lt;br /&gt;So, here's an idea: let's find a middle ground. Let's try to use regular expressions, with all of their efficiency advantages, but get an abstract tree-like view of the grammar and an ability to use parsing actions like high-level abstractions allow. Ideally, the user won't have to know about the implementation beyond two simple facts: regular languages can't use general recursion, and nondeterminism should be minimized.&lt;br /&gt;&lt;br /&gt;This isn't something that I've implemented, but I have a pretty good idea for the design of such as system, and I wanted to share it with you. First, a little background.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;DFAs and regular expressions&lt;/h3&gt;&lt;br /&gt;I'm taking a computer science class about this right now, so I'm gonna be totally pedantic. When I say regular expression, I mean an expression that describes a regular language. Perl regexes aren't regular expressions (and Larry Wall knows this). If you don't feel like putting on your theoretician's hat, this blog post will be mostly meaningless.&lt;br /&gt;&lt;br /&gt;What's a regular language? First off, a language is a set of strings. We care about infinite sets of strings, since finite sets are trivial to represent. If a string is in the language, that means that the language matches the string, intuitively. A regular language is one which can be represented by a deterministic finite automaton (DFA) without extensions, also called a finite state machine (FSM) for some reason. Many useful languages are regular, and many are not.&lt;br /&gt;&lt;br /&gt;The idea of a DFA is a finite number of states and a transition function, which takes the current state and a character of a string and returns the next state. The transition function is defined on all states and all characters in the alphabet. There is a set of final states, and if the string runs out when the machine is in a final state, then the string is accepted. The language of the DFA is the set of strings accepted by the DFA. For a given DFA, that DFA can be run in linear time with respect to the length of the input string and constant space. It can also be run "on-line", that is, without the whole string in advance, going incrementally.&lt;br /&gt;&lt;br /&gt;A related construction is an NFA, or nondeterministic finite automaton. Imagine the previous idea, but instead of a transition function, there is a transition relation. That is, for any character and current state, there are zero or more next states to go to, and the NFA always picks the right one. This is called nondeterminism (at least that's what it means here). Amazingly, NFAs can accept only regular languages and nothing more, because NFAs can be translated into DFAs. Basically, you build a DFA which picks all possible states at once, given all possible paths through the NFA. Potentially, though, there's an exponential blowup in the number of states.&lt;br /&gt;&lt;br /&gt;Every regular expression can be converted into an equivalent NFA, which can be converted into a DFA, which can then be converted back into a regular expression. They're all equivalent. So then what's a regular expression? There are different ways to define it. One is that you can build up a regular expression from the following elements: the epsilon regex (matching the empty string), the empty regex (matching nothing), single character regexes (matching just a single character), concatenation (one followed by another), disjunction (or) and the Kleene star (0 or more copies of something). Counterintuitively, it's possible to construct regexes which support negation, conjunction, lookahead and other interesting things.&lt;br /&gt;&lt;br /&gt;The most important distinction from Perl regexes is that &lt;em&gt;regular expressions cannot contain backreferences&lt;/em&gt;, because these are provably impossible to express in a DFA. It's impossible to parse something with backreferences in the same linear time and constant space that you get from regexes which are regular. In fact, &lt;a href="http://perl.plover.com/NPC/"&gt;parsing patterns with backreferences is NP-hard&lt;/a&gt; and not believed possible in polynomial time (with respect to the length of the input string). Since regular expressions which are regular give us such nice properties, I'm going to stick to them.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Regular expressions in practice in parsing today&lt;/h3&gt;&lt;br /&gt;The formal study of regular languages is a basically solved problem within the formalism itself: they are equivalent to DFAs, and satisfy a convenient set of properties summarized by the &lt;a href="http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages"&gt;pumping lemma&lt;/a&gt;and the &lt;a href="http://en.wikipedia.org/wiki/Myhill-Nerode_theorem"&gt;Myhill-Nerode theorem&lt;/a&gt;. The problem is just, is the given string a member of the language? What languages are regular?&lt;br /&gt;&lt;br /&gt;This was solved in the 1950s and 1960s, and the basic results are in most introductory compiler books. Those books use the solution to build lexers, like &lt;a href="http://dinosaur.compilertools.net/lex/index.html"&gt;Lex&lt;/a&gt;. Lex basically takes a list of regular expressions, each associated with an action, finds one of them to match maximally with the current input, executes the associated action on the portion that matches, and then repeats with the rest of the string. This is useful to build lexers, but the programmer has very little context for things, so it's difficult to use for much else.&lt;br /&gt;&lt;br /&gt;More recently, &lt;a href="http://www.cs.queensu.ca/~thurston/ragel/"&gt;Ragel&lt;/a&gt; has been used as a way to parse more complicated things using regular expressions. Its strategy is to turn its compile-time input into one big regular expression, annotated with actions on certain states or transitions. The actions are fragments of C code, and they form the processing power of the machine. However, their behavior can get a little unintuitive if too much nondeterminism is used, so Ragel provides a bunch of tools to limit that. Also, Ragel lets you explicitly specify a DFA through transitions, which seems useful but low-level.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Group capture with regexes&lt;/h3&gt;&lt;br /&gt;One of the most useful features of Perl's regular expressions is group capture. By this, I mean how you can do something like &lt;code&gt;s/^(1*)(0*)$/$2$1/&lt;/code&gt; to swap ones and zeros in a string. This is different from backreferences (like the non-regular expression &lt;code&gt;/(.*)$1/&lt;/code&gt;) because it's only used in subsequent code, to figure out what got matched to what part of the regex. It doesn't parse any languages which aren't regular, but it's a useful tool for processing.&lt;br /&gt;&lt;br /&gt;Curiously, this has been ignored both by academics and DFA implementors so far. I hypothesize that it's been ignored by theorists for two reasons: (1) It's easy to confuse with backreferences, which make the language non-regular, which is totally uninteresting to theorists (2) They're not part of the formalism of regular expressions as previously expressed. &lt;br /&gt;&lt;br /&gt;Implementors of (non-Perl-based) regular expression-based parsing mechanisms tend to avoid group capture because, in the general case, it's not fast enough and can't be done on-line. Also, as far as I can tell, it hasn't been implemented any other way than interpreting an NFA, using backtracking, and keeping track of where the parser is within the regex to determine group boundaries. This would be terrible for the domain of Lex and Ragel. By "on-line" I don't mean on the internet but rather that an algorithm that can be performed incrementally, getting little pieces (characters, say) of the input and doing computation incrementally, as the input is received, without storing the whole thing and running the algorithm all at once.&lt;br /&gt;&lt;br /&gt;So how can we do group capture on-line? Well, in general, we can't. Consider the regular expression (1*)1 where you're trying to capture the group 1*. As the input is being processed, we don't know when we've gotten to the end of the group until the entire input is over, since if there are two more 1's, then the first one must be in the group. However, in many cases group capture can in fact be done on-line, as in (0*)(1*), where the groups captured are 0* and 1*. As the regex is processing on the string, it knows that, if there is a match, the group boundary is just before the first 1. This can be formalized as a "boundary of determinism": a point where, in the subset construction to form a DFA from an NFA gets a subset of exactly one state.&lt;br /&gt;&lt;br /&gt;I believe this can handle most cases of group capture in practice, if the regular expression is well-written, but surely not all of them. I have an idea for how to do group capture in the few remaining circumstances, but unfortunately it takes linear space and it's not online. I'll blog about it once I have a proof of correctness.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Hierarchical parsing using group capture&lt;/h3&gt;&lt;br /&gt;Using this group capture mechanism, we can build a hierarchical parsing mechanism with actions on different things, which can be built to parse regular languages in a higher-level way. Regular expressions can't use arbitrary recursion like context-free grammars can, so the parse tree will be of fixed size, but it could still be useful. In designing this, I'm thinking specifically about making a SAX-like XML parser. It'd be awkward to write everything out as one big regular expression, but split into smaller things, each with their own little steps in processing, it could be much more elegant. My goal for syntax is something like EBNF syntax, as Chris Double's PEGs library in Factor does. Here's some future pseudocode for how it could look in parsing an XML tag, simplified. (In this code, &gt; is used like Ragel :&gt;&gt;, to indicate that when the expression afterwards can be matched by the regex, it is, as soon as possible (basically).)&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;REG: tag&lt;br /&gt;chars = "&amp;" entity:any* &gt; ";" [[ entity lookup-entity ]]&lt;br /&gt;    | any&lt;br /&gt;string = "\"" str:chars &gt; "\"" [[ str ]]&lt;br /&gt;    | "'" str:chars &gt; "'" [[ str ]]&lt;br /&gt;xml-name = name-start-char name-char*&lt;br /&gt;attribute = name:xml-name ws "=" ws str:string [[ name str 2array ]]&lt;br /&gt;tag = "&lt;" ws closer?:("/"?) name:xml-name attrs:(attribute*) ws contained?:("/"?) ws "&gt;" [[ ... ]]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;Though I haven't implemented this yet, and probably shouldn't even be talking about it, I'm really excited about this idea. I even came up with a stupid little name with it: Hegel, both for High-level Ragel and because it represents the synthesis of the dialectic (as described by Hegel) of slower, higher-level parsing and fast low-level parsing into fast, high-level parsing of regular languages. I hope it works.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-238500383383454362?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/238500383383454362/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=238500383383454362' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/238500383383454362'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/238500383383454362'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/05/parsing-with-regular-expressions-and.html' title='Parsing with regular expressions and group capture'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-408995982675986504</id><published>2008-05-07T21:50:00.000-07:00</published><updated>2008-05-07T21:54:15.352-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='introduction'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='data structures'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>Interval maps in Factor</title><content type='html'>Recently, I wrote a little library in Factor to get the script of a Unicode code point. It's in the Factor git repository in the vocab &lt;code&gt;unicode.script&lt;/code&gt;. Initially, I relatively simple representation of the data: there was a byte array, where the index was the code point and the elements were bytes corresponding to scripts. (It's possible to use a byte array because there are only seventy-some scripts to care about.) Lookup consisted of &lt;code&gt;char&gt;num-table nth num&gt;name-table nth&lt;/code&gt;. But this was pretty inefficient. The largest code point (that I wanted to represent here) was something around number 195,000, meaning that the byte array took up almost 200Kb. Even if I somehow got rid of that empty space (and I don't see an obvious way how, without a bunch of overhead), there are 100,000 code points whose script I wanted to encode. &lt;br /&gt;&lt;br /&gt;But we can do better than taking up 100Kb. The thing about this data is that scripts are in a bunch of contiguous ranges. That is, two characters that are next to each other in code point order are very likely to have the same script. The &lt;a href="http://unicode.org/Public/UNIDATA/Scripts.txt"&gt;file&lt;/a&gt; in the Unicode Character Database encoding this information actually uses special syntax to denote a range, rather than write out each one individually. So what if we store these intervals directly rather than store each element of the intervals?&lt;br /&gt;&lt;br /&gt;A data structure to hold intervals with O(log n) lookup and insertion has already been developed: interval trees. They're described in Chapter 14 of &lt;a href="hhttp://books.google.com/books?id=NLngYyWFl_YC&amp;dq=&amp;pg=PP1&amp;ots=BwOmAE4oG5&amp;sig=EP2XL5q4OCbvCdHfj44WGN8Nhpg&amp;hl=en&amp;sa=X&amp;oi=print&amp;ct=title&amp;cad=one-book-with-thumbnail"&gt;Introduction to Algorithms&lt;/a&gt; starting on page 311, but I won't describe them here. At first, I tried to implement these, but I realized that, for my purposes, they're overkill. They're really easy to get wrong: if you implement them on top of another kind of balanced binary tree, you have to make sure that balancing preserves certain invariants about annotations on the tree. Still, if you need fast insertion and deletion, they make the most sense.&lt;br /&gt;&lt;br /&gt;A much simpler solution is to just have a sorted array of intervals, each associated with a value. The right interval, and then the corresponding value, can be found by simple &lt;a href="http://en.wikipedia.org/wiki/Binary_search"&gt;binary search&lt;/a&gt;. I don't even need to know how to do binary search, because it's already in the Factor library! This is efficient as long as the interval map is constructed all at once, which it is in this case. By a high constant factor, this is also more space-efficient than using binary trees. The whole solution takes less than 30 lines of code.&lt;br /&gt;&lt;br /&gt;(Note: the intervals here are closed and must be disjoint. &amp;lt;=&gt; must be defined on them. They don't use the intervals in &lt;code&gt;math.intervals&lt;/code&gt; to save space, and since they're overkill. Interval maps don't follow the assoc protocol because intervals aren't discrete, eg floats are acceptable as keys.)&lt;br /&gt;&lt;br /&gt;First, the tuples we'll be using: an &lt;code&gt;interval-map&lt;/code&gt; is the whole associative structure, containing a single slot for the underlying array.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;TUPLE: interval-map array ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That array consists of &lt;code&gt;interval-node&lt;/code&gt;s, which have a beginning, end and corresponding value.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;TUPLE: interval-node from to value ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Let's assume we already have the sorted interval maps. Given a key and an interval map, find-interval will give the index of the interval which might contain the given key.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: find-interval ( key interval-map -- i )&lt;br /&gt;    [ from&gt;&gt; &amp;lt;=&gt; ] binsearch ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;code&gt;interval-contains?&lt;/code&gt; tests if a node contains a given key.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: interval-contains? ( object interval-node -- ? )&lt;br /&gt;    [ from&gt;&gt; ] [ to&gt;&gt; ] bi between? ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Finally, &lt;code&gt;interval-at*&lt;/code&gt; searches an interval map to find a key, finding the correct interval and returning its value only if the interval contains the key.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: fixup-value ( value ? -- value/f ? )&lt;br /&gt;    [ drop f f ] unless* ;&lt;br /&gt;&lt;br /&gt;: interval-at* ( key map -- value ? )&lt;br /&gt;    array&gt;&gt; [ find-interval ] 2keep swapd nth&lt;br /&gt;    [ nip value&gt;&gt; ] [ interval-contains? ] 2bi&lt;br /&gt;    fixup-value ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;A few convenience words, analogous to those for assocs:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: interval-at ( key map -- value ) interval-at* drop ;&lt;br /&gt;: interval-key? ( key map -- ? ) interval-at* nip ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So, to construct an interval map, there are a fewi things that have to be done. The input is an abstract specification, consisting of an assoc where the keys are either (1) 2arrays, where the first is the beginning of an interval and the second is the end (2) numbers, representing an interval of the form [a,a]. This can be converted into a form of all (1) with the following:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: all-intervals ( sequence -- intervals )&lt;br /&gt;    [ &gt;r dup number? [ dup 2array ] when r&gt; ] assoc-map&lt;br /&gt;    { } assoc-like ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once that is done, the objects should be converted to intervals:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: &gt;intervals ( specification -- intervals )&lt;br /&gt;    [ &gt;r first2 r&gt; interval-node boa ] { } assoc&gt;map ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;After that, and after the intervals are sorted, it needs to be assured that all intervals are disjoint. For this, we can use the &lt;code&gt;monotonic?&lt;/code&gt; combinator, which checks to make sure that all adjacent pairs in a sequence satisfy a predicate. (This is more useful than it sounds at first.)&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: disjoint? ( node1 node2 -- ? )&lt;br /&gt;    [ to&gt;&gt; ] [ from&gt;&gt; ] bi* &lt; ;&lt;br /&gt;&lt;br /&gt;: ensure-disjoint ( intervals -- intervals )&lt;br /&gt;    dup [ disjoint? ] monotonic?&lt;br /&gt;    [ "Intervals are not disjoint" throw ] unless ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;And, to put it all together, using a tuple array for improved space efficiency:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: &amp;lt;interval-map&gt; ( specification -- map )&lt;br /&gt;    all-intervals [ [ first second ] compare ] sort&lt;br /&gt;    &gt;intervals ensure-disjoint &gt;tuple-array&lt;br /&gt;    interval-map boa ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;All in all, in the case of representing the table of scripts, a table which was previously 200KB is now 20KB. Yay!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-408995982675986504?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/408995982675986504/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=408995982675986504' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/408995982675986504'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/408995982675986504'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/05/interval-maps-in-factor.html' title='Interval maps in Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4823195136324497460</id><published>2008-05-03T00:51:00.000-07:00</published><updated>2008-05-03T02:45:07.506-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>A couple GC algorithms in more detail</title><content type='html'>In previous posts on garbage collection, I've given a pretty cursory overview as to how things actually work. In this post, I hope to give a somewhat more specific explanation of two incremental (and potentially concurrent or parallel, but we'll ignore that for now) GC algorithms: Yuasa's snapshot-at-the-beginning incremental mark-sweep collector, and the MC&lt;sup&gt;2&lt;/sup&gt; algorithm. Yuasa's collector is very widely used, for example in Java 5 when an incremental collector is requested. MC&lt;sup&gt;2&lt;/sup&gt; is a more recent algorithm designed to reduce the fragmentation that mark-sweep creates, and appears to get great performance, though it isn't used much yet. In their practical implementation, both collectors are generational.&lt;br /&gt;&lt;h3&gt;Yuasa's mark-sweep collector&lt;/h3&gt;&lt;br /&gt;The idea is pretty simple: take a mark-sweep collector and split up the work, doing a little bit on each allocation. When the heap occupancy passes a certain threshold, say 80%, switch into "mark phase", and on each allocation, mark the right amount of the heap so that everything's marked by the time the heap is full. (You can ensure this by making the amount of marking proportional to the amount of memory allocated.) Then, switch into sweep phase, and on each allocation sweep the heap by a certain amount. If a big object is allocated, sweeping continues until there's enough room. Once sweeping is done, the collector returns to a neutral state and allocation takes place without any special collection actions until the free space dips below the threshold.&lt;br /&gt;&lt;h4&gt;Making this work&lt;/h4&gt;&lt;br /&gt;This is a neat little way to specify a GC algorithm. The implementor has three knobs at their disposal: the threshold to begin collection, the speed of marking, and the speed of sweeping. But there's a problem: the algorithm, as I described it, doesn't work. See, the graph of interconnections in the heap may change during the course of marking, and that's a problem. As I described &lt;a href="http://useless-factor.blogspot.com/2008/03/some-more-advanced-gc-techniques.html"&gt;in a previous post&lt;/a&gt;, if a pointer gets moved to another location, it might evade marking and get swept, causing memory corruption.&lt;br /&gt;&lt;br /&gt;In a snapshot-at-the-beginning incremental marking GC, the technique to save this is to trap all pointer writes and execute a little bit of code: if the collector is in the marking phase, and if the old pointer value isn't marked, it needs to get marked and get pushed on the marking stack so that its children get marked. (The marking stack is the explicit stack used for depth-first traversal of the heap, to mark everything it reaches.) This piece of code is called the write barrier, and it goes on in addition to the generational write barrier, if one is necessary.&lt;br /&gt;&lt;h4&gt;Conservativeness&lt;/h4&gt;&lt;br /&gt;One more thing: objects are allocated as marked, if an object is allocated during a GC cycle. This means that they can't be collected until the next time around. Unfortunately, this means that any generational GC will be ineffective while marking is going on: everything is effectively allocated in the oldest generation. Nevertheless, generations still provide a significant performance advantage, since most time is spent in the neural non-GC state.&lt;br /&gt;&lt;br /&gt; This is called snapshot-at-the-beginning not because an actual snapshot is made, but because everything is saved that had something referring to it at the beginning of the marking phase. (Everything that gets a reference to it during the cycle is also saved.) Of all incremental mark-sweep GC algorithms, a snapshot-at-the-beginning collector is the most conservative, causing the most floating garbage to lie around and wait, uncollected, until the next cycle. Other algorithms have techniques to avoid this, but it often comes at other costs.&lt;br /&gt;&lt;h3&gt;MC&lt;sup&gt;2&lt;/sup&gt;&lt;/h3&gt;&lt;br /&gt;Unfortunately, no matter what strategy is used to minimize fragmentation, there is a program which will cause bad fragmentation of the heap, making it less usable and allocation more expensive. For this reason, a compaction strategy is helpful, and the MC&lt;sup&gt;2&lt;/sup&gt; algorithm (Memory-Constrained Compaction), created by Narendran Sachindran, provides one within an incremental and generational system. The details are somewhat complicated, and in this blog post I'll offer a simplified view. You can also look at the &lt;a href="http://www.cs.umass.edu/~emery/pubs/04-15.pdf"&gt;full paper&lt;/a&gt;.&lt;br /&gt;&lt;h4&gt;MC&lt;/h4&gt;&lt;br /&gt;The idea is based on the Mark-Copy (MC) algorithm. The heap is divided up into a number of equally sized windows, say 40. One of these is the nursery, and the others act as tenured space. (I don't know why, but the papers about this seem to use a two-generation rather than three-generation model. I think it could easily be updated to use three generations, but I'll stick with this for now.) Each window has a logical number, with the nursery having the highest number.&lt;br /&gt;&lt;br /&gt;Nursery collections go on as I've described in &lt;a href="http://useless-factor.blogspot.com/2008/03/little-more-about-garbage-collection.html"&gt;a previous post&lt;/a&gt;. A tenured space collection is triggered when there is only one (non-nursery) window left free. At this point, the heap is fully marked. During marking, remembered sets of pointers into each window are made. In turn, each window is copied (using Cheney's copying collector) to the open space that exists, starting in the one free window. The remembered sets can be used to update pointers that go to things that were moved. If the lowest number window is copied first, the remembered sets only need to contain pointers from higher windows to lower windows.&lt;br /&gt;&lt;h4&gt;New modifications&lt;/h4&gt;&lt;br /&gt;MC&lt;sup&gt;2&lt;/sup&gt; adds a few things to this, to make the algorithm incremental and give low upper bounds on space overhead. The first change is that incremental marking is done. This is similar to the incremental snapshot-at-the-beginning marker described above, though the creators of MC&lt;sup&gt;2&lt;/sup&gt; opted for a version called incremental update, which is less conservative and more complicated but equally sound. The next change is in the copying technique. If a window is determined to have high occupancy (like more than 95%), it is left as it is without copying. Otherwise, windows are collected into groups whose remaining data can fit into one window. Those groups are incrementally copied into a new window.&lt;br /&gt;&lt;br /&gt;Other changes make sure that the space overhead is bounded. The size of remembered sets is limited by switching to a card marking system in the event of an overflow. Objects with many references to them are put in semi-permanent storage in the lowest possible window number, minimizing the size of remembered set that they need.&lt;br /&gt;&lt;br /&gt;In a benchmark included in the MC&lt;sup&gt;2&lt;/sup&gt; paper, it is demonstrated that MC&lt;sup&gt;2&lt;/sup&gt; has the same or slightly better performance compared to &lt;em&gt;non-incremental&lt;/em&gt; generational mark-sweep or generational mark-compact, the alternatives for the domain of memory-constrained systems. Pauses more than 30ms are rare, and performance appears to be consistent over a wide range of Java programs.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4823195136324497460?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4823195136324497460/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4823195136324497460' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4823195136324497460'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4823195136324497460'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/05/couple-gc-algorithms-in-more-detail.html' title='A couple GC algorithms in more detail'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5828951335622190756</id><published>2008-04-29T15:48:00.000-07:00</published><updated>2008-04-29T16:37:54.525-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><category scheme='http://www.blogger.com/atom/ns#' term='pattern matching'/><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>Potential ideas to explore</title><content type='html'>I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...&lt;br /&gt;&lt;h3&gt;Possible extensions to Inverse&lt;/h3&gt;I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in &lt;code&gt;extra/inverse&lt;/code&gt;. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see &lt;a href="http://useless-factor.blogspot.com/2007/06/concatenative-pattern-matching.html"&gt;my previous blog post describing Inverse&lt;/a&gt;.)&lt;br /&gt;&lt;h4&gt;Constraint solving&lt;/h4&gt;The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.&lt;br /&gt;&lt;br /&gt;So what can constraint solving get you in Inverse? Well, imagine an inverse to &lt;code&gt;bi&lt;/code&gt;. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting &lt;code&gt;[ first ] [ second ] bi&lt;/code&gt;. Inverting this should get the same result as &lt;code&gt;first2&lt;/code&gt; (which has a hard-coded inverse right now, inverting to &lt;code&gt;2array&lt;/code&gt;). But it won't work.&lt;br /&gt;&lt;br /&gt;A way for &lt;code&gt;[ first ] [ second ] bi&lt;/code&gt; to work would be using the following steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Initialize a logic variable X as unbound&lt;/li&gt;&lt;li&gt;Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.&lt;/li&gt;&lt;li&gt;Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.&lt;/li&gt;&lt;li&gt;From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.&lt;br /&gt;&lt;br /&gt;Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A.  The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.&lt;br /&gt;&lt;h4&gt;The Harmony Project&lt;/h4&gt;&lt;br /&gt;The &lt;a href="http://www.seas.upenn.edu/~harmony/"&gt;Harmony Project&lt;/a&gt;, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.&lt;br /&gt;&lt;br /&gt;Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)&lt;br /&gt;&lt;br /&gt;So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.&lt;br /&gt;&lt;h4&gt;Backtracking&lt;/h4&gt;In &lt;a href="http://citeseer.ist.psu.edu/337368.html"&gt;Mark Tullsen's paper on first-class patterns&lt;/a&gt;, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like &lt;code&gt;switch&lt;/code&gt; statements, rather than &lt;code&gt;undo&lt;/code&gt; itself. I'm not sure if it'd actually be useful in practice.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Garbage collection research ideas&lt;/h3&gt;Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work.&lt;/strong&gt; The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Run a real benchmark comparing several GC algorithms.&lt;/strong&gt; Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that &lt;em&gt;there's no comprehensive benchmark to refer to&lt;/em&gt;. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements&amp;mdash;over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Garbage collector implementation in proof-carrying code.&lt;/strong&gt; There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Ragel-style state machines in Factor&lt;/h3&gt;In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.&lt;br /&gt;&lt;br /&gt;Anyway, if you've heard of &lt;a href="http://www.cs.queensu.ca/~thurston/ragel/"&gt;Ragel&lt;/a&gt;, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.&lt;br /&gt;&lt;br /&gt;I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.&lt;br /&gt;&lt;br /&gt;(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5828951335622190756?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5828951335622190756/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5828951335622190756' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5828951335622190756'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5828951335622190756'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/04/potential-ideas-to-explore.html' title='Potential ideas to explore'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-6085191992759956830</id><published>2008-04-06T19:06:00.000-07:00</published><updated>2008-04-07T17:38:12.185-07:00</updated><title type='text'>Programming in a series of trivial one-liners</title><content type='html'>Among Perl programmers, a one-line program is considered a useful piece of hackage, something to show off to your friends as a surprisingly simple way to do a particular Unix or text-processing task. Outsiders tend to deride these one-liners as line noise, but there's a certain virtue to it: in just one line, in certain programming languages, it's possible to create meaningful functionality.&lt;br /&gt;&lt;br /&gt;APL, lived on by its derivatives like K, Q, J and Dyalog pioneered the concept of writing entire programs in a bunch of one-liners. Because their syntax is so terse and because of the powerful and high-level constructs of array processing, you can pack a lot into just 80 characters. In most K programs I've seen, each one does something non-trivial, though this isn't always the case. It can take some time to decode just a single line. Reading Perl one-liners is the same way.&lt;br /&gt;&lt;br /&gt;Factor continues the one-line tradition. In general, it's considered good style to write your words in one, or sometimes two or three, lines each. But this isn't because we like to pack a lot into each line. Rather, each word is rather trivial, using the words defined before it. After enough simple things are combined, something non-trivial can result, but each step is easy to understand.&lt;br /&gt;&lt;br /&gt;Because Factor is concatenative (concatenation of programs denotes composition) it's easier to split things into these trivial one-liners. It can be done by copy and paste after the initial code is already written; there are no local variables whose name has to be changed. One liners in Factor aren't exceptional or an eccentric trait of the community. They're the norm and programs written otherwise are considered in bad style.&lt;br /&gt;&lt;br /&gt;Enough philosophizing. How does this work in practice? I'm working on encodings right now, so I'll break down how this worked out in implementing 8-bit encodings like ISO-8859 and Windows-1252. These encodings are just a mapping of bytes to characters. Conveniently, a bunch of resource files describing these mappings which are all in exactly the same format is &lt;a href="ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/"&gt;already exists&lt;/a&gt; on the Unicode website. &lt;br /&gt;&lt;br /&gt;The first thing to do in implementing this is to parse and process the resource file, turning it into two tables for fast lookup in either direction. Instead of putting this in one word, it's defined in five, each one or two lines long. First, &lt;code&gt;tail-if&lt;/code&gt; is a utility word which works like &lt;code&gt;tail&lt;/code&gt; but leaves the sequence as it is if it's shorter.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: tail-if ( seq n -- newseq )&lt;br /&gt;    2dup swap length &lt;= [ tail ] [ drop ] if ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Using that, &lt;code&gt;process-contents&lt;/code&gt; an array of lines and turns it into an associative mapping (in the form of an array of pairs) from octets to code points.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: process-contents ( lines -- assoc )&lt;br /&gt;    [ "#" split1 drop ] map [ empty? not ] subset&lt;br /&gt;    [ "\t" split 2 head [ 2 tail-if hex&gt; ] map ] map ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;code&gt;byte&gt;ch&lt;/code&gt; takes this assoc, the product of &lt;code&gt;process-contents&lt;/code&gt; and produces an array which can be used to get the code point corresponding to a byte.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: byte&gt;ch ( assoc -- array )&lt;br /&gt;    256 replacement-char &amp;lt;array&gt;&lt;br /&gt;    [ [ swapd set-nth ] curry assoc-each ] keep ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;code&gt;ch&gt;byte&lt;/code&gt; is the opposite, taking the original assoc and producing an efficiently indexable mapping from code points to octets.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: ch&gt;byte ( assoc -- newassoc )&lt;br /&gt;    [ swap ] assoc-map &gt;hashtable ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Finally, &lt;code&gt;parse-file&lt;/code&gt; puts these all together and makes both mappings, given a stream for the resource file.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: parse-file ( stream -- byte&gt;ch ch&gt;byte )&lt;br /&gt;    lines process-contents&lt;br /&gt;    [ byte&gt;ch ] [ ch&gt;byte ] bi ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Next, the structure of the encoding itself is defined. A single tuple named &lt;code&gt;8-bit&lt;/code&gt; is used to represent all 8-bit encodings. It contains the encoding and decoding table, as well as the name of the encoding. The &lt;code&gt;encode-8-bit&lt;/code&gt; and &lt;code&gt;decode-8-bit&lt;/code&gt; words just take some encoding or decoding information and look the code point or octet up in the given table.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;TUPLE: 8-bit name decode encode ;&lt;br /&gt;&lt;br /&gt;: encode-8-bit ( char stream assoc -- )&lt;br /&gt;    swapd at* [ encode-error ] unless swap stream-write1 ;&lt;br /&gt;&lt;br /&gt;M: 8-bit encode-char&lt;br /&gt;    encode&gt;&gt; encode-8-bit ;&lt;br /&gt;&lt;br /&gt;: decode-8-bit ( stream array -- char/f )&lt;br /&gt;    swap stream-read1 dup&lt;br /&gt;    [ swap nth [ replacement-char ] unless* ] [ nip ] if ;&lt;br /&gt;&lt;br /&gt;M: 8-bit decode-char&lt;br /&gt;    decode&gt;&gt; decode-8-bit ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I wanted to design this, like existing Unicode functionality, to read resource files at parsetime rather than to generate Factor source code. Though I don't expect these encodings to change, the result is still more maintainable as it leaves a lower volume of code. If I were implementing this in C or Java or R5RS Scheme or Haskell98, this wouldn't be possible. So &lt;code&gt;make-8-bit&lt;/code&gt; defines an encoding given a word and the lookup tables to use:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: make-8-bit ( word byte&gt;ch ch&gt;byte -- )&lt;br /&gt;    [ 8-bit construct-boa ] 2curry dupd curry define ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;code&gt;define-8-bit-encoding&lt;/code&gt; puts everything together. It takes a string for the name of an encoding to be defined and a stream, reads the appropriate resource file and defines an 8-bit encoding.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: define-8-bit-encoding ( name stream -- )&lt;br /&gt;    &gt;r in get create r&gt; parse-file make-8-bit ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To top it all off, here's what's needed to define all the 8-bit encodings we want:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: mappings {&lt;br /&gt;    { "latin1" "8859-1" }&lt;br /&gt;    { "latin2" "8859-2" }&lt;br /&gt;    ! ...&lt;br /&gt;} ;&lt;br /&gt;&lt;br /&gt;: encoding-file ( file-name -- stream )&lt;br /&gt;    "extra/io/encodings/8-bit/" ".TXT"&lt;br /&gt;    swapd 3append resource-path ascii &lt;file-reader&gt; ;&lt;br /&gt;&lt;br /&gt;[&lt;br /&gt;    "io.encodings.8-bit" in [&lt;br /&gt;        mappings [ encoding-file define-8-bit-encoding ] assoc-each&lt;br /&gt;    ] with-variable&lt;br /&gt;] with-compilation-unit&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So by combining these trivial one-liners or two-liners, you can make something that's not as trivial. The end product is that hard things are made easy, which is the goal of every practical programming language. The point of this isn't to say that this code is perfect (it's very far from that), but just to demonstrate how clear things become when they're broken down in this way.&lt;br /&gt;&lt;br /&gt;When I first started programming Factor, I thought that it only made sense to define things separately when it was conceivable that something else would use them, or that it'd be individually useful for testing, or something like that. But actually, it's useful for more than that: for just making your program clear. In a way, the hardest thing to do when programming in Factor once you have the basics is to name each of these pieces and factor them out properly from your program. The result is far more maintainable and readable than if the factoring process has not been done.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-6085191992759956830?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/6085191992759956830/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=6085191992759956830' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/6085191992759956830'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/6085191992759956830'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/04/programming-in-series-of-trivial-one.html' title='Programming in a series of trivial one-liners'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7953888503404760418</id><published>2008-03-29T22:59:00.000-07:00</published><updated>2008-03-30T10:51:54.395-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='event'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>How the Factor meeting went in New York</title><content type='html'>I &lt;a href="http://useless-factor.blogspot.com/2008/03/another-factorcon-in-nyc.html"&gt;invited&lt;/a&gt; all of you, at the very last minute, to come meet me in New York to talk about Factor and stuff, and at least two people asked me to post in detail about what happened... so here's my best shot. Dan McCarthy was the brave soul who attended, and we had a really interesting conversation about various aspects of programming. One thing we discussed was the &lt;code&gt;inverse&lt;/code&gt; pattern matching library. I showed Dan how it works, and he found it really interesting that quotations were sequences at runtime&amp;mdash;similar to s-expressions, but directly executed. Dan works as a programmer/sysadmin at a company that provides closed captioning services for media companies, and it seems like a more interesting task than I would have thought. There are some text encoding issues there (the HD captioning standard, if I understand it correctly, actually has encoding left unspecified for characters outside of Windows-1252, though it leaves room for two-byte and three-byte characters) and Dan has been researching them for a project for a Korean client. I explained the encoding definition protocol to Dan, and I'm going to try to get him to implement East Asian encodings, which there seem to be quite a few of in use (Shift-JIS, ISO 2022-JP, GB 2312, Big5, EUC-JP, EUC-KR, GB 18030). These all need big tables for encoding and decoding, and some require state to decode. Many have multiple possible representations of the same string for output, which complicates things somewhat. So, there's not much to report, but I've definitely learned my lesson about organizing things: I need to announce things more than 11 days in advance, and I need to advertise them better.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7953888503404760418?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7953888503404760418/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7953888503404760418' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7953888503404760418'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7953888503404760418'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/how-factor-meeting-went-in-new-york.html' title='How the Factor meeting went in New York'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4361684192035085520</id><published>2008-03-23T12:55:00.000-07:00</published><updated>2008-03-30T18:21:12.458-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>Some more advanced GC techniques</title><content type='html'>After my &lt;a href="http://useless-factor.blogspot.com/2008/02/quick-intro-to-garbage-collection.html"&gt;last&lt;/a&gt; &lt;a href="http://useless-factor.blogspot.com/2008/03/little-more-about-garbage-collection.html"&gt;two&lt;/a&gt; posts about garbage collection, some people people suggested some more advanced techniques be used to solve the pausing problem. Here's a quick* overview of some more advanced techniques, some of which can eliminate noticeable pauses and some of which can solve other problems in GC.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The train algorithm&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The idea of the train algorithm is to break the heap (or, really, just the oldest generation) into small chunks that can be collected individually. These chunks need to contain a list, or remembered set, of things from the outside that point into it. (Card marking doesn't work so well in the presence of many separate chunks.) Then, crucially, cyclic data structures need to be put in the same chunk, or at least the same group of chunks which get collected together. You can think of the chunks as cars and the groups of chunks as trains.&lt;br /&gt;&lt;br /&gt;It's a bit difficult to get this whole thing sound, though. The precise strategy is described really well in &lt;a href="http://www.ssw.uni-linz.ac.at/General/Staff/TW/Wuerthinger05Train.pdf"&gt;this term Würthinger&lt;/a&gt;. That PDF has tons of great diagrams. Java used to use the train algorithm optionally, but it was deprecated because the train algorithm has high overhead in terms of throughput and it can take several GC cycles to delete a cyclic data structure: as many as there are elements in the cycle.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Incremental mark-sweep&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Another thing we can try is making mark-sweep incremental. Mark-sweep isn't that good for general collection, since it can cause memory fragmentation and make allocation slow. However, for just the oldest generation (aka tenured space) in a generational system, it's not completely horrible. Since the oldest generation can be pretty large, compaction takes a long time, since everything has to be copied. (This is true whether you're using mark-compact or copying collection.)&lt;br /&gt;&lt;br /&gt;So, can we base something off mark-sweep that eliminates long pauses? Well, going by the heading, I guess we could try to make it incremental. There are two pieces to this: incremental mark and incremental sweep. Actually, instead of incremental sweep, we can do either lazy sweep (sweep as much as we need whenever there's an allocation) or concurrent sweep (sweep in a concurrent thread, and have allocating threads block on allocation until there's enough space swept).&lt;br /&gt;&lt;br /&gt;The marking phase is more difficult because of a consistency problem. Imagine this situation with objects A, B and C. A and C are pointed to by something that we know we're keeping. When marking starts, A points to B, and B and C points to null. The marker visits C and marks it as already visited. Then, before the marker visits A or B, C is changed to point to B and A is changed to point to null. Then, the marker visits A, and A gets marked. But then B is never marked, and it is lost!&lt;br /&gt;&lt;br /&gt;The easiest solution is to trap all pointer writes to watch out for cases like this, making sure that B gets marked when A is changed. This is called a snapshot-at-the-beginning write barrier. But this makes it so that, if A and C both point to null, B still won't be collected until the next time around. That phenomenon is called &lt;em&gt;floating garbage&lt;/em&gt;, and more subtle strategies remedy it. Most of these incremental algorithms can be parallelized with a little bit of work.&lt;br /&gt;&lt;br /&gt;Aside from the &lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;location=http%3A%2F%2Fwww.amazon.com%2FGarbage-Collection-Algorithms-Automatic-Management%2Fdp%2F0471941484%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1202357267%26sr%3D1-1&amp;tag=uselfact-20&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325"&gt;book&lt;/a&gt; I recommended before, a good resource on incremental techniques is &lt;a href="ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps"&gt;this ACM survey on garbage collection&lt;/a&gt;. [&lt;strong&gt;Update&lt;/strong&gt;: There's also &lt;a href="http://chaoticjava.com/posts/parallel-and-concurrent-garbage-collectors/"&gt;this great blog post&lt;/a&gt; which I forgot to mention before. It has lots of pretty diagrams.]&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Garbage-first&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The people at Sun have come up with a new garbage collection strategy for Java called garbage-first garbage collection (G1GC). The idea is somewhat similar to the train algorithm: the heap is split up into small chunks which can be collected separately, maintaining a remembered set of inward references. But the G1GC uses all kinds of crazy heuristics to figure out what chunks are most likely to have a small remembered set. This works so well that this "youngness heuristic" can completely replace the generational mechanism. The whole thing is led by user-specified parameters about the maximum allowable pause time and throughput goals.&lt;br /&gt;&lt;br /&gt;There's a &lt;a href="http://research.sun.com/jtech/pubs/04-g1-paper-ismm.pdf"&gt;paper&lt;/a&gt; describing G1GC [&lt;strong&gt;Update&lt;/strong&gt;: link no longer requires ACM access], but I can't really understand it. A more intelligible source is FAQ #4 of the most recent blog post on &lt;a href="http://blogs.sun.com/jonthecollector/"&gt;Jon Masamitsu's blog&lt;/a&gt;. (Jon works in Java's GC group.)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Reference counting with concurrency and cycles&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;In a nondeterministically multithreaded environment, reference counting has problems. Increment and decrement operations need to be atomic, or else there will be consistency issues. For example, if two concurrent threads try to increment the reference count of a single variable at the same time, and it works out that they both read and then both write, then the reference count will only increase by one. This might mean that the memory is freed while there are still references to it! In the same way, a decrement could be missed.&lt;br /&gt;&lt;br /&gt;A bad solution is to put a lock on each reference count. This is bad because it's slow: every time there's a new reference, you need to not only increment the refcount but also acquire and then free a lock. Another solution is to gave a worker thread which handles all increments and decrements; all other threads send messages to it.&lt;br /&gt;&lt;br /&gt;To handle cycles, you could use a hybrid approach, to use mark-sweep when memory runs out in order to collect cycles. But there are other approaches. In an explicit refcounting system (where increments and decrements are manual), the user could be expected to insert a "weak reference", one which doesn't increase the refcount, whenever completing a cycle. Another way is to perform a small local marking trace when refcounts are decremented but not set to zero, to make sure there isn't an unreferenced cycle. That's described in &lt;a href="http://www.research.ibm.com/people/d/dfb/papers/Bacon01Concurrent.pdf"&gt;this recent paper&lt;/a&gt;, which also handles concurrency. &lt;a href="http://www.jucs.org/jucs_9_8/lazy_cyclic_reference_counting/Lins_R_D.pdf"&gt;Here&lt;/a&gt;'s an optimization on that with a proof of correctness.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Hard real-time GC&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;So far, I've been talking about minimizing pauses in a vague, general sense. We just want them to be a fraction of how long it takes to do a full tracing collection. But this isn't enough for some applications. Say you're making a video game where a 50ms GC pause (as the best incremental mark-sweep collectors benchmark at, I've heard) means a skipped frame or two. That can be noticeable, and it presents a real disadvantage compared to explicit allocation. Even refcounting doesn't always give really short pauses, since it causes memory fragmentation (making allocation take longer) and deallocation is not scheduled incrementally. That is, if you have a long linked list with just one reference to the head, and that reference ends, then the entire linked list gets deallocated in a chain, with no incremental scheduling.&lt;br /&gt;&lt;br /&gt;What this situation needs is a garbage collector follow hard real-time guarantees. One way that this guarantee could be phrased is that pauses are at most 1ms, and that at least 7 out of 10 milliseconds are spent running the main program. This guarantee will be met even if the main program is acting "adversarially", bringing out the worst-case behavior in the collector. It's possible to specify a requirement like this that's unachievable for a particular application, but this requirement works for most things. Different applications can specify different requirements based on, say, their frame rate and how long it takes to render a frame. For applications like this, absolute throughput (which is basically maximized by a well-tuned generational collector) can be sacrificed in favor of better scheduling.&lt;br /&gt;&lt;br /&gt;This sounds like an impossible dream, but it's actually been implemented in the &lt;a href="http://domino.watson.ibm.com/comm/research_projects.nsf/pages/metronome.index.html"&gt;Metronome&lt;/a&gt; system, implemented for Jikes by IBM. Metronome has been &lt;a href="http://domino.watson.ibm.com/comm/research_projects.nsf/pages/metronome.pubs.html/$FILE/Bacon07Realtime.pdf"&gt;written about&lt;/a&gt; in the ACM Queue and there's also a &lt;a href="http://domino.watson.ibm.com/comm/research_projects.nsf/pages/metronome.pubs.html/$FILE/Bacon03Realtime.pdf"&gt;paper&lt;/a&gt; which is harder to understand but explains more. The goal of the Metronome project is to allow high-level languages to be used for real-time applications on uniprocessor machines. While Java isn't what I'd choose, the GC seems to be the biggest barrier, and it's great that this research is being done.&lt;br /&gt;&lt;br /&gt;The idea is to have an incremental mark-sweep collector (not generational) which segregates the heap into chunks (just for allocation purposes) of roughly the same size data. This minimizes fragmentation. However, fragmentation can still occur, and when a heap segment is too fragmented, it is incrementally copied and compacted to a different piece of memory. Large objects are split up into chunks called arraylets. By all of these techniques, garbage collection can be broken up into small tasks, and an innovative scheduler makes it satisfy the hard real-time guarantees.&lt;br /&gt;&lt;br /&gt;Because the collector isn't generational, and because of the overhead of the scheduler and the floating garbage that's left by the incremental collector, this is far from optimal for applications that don't really need things to be so predictably well-spaced. But maybe, if there were more knobs on this algorithm (eg, the scheduler can be turned off, and more generations can be added), this could be a general-purpose GC that's really useful.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;GC and language features&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;In the most basic sense, a garbage collection system consists of one exposed function, &lt;code&gt;allocate&lt;/code&gt;, which takes a number of bytes and allocates a region of memory that's that big. But there are some other things that can be useful. For example, for tracing collectors, a &lt;code&gt;collect-garbage&lt;/code&gt; function can be used to do a major collection when the program knows it's idle.&lt;br /&gt;&lt;br /&gt;Another useful feature is finalizers. For most things, it's sufficient to just deallocate memory in when it's collected. But think about files. You should always explicitly close a file when you're done with it, but if the programmer makes an error, the file should still be closed once it is unreachable. With a reference counting or mark-sweep collector, this is relatively easy: just have a generic function &lt;code&gt;finalize&lt;/code&gt; that gets called on everything that's collected. With copying collection, the collector maintains a list of objects that have finalizers, and on each copying cycle, this list is traversed and it is checked whether objects have forwarding pointers in fromspace. If an object with a finalizer doesn't have a forwarding pointer, it has been deleted and the finalizer should be invoked. This avoids a full traversal of the heap.&lt;br /&gt;&lt;br /&gt;Actually invoking so simple, because now the object might contain some dead pointers. With a reference counting collector, if you're not collecting a cycle, you can call the finalizers in top-down order (also called topological order), and then the dead pointer issue doesn't exist. But this breaks down in the presence of cycles, and is difficult to calculate with a tracing collector. An easier-to-implement strategy is to call the finalizers in arbitrary order, but call them all before garbage is actually collected. Alternatively, everything the finalizer references can be considered a root. But in this situation, programmers have to be very careful not to retain the objects forever.&lt;br /&gt;&lt;br /&gt;This summer, I hope to join this research community in a small way by participating in a &lt;a href="http://www.cs.hmc.edu/reu/projects/garbage-collection/"&gt;Harvey Mudd REU&lt;/a&gt; (summer undergraduate research project) in garbage collection. In previous summers, an idea of &lt;em&gt;blobs&lt;/em&gt; was developed, a generalization of a concept called ephemerons to make weak hashtables without memory leaks. They wrote and published &lt;a href="http://www.cs.hmc.edu/~oneill/papers/Blobs-SPACE.pdf"&gt;a paper&lt;/a&gt; about it. They also researched and &lt;a href="http://www.cs.hmc.edu/~oneill/papers/Trailers-SPACE.pdf"&gt;wrote about&lt;/a&gt; garbage collection techniques for certain hard-to-collect persistent data structures. &lt;br /&gt;&lt;br /&gt;Another project of the leader of this group is called a &lt;a href="http://www.cs.hmc.edu/~oneill/papers/Simplifiers-MSPC.pdf"&gt;simplifier&lt;/a&gt;, which is something that gets invoked in a copying collector to simplify a datastructure when it gets copied. This is a technique that is used in an ad-hoc way in graph-reduction runtimes in purely functional languages: you don't want to copy the whole graph if there's an easy way to simplify it without allocating any new nodes. It should be really fun to research these more advanced techniques for making garbage collection more correct and efficient.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Where to now for more research?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Academia has been working on this since the '60s. But recently, big companies like Sun, IBM and Microsoft have been doing more here in order to augment their Java and .NET platforms. Some academic resources to look at to learn more about GC are &lt;a href="ftp://ftp.cs.utexas.edu/pub/garbage/"&gt;at UT Austin's website&lt;/a&gt; (especially bigsurv.ps). There are conferences which discuss memory management, like the &lt;a href="http://www.eecs.harvard.edu/~greg/ismm07/"&gt;International Symposium on Memory Management&lt;/a&gt; and &lt;a href="http://www.cs.cornell.edu/Conferences/space2008/"&gt;SPACE&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;When implementing a garbage collector for research purposes, you probably don't want to build a whole programming language runtime yourself. &lt;a href="http://jikesrvm.org"&gt;Jikes RVM&lt;/a&gt; provides an open-source research virtual machine that you can easily plug in different garbage collectors into. Jikes RVM's &lt;a href="http://jikesrvm.org/MMTk"&gt;MMTk&lt;/a&gt; (Memory Manager Toolkit) makes this possible. There are visualization tools, and the heap can be split up into different segments which use different collectors.&lt;br /&gt;&lt;br /&gt;These advanced garbage collection algorithms haven't been implemented many times and aren't well-understood by many people. There also hasn't been much work in formally analyzing and comparing these algorithms. This is partly because they're hard to analyze; constant factors have a gigantic effect. Someone came up with &lt;a href="http://www.research.ibm.com/people/d/dfb/papers/Bacon04Unified.pdf"&gt;a unified theory of garbage collection&lt;/a&gt;, though, which analyzes all garbage collection strategies as some combination between marking and reference counting, which can be seen as duals. Just like with differential equations, there's no general solution which meets all of our requirements (maximizing throughput and locality, minimizing and organizing pause times, making allocation cheap) at once, though our understanding is always improving. &lt;br /&gt;&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;* You may be wondering why I keep saying that these posts are short when they're something like four pages long printed out. Mainly, it's because the things that I'm reading are much longer. It'd be kinda hard for me to describe anything meaningful in 500 words or less.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4361684192035085520?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4361684192035085520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4361684192035085520' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4361684192035085520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4361684192035085520'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/some-more-advanced-gc-techniques.html' title='Some more advanced GC techniques'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-800818812873539753</id><published>2008-03-19T00:42:00.000-07:00</published><updated>2008-03-19T10:52:37.570-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>Three open language design problems in Factor</title><content type='html'>Factor is progressing at a surprisingly rapid pace for its small developer base. The compiler and virtual machine make code run relatively fast and getting faster, and there's a large and growing standard library. As far as language design methods go, Factor is fairly unique. With most languages, when they get to a certain state of maturity, code gets written and backwards compatibility for that code needs to be maintained so that that code base still works.&lt;br /&gt;&lt;br /&gt;In Factor, nearly all code written is distributed in &lt;code&gt;extra/&lt;/code&gt; with Factor itself. We can keep making changes to the language and update the entire code base whenever we want. It's pretty amazing, and allows us to remove unused things and make major changes in the API like &lt;a href="http://useless-factor.blogspot.com/2008/02/designing-api-for-encoded-streams.html"&gt;the recent one with encodings&lt;/a&gt;. So the resulting language design and standard library can be as clean as some abstract, mostly &lt;em&gt;a priori&lt;/em&gt; language standard like Haskell98 or R5RS while being as useful as we want. (Of course, after 1.0, stability will be a major goal.)&lt;br /&gt;&lt;br /&gt;The library is progressing amazingly. But within this context, there are a few things that need to be worked out language-design-wise.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Mixing currying with macros&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Macros (by which I mean the things you define with &lt;a href="http://factorcode.org/responder/help/show-word?word=MACRO%3a&amp;vocab=macros"&gt;&lt;code&gt;MACRO:&lt;/code&gt;&lt;/a&gt;, described &lt;a href="http://factor-language.blogspot.com/2007/08/factorcon-summer-2007.html"&gt;here&lt;/a&gt;) are really an optimization. You could, instead of using &lt;code&gt;MACRO:&lt;/code&gt;, do a normal word definition with &lt;code&gt;call&lt;/code&gt; at the end: Macros just take a number of things from the stack and generate a quotation which is called. The thing you put in the body of the definition is everything but the &lt;code&gt;call&lt;/code&gt;. The thing macros do is move this calculation of the quotation as early as possible. If it's possible, the quotation is calculated when the optimizing compiler passes through the code; otherwise, it's calculated at runtime and the result is memoized. This way, it's only calculated once for any given input.&lt;br /&gt;&lt;br /&gt;Macros work great as a way to describe things like &lt;a href="http://useless-factor.blogspot.com/2007/06/concatenative-pattern-matching.html"&gt;&lt;code&gt;undo&lt;/code&gt;&lt;/a&gt;, &lt;code&gt;cond&lt;/code&gt; and &lt;code&gt;case&lt;/code&gt;. But what if we want to use locals in a &lt;code&gt;cond&lt;/code&gt;? Or curry something onto the quotation used by &lt;code&gt;undo&lt;/code&gt;? These are both instances of the same problem, since locals boil down to currying and a quotation transformation and so does &lt;code&gt;undo&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;Consider the following code:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;:: foo ( bar -- baz )&lt;br /&gt;    {&lt;br /&gt;        { [ bar 1 = ] [ 2 ] }&lt;br /&gt;        { [ t ] [ bar 1 + ] }&lt;br /&gt;    } cond ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The double colon (&lt;code&gt;::&lt;/code&gt;) indicates that this definition uses locals, and that &lt;code&gt;bar&lt;/code&gt; is a lexically scoped local variable taken from the stack. The code I wrote won't work, because the locals implementation doesn't know how to deal with &lt;code&gt;cond&lt;/code&gt;. When dealing with ordinary quotations, it can curry the locals that get used onto the front, and do the typical locals transformation inside the quotation. But &lt;code&gt;cond&lt;/code&gt; takes an array from the stack. What should it do there?&lt;br /&gt;&lt;br /&gt;Similarly, consider this code:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: second-if-tagged ( pair first-tag -- second )&lt;br /&gt;    [ swap 2array ] curry undo ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This code will work as expected, but the inverse of the quotation has to be compiled separately on each run. Even worse, there's a memory leak as macros memoize their input: the &lt;code&gt;first-tag&lt;/code&gt; would never be freed.&lt;br /&gt;&lt;br /&gt;It should be that &lt;code&gt;undo&lt;/code&gt; and &lt;code&gt;cond&lt;/code&gt; specify some way to deal with currying and the transformation that locals use. But how should this be structured?&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Formalizing protocols&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Right now, mixins are used for two things. One, in things like &lt;code&gt;plain-writer&lt;/code&gt; (in the latest development sources) implement a bunch of generic words. Others, like &lt;code&gt;immutable-sequence&lt;/code&gt;, have a bunch of generic words defined so that instances of it are expected to implement it. Some things like &lt;code&gt;virtual-sequence&lt;/code&gt;, do both.&lt;br /&gt;&lt;br /&gt;For things like &lt;code&gt;sequence&lt;/code&gt;, there's only an implicit replationship with the generic words that are associated with them. In &lt;a href="http://factorcode.org/responder/help/show-vocab?vocab=delegate"&gt;extra/delegate&lt;/a&gt;, there's a formally-defined set of generic words, but it's not associated with the &lt;code&gt;sequence&lt;/code&gt; mixin.&lt;br /&gt;&lt;br /&gt;It'd be even nicer if we could make some static verification tools based on this. Just to throw some kind of error, some time before runtime, if a necessary method isn't defined. But it's a little harder to do this than it sounds. For example, for sequences, there is a default method for both &lt;code&gt;nth&lt;/code&gt; and &lt;code&gt;nth-unsafe&lt;/code&gt;. These are mutually recursive, so that only one has to be implemented. But every sequence has to implement at least one of these to be valid. How should this be represented? It couldn't be done automatically. There are also some things, like &lt;code&gt;like&lt;/code&gt;, whose implementation is completely optional, and &lt;code&gt;length&lt;/code&gt;, whose implementation is mandatory.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A soft type system&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Factor is dynamically typed, and to some degree, this is a strength. Inherent in any fully static type system is a limitation preventing certain programs from being run. It's much easier to have a dynamic interactive development environment that uses dynamic typing. If Factor had a static type system, it's unlikely that it could have had the gradual development of programing concepts that it did.&lt;br /&gt;&lt;br /&gt;But, at the same time, there are some advantages to static typing. For one, it opens up a bunch of optimization opportunities for the compiler. The Factor optimizing compiler already infers some things about types. This happens in two ways. The compiler infers the stack effect of arbitrary quotations, that is, how many things a block of code takes and leaves on the stack. This is exposed to the user through the &lt;code&gt;infer.&lt;/code&gt; word. A weakness of this is that stack effects can only be inferred if all quotations can be inlined at compiletime, and deeper knowledge about quotations could fix this.&lt;br /&gt;&lt;br /&gt;The other way type inference happens is that the compiler eliminates type checks and method dispatch where it can prove that something is of a particular class. This happens in a more ad-hoc way and is not exposed to the user because it doesn't infer which types are required. If there were a stronger type system, then it might be possible to remove even more type checks, and it might be possible to automatically insert what's now done by the hints mechanism.&lt;br /&gt;&lt;br /&gt;Another advantage is that a user-visible static type system can reject programs that won't work, and can provide useful metadata about programs based on their type. Factor can currently reject programs whose stack effect comment doesn't match the stack effect that's inferred. Usually, once you get the hang of stack programming, you don't need this, but occasionally it still comes in handy. Type checking does the same thing.&lt;br /&gt;&lt;br /&gt;I'm not sure if Factor will ever have a type system that is this expressive, but I think it's pretty cool that in Haskell, you can have a function which dispatches (at compiletime) off its expected &lt;em&gt;return value&lt;/em&gt;. This is seriously cool, and without it, monads would be a bit more awkward.&lt;br /&gt;&lt;br /&gt;Anyway, none of these things represents an urgent need, but there's also something to consider: if we make a static type system for Factor, we don't want to restrict the flexibility of Factor at all. There would need to be a number of things. First, a staged approach similar to the way metaprogramming currently works. Second, the type system would have to be a soft typing system, which can infer many things but doesn't reject everything that it can't infer. It can warn the user if it can tell that there'll be a type mismatch, though. Neither of these things has been attempted in a concatenative language yet.&lt;br /&gt;&lt;br /&gt;[&lt;strong&gt;Update&lt;/strong&gt;: A commenter pointed out that it's also possible to have explicit declarations and no inference. A version this is planned to happen before 1.0, in the form of a better syntax on top of multimethods. Basically, you have a generic word with only one method, and that method is for the types of arguments that you specify. This allows for earlier failure and some generic dispatch/type check eliminations. It's still not clear how more general, before-runtime type checking (with or without inference) can be done, though.]&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There are a bunch more language design problems to come, but these three seem to be the biggest right now. It's possible that these won't be solved, and for the last one, I'm not sure if that'd be too horrible. I hope that the currying problem can be resolved before Factor 1.0, and I expect everything to have some sort of resolution by Factor 2.0. The sooner we solve the problems, the easier it will be to update the code to conform with the solution. I just wish I had good enough ideas to figure them out right now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-800818812873539753?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/800818812873539753/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=800818812873539753' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/800818812873539753'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/800818812873539753'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/three-open-language-design-problems-in.html' title='Three open language design problems in Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4441878924733381321</id><published>2008-03-17T20:30:00.000-07:00</published><updated>2008-03-18T19:53:16.882-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='event'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>Another FactorCon in NYC</title><content type='html'>&lt;a href="http://zedshaw.com/"&gt;Zed Shaw&lt;/a&gt; suggested to me recently that we hold a Factor-related meeting this month in New York, since I'll be there on March 28th. Since he hasn't done anything to organize it, I thought I'd take the lead: let's meet at&lt;br /&gt;&lt;br /&gt;&lt;a href="http://earthmatters.com/"&gt;Earth Matters&lt;/a&gt;&lt;br /&gt;&lt;a href="http://maps.google.com/maps?ie=UTF-8&amp;oe=utf-8&amp;rls=org.mozilla:en-US:official&amp;client=firefox-a&amp;um=1&amp;q=earth+matters&amp;near=New+York,+NY&amp;fb=1&amp;cid=0,0,12589332362799576843&amp;sa=X&amp;oi=local_result&amp;resnum=1&amp;ct=image"&gt;177 Ludlow St, Manhattan, NY, USA&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This'll happen at 7 PM, on Friday the 28th. The idea of the event is to talk about Factor, meaning (depending on who comes) sharing current projects in Factor, or a Factor tutorial for beginners, or a little of both. It'll all be very informal. You can come whether or not you know Factor. &lt;br /&gt;&lt;br /&gt;Unlike the recent PyCon, this Factor convention will &lt;em&gt;not&lt;/em&gt; be beholden to our sponsors' interests! But if we had any sponsors we might be... it's a little too late to solicit them.&lt;br /&gt;&lt;br /&gt;If you can come, please &lt;a href="mailto:ehrenbed@carleton.edu"&gt;send me an email&lt;/a&gt; so I can get an idea of how many people are going to show up. (Sorry about the short notice.) I hope to see you there!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4441878924733381321?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4441878924733381321/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4441878924733381321' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4441878924733381321'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4441878924733381321'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/another-factorcon-in-nyc.html' title='Another FactorCon in NYC'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-177972849193175197</id><published>2008-03-14T10:13:00.000-07:00</published><updated>2008-03-16T14:13:43.383-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>A protocol for creating encodings</title><content type='html'>I &lt;a href="http://useless-factor.blogspot.com/2008/02/designing-api-for-encoded-streams.html"&gt;previously&lt;/a&gt; wrote about the API that I designed for creating streams with encodings in Factor. I'm not sure if that's going to stick around permanently in this form, due to concerns about easily changing stream encodings and grouping encodings with a pathname as one object on the stack.&lt;br /&gt;&lt;br /&gt;Either way, I wanted to describe the protocol I'm developing for actually defining new encodings in Factor. This code isn't completely debugged but it should be done and in the main Factor development repository very soon. There are four words in the encoding protocol:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;GENERIC: &amp;lt;encoder&gt; ( stream encoding -- encoder-stream )&lt;br /&gt;GENERIC: &amp;lt;decoder&gt; ( stream decoding -- decoder-stream )&lt;br /&gt;GENERIC: encode-char ( char stream encoding -- )&lt;br /&gt;GENERIC: decode-char ( stream decoding -- char/f )&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Let's go through these. First, the constructors &lt;code&gt;&amp;lt;encoder&gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;decoder&gt;&lt;/code&gt;. These are very rarely called directly by the library user, more often by stream constructors. For example, when you do &lt;code&gt;"filename" utf8 &amp;lt;file-reader&gt;&lt;/code&gt;, what's going on underneath is &lt;code&gt;"filename" (file-reader) utf8 &amp;lt;decoder&gt;&lt;/code&gt;. &lt;code&gt;(file-reader)&lt;/code&gt; is a low-level constructor that gives you a binary stream, and &lt;code&gt;&amp;lt;decoder&gt;&lt;/code&gt; wraps it an a decoded stream using the specified encoding descriptor, &lt;code&gt;utf8&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;I have some slightly funny methods on &lt;code&gt;&amp;lt;encoder&gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;decoder&gt;&lt;/code&gt;. See, right now, all encodings are tuples, and their abstract descriptors are tuple classes. All tuple class symbols are in the class &lt;code&gt;tuple-class&lt;/code&gt;, and all tuples are in the class &lt;code&gt;tuple&lt;/code&gt;. So we can define methods on the two constructor words, for tuple classes one which makes an empty instance of the encoding tuple class and calls the constructor again, and for encoding &lt;em&gt;tuples&lt;/em&gt; one which actually puts together the instance of the physical &lt;code&gt;encoder&lt;/code&gt; or &lt;code&gt;decoder&lt;/code&gt; tuple. Here's how it looks:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;M: tuple-class &amp;lt;decoder&gt; construct-empty &amp;lt;decoder&gt; ;&lt;br /&gt;M: tuple &amp;lt;decoder&gt; f decoder construct-boa ;&lt;br /&gt;&lt;br /&gt;M: tuple-class &amp;lt;encoder&gt; construct-empty &amp;lt;encoder&gt; ;&lt;br /&gt;M: tuple &amp;lt;encoder&gt; encoder construct-boa ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;One reason these need to be generic is for things like binary streams, where methods on these generic words are implemented as dummies: a binary encoding is just the lack of encoding&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;TUPLE: binary ;&lt;br /&gt;M: binary &amp;lt;encoder&gt; drop ;&lt;br /&gt;M: binary &amp;lt;decoder&gt; drop ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Another reason is that certain encodings require processing at the beginning. For example, UTF-16 should write a byte order mark (BOM) immediately when it's initialized for writing, and read a BOM immediately when it's initialized for reading.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;M: utf16 &amp;lt;decoder&gt; ( stream utf16 -- decoder )&lt;br /&gt;    2 rot stream-read bom&gt;le/be &amp;lt;decoder&gt; ;&lt;br /&gt;&lt;br /&gt;M: utf16 &amp;lt;encoder&gt; ( stream utf16 -- encoder )&lt;br /&gt;    drop bom-le over stream-write utf16le &amp;lt;encoder&gt; ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now, let's look at the other words. The idea of &lt;code&gt;encode-char&lt;/code&gt; and &lt;code&gt;decode-char&lt;/code&gt; is that it's simpler for encodings to encode or decode one character than implement all the relevant functions of the stream protocol. &lt;code&gt;encode-char&lt;/code&gt; takes an encoding, an underlying stream and a character to write to that underlying stream.&lt;br /&gt;&lt;br /&gt;The inverse, &lt;code&gt;decode-char&lt;/code&gt;, takes an underlying stream and an encoding and uses the encoding to pull a character from the stream. For everything I've implemented so far, the encoding is dropped after method dispatch, but when things like Shift JIS, which require state in decoding, are implemented, the state will be stored in the tuple.&lt;br /&gt;&lt;br /&gt;This is all much simpler than my previous design, which required looping to decode a single character and forced encodings to adopt a complicated state-machine-based model. This is something like the third iteration of the encoding protocol I've made, and the code is finally starting to look good.&lt;br /&gt;&lt;br /&gt;In Factor, it takes a little bit of work to make certain things, like encodings, have clean code. The appropriate abstractions don't fall out as immediately obvious, but eventually they're found. The result is far more maintainable and clean. I'm not sure what this would imply on big projects with bad programmers. (But I plan to never work in an environment like that; better to be an academic if good work in a small company can't be found.)&lt;br /&gt;&lt;br /&gt;Anyway, future pie-in-the-sky plans for encodings include treating cryptographic protocols and compression as encodings (under different protocols, of course). This is really cool: there are five orthogonal layers: stream, cryptography, compression, text encoding and usage. It'll be possible to compose them and factor out their compositions in any way you want! But this doesn't exist, so I probably shouldn't even be talking about it.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Fixed stupid typos. Thanks Slava!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-177972849193175197?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/177972849193175197/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=177972849193175197' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/177972849193175197'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/177972849193175197'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/protocol-for-creating-encodings.html' title='A protocol for creating encodings'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-673312772573787359</id><published>2008-03-09T12:00:00.000-07:00</published><updated>2008-03-09T19:15:31.050-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ignorable'/><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>Explaining garbage collection</title><content type='html'>Imagine you have a lot of garbage in your house. You have so much garbage that there's no room to put any new stuff. This isn't garbage that you've thrown in the garbage can, and it's a little unclear what's garbage and what's not.&lt;br /&gt;&lt;br /&gt;See, you've built a bunch of Rube Goldberg contraptions in your house, and some machines share parts. Your washing machine might use a pencil somewhere to write down how much longer it has until it's done, and your drier might use the same pencil on a different pad of paper. Even if you decide you don't want the washing machine, you can't just throw out the pencil (though you can throw out the washer's pad of paper.) So how do you determine what you can safely throw away?&lt;br /&gt;&lt;br /&gt;It'd be really tedious to do this yourself, so you bring out your handy dandy garbage collection robot. You give the robot a list of Rube Goldberg machines that you use, and the robot will look at what uses what. It'll throw out everything that is unused once it figures everything out.&lt;br /&gt;&lt;br /&gt;One strategy the robot could use is to write down a list of all of the objects in your house. The robot will then go down the list of machines that you use and put a mark next to each one you say you use. Until everything's been visited that has a mark, the robot then goes to each marked object and marks everything that it uses. When it's done, it can go back to all of the unmarked objects and thrown them out. This is known as "mark and sweep" garbage collection.&lt;br /&gt;&lt;br /&gt;Another strategy is, when there's no more room, to build a new house, and then put in it a copy everything that you use. Then, you look at everything that those Rube Goldberg machines use and make a copy of them, for use in the new house. This doesn't have to be so inefficient, since you can actually reuse the first house again the next time garbage collection happens. This is known as "copying" garbage collection.&lt;br /&gt;&lt;br /&gt;The last strategy is to just tell the robot, when you're building your machines, what uses what, and when you start and stop using things. The robot will keep a count of how many things use what. If only one machine is using something, or if you're using it directly and nothing else is, and then that usage stops, then you tell the robot and it throws out that thing. When it throws that piece of garbage out, it has to remember that each thing that that object uses is now used by one fewer thing. This is called "reference counting", because you count how many things refer to a particular object.&lt;br /&gt;&lt;br /&gt;Now imagine all of this in a computer. Computer programs often amount to Rube Goldberg machines on a much larger scale. Different pieces of RAM refer to other pieces of RAM, and it's hard for the programmer to tell what she can safely throw out. By using a garbage collector, this can all be handled automatically.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-673312772573787359?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/673312772573787359/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=673312772573787359' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/673312772573787359'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/673312772573787359'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/explaining-garbage-collection.html' title='Explaining garbage collection'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1859765155110203891</id><published>2008-03-07T01:44:00.000-08:00</published><updated>2008-03-07T12:29:39.642-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>A little more about garbage collection</title><content type='html'>In my &lt;a href="http://useless-factor.blogspot.com/2008/02/quick-intro-to-garbage-collection.html"&gt;last post&lt;/a&gt; about garbage collection, I mentioned that that there are more complicated and efficient algorithms to do GC better. A few people asked me for a followup describing those, so here's my best shot. For inspiration, let's look at an excellent runtime system for a horrible programming language.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Java's garbage collection&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Sun's HotSpot JVM gives us a good example of how to go about garbage collection. It may not be the best programming language, but for more than 10 years they've been working on a good implementation of a garbage collector. What they have now is pretty good, and programs can expect 95-99% throughput with few noticeable pauses. Everything can be tweaked by advanced users through command-line options.&lt;br /&gt;&lt;br /&gt;Originally, the JVM used a very inefficient, naive GC algorithm. (I think it was mark-sweep, but I can't say for sure because the docs are fairly sketchy.) By version 1.2, Java switched to a generational scheme where copying (originally mark-compact) is used for the younger generations and mark-compact is used for the oldest generation. In version 1.3, there was an optional implementation of the train algorithm for applications which needed to avoid a pause, but this has since been discontinued. In version 1.4, an incremental mark-sweep algorithm was introduced for programs with harder real-time constraints, along with "heap ergonomics"--smart runtime heap and generation resizing. In Java 1.5, the main maximum-throughput generational collector was upgraded to use multiple threads in minor GC collections. Java 7 will have a new collector, invented at Sun, called Garbage First, which can be seen as an extremely complicated generalization of generational GC which I don't completely understand. A listing of the collectors in Java 6, along with information about G1GC, is &lt;a href="http://blogs.sun.com/jonthecollector/entry/our_collectors"&gt;here&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Java uses some interesting concepts in its garbage collector, and in this post I hope to describe enough to understand the ideas behind these things, though implementation details will vary wildly. I haven't described the incremental algorithms here (train, G1, concurrent mark-sweep), but hope to do so in a later post.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Increasing throughput with generational collection&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;We want to make our garbage collector such that programmers don't really have to worry about memory allocation costs. In copying and mark-compact collection systems, the cost of allocation is just incrementing a pointer, so this is great. Or is it? In truth, the cost is higher: each time we allocate memory, we move closer to having to do a full garbage collection, which results in either a traversal of the heap or copying of all the data on the heap. This isn't so good if we have to do it a lot, so memory allocation isn't so cheap.&lt;br /&gt;&lt;br /&gt;With a quick realization, we can make this more efficient: most objects die young. This is called the weak generational hypothesis. (There's a stronger variant of this, but it's not really true, so we'll ignore it.) Anyway, once we have this realization, we can use &lt;em&gt;generational garbage collection&lt;/em&gt;, or garbage collection with objects segregated by age.&lt;br /&gt;&lt;br /&gt;Here's the idea, in somewhat better-thought-out form: let's take the heap and split it into three generations. The first generation is called the nursery, the second is called aging space and the last tenured space. Aging space is bisected into a "fromspace" and a "tospace" in the way that copying garbage collection works. When objects get allocated, they are first put in the nursery. When the nursery fills up, there is a &lt;em&gt;minor garbage collection&lt;/em&gt;, where everything in the nursery which is referenced from the roots is copied to the next generation up, the place in aging space where allocation happens. When aging space fills up, we copy to the other aging space, back and forth in the style that copying GC normally works. A minor GC needs to happen whenever this takes place. When aging space is full (or almost full) after one of these back-and-forth aging space GC cycles, an &lt;em&gt;intermediate collection&lt;/em&gt; takes place, where things in aging space are moved to tenured space, and things in the nursery are moved to aging space. Tenured space can be managed by a mark-compact algorithm, or potentially other things instead. When all of tenured space (and everything below it) collects garbage, it's called a &lt;em&gt;major garbage collection&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;There are variations to this, but that's the basic idea. Except what I wrote above is unsound, potentially: in a minor or intermediate GC, it's insufficient to only consider the roots; we also have to treat pointers from older generations to younger generations as roots. The idea is that we can figure out where these roots are &lt;em&gt;without&lt;/em&gt; traversing the entire heap.&lt;br /&gt;&lt;br /&gt;There are two ways to do this, using a remembered set or a card marking system. Both of these require a write barrier, or a small piece of code that's executed when writing a pointer. With a remembered set, there's a list of memory addresses stored which contain old-to-young pointers. The write barrier for this strategy must check each write operation to see if a pointer is being stored, and if so, if it's an old to young pointer. A more efficient strategy is card marking: divide the heap for older generations into cards, say, 128 bytes each, and when an object is modified, unconditionally set this bit on. Then, when a garbage collection occurs, old-to-young pointers only need to be scanned among cards which are marked. If no such pointer is found, the card can be unmarked.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A few more spaces to put stuff&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;What I've described so far interacts terribly with concurrency. This is because there has to be a global lock on allocation, which can form a bottleneck if there are many threads and allocation is relatively frequent. To fix this, just make a separate nursery for each thread. This is called a thread-local allocation buffer. The only thing you have to watch out for is that, whenever there's a GC going on from one TLAB, other allocation must stop or there is a risk that more than one TLAB could try to move to aging space at a time. Also, if something's referenced from more than one thread, it has to be moved outside of a TLAB (or else the write barrier has to be accommodated to deal with inter-thread pointers).&lt;br /&gt;&lt;br /&gt;It's fairly inefficient to copy large objects, so a collection strategy more like mark-sweep might be more appropriate. To have this, we can make a &lt;em&gt;large object area&lt;/em&gt; where the bodies of large objects are put immediately when they are allocated. Though it's managed by a free list, fragmentation won't be so severe (though it still happens) because of the size of the objects involved. Headers for these objects aren't in the LOA itself but rather passed around the heap like regular objects, promoted through various age groups in the normal way.&lt;br /&gt;&lt;br /&gt;Alternatively, large objects could just be immediately placed in the oldest generation. This strategy works best when the oldest generation isn't compacting.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;How big should things be?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Counterintuitively, it can actually speed things up for the nursery and aging space to be relatively small. It's good for the nursery to be small because, if it's small enough, it can fit entirely in the CPU's L2 cache for faster access. Also, since most of it is expected to be garbage (since most objects die young), there's no benefit to having it be large. It's good for the aging space to be relatively small, too, since if it's too big, then the same long-lived objects will bounce back and forth inside of it, needlessly copied and increasing pause times. It'd be better if these objects were promoted to tenured space earlier. Also, aging space represents objects allocated at around the same time, so if they're moved to tenured space at the same time, they'll be close together in memory, improving locality.&lt;br /&gt;&lt;br /&gt;When the tenured space fills up and stays filled up (or almost filled up) after a major garbage collection, the heap needs more space. You could increase the size of all of the generations, but because of the reasons in the previous paragraph, it's usually best just to increase the size of the tenured space, and maybe the large object area if you're using one.&lt;br /&gt;&lt;br /&gt;So, say the heap size required was very high, but now not as much is needed. How can we go back and make the heap smaller? It wouldn't be good to immediately free part of the heap when it's not being used, because it might be needed soon again. A good strategy is to set a soft goal for throughput: shrink the heap as long as the total throughput, or percentage of time spent not collecting garbage, is below a certain bound. This will avoid a cycle of growing and shrinking the heap while allowing the heap to shrink as long as heap shrinking doesn't occupy too much time.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Why this is still insufficient&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Is this the end of what we need to know about garbage collection? When I started writing this post, I thought it would be. Generational collection, along with TLABs and a LOA, can significantly improve the performance of garbage collection, but there's still the risk of long, noticeable pauses caused by a full traversal of the heap.&lt;br /&gt;&lt;br /&gt;So I still need to do a bit more research before being able to implement a perfect garbage collection algorithm for Factor which can avoid noticeable pauses. Maybe G1 is the answer here (but it's really complicated), or maybe something simpler like the train algorithm or &lt;a href="http://chaoticjava.com/posts/parallel-and-concurrent-garbage-collectors/"&gt;concurrent mark-sweep&lt;/a&gt; (but these both have significant performance disadvantages). Until then, heap ergonomics, mark-compact or mark-sweep in the oldest generation and maybe a large object area would be good places to start in improving Factor's memory management system.&lt;br /&gt;&lt;br /&gt;Right now, Factor's system consists of a three-generational copying collector which grows all generations by a particular factor when the heap is exhausted. The heap never shrinks. There is a separate section of the heap for compiled code, which uses a separate mark-sweep system, which can potentially make the system run out of memory when it doesn't have to. I hope this can be fixed by Factor 1.0, but if not, the GC will definitely be redone in Factor 2.0 when multiple system threads are supported.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1859765155110203891?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1859765155110203891/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1859765155110203891' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1859765155110203891'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1859765155110203891'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/03/little-more-about-garbage-collection.html' title='A little more about garbage collection'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-1299193571159156082</id><published>2008-02-26T22:21:00.000-08:00</published><updated>2008-03-04T11:32:44.052-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='garbage collection'/><title type='text'>A quick intro to garbage collection</title><content type='html'>I've been reading about garbage collection algorithms to potentially create a better one for Factor, so I thought I'd share the basics of what I've learned so far. Garbage collection, by the way, is the automatic reclamation of unreferenced heap-allocated memory. This means you can malloc without worrying about free, since the runtime system does it all for you.&lt;br /&gt;&lt;br /&gt;Modern programming languages like Lisp, Java, Python, Prolog, Haskell and Factor all have some form of garbage collection built-in, whereas C, C++ and Forth require explicit freeing of unused memory.* Because all heap-allocated data is handled by the garbage collector (and stack allocation is rarely, if ever, used), garbage collector performance has an impact on all programs written in a language with GC.&lt;br /&gt;&lt;br /&gt;For this reason, a lot of effort has been put into maximizing throughput (the amount of time spent not doing garbage collection), and more recently, minimizing the length of individual pauses, possibly at the expense of some throughput. Other good things a garbage collector can do is improve locality (for both cache and virtual memory paging purposes), reduce memory fragmentation, and provide for fast allocation. When everything is considered, a well-constructed garbage collector can actually make a program have greater throughput than manual memory management. When there's efficient allocation, then programmers can be free to use higher-level techniques. First, let's look at the simplest garbage collection algorithms.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Reference counting&amp;mdash;the naive algorithm&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Here's a pretty easy way to automatically collect unused memory. Put a counter on each object, signifying how many outside references there are to this object. When it gets to zero, you know there are no references to it, so it can be deleted. When another object starts referencing that object, it needs to increment the reference count by one. When it stops referencing it, the referrer needs to decrement the reference count.&lt;br /&gt;&lt;br /&gt;This adds just a little overhead: at each read of a pointer, you may have to increment the reference count. But this cost is distributed throughout the program, and it's not that great; no noticeable pause is caused. Another piece of overhead is that each reference-counted object has to have a counter which is as big as a a pointer, since potentially, everything in memory points to a particular object. In practice, it doesn't need to be made this big for most applications, but in the case of overflow, you need some other mechanism to clean up the mess.&lt;br /&gt;&lt;br /&gt;There's a bigger problem, though. Reference counting fails to collect cyclic data structures. Say I have a doubly linked list, or a tree with parent pointers. Then simple reference counting will fail to delete it when there are no outside references. Here's why: say we have three objects, A, B and C. A points to B. B points to C. C points to B. There are no other pointers to B and C. We'll assume A is a local variable, so its reference count is 1. B's reference count is two, since A and C point to it. C's reference count is 1 since A points to it. When A goes out of scope and stops pointing to B, then B's reference count will be decremented. So B and C will have a count of 1 and they won't be deleted!&lt;br /&gt;&lt;!--&lt;br /&gt;&lt;small&gt;&lt;br /&gt;&lt;img src="http://bp3.blogger.com/_JAVrje_IwhQ/R8bQeEK75GI/AAAAAAAAAAM/fo9vL2_nxIU/s320/refcount1.png"/&gt;&lt;br /&gt;The initial cyclic data structure. There is an external pointer to A, so everything is reachable&lt;br /&gt;&lt;img src="http://bp2.blogger.com/_JAVrje_IwhQ/R8bQ00K75HI/AAAAAAAAAAU/yAdtD8EFfb4/s320/refcount2.png"/&gt;&lt;br /&gt;After node A loses its external reference, B and C still have nonzero refcounts so they will not be freed, though A will be&lt;br /&gt;&lt;/small&gt;&lt;br /&gt;--&gt;&lt;br /&gt;There are a couple different ways to deal with this. One is to detect cycles some time. This isn't so good because cycles might be kinda long, and you don't want to go chasing pointers around unless you know where you're going. There's no good way to do this. Another way is to use a hybrid technique, where cycles are collected with something which doesn't have this weirdness. The easiest way to do this is with partial mark-sweep collection.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Ensuring soundness with mark-sweep&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Another way to go about this is to just allocate memory as normal, without worrying about freeing it, until we run out of memory. At that time, you can scan the heap and free everything that's not being referenced.&lt;br /&gt;&lt;br /&gt;Say that each object has a mark bit, determining whether it's being referenced. At the beginning of the heap scan, every mark bit is set to zero. Then, we do a depth-first search of the heap, starting at things that we know have references to then, and at each item we visit, we set the mark bit to 1. After the marking process is done, we can scan through the heap again and add everything that's not referenced to the free list.&lt;br /&gt;&lt;br /&gt;One problem with this is that the stack space needed for the depth-first search is, potentially, as big as all of the memory. Even if you don't use the host programming language's stack and instead use a manually-maintained stack, things might not always work out. So the Deutsch-Schorr-Waite algorithm uses the space inside the pointers themselves to store what to do next. It uses &lt;em&gt;pointer reversal&lt;/em&gt; during the depth-first search: each object has a flag bit to determine whether it's reversed or not, and when it's visited, it is changed to a pointer to its referent after reading its value and deciding to recurse on it. Pointer reversal is somewhat expensive, but it comes with a constant space guarantee, which is valuable.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The free list: not as efficient as it sounds&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Any good C programmer will tell you, don't allocate memory in an inner loop. Why is this? It's because malloc'ing memory isn't a simple process, and in the worst case can become as slow, asymptotically, as traversing the entire heap. Here's how it works: when the program starts, there's this big free area called the heap where memory is available. There's something called a free list, a circular singly linked list holding all of the free blocks of memory. Initially, it contains just this one big block. When you allocate memory, the free list is searched to find a block that's big enough to hold what you want. Then the free list is altered to say that that memory is taken. This operation preserves the length of the free list.&lt;br /&gt;&lt;br /&gt;Freeing memory is where the trouble starts. When you free memory, the old location is added to the free list. If memory is allocated in the same order that it's freed, this is no problem: adjacent free blocks of memory in the free list are joined, so there will only ever be two things in the free list. But this generally doesn't happen. Generally, as the program runs, things get freed in different orders, so the free list grows longer and longer, with pieces in it getting smaller and less usable. When the free list grows, it takes longer to allocate more memory. This is called heap fragmentation, and it's one reason why memory allocation is avoided in C.&lt;br /&gt;&lt;br /&gt;There are ways around it. The simplest way to allocate memory is to look for the first place in the free list that the requested block size will fit, but you can also look for the best fit. Another strategy is to partition the heap into areas for different sizes of allocated memory. But, fundamentally, heap fragmentation in general is unavoidable because the allocator isn't free to move things around. Once a pointer is allocated, it's supposed to stay in the same place for ever. This limitation is maintained by refcounting and mark-sweep collectors. More advanced compacting collectors, like copying collectors and mark-compact collectors, don't have this limitation, making allocation cheaper.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Copying collection with Cheney's algorithm&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Instead of dealing with a free list, we could just increment a pointer for allocation, and when memory fills up, move all the data that's referenced someplace else. This is the basic idea behind copying collection. We have two spaces, Fromspace and Tospace. All of the memory is allocated in Fromspace by incrementing a pointer, since everything in it above that pointer is free. When it fills up, copy each thing in Fromspace to Tospace. We can preserve the structure of the pointers by making sure that if something gets moved to Tospace, then its old value in Fromspace is set to a special value that indicates where it's already been copied to. When we're all done, swap the titles of Fromspace and Tospace.&lt;br /&gt;&lt;br /&gt;This has the same basic problem as mark-sweep collection: we can't just do a depth-first search of the heap. The easiest way to solve this is by instead using breadth-first search. No, we don't want an external queue to store this stuff in; we can use Tospace itself.&lt;br /&gt;&lt;br /&gt;Here's how it works with Cheney's algorithm. First, copy the stuff we know is referenced into Tospace. (These are called the roots, by the way.) Maintain one pointer, initially at the beginning of the heap, called "scan", and another, initially after the copied roots, called "free". "Scan" represents where we're looking for new things to copy, and "free" represents unalllocated memory in Tospace. Look at the thing at "scan", and copy it to right after "free". Then increment "scan", and increment "free" by the size of what you just copied. Continue this until scan and free are equal, that is, until nothing more is referenced that hasn't already been copied. To me, this is a very elegant instance of breadth-first search, where a queue emerges naturally between the "scan" and "free" pointers.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Sliding mark-compact using less space&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Copying is pretty good because everything is compacted, and allocation is as simple as incrementing a pointer. But there's a cost: it takes twice as much space as the earlier algorithms because half of the space just isn't used at any given time. Another technique is to use something like mark-sweep (with the same marking phase), but instead of putting things on a free list, to move memory back towards the beginning. This is called mark-(sweep-)compact collection.&lt;br /&gt;&lt;br /&gt;One goal in compaction is to preserve locality. The breadth-first search of copying collection doesn't do this well for most applications, but mark-compact collection can. The key to modern techniques for this, called "table-based", is to store information about free space that's been discovered in tables which are located in the free space itself. Once this information has been compiled, new locations for each piece of data can be calculated, pointers updated, and finally objects moved. The details are a little obscure, though.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Further ideas&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;This is only the beginning. There's a big problem with the last two methods discussed, that they can cause long pauses since they require a scan of the whole heap or all of the data that's in use. To do this, there are a number of possible improvements. The most basic one is generational garbage collection, and there are also incremental and concurrent methods. These get a lot more complicated, though, and I can't discuss them in this blog post. But before it's possible to understand those, you need the basics.&lt;br /&gt;&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;* Caveat: &lt;a href="http://www.pipeline.com/~hbaker1/LinearLisp.html"&gt;Linear Lisp&lt;/a&gt; (a Lisp dialect designed by Henry Baker) and &lt;a href="http://repetae.net/computer/jhc/"&gt;JHC&lt;/a&gt; (a Haskell implementation) don't use garbage collection, instead using only stack allocation (in Linear Lisp) or region inference (in JHC). &lt;a href="http://www.cs.mu.oz.au/research/mercury/"&gt;Mercury&lt;/a&gt;, a Prolog derivative, also uses region inference. But this is for another blog post which I'm not qualified to write. Also, you can define a garbage collector for C, C++ or Forth, but it has to treat each word with the precaution that it might be a pointer, even if it isn't.&lt;br /&gt;&lt;br /&gt;If any of this was hard to understand (especially for lack of diagrams), I strongly recommend reading &lt;a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&amp;location=http%3A%2F%2Fwww.amazon.com%2FGarbage-Collection-Algorithms-Automatic-Management%2Fdp%2F0471941484%3Fie%3DUTF8%26s%3Dbooks%26qid%3D1202357267%26sr%3D1-1&amp;tag=uselfact-20&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325"&gt;Garbage Collection: Algorithms for Automatic Dynamic Memory Management&lt;/a&gt; by Richard Jones and Rafael Lins.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Minor changes in wording in response to comments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-1299193571159156082?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/1299193571159156082/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=1299193571159156082' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1299193571159156082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/1299193571159156082'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/02/quick-intro-to-garbage-collection.html' title='A quick intro to garbage collection'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-75356137539953320</id><published>2008-02-17T01:45:00.000-08:00</published><updated>2008-03-07T12:14:18.633-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>Designing an API for encoded streams</title><content type='html'>When I started looking at Unicode to design a good library for Factor, I wanted to make an API such that the programmer never needed to think about Unicode at all. I now see that that's impossible, for a number of reasons. One thing that the programmer needs to explicitly think about is the encoding of files. For this reason, I'm in the middle of changing lots of words which deal with streams to take an extra mandatory parameter specifying the encoding. The encodings supported so far are &lt;code&gt;binary&lt;/code&gt;, &lt;code&gt;ascii&lt;/code&gt;, &lt;code&gt;latin1&lt;/code&gt;, &lt;code&gt;utf8&lt;/code&gt;, &lt;code&gt;utf16&lt;/code&gt; and some more. In the library, we'll eventually put Shift JIS, more 8-bit encodings like MacRoman, Windows 1252 and the other ISO-8859s, UTF-32, etc. Internally, all strings are already in Unicode; this is only for external communication.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Mandatory?!&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Some people objected to this. Why should there be a new mandatory parameter when everything worked already? This makes code longer, rather than shorter! The rationale is that this expands the functionality of streams. With the old functionality, everything is treated as if it's encoded in Latin 1. But in 99% of cases, this is just wrong. When a text file isn't in plain old ASCII, it's almost always in UTF-8 (though occasionally it's in UTF-16 or Shift JIS). Things are rarely in an 8-bit encoding because of its ambiguity; 8-bit non-ASCII encodings can safely be labeled "legacy" except on specialized low-resource applications. Right now things work like Latin 1 is the encoding for all streams, but if we want to do much actual text processing, things will come out wrong.&lt;br /&gt;&lt;br /&gt;Even if UTF-8 is used most of the time, could we use a heuristic to determine what encoding things are in? If we know the file is in either ASCII, UTF-8, UTF-16 or UTF-32, it's not too hard to come up with some kind of heuristic that works in almost all cases. But once things get generalized to Shift JIS and 8-bit encodings, it's basically impossible to determine generally how things are encoded. And it's completely impossible if there are binary streams, or for output streams all together.&lt;br /&gt;&lt;br /&gt;So let's make UTF-8 the default encoding. Any stream which doesn't want to use UTF-8 should have its instantiation followed by some &lt;code&gt;set-encoding&lt;/code&gt; word. But what about binary streams? These aren't uncommon and are needed for things like audio, video, compressed data and Microsoft Word documents. If UTF-8 is the default encoding, it'd be easy to open a file for reading or writing, forgetting that it's in UTF-8, and then writing stuff to it as if it's a binary stream. But if we make the encoding a mandatory explicit parameter, then nobody will forget: if you want to open a stream reading it as UTF-8, you can do &lt;code&gt;utf8 &amp;lt;file-reader&gt;&lt;/code&gt;, and if you want to open it as binary, you can do &lt;code&gt;binary &amp;lt;file-reader&gt;&lt;/code&gt;. Writing &lt;code&gt;utf8&lt;/code&gt; or &lt;code&gt;binary&lt;/code&gt; isn't just boilerplate: it actually indicates some information about how things should work. And for those situations where you want some other encoding, that can be specified just as easily.&lt;br /&gt;&lt;br /&gt;Now, do we really want to prefix each stream constructor with an encoding, or can it be determined, explicitly, in the context somehow? There are two ways to scope this, lexical and dynamic, and they both fail. With dynamic scoping, composability is broken: if one piece of code makes some assumption about the encoding&amp;mdash;say, that the encoding is UTF-8, which could be the default global encoding&amp;mdash;but then the caller sets it to something else. So the encoding must be set lexically. But when I looked at actual code samples, I saw it'd be more trouble than it's worth to have a lexically scoped encoding: nearly all words which open streams which need an encoding only need one or two. You're at most writing the same encoding twice, but it ends up being fewer words than a whole scope declaration (which needs, at a minimum, brackets, the encoding name and something declaring that this is a scope for encoding purposes). What about vocab-level scoping? It could work, but it'd have to be overridden in too many cases to be useful, since it's not unrealistic to have a vocab which uses UTF-8 for half of its streams and binary for the other half.&lt;br /&gt;&lt;br /&gt;One other thing that's useful and not particularly common in these sorts of libraries is the fact that the encoding can be changed after the stream is initialized. This is useful for things like XML, where the encoding can be declared in the prolog, and certain network protocols like HTTP and SMTP which allow the encoding to be specified in the header, so the encoding needs to change on the fly. I can only assume that previous implementations of this took everything as binary and used string processing routines to get things in and out of the right encoding.&lt;br /&gt;&lt;br /&gt;You might think of this as a standard library cruely forcing everyone to specify every little detail, but I think of it a little differently: the file I/O API encourages programmers to think about the encodings of their files. We could go the other way, still, and use UTF-8 as the default, but it'd create some strange and unreadable bugs. Any default is bad. All other stream APIs I've looked at make this optional, but no matter which way you go this makes misleading assumptions for programmers.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Specifics&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&amp;lt;file-reader&gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;file-writer&gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;file-appender&gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;client&gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;server&gt;&lt;/code&gt; will now take an extra argument of an encoding descriptor, making them have the stack effect &lt;code&gt;( path/addrspec encoding -- stream )&lt;/code&gt;. &lt;code&gt;file-contents&lt;/code&gt; and &lt;code&gt;file-lines&lt;/code&gt; also take an encoding from the top of the stack. &lt;code&gt;process-stream&lt;/code&gt;'s encodings are in the descriptor, as a possible value for &lt;code&gt;stdin&lt;/code&gt;, &lt;code&gt;stdout&lt;/code&gt; or &lt;code&gt;stderr&lt;/code&gt;, indicating that those values will be sent to/from Factor as a stream of the given encoding. If you're dealing with files, the process you call should handle all encoding issues. Some streams, like HTML streams and pane streams, don't need changes, since their encoding is unambiguous. You also don't need to specify the encodings of file and process names, since those are OS-specific and handled by the Factor library.&lt;br /&gt;&lt;br /&gt;In addition to &lt;code&gt;&amp;lt;string-reader&gt;&lt;/code&gt;s and &lt;code&gt;&amp;lt;string-writer&gt;&lt;/code&gt;s that already exist and remain unchanged (they don't need an encoding since everything that goes on there is in Factor's internal Unicode encoding), there are now also &lt;code&gt;&amp;lt;byte-reader&gt;&lt;/code&gt;s and &lt;code&gt;&amp;lt;byte-writer&gt;&lt;/code&gt;s which do have an encoding as a parameter. Byte readers and writers work on an underlying byte vector, and provide the same encodable interface that files do, because an array of bytes, unlike a string, can take multiple interpretations as to the code points it contains.&lt;br /&gt;&lt;br /&gt;I renamed &lt;code&gt;with-file-out&lt;/code&gt; to &lt;code&gt;with-file-writer&lt;/code&gt;, &lt;code&gt;with-file-in&lt;/code&gt; to &lt;code&gt;with-file-reader&lt;/code&gt;, &lt;code&gt;string-in&lt;/code&gt; to &lt;code&gt;with-string-reader&lt;/code&gt; and &lt;code&gt;string-out&lt;/code&gt; to &lt;code&gt;with-string-writer&lt;/code&gt; for consistency. Additionally, there are now also words &lt;code&gt;with-byte-reader&lt;/code&gt; and &lt;code&gt;with-byte-writer&lt;/code&gt;. Since byte and file readers and writers need an encoding, in these combinators I've put the encoding before the quotation. It could be the other way around, and really this was an arbitrary choice. Conceptually, you can think of it like the file name or byte array and the encoding form a sort of unit, so they're consistently adjacent in the words which use them.&lt;br /&gt;&lt;br /&gt;I've made all the updates to everyone's software in my local branch, so you don't have to worry about implementing these changes. You might want to go back and look at your code to make sure the encoding I chose was sane. 90% of the time it's binary or UTF-8, occasionally ASCII. It's usually clear-cut. Also, I never had to make more than 3 or 4 updates in a single file.&lt;br /&gt;&lt;br /&gt;It'd be nice if things were simpler, and nobody had to consider encodings at all except for Unicode library writers. Theoretically, this could be solved by a standard way to denote, inside the file, what encoding the rest of the file is in. But if we did that, then multiple competing encoding encodings might emerge, and we'd have to explicitly choose among them! It'd be even better if the filesystem had metadata on this, but it doesn't. Maybe, on the Factor end, there's a place for having an abstraction over the locations of resources grouped with a description of their type (either encoding or filetype). But either way, encodings just aren't simple enough to allow programmers not to think about them.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Added more info about specifics. &lt;strike&gt;It's been taking me a little longer than I initially thought to get this whole thing working with Factor, so this stuff still isn't in the main branch, though you can see the progress in the unicode branch of my repository. Bootstrapping will take a little work, though.&lt;/strike&gt; The changes have been integrated into Factor! Thanks, Slava, for making it all work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-75356137539953320?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/75356137539953320/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=75356137539953320' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/75356137539953320'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/75356137539953320'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/02/designing-api-for-encoded-streams.html' title='Designing an API for encoded streams'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7274295878802391756</id><published>2008-02-11T17:30:00.000-08:00</published><updated>2008-02-12T08:55:44.974-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='ignorable'/><category scheme='http://www.blogger.com/atom/ns#' term='math'/><title type='text'>Entscheidungsproblem gelöst! New answer and FAQ</title><content type='html'>Alan Turing wasn't quite right: anything that can be executed in the physical universe runs in constant time, and it's simple to design a mechanism which tests if an algorithm will halt or not which runs in constant time (assuming it takes less than half of the universe to run).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Q:&lt;/strong&gt; How can you make such a bold, stupid claim?&lt;br /&gt;&lt;strong&gt;A:&lt;/strong&gt; Because Turing assumed an infinite space to solve problems in. In the real world, there is a finite number of bits we can work with. Let's call this n. If a problem can be solved in the physical universe, it must be solvable using n or fewer bits of memory. There are 2&lt;sup&gt;n&lt;/sup&gt; possible states the bits can be in, so after 2&lt;sup&gt;n&lt;/sup&gt; memory writes, either an answer must be produced or the bits are in a state that they've been in exactly, an infinite loop. Since n is a constant, 2&lt;sup&gt;n&lt;/sup&gt; is a constant, so we know everything runs in worst-case O(1) time. For a more practical simulation, take n to be the number of bits in addressable memory, including virtual memory.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Q:&lt;/strong&gt; How can you check if an algorithm will halt or not in constant time?&lt;br /&gt;&lt;strong&gt;A:&lt;/strong&gt; Divide the available memory in two. Half will be used for solving the algorithm, and half will be a counter. Increment the counter on each memory write the algorithm memory space. If this counter overflows before the algorithm halts, we know there is an infinite loop. (Note: This can only solve things that take a maximum of half of the available memory, which is a smaller class of programs.)&lt;br /&gt;&lt;br /&gt;Remember the big thing that Turing proved: we can't have a program which tests if other programs halt, because if we did have that program, and ran it on a program which halted if it didn't halt and didn't halt if it halted (by the first program's test), then the halting test program couldn't halt or there would be a contradiction. This implicitly rests on the idea that program memory is unbounded. To run this program (the second one) using the halting test method described above would require an unbounded amount of memory because it would result in an unbounded recursion. We can just say that the program crashes, since it runs out of memory. In this world, not everything that's computable can be computed, but everything halts (or repeats a previous memory state exactly, which can be caught, as explained above).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Q:&lt;/strong&gt; What do you mean by unbounded? And halt?&lt;br /&gt;&lt;strong&gt;A:&lt;/strong&gt; By "unbounded" I mean countable (meaning isomorphic as a set to the naturals). Basically, there is no limit, but no element itself is infinite. By "halt" I mean that the algorithm will stop processing and come to an answer. In this theoretical world of algorithms, note that there's no addtional external input once the algorithm begins.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Q:&lt;/strong&gt; But the system you describe isn't Turing-complete!&lt;br /&gt;&lt;strong&gt;A:&lt;/strong&gt; Yes. The finite universe isn't strictly Turing-complete, either, and I think this is interesting to note. For example, there's no way we could model another universe larger than our own, down to quantum states, from this universe.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Q:&lt;/strong&gt; So my computer science professor was lying to me all along! Computational complexity is useless. Why didn't they tell you this?&lt;br /&gt;&lt;strong&gt;A:&lt;/strong&gt; Not quite. Since this constant is so large (even when we shrink it down to 2&lt;sup&gt;the size of the available memory&lt;/sup&gt;), we can actually get a tighter upper bound than O(1) if we use, say, O(m) for the same algorithm. This is why you sometimes have to be careful what your constants are in analysis; sometimes something which looks asymptotically faster is actually slower in not just small but &lt;em&gt;all&lt;/em&gt; cases. Anyway, your professor probably didn't tell you this because it's both obvious and vacuous.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Q:&lt;/strong&gt; So, if you were just tricking me, does this idea mean anything?&lt;br /&gt;&lt;strong&gt;A:&lt;/strong&gt; Actually, it comes up fairly often. Say you're doing arithmetic on integers. Each operation takes constant time, right? Well, it does if you're using machine integers. That's because machine integers are of a certain fixed maximum size. The reason we can say it takes constant time is the same reason that the universe can calculate everything in constant time! In fact, even if we use bignums, if we say "this is only going on in numbers less that can be represented in 256 bits," it &lt;em&gt;still&lt;/em&gt; makes some sense to say that things go on in constant time. It's only when things are unbounded that it makes total sense. Sometimes you have to look at very large data points to find that, say, n&lt;sup&gt;log&lt;sub&gt;4&lt;/sub&gt; 3&lt;/sup&gt; grows faster than n log&lt;sup&gt;2&lt;/sup&gt; n. If the size of the input is bounded at 10 billion, there's no clear winner.&lt;br /&gt;&lt;br /&gt;Think about sorting algorithms. If we have k possible items in an array of length n, we can sort the array in O(n) time and O(k) space using &lt;a href="http://en.wikipedia.org/wiki/Counting_sort"&gt;counting sort&lt;/a&gt;. Say we're sorting an array of machine integers, and we have no idea what the distribution is. Great! Just make an array of integers which has one counter for each machine integer. Now iterate through the list and increment the array at the integer's index when you encounter that integer. What's the problem? Oh yeah, we just used up more than all of the addressable memory. So just because we can, theoretically, construct something that will solve our problem in linear (or, with the original example) constant time doesn't mean it'll work.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Added another question. Note to readers: The headline and first paragraph aren't to be taken seriously or directly! This is a joke to demonstrate a point, not a claim of mathematical discovery, and the original solution to the halting problem is an eternal mathematical truth.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7274295878802391756?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7274295878802391756/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7274295878802391756' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7274295878802391756'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7274295878802391756'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/02/entscheidungsproblem-gelst-new-answer.html' title='Entscheidungsproblem gelöst! New answer and FAQ'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-9209077169635400517</id><published>2008-02-09T08:50:00.000-08:00</published><updated>2009-03-30T19:30:03.303-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='math'/><title type='text'>Factor[ial] 102</title><content type='html'>Previously, I wrote an introduction to Factor though Factorial called &lt;a href="http://useless-factor.blogspot.com/2007/07/factorial-101.html"&gt;Factor[ial] 101&lt;/a&gt;. If you saw that, you probably thought that's all you'd see of that totally un-compelling example. Well, you're wrong! We can actually implement factorial in a more simple way and more efficiently with large integers.&lt;br /&gt;&lt;br /&gt;Let's look at the solution from last time:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: factorial ( n -- n! )&lt;br /&gt;    1 [ 1+ * ] reduce ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When you saw this, you might have been thinking, "Why not just get a list of the numbers from 1 to n and do their product? Why bother with 1+? In Haskell, it's just &lt;code&gt;product [1..n]&lt;/code&gt;." We can actually use this strategy in Factor using the &lt;code&gt;math.ranges&lt;/code&gt; library, which has code a word called &lt;code&gt;[1,b]&lt;/code&gt; which creates a virtual sequence containing the integers 1 through its argument. We can also use the word &lt;code&gt;product&lt;/code&gt; in &lt;code&gt;math.vectors&lt;/code&gt; to get the product of a sequence. So the new factorial is just&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: factorial ( n -- n! )&lt;br /&gt;    [1,b] product ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;strong&gt;Efficiency and bignums&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I &lt;a href="http://useless-factor.blogspot.com/2007/09/using-bigums-efficiently.html"&gt;previously&lt;/a&gt; talked about how to make some simple mathematical functions work well with bignums. The example I used there was a string&gt;number conversion procedure, but it applies equally to getting the product of a list. In short: when multiplying two bignums of size (in bits) n and m, we know there's a lower bound of &amp;Omega;(nm), since a bignum of nm bits must be constructed. So if we go about finding the product of a sequence by starting with 1, then multiplying that by the first element, then the second, and so on for the entire sequence, then this will take &amp;Omega;(n&lt;sup&gt;2&lt;/sup&gt;) time where n is the size of the resulting product!&lt;br /&gt;&lt;br /&gt;We can do better, and a better strategy is to use binary recursion similar to mergesort: split the sequence in half, find the products of both halves, and multiply them. Then the easy lower bound is lower: &amp;Omega;(n log n). (Note: the upper bounds for these algorithms is something like the lower bound times log n, with a reasonably efficient multiplication algorithm.)&lt;br /&gt;&lt;br /&gt;So here's a better implementation of &lt;code&gt;product&lt;/code&gt;, using this strategy:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: halves ( seq -- beginning end )&lt;br /&gt;    dup length 2 /i cut-slice ;&lt;br /&gt;&lt;br /&gt;: product ( seq -- num )&lt;br /&gt;    dup length {&lt;br /&gt;        { 0 [ drop 1 ] }&lt;br /&gt;        { 1 [ first ] }&lt;br /&gt;        [ drop halves [ product ] bi@ * ]&lt;br /&gt;    } case ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;strong&gt;Abstraction and combinators&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;If we want to implement a word to sum an array, we'll be repeating a lot of code. So let's abstract the basic idea of this kind of recursion into a combinator, or higher order function, that we can supply the starting value and combining function to. With this, we should be able to write&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: product ( seq -- num ) 1 [ * ] split-reduce ;&lt;br /&gt;: sum ( seq -- num ) 0 [ + ] split-reduce ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;where &lt;code&gt;split-reduce&lt;/code&gt; is this new combinator. Now, it's no harder to write code using the binary-recursive strategy than the original naive strategy, if &lt;code&gt;split-reduce&lt;/code&gt; is somewhere in the library. Here's how you can implement it:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: split-reduce ( seq start quot -- value )&lt;br /&gt;    pick length {&lt;br /&gt;        { 0 [ drop nip ] }&lt;br /&gt;        { 1 [ 2drop first ] }&lt;br /&gt;        [ drop [ halves ] 2dip [ [ split-reduce ] 2curry bi@ ] keep call ]&lt;br /&gt;    } case ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This looks a little messy, as combinators sometimes get. Let's see how it looks using local variables: (I'm using the &lt;code&gt;locals&lt;/code&gt; vocab, which allows the syntax &lt;code&gt;:: word-name ( input variables -- output ) code... ;&lt;/code&gt; for lexical scoping. I usually don't use this very often, but in implementing combinators it can make things much cleaner.)&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;:: split-reduce ( seq start quot -- seq' )&lt;br /&gt;    seq empty? [ start ] [&lt;br /&gt;        seq singleton? [ seq first ]&lt;br /&gt;        [ seq halves [ start quot split-reduce ] bi@ quot call ] if&lt;br /&gt;    ] if ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Which one of these you prefer is purely a matter of taste.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A tangent to mergesort&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;What if we wanted to use &lt;code&gt;split-reduce&lt;/code&gt; to implement mergesort? It might look like you can do this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: mergesort ( seq -- sorted )&lt;br /&gt;    { } [ merge ] split-reduce ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;However, there's a problem here: in the base case, if we have &lt;code&gt;{ 1 }&lt;/code&gt;, it'll be changed into &lt;code&gt;1&lt;/code&gt;. But we need the base case to output sequences! (Ignore the fact that 1 is a sequence; it's of the wrong type.) So the cleanest way to do this is to make a new word, &lt;code&gt;split-process&lt;/code&gt;, which does the same thing as &lt;code&gt;split-reduce&lt;/code&gt; but takes a new parameter specifying what to do in the base case. With this we're able to do&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: split-reduce ( seq start quot -- value )&lt;br /&gt;    [ first ] split-process ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To implement this, we just need to modify &lt;code&gt;split-reduce&lt;/code&gt;, factoring out the base case code:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;:: split-process ( seq start quot base-quot -- seq' )&lt;br /&gt;    seq empty? [ start ] [&lt;br /&gt;        seq singleton? [ seq base-quot call ] [&lt;br /&gt;            seq halves&lt;br /&gt;            [ start quot base-quot split-process ] bi@&lt;br /&gt;            quot call&lt;br /&gt;        ] if&lt;br /&gt;    ] if ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now mergesort can be implemented as&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: mergesort ( seq -- sorted )&lt;br /&gt;    { } [ merge ] [ ] split-process ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;for some suitable implementation of &lt;code&gt;merge&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;To &lt;code&gt;binrec&lt;/code&gt; and beyond&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;What if we took this even further: why restrict this to binary recursion on sequences? We can do binary recursion on everything that needs binary recursion! So let's make a combinator out of this, calling it &lt;code&gt;binrec&lt;/code&gt;. &lt;code&gt;binrec&lt;/code&gt; takes four&amp;mdash;&lt;em&gt;four!&lt;/em&gt;&amp;mdash;quotations. The first one specifies the termination (base case) condition. The second specifies what to do in the base case. The third specifies how to split up the data in the inductive case, and the fourth specifies how to put the two pieces back together after the recursion takes place. Here's how we can implement &lt;code&gt;binrec&lt;/code&gt; for a totally general binary recursion combinator:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;:: binrec ( data test end split rejoin -- value )&lt;br /&gt;    data test call [ data end call ] [&lt;br /&gt;        data split call&lt;br /&gt;        [ test end split rejoin binrec ] bi@&lt;br /&gt;        rejoin call&lt;br /&gt;    ] if ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In the abstract, this isn't too bad. But how can you read code that uses binrec? You have to remember four quotations, their intended stack effects and their role in calculating this. For me, this is too difficult to do in most cases.&lt;br /&gt;&lt;br /&gt;Look at how we can define &lt;code&gt;split-process&lt;/code&gt; in terms of &lt;code&gt;binrec&lt;/code&gt;:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;:: split-process ( seq start rejoin-quot end-quot -- value )&lt;br /&gt;    [ dup singleton? swap empty? or ]&lt;br /&gt;    [ dup singleton? [ end-quot call ] [ drop start ] if ]&lt;br /&gt;    [ halves ] rejoin-quot binrec ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This isn't actually easier than defining &lt;code&gt;split-process&lt;/code&gt; directly, and you can argue that it's worse than the original version. Still, it provides an interesting way to avoid explicit recursion.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Pulling it all together&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Complicated combinators like &lt;code&gt;binrec&lt;/code&gt; can be useful, sometimes, as long as you don't use them directly. One of the great things about Factor is that it's so easy to specialize these things. So why not? Almost every case that you're using &lt;code&gt;binrec&lt;/code&gt; follows a particular pattern.&lt;br /&gt;&lt;br /&gt;We can tell everyone more loudly about &lt;code&gt;split-reduce&lt;/code&gt;, which is much easier to use, and have binrec be hidden in the library for advanced users who want to implement their own similar combinators without repeating the code that's already written in &lt;code&gt;binrec&lt;/code&gt;. It's not that recursion is difficult to do, its just that there's no reason to write this code more than once.&lt;br /&gt;&lt;br /&gt;So that's how you implement factorial in Factor. Except once all this is in the library, all you have to worry about is &lt;code&gt;[1,b] product&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;(&lt;strong&gt;BTW&lt;/strong&gt; If you actually want to use factorial for something practical, where it'll be called multiple times, a memoizing table-based approach might be faster. Or maybe Stirling's approximation is appropriate, depending on the use case. Or maybe &lt;a href="http://www.luschny.de/math/factorial/FastFactorialFunctions.htm"&gt;one of these algorithms&lt;/a&gt;. But that's a topic for another blog post!)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: I fixed some typos, found by Slava. Also, Slava added &lt;code&gt;split-reduce&lt;/code&gt; into the core as &lt;code&gt;binary-reduce&lt;/code&gt; and implemented sum and product with it.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update 2&lt;/strong&gt;: Updated the code samples for the current version of Factor, as of late March 2009.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-9209077169635400517?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/9209077169635400517/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=9209077169635400517' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/9209077169635400517'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/9209077169635400517'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/02/factorial-102.html' title='Factor[ial] 102'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5730559444479386235</id><published>2008-02-06T12:15:00.000-08:00</published><updated>2008-02-11T11:36:19.927-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><title type='text'>XML and its alternatives</title><content type='html'>I started writing Factor's XML parser thinking that its purpose was to interface with legacy protocols which made the mistake of choosing XML, and that the people at the W3C were a bunch of idiots for pushing such a bad, unoriginal format on innocent programmers who would do better without it. At this point, though, I think it might not actually be that bad. Let's look at the alternatives for representing human-readable structured information for standardized protocols.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Flat text files&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;In the recent past, many protocols and file formats were written with a flat text file, binary or human-readable, each requiring an individually specialized parser. Many are still written this way. Does it make any sense to impose a tree structure on something as simple as blog syndication, documents or remote procedure calls? Or was it a wrong turn to put all of that in the same verbose, complicated syntax?&lt;br /&gt;&lt;br /&gt;I think it was a good idea to specify these formats in terms of a common tree-based human-readable format. Maybe for some low-level network protocols, a flat text or binary file makes sense, but many other things work out well using a tree structure. For example, the Atom syndication format is a way to store things like blog feeds in XML. The structure is pretty simple: there's a bunch of metadata about the blog, and then there are a bunch of nodes corresponding to items, with roughly the same fields as the feed itself has. (I'm oversimplifying, here.) Atom uses a tree structure to store this, and the tree is in XML syntax. A tree structure makes sense, because there are a couple different sub-levels: there's the level of items, and then underneath that, the level of data about each item. These can be cleanly separated in a tree model.&lt;br /&gt;&lt;br /&gt;Using a pre-existing XML parser, Atom is fairly easy to parse and generate. I wrote a simple library for Atom parsing and generation &lt;a href="http://factorcode.org/responder/source/extra/rss/rss.factor"&gt;here&lt;/a&gt; in not much code.&lt;br /&gt;&lt;br /&gt;An additional benefit of a tree structure in a standard syntax is that standard tools can be used on it. On the most basic level, you can use a parsing library. But is this really necessary if the format is simple enough anyway? When there is a large amount of information, parsing inevitably becomes harder, and a consistent encoding of hierarchical structure makes this easier.&lt;br /&gt;&lt;br /&gt;A new alternative to Atom is Zed Shaw's &lt;a href="http://www.zedshaw.com/blog/2008-01-13.html"&gt;XSFF&lt;/a&gt;, where information is in a simple in a flat-file format. (&lt;strong&gt;Update&lt;/strong&gt;: Zed says this should be taken as a joke.) Originally, this only had basic information about the blog overall, and the URL of each post in chronological order. But when things were extended to show the article contents, Zed's solution was to have the flat file link to a special source format he uses to generate HTML. He didn't provide anything to get the date things are posted, which, in his case, can be deduced from the URL.&lt;br /&gt;&lt;br /&gt;I don't mean to criticize Zed, but this new format will actually be more difficult for programmers to process than regular Atom feeds, if they want to have a "river of news" for aggregator format. A ZSFF aggregator (as &lt;a href="http://planet.factorcode.org"&gt;Planet Factor&lt;/a&gt; is an Atom aggregator) would have to parse the URLs to figure out the dates to get a correct ordering and follow the URLs with a .page extension to get content. For those pages, they also must be parsed to get the title and content in HTML form. Is it easier to write an ZSFF generator? Yes, but it's much harder to read, and that must be taken into consideration just as much.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;S-expressions&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Many smart Lispers have complained about XML. They claim s-expressions (sexprs), the basis for Lisp syntax, are better for most human-readable data serialization purposes with a bunch of good reasons. Some of these are,&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Sexprs are simpler to parse&amp;mdash;just use &lt;code&gt;read&lt;/code&gt;.&lt;/li&gt;&lt;li&gt;They're easier to process, since they're just nested lists.&lt;/li&gt;&lt;li&gt;Sexprs encode real data types in a direct way, not just strings but also integers and floats.&lt;/li&gt;&lt;li&gt;XML is unnecessarily complicated, including things like the distinction between attributes and children and unnecessary redundancy in closing tags.&lt;/li&gt;&lt;li&gt;Sexprs came first and they do everything XML does, so why should we use XML?&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Explaining XML's utility is different than to Lispers' criticisms, largely because at least half of the criticisms are correct: XML is completely isomorphic to sexprs but with a more complicated and verbose syntax. Still, it has a few advantages besides its legacy status. XML is very well-specified and is guaranteed not to differ between conforming implementations. The general idea of s-expressions is fairly universal, but it differs between implementations what characters can be included in a symbol, what characters and escapes can be used in strings, the precision of floats and the identity of symbols. Within a well-specified Lisp (I'm using Lisp in the general sense here), there's not much ambiguity, but in communicating between computers programmed with different parser implementations, XML is more robust. XML supports Unicode very consistently, with explicit mention of it in the stanard. Sexprs can't be depended on for this.&lt;br /&gt;&lt;br /&gt;This might not be helpful in general, but one really great thing about XML is that you can embed other XML documents inside of it very cleanly. I never thought I'd use XML for anything but others' protocols until it turned out to be the right tool for a quick hackish job: a simple resource file to generate the Factor FAQ. Previously, I maintained the FAQ in HTML directly, but it became tedious to update the table of contents and the start values for ordered lists. So in a couple hours I came up with and implemented a simple XML schema to represent the questions and answers consisting of XHTML fragments. I could parse the HTML I was already using for the FAQ and convert it to this XML format, and convert it back.&lt;br /&gt;&lt;br /&gt;Could I have used s-expressions, or an equivalent using the Factor reader? Sure, but it would have taken more effort to construct the HTML, and I'm lazy. One reason is really superficial: I would have had to backslash string literals. Another reason is that I had an efficient XML parser lying around. But one thing which came in handy which I didn't expect was that because of XML's redundancy, the parser caught my mistakes in missing closing tags and pointed them out at the location that they occurred. Anyway, these are all small things, but they added up to XML being an easier-to-hack-up solution than s-expressions or ad-hoc parsing.&lt;br /&gt;&lt;br /&gt;(&lt;strong&gt;JSON&lt;/strong&gt;: I have nothing to say about JSON, but I don't think anyone's ever tried to use it for standardized protocols, just for AJAX. And I don't really know much about it. But I feel like I should mention it because it's out there, and it's definitely a reasonable option because it's strongly standardized. The same goes for YAML, basically. It's important to note, though, that JSON and YAML are in a way more complicated (though arguably more useful) because they include explicit data types.)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;While XML isn't perfect, it is the right tool for some jobs, even some things it isn't currently used for. Of course, there are definitely cases where flat files or s-expressions are more appropriate; it would be stupid to reflexively use XML for everything where you want a human-readable data format. But format standards, while annoying, are great when someone else takes the time to implement them for you. This way, you don't have to worry about things like character encodings or parsing more complicated grammars as the format grows. The biggest benefits of XML are the tree structure and its standardization.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: For a more complete look at the alternatives to XML, check out &lt;a href="http://web.archive.org/web/20060325012720/www.pault.com/xmlalternatives.html"&gt;this page&lt;/a&gt; which William Tanksley pointed out (unfortunately only on the Internet Archive).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5730559444479386235?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5730559444479386235/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5730559444479386235' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5730559444479386235'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5730559444479386235'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/02/xml-and-its-alternatives.html' title='XML and its alternatives'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8787532487261541018</id><published>2008-01-31T23:36:00.000-08:00</published><updated>2008-01-31T19:56:05.201-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>The most common Unicode-processing bug</title><content type='html'>The most common Unicode-processing bug is as pervasive as it is trivial: UTF-8 is confused for some 8-bit repertoire and encoding, or the other way around. Most commonly it's something like &lt;a href="http://en.wikipedia.org/wiki/Windows-1252"&gt;Windows-1252&lt;/a&gt;, a poorly specified superset of ISO 8859-1 (Latin 1). This, I'm fairly sure, is the source of all of those "funny Unicode characters".&lt;br /&gt;&lt;br /&gt;For example, on the Unicode mailing list digest as read in GMail, I received a letter a few days ago which opened, "I donâ€™t understand..." I'm not sure where in the transmission line this happened, but I'm sure that nobody typed those characters. More likely, they used a single right quotation mark U+2019 (or rather their email program did), which would be encoded in UTF-8 as 0010 0000 0001 1001 ==&gt; 11100010 1000000 10011001 = 0xE2 0x80 0x99 = â€™ in Windows-1252.&lt;br /&gt;&lt;br /&gt;(Note: the giveaway that this is Windows-1252 and not Latin 1 is the Euro sign, which is a relatively recent invention and not encoded into any official ISO 8859 that people actually use*. In all ISO 8859s, 80 is reserved as a kind of fake control character.)&lt;br /&gt;&lt;br /&gt;Here's how it might have happened: the email program declared that everything was in Windows-1252, though it was not, and the mailing list server correctly decoded that encoding into the corresponding Unicode code points. Alternatively, maybe no encoding was specified, and since Windows-1252 is a superset of Latin 1, which in turn is a superset of ASCII, it was used as a "maximally tolerant" assumed encoding where ASCII would be the normal default. Alternatively, maybe the mailing list itself failed to specify what encoding it was using, and GMail made a similar mistake. This is more likely, as things consistently appear for me as this way when reading the Unicode mailing list.&lt;br /&gt;&lt;br /&gt;This bug is at once easy to fix and impossible. Because compatibility with legacy applications needs to be maintained, it's difficult to change the default encoding of things. So, everywhere it's possible, things need to say explicitly what their encoding is, and applications need to process this information properly.&lt;br /&gt;&lt;br /&gt;Still, do we care more about maintaining compatibility with legacy applications or getting things working today? In practice, almost everything is done in UTF-8. So it should be fine to just assume that encodings are always UTF-8, legacy programs be damned, to get maximum utility out of new things.&lt;br /&gt;&lt;br /&gt;Well, as it turns out, that's not always the best thing to do. Someone recently told me that he suspects a Unicode bug in &lt;a href="http://amarok.kde.org/"&gt;Amarok&lt;/a&gt;: it wasn't correctly loading his songs that had German in the title, though it worked correctly on a previous installation. Instead, I think the bug was in incompatible default settings for GNU tar or ISO format. The songs used to have accented letters, but the files were transferred onto a different computer. Now, those letters were single question marks when viewed in Konquerer, and Amarok refused to open them, giving a cryptic error message.&lt;br /&gt;&lt;br /&gt;UTF-8 is fault-tolerant, and a good decoder will replace malformed octets with a question mark and move on. This is probably exactly what happened: the title of a song contained a character, say ö, which was encoded in Latin 1 as 0xF6, followed by something in ASCII. The song title was encoded in Latin 1 when the file system expected UTF-8. The UTF-8 decoder in Konquerer replaced the 0xF6 with a &amp;#xfffd; (replacement character, U+FFFD), but Amarok threw an error for some reason.&lt;br /&gt;&lt;br /&gt;So, for all of this, there's no good solution but to be more careful and mindful of different encodings. In most cases, you can use a heuristic to determine whether something is in Windows-1252 or UTF-8, but this can never be completely accurate, especially if other encodings are considered at the same time.&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;* ISO 8859-15 and -16 actually have the Euro sign, but I really doubt many people use them, as they were released around the turn of the millennium, when Unicode was already in broad use, and come with the difficulties of telling them apart from other ISO 8859s.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: An anonymous commenter pointed out that it's not too hard to use a heuristic to differentiate between Windows-1252 and UTF-8.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-8787532487261541018?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/8787532487261541018/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=8787532487261541018' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8787532487261541018'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8787532487261541018'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/most-common-unicode-processing-bug.html' title='The most common Unicode-processing bug'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7998438314037640248</id><published>2008-01-22T22:36:00.000-08:00</published><updated>2009-04-21T07:58:59.651-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><title type='text'>Matching, diffing and merging XML</title><content type='html'>&lt;strong&gt;Update&lt;/strong&gt;: A newer, more complete version is &lt;a href="http://www.scribd.com/doc/14482474/XML-diff-survey"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I've said bad things about my job working on Carleton College's website, but fundamentally it's a really sound work environment we have. Just before winter break, one of the full-time employees came to me and asked if I could make a diff between two XHTML documents for use in Carleton's CMS, Reason. This would be useful for (a) comparing versions of a document in the CMS (b) merging documents, in case two people edit the same document at the same time, so as to avoid locks and the need for manual merges. They came to me because I told them I'd written an XML parser.&lt;br /&gt;&lt;br /&gt;I may know about XML, but I know (or rather knew) nothing about the algorithms required for such a diff and merge. I looked on Wikipedia, but there was &lt;em&gt;no article&lt;/em&gt; about this kind of stuff. So for the past three weeks, I've been paid to read academic papers, mostly from Citeseer, my favorite website, about matching, diffing, and merging trees. Here's some of what I've found.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A naive algorithm&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Here's something that won't work: a line-by-line diff and merge, along the lines of the Unix utilities &lt;code&gt;diff&lt;/code&gt; and &lt;code&gt;patch&lt;/code&gt;. You can split things up so that each tag is on a different line, as many have suggested, but it still won't work. Everyone who I've mentioned this problem to, including the boss who gave the assignment, thought of doing something like this, but it's unworkable.&lt;br /&gt;&lt;br /&gt;Most obviously, this can easily break the tree structure. Here's one example of a failing three-way merge. (By a three-way merge, I mean a merge of two documents where you have the original document which the other two are derived from. &lt;code&gt;diff3&lt;/code&gt; does a line-by-line three-way merge, which I'm using for this example.)&lt;br /&gt;&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;p&gt;&lt;br /&gt;This is&lt;br /&gt;some&lt;br /&gt;text&lt;br /&gt;&amp;lt;/p&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/td&gt;&lt;td&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;p&gt;&lt;br /&gt;&amp;lt;b&gt;&lt;br /&gt;This is&lt;br /&gt;some&lt;br /&gt;&amp;lt;/b&gt;&lt;br /&gt;text&lt;br /&gt;&amp;lt;/p&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/td&gt;&lt;td&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;p&gt;&lt;br /&gt;This is&lt;br /&gt;&amp;lt;i&gt;&lt;br /&gt;some&lt;br /&gt;text&lt;br /&gt;&amp;lt;/i&gt;&lt;br /&gt;&amp;lt;/p&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/td&gt;&lt;td&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;p&gt;&lt;br /&gt;&amp;lt;b&gt;&lt;br /&gt;This is&lt;br /&gt;&amp;lt;i&gt;&lt;br /&gt;some&lt;br /&gt;&amp;lt;/b&gt;&lt;br /&gt;text&lt;br /&gt;&amp;lt;/i&gt;&lt;br /&gt;&amp;lt;/p&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;Original&lt;/th&gt;&lt;th&gt;Part bolded&lt;/th&gt;&lt;th&gt;Part italicized&lt;/th&gt;&lt;th&gt;Line-by-line merge&lt;/th&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;This XML is not well-formed! This is unacceptable. I'm not saying &lt;code&gt;diff3&lt;/code&gt; is a bad tool, it's just not what we need.&lt;br /&gt;&lt;br /&gt;Another problem is that a standard Unix diff only takes two things into account: insertions and deletions. In document diffing and merging, it'd be helpful if we supported another operation: moves. Say we take two paragraphs and swap them. In a model with only deletions and insertions, the paragraph would have to be deleted from one place and inserted in another. This can lead to sub-optimal diffs as presented to the user. It can also lead to bad merges: say in one branch, a paragraph was edited, and in another branch, that paragraph was moved to a different location. An optimal merge would put the edited paragraph in the new location, which requires tracking these moves directly.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Background to a better solution&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;XML can be viewed as ordered &lt;a href="http://en.wikipedia.org/wiki/Tree_(data_structure)"&gt;trees&lt;/a&gt;, where each node has an unbounded number of children and each internal node has a label. (In some situations, it acts more like an unordered tree, but in XHTML it's definitely ordered. Also, finding the optimal matching for unordered trees is known to be NP-hard, so we don't want to go there.) So we can solve these problems of diffing and merging XML by solving a more general problem on ordered trees. There are some specific aspects of XML which deserve mention (attributes, which are guaranteed to have unique names within a node; IDs, which are guaranteed to be unique within a document), but these are minor aspects which we can ignore for most of the time.&lt;br /&gt;&lt;br /&gt;To avoid confusing, I'll define some terms I've been using or will soon start using. When I talk about "diffing", what I mean, formally, is generating an "edit script", or list of changes between two documents that can be used to get the modified document from the original. Sometimes, these edit scripts are invertible, but not always. When I talk about a "merge", I mean a way to reconcile the changes between documents to incorporate both of these changes. A merge can be an operation on edit scripts or it can be done directly on a tree matching. A "matching" is a set of correspondences between nodes in different trees; it is the basis for doing either a diff or a merge, and it's difficult to do efficiently.&lt;br /&gt;&lt;br /&gt;(It might help to have a basic understanding of &lt;a href="http://en.wikipedia.org/wiki/Levenshtein_distance"&gt;Levenshtein distance&lt;/a&gt; and the implied string diff algorithm. The &lt;a href="http://en.wikipedia.org/wiki/Longest_common_subsequence_problem"&gt;longest common subsequence problem&lt;/a&gt; will also crop up here from time to time.)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The Zhang-Shasha algorithm and extensions&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The Zhang-Shasha algorithm is the basic starting point when thinking about tree matching, diffing and merging. Except it isn't that basic. Dennis Shasha and Kaizhong Zhang created an algorithm to solve the approximate tree matching problem, which they described in the book &lt;a href="http://books.google.com/books?id=mFd_grFyiT4C"&gt;Pattern Matching Algorithms&lt;/a&gt;. Their chapter on trees is available from &lt;a href="http://citeseer.ist.psu.edu/shasha95approximate.html"&gt;Citeseer&lt;/a&gt;. Here's the basic idea: we want to see how similar two trees are, by a weighted edit distance metric. The edit script has three operations, similar to Levenshtein distance: add a node (optionally including a contiguous subsequence of the parent node), delete a node (putting children in the parent node), and relabel a node.&lt;br /&gt;&lt;br /&gt;With this, they were able to come up with an algorithm of complexity (basically) O((n log n)^2), where n is the number of nodes in the tree. So this can get you a matching and an edit script, or diff between two trees. This isn't great, but it's much better than previous algorithms. The two also worked on the more difficult problem of supporting "Variable-length don't-cares", or the equivalent of * in Google/Unix file globbing, in tree matching, but we don't care about that here.&lt;br /&gt;&lt;br /&gt;But this doesn't describe all of the changes that might take place in a tree structure that we might want to record. For example, in Zhang-Shasha proper, moving a node from one place to another is recorded as deleting the node from one place and inserting it into another place. Another issue is that inserting or deleting a subtree is recorded as inserting the node, then inserting each of its children, or deleting the leaf nodes recursively up until you delete their parent. This all leads to counterintuitive diffs, as far as human readability goes, as well as inflated edit distances.&lt;br /&gt;&lt;br /&gt;So David Barnard, Gwen Clarke and Nicholas Duncan got together to create a modified algorithm that accommodated this modified definition of edit distance in &lt;a href="http://citeseer.ist.psu.edu/47676.html"&gt;this paper&lt;/a&gt;. It adds three additional operations: insertTree, deleteTree, and swap. Unfortunately, this doesn't account for copying nodes, or for moves that aren't within the same parent.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Some tree matching heuristics&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;So, it's not very good that the Zhang-Shasha algorithm is quadratic in most cases. In fact, in many cases, it's unacceptable. For example, in my case, where I might sometimes have to compare XHTML documents which are very long, it's unacceptable. But there are some algorithms which run in a time which is dominated by the number of nodes multiplied by the edit distance, or O(ne).&lt;br /&gt;&lt;br /&gt;One algorithm called FastMatch, which has insert leaf, delete leaf, update and general move operations is presented by a bunch of people from Stanford in &lt;a href="http://citeseer.ist.psu.edu/chawathe96change.html"&gt;this paper&lt;/a&gt;. They work on getting an edit script and matching at the same time, but the algorithm starts by matching as much as possible, top-down, before proceeding to calculate the differences. This yields a complexity of O(ne+e^2). A related algorithm, described in Chapter 7 of &lt;a href="http://www.cs.hut.fi/~ctl/3dm/thesis.pdf"&gt;Tancred Lindholm's master's thesis [PDF]&lt;/a&gt; incorporates tree insertion and deletion operations for a complexity of O(ne log n).&lt;br /&gt;&lt;br /&gt;It's important to note that both of these will be O(n^2) in the complete worst case. A different XML matching algorithm was described by Grégory Cobéna in his master's thesis (and also in &lt;a href="http://citeseer.ist.psu.edu/cobena01detecting.html"&gt;this paper&lt;/a&gt;, which is basically the relevant segment). Cobéna calls his algorithm BULD, which stands for bottom-up lazy-down. The key to the algorithm is that, for each node, there is a hash value and a weight, both calculated bottom-up. Exact equivalence between nodes can be approximated by equal hash values, and you search for the equal hash values of nodes that have been inserted on a maxheap by weight. In Cobéna's &lt;a href="http://gregory.cobena.free.fr/www/Publications/thesis_draft.pdf"&gt;full thesis [PDF]&lt;/a&gt;, he goes into more depth about his invertible edit script format. This algorithm doesn't necessarily generate the optimal diff, but in experiments it generates a very good one, and with a worst-case time of O(n log n).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Three-document merge&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Creating an edit script is all well and good, but it's only half of the problem: the merge. Remember that a three-document merge is one where we have the original document and two modified versions, and we want to create a fourth version with both modifications together. Here was my idea: create an edit script for both modified versions with respect to the original, then do one followed by another, with repeated modifications done only once. We know there's a conflict if the order matters, in terms of which comes first in applying to the original document.&lt;br /&gt;&lt;br /&gt;But this will come up with more conflicts than actually exist. For example, say some node A has four children, B C D and E. In one change, we insert a new node X after B as a child of A, and in another change, we insert a node Y after D as a child of A. So a sensible merge would have A's new children be B X C D Y E, in that order. But with the model described above, there would be an edit conflict!&lt;br /&gt;&lt;br /&gt;One solution to this is the more general strategy of &lt;a href="http://en.wikipedia.org/wiki/Operational_transformation"&gt;operational transformation&lt;/a&gt;. The basic idea for this technique as applied here is that, if we insert Y after inserting X, we have to add 1 to the index that Y is being inserted. If, on the other hand, Y is inserted first, we don't have to add one to the index that X is inserted on. This technique leads to fewer conflicting merges, or in OT lingo, it converges in more cases. There are a few formal properties of an operational transformation that have only recently been proven correct in the best-known algorithms. Pascal Molli used operational transformation, together with Cobéna's diff algorithm and format, in his So6 synchronization framework, which he described &lt;a href="http://www.loria.fr/~molli/pmwiki/uploads/Main/OsterICEIS07.pdf"&gt;in a deceptively short paper [PDF]&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Tancred Lindholm went a different route altogether in creating a three-way merge, throwing out the edit script and basing it on a tree matching. His algorithm is described in &lt;a href="http://citeseer.ist.psu.edu/741860.html"&gt;a paper&lt;/a&gt; that I don't understand well enough to faithfully summarize here.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;At this point, you might be thinking, "I read all that and he didn't even tell me how the algorithms work?!" I would tell you all about that, but there are two problems. First, that'd take thousands more words, and this post already exceeds 2000 words. This is just an introduction to the problem and the literature. But more importantly, I don't fully understand these algorithms. They're very complicated to follow and even harder to implement.&lt;br /&gt;&lt;br /&gt;So, overall, it looks like this is a much harder problem than I initially thought. Each time I read a paper on this, I think, "Oh, so &lt;em&gt;this&lt;/em&gt; is what I needed all along. And I wasted so much time looking at the other papers before it!" But there is always another more complicated layer. Before I get to each new layer, I think, "No one's ever done this before!" and then I suddenly find a paper about it, published increasingly recently.&lt;br /&gt;&lt;br /&gt;The problem I'm looking at right now is the last step, the interface. In particular, I'm not sure how to explain a diff to non-technical users and ask questions about how to resolve a merge conflict. Though I haven't found information about these things yet, I'm sure I will in the near future.&lt;br /&gt;&lt;br /&gt;(There's also the problem of doing a three-way merge on the text mentioned earlier, which should come out bolding one part and italicizing another. But this problem, in the general case, is significantly harder to solve because it requires knowledge of XHTML. Imagine how the desired behavior would differ if one of the tags were an anchor. It also makes the matching task more complex as text nodes can be expected to be matched when they are all together in one version, and split up in subnodes in another version. I'm not sure if anyone has researched this yet. Another difficult problem is to merge edits within a text node, which can be done with a standard LCS procedure and operational transformation but requires a fuzzy matching of text nodes.)&lt;br /&gt;&lt;br /&gt;Maybe, continuing in the trend of increasing dates, the answer to these questions hasn't come yet. In that case, I might have to solve this problem myself (and publish the results) so that later people don't have to solve it again. I've already started meeting with a Carleton computer science professor to talk about this problem in general.&lt;br /&gt;&lt;br /&gt;Academic computer science is amazing.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: I've been getting lots of comments suggesting certain tools for XML diffing or merging. I should have made it more clear that this is a &lt;em&gt;solved problem&lt;/em&gt;, or rather, that all current XML diff/merge tools do roughly the same thing. What isn't a solved problem is a schema-sensitive diff for XHTML which doesn't treat text nodes as atomic, and no comment given so far has anything like that. Most of the research going on in XML diffing is about improving heuristics to give faster algorithms, but my main concern is diff quality: a diff over XHTML that's so good even an untrained person could understand it.&lt;br /&gt;&lt;br /&gt;If you're trying to solve the first problem of a general XML diff or merge, you shouldn't be using a closed-source implementation, because these don't generally tell you what algorithm they're using. Three good open-source implementations are &lt;a href="http://gemo.futurs.inria.fr/software/XyDiff/cdrom/www/xydiff/index-eng.htm"&gt;XyDiff&lt;/a&gt; (and its Java clone &lt;a href="http://potiron.loria.fr/projects/jxydiff"&gt;jXyDiff&lt;/a&gt;) implementing Cobena's algorithm in C++, Logilab's &lt;a href="http://www.logilab.org/859"&gt;xmldiff&lt;/a&gt; implementing the FastMatch algorithm in Python, and the &lt;a href="http://tdm.berlios.de/3dm/doc/index.html"&gt;3DM project&lt;/a&gt; implementing Lindholm's algorithms. Unfortunately, none of these seems to have been updated since 2006.&lt;br /&gt;&lt;br /&gt;Originally, my description of edit distances in the Zhang-Shasha algorithm was incorrect (I described Selkow's operations), but it's now fixed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7998438314037640248?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7998438314037640248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7998438314037640248' title='27 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7998438314037640248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7998438314037640248'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/matching-diffing-and-merging-xml.html' title='Matching, diffing and merging XML'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>27</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8598823082615528767</id><published>2008-01-13T23:04:00.000-08:00</published><updated>2008-01-14T11:48:12.340-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='introduction'/><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='scheme'/><title type='text'>until: The Ultimate Enumerator</title><content type='html'>Recently, via &lt;a href="http://programming.reddit.com"&gt;programming.reddit&lt;/a&gt;, I found this &lt;a href="http://okmij.org/ftp/papers/LL3-collections-enumerators.txt"&gt;paper&lt;/a&gt; by &lt;a href="http://okmij.org/ftp/"&gt;Oleg Kiselyov&lt;/a&gt; about the best way to build a generic collection protocol to go through a sequence. We want to do this so that operations on collections can be implemented once generically to work on any kind of collection.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Oleg's idea&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There are two main ways to go about this: one is a cursor, which is called an iterator in Java and C#: basically, you have an object which represents a particular point in a collection, and there are operations to get the current value and the next cursor position, if one exists. The alternative is to have a higher order function called an enumerator which represents, basically, a left fold. Oleg's conclusion is that the left fold operator is the best way to go about things, and he proceeds to demonstrate how this works out in languages with and without call/cc. He shows how cursors and enumerators are isomorphic, defining each in terms of the other.&lt;br /&gt;&lt;br /&gt;Oleg's function is something like this: you have a coll-fold-left function which takes a collection, a function, and an unbounded number of seeds. This will return those seeds after this left fold processing is done. The function takes an element and the values of the seed inputs and returns an indicator with new values for the seeds. The indicator determines whether to keep going or to stop now and return the current values from coll-fold-left.&lt;br /&gt;&lt;br /&gt;An enumeration function like this, which is fully general, is much better than a cursor, because cursors depend on mutable state and they are much more difficult to implement, say, when traversing a tree with no links to parent nodes. Also, enumerators handle termination of the collection much better than iterators, where it must either be manually checked whether the collection has terminated, or an exception thrown when reading past the end. There are also efficiency problems, and some interesting research into doing compiler optimizations like deforestation based on enumerators.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Concatenativity and accumulators&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I'm going to talk about how this can all work in Factor. Factor does have call/cc, so we can use the simpler iterator which depends on call/cc for full control. (I won't go into how to stop and start an iteration in the middle in this article; Oleg's example can be translated directly into Factor.) But we also have something else to take advantage of, which neither Scheme nor Haskell have at their disposal: a stack.&lt;br /&gt;&lt;br /&gt;Remember, Factor is a concatenative language, and that means that whenever we make a function call, it operates on a huge built-in accumulator. You could think of the stack as one accumulator ("all words are functions from stack to stack") or as an unbounded number of accumulators (which makes more sense thinking about things from a dataflow perspective, that the stack is used to glue together the inputs and outputs of various words).&lt;br /&gt;&lt;br /&gt;Here's an example. You know the regular old left fold? As in, taking a list like [1, 2, 3, 4], an initial value (called the identity) 0, and a function like (+), and transforming this into ((((0+1)2)+3)+4)? Well, in Factor we call that &lt;code&gt;reduce&lt;/code&gt;, and here's how it's defined:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: reduce ( sequence identity quotation -- value )&lt;br /&gt;    ! quotation: accumulator element -- new-value&lt;br /&gt;    swapd each ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;code&gt;swapd&lt;/code&gt; is a stack shuffler with the effect &lt;code&gt;( x y z -- y x z )&lt;/code&gt; and &lt;code&gt;each&lt;/code&gt; is a word which takes a sequence and a quotation, and calls the quotation on each element of the sequence in order. &lt;code&gt;each&lt;/code&gt; normally takes a quotation with stack effect &lt;code&gt;( element -- )&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;But how does this work? &lt;code&gt;each&lt;/code&gt;, like nearly every combinator, is defined to expose its quotation to values further down on the stack, so we can mess around with that stuff all we want as long as the result is balanced. Here, the quotation that we give the whole thing is messing with the top thing on the stack, right under the element, each time it's called. So, by the end, the top of the stack has our answer. Keep in mind that all of this is happening completely without mutable state.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;code&gt;until&lt;/code&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;So, that's the great simplification that you get from concatenative languages: there is no need to maintain the accumulators. Here's all we need to define the perfect enumeration protocol: a word which I call &lt;code&gt;until ( collection quotation -- )&lt;/code&gt;. It takes a quotation and a collection from the stack and leaves nothing. The quotation is of the effect &lt;code&gt;( element -- ? )&lt;/code&gt;. If the quotation returns true, the iteration through the list stops there.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;until&lt;/code&gt; is significantly easier to define than &lt;code&gt;coll-fold-left&lt;/code&gt; because we don't have to worry about the accumulators. Here's a simple definition with linked lists, if &lt;code&gt;f&lt;/code&gt; is used as the empty list:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;TUPLE: cons car cdr ;&lt;br /&gt;&lt;br /&gt;: until ( list quot -- )&lt;br /&gt;    over [ 2drop ]&lt;br /&gt;    [ [ &gt;r cons-car r&gt; call ] 2keep &gt;r cons-cdr r&gt; until ] if ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Of course, we'd probably want to define it as a generic word, if we're doing this in Factor and will use it for more than one thing.&lt;br /&gt;&lt;br /&gt;(&lt;code&gt;until&lt;/code&gt; is kinda like the generic iteration protocol I've defined for assocs, using &lt;code&gt;assoc-find&lt;/code&gt;, except &lt;code&gt;until&lt;/code&gt; should be easier to implement.)&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Derived operations&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Using &lt;code&gt;until&lt;/code&gt; as a base, we can easily define other basic operations like &lt;code&gt;each&lt;/code&gt;, &lt;code&gt;find&lt;/code&gt; and &lt;code&gt;assoc-each&lt;/code&gt; as they currently exist in Factor.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: each ( seq quot -- )&lt;br /&gt;    [ f ] compose until ; inline&lt;br /&gt;&lt;br /&gt;: find ( seq quot -- i elt ) ! This is hideous, and should be factored&lt;br /&gt;    &gt;r &gt;r -1 f r&gt; r&gt; [ ! i f elt quot&lt;br /&gt;        2swap &gt;r &gt;r keep r&gt; 1+ r&gt; drop ! ? elt i&lt;br /&gt;        rot [ rot and ] keep ! i f/elt ?&lt;br /&gt;    ] curry until dup [ nip f swap ] unless ; inline&lt;br /&gt;&lt;br /&gt;: assoc-each ( assoc quot -- )&lt;br /&gt;    ! This assumes that assoc implement until, giving 2arrays for each element&lt;br /&gt;    [ first2 ] swap compose each ; inline&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;You might notice that &lt;code&gt;find&lt;/code&gt; will involve a little bit of redundant arithmetic on integer-indexed sequences. I'm not sure how to avoid this without complicating &lt;code&gt;until&lt;/code&gt; and leaking implementation details. Also, you might notice that there's no obvious way to implement &lt;code&gt;find*&lt;/code&gt; (without doing a bunch of redundant iteration), which takes an index to start as an argument. This is a result of the fact that &lt;code&gt;find*&lt;/code&gt; can only really be implemented efficiently on collections with random access indexing. There's also no &lt;code&gt;find-last&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;until&lt;/code&gt; isn't everything. As I pointed out just now, some operations, like find, may have to do redundant arithmetic when operating on certain kinds of collections to get an index, when that index must already be calculated for the traversal itself. Another limitation is that there's no obvious way to implement &lt;code&gt;map&lt;/code&gt;. (I would still have to suggest going with Factor's current scheme of the words &lt;code&gt;new&lt;/code&gt;, &lt;code&gt;new-resizable&lt;/code&gt; and &lt;code&gt;like&lt;/code&gt;. But this scheme doesn't work with, say, linked lists.) This could be implemented in place of the assoc iteration protocol, but that might result in some unnecessary allocation of 2arrays when iterating through things like hashtables.&lt;br /&gt;&lt;br /&gt;Nevertheless, something like &lt;code&gt;until&lt;/code&gt; could definitely help us make our sequence protocol more generic, allowing for (lazy) lists, assocs, and what we currently call sequences to be used all through the same protocol.&lt;br /&gt;&lt;br /&gt;&lt;hr/&gt;If you don't understand the title allusion, you should read some of &lt;a href="http://library.readscheme.org/page1.html"&gt;the original Scheme papers&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-8598823082615528767?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/8598823082615528767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=8598823082615528767' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8598823082615528767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8598823082615528767'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/until-ultimate-enumerator.html' title='&lt;code&gt;until&lt;/code&gt;: The Ultimate Enumerator'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-7477927383128943389</id><published>2008-01-11T06:22:00.000-08:00</published><updated>2008-01-12T15:59:21.583-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='event'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>Factor hack in NYC</title><content type='html'>This is kinda short notice, but if anyone's in the New York area today, Zed Shaw is organizing another one of his Factor hackfests in Manhattan, which I'll be going to. The details are at &lt;a href="http://www.zedshaw.com/blog/2008-01-10.html"&gt;his blog&lt;/a&gt; (unfortunately I can't find a link to the particular article, just the blog index). So it'll be at &lt;a href="http://www.earthmatters.com/"&gt;Earth Matters&lt;/a&gt; (177 Ludlow St) at 7 PM on Friday, January 11th.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Sorry to those of you who couldn't come on such short notice, or on this continent. The whole meeting was really fun, though. In all, nine people came, four of whom had barely seen Factor at all before then. So I gave a little tutorial introduction along the lines of &lt;a href="http://useless-factor.blogspot.com/2007/07/factorial-101.html"&gt;this blog post&lt;/a&gt; using factorial as an example. I also explained &lt;a href="http://useless-factor.blogspot.com/2007/06/concatenative-pattern-matching.html"&gt;inverse&lt;/a&gt;, demoed &lt;a href="http://useless-factor.blogspot.com/2007/07/thats-dirtiest-macro-ive-ever-seen.html"&gt;shufflers&lt;/a&gt;, and talked about &lt;a href="http://useless-factor.blogspot.com/2007/09/using-bigums-efficiently.html"&gt;stuff about bignums and complexity&lt;/a&gt;, but I don't think I explained those things in enough depth to be clear. Luckily it wasn't a complete monologue on my part, as Zed and a couple others made helpful comments to help my descriptions and stuff. Then we devolved into random discussions about various things, some Factor-related and others not, which occupied the majority of the evening and was really fun. There's a good chance I'll be back in New York next April for the first few days of Passover, and I hope to see many of you there at a later Factor meetup!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-7477927383128943389?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/7477927383128943389/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=7477927383128943389' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7477927383128943389'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/7477927383128943389'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/factor-hack-in-nyc.html' title='Factor hack in NYC'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8904932166050225446</id><published>2008-01-08T10:14:00.000-08:00</published><updated>2008-01-08T10:23:54.399-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ignorable'/><title type='text'>Useless Factor is one year old!</title><content type='html'>Actually, it's a little more than a year old; I started this blog on December 19, 2006 with a still-relevant post on &lt;a href="http://useless-factor.blogspot.com/2006/12/variants-of-if-in-factor.html"&gt;Factor's forms of if&lt;/a&gt;. I meant to put up a "happy birthday me!" post on the right date, but I forgot, so now: Happy Birthday Me!&lt;br /&gt;&lt;br /&gt;Many of you brave, intrepid blog readers have endured more than a year of dry, technical posts approaching 2000 words a pop just to learn what I have in my head, giving me a refreshing boost of self-importance. Be proud of yourselves!&lt;br /&gt;&lt;br /&gt;(As a little side-comment about Steve Yegge's &lt;a href="http://steve-yegge.blogspot.com/2008/01/blogging-theory-201-size-does-matter.html"&gt;recent blog post&lt;/a&gt; about long blog posts, I write for a long time for a somewhat different reason. I don't try to make readers endure one huge sitting, but rather try to explain everything as fully as is needed to give a good understanding of things. Because of the topics I usually write about, this sometimes takes a long time. I do this completely without noticing; I only think about it when someone points it out to me. I will never limit myself to 1000-word blog posts, because there's just too much important stuff to say.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-8904932166050225446?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/8904932166050225446/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=8904932166050225446' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8904932166050225446'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/8904932166050225446'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/useless-factor-is-one-year-old.html' title='Useless Factor is one year old!'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-608794531426110771</id><published>2008-01-04T19:54:00.000-08:00</published><updated>2008-01-09T19:02:30.376-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='ignorable'/><title type='text'>Core Factor developers and their projects</title><content type='html'>Here's a quick listing of the main current Factor developers and their projects. Hopefully, it'll help beginners navigate the community better.&lt;br /&gt;&lt;br /&gt;I wanted to keep this list short, but I hope I'm not offending anyone by leaving them off, so please tell me if there's some glaring omission here. There's definitely no official "core" group Factor developers; these are just the people who have contributed a lot in the past. I put the names in order of seniority (in terms of time using Factor).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Slava Pestov&lt;/strong&gt;&lt;br /&gt;IRC nick: &lt;code&gt;slava&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Slava is Factor's creator. He wrote almost all of the core, the compiler and the Factor virtual machine, the frontend of the Factor UI and the backend on X11 and Mac OS X, a detailed math library and the Factor HTTP server. Slava's definitely the leader of the Factor project, but he doesn't tend to operate in a dictatorial way, instead taking input from many people on language design decisions.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Chris Double&lt;/strong&gt;&lt;br /&gt;IRC nick: &lt;code&gt;doublec&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Chris was Factor's first adopter. He wrote a bunch of interesting libraries, including one for lazy lists, another for parsing expression grammars (PEGs), a compiler from a subset of Factor to Javascript, and an 8086 emulator suitable for playing old console games. He also wrote a bunch of useful bindings to various multimedia libraries and a bunch of other things too numerous to list here. Unfortunately for us, Chris has a full-time job, limiting his time to hack on Factor.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Eduardo Cavazos&lt;/strong&gt;&lt;br /&gt;IRC nick: &lt;code&gt;dharmatech&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Ed is a non-conformist and proud of it. He devised Factor's current module system, where vocabs in a USE: clause that haven't already been loaded are loaded automatically. He also wrote a window manager in Factor, a really cool artificial life demo called "boids", a chat server/client, an art programming language implementation and an alternative object system for Factor. &lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Daniel Ehrenberg&lt;/strong&gt; (me)&lt;br /&gt;IRC nick: &lt;code&gt;littledan&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;I'm working on Factor's XML library, Unicode support, and a concatenative pattern matching library. I also have this blog, where I try to write useful (or useless) things about Factor.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Doug Coleman&lt;/strong&gt;&lt;br /&gt;IRC nick: &lt;code&gt;erg&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Doug made a bunch of important libraries including the calendar library, SQL bindings (he's currently working on an abstraction layer), a Mersenne Twister random number generator, some other math libraries, a cryptography library and integration for several text editors. Doug took over maintaining Windows I/O after another contributer, Mackenzie Straight, stopped maintaining it. Doug's really friendly to the beginners in Factor, even the ones who ask stupid questions.&lt;br /&gt;&lt;br /&gt;Mackenzie Straight, Elie Chaftari and Alex Chapman, among others, also contributed a lot. You can see what everyone's done by looking in the vocabulary brower included in Factor.&lt;br /&gt;&lt;br /&gt;At the risk of repeating myself, when you have questions about anything related to Factor, including any of these libraries discussed, you should pose the question to the general Factor forums: the IRC channel #concatenative on Freenode and the mailing list. We'd be happy to answer them.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Fixed attributions for Doug and Elie, and changed last paragraph.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-608794531426110771?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/608794531426110771/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=608794531426110771' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/608794531426110771'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/608794531426110771'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/core-factor-developers-and-their.html' title='Core Factor developers and their projects'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5668129434880833118</id><published>2008-01-03T23:53:00.000-08:00</published><updated>2009-11-24T21:13:41.374-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='introduction'/><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>Learning Factor</title><content type='html'>Learning Factor is tough. One reason for this is that Factor is very different from other programming languages. Programmers today are used to imperative programming languages where data is stored and passed around in named variables (or function calls, which name their variables). Factor is the opposite of this. A lot of code tends to be written in a functional style, and even more jarringly, variables are rare, only referenced in a small fraction of words.  Nobody intends to change any of this; it's a feature, not a bug!&lt;br /&gt;&lt;br /&gt;What we do need to change is the other reason for Factor's difficulty to learn: lack of documentation and organization thereof. It's hard for me to say this when a lot of effort has been put into documentation, especially creating a comprehensive reference for the Factor core and libraries. When I got started on Factor, there was basically no documentation at all; I just asked questions on the IRC channel. Now there's tons of it, but it's in many places.&lt;br /&gt;&lt;br /&gt;Two starting places (not necessarily in this order) are the &lt;a href="http://concatenative.org/wiki/view/Factor/FAQ"&gt;FAQ&lt;/a&gt; and the Factor cookbook. The Factor cookbook is included in the Factor distribution, accessed by running Factor, selecting the "Browser" tab and clicking on Factor cookbook, or &lt;a href="http://factorcode.org/responder/help/show-help?topic=cookbook"&gt;online&lt;/a&gt;. It was written by Slava Pestov, and illustrates really clearly the basics of Factor.&lt;br /&gt;&lt;br /&gt;But once you get done with that, where should you go? Here are a bunch of options, depending on what exactly you want to:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Factor's included reference documentation, written mostly by Slava, is sound and basically complete but not very tutorial-like.&lt;/li&gt;&lt;li&gt;Slava's &lt;a href="http://factor-language.blogspot.com/"&gt;blog&lt;/a&gt; is the best place to learn about new language features as they're developed.&lt;/li&gt;&lt;li&gt;Aaron Schaefer wrote a series of blog posts that introduce Factor, starting &lt;a href="http://elasticdog.com/2008/11/beginning-factor-introduction/"&gt;here&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;I wrote a bunch of &lt;a href="http://useless-factor.blogspot.com/search/label/introduction"&gt;introductory blog posts&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;Elie Chaftari wrote an &lt;a href="http://fun-factor.blogspot.com/2007/03/factor-magic.html"&gt;introduction&lt;/a&gt; to programming Factor, and he wrote about &lt;a href="http://fun-factor.blogspot.com/2007/10/getting-started-with-factor-easy-ffi.html"&gt;the FFI&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;You can look at many other blogs about Factor through the &lt;a href="http://planet.factorcode.org/"&gt;Planet Factor aggregator&lt;/a&gt;. Scattered around here are many more introductions to Factor.&lt;/li&gt;&lt;li&gt;The &lt;a href="http://www.latrobe.edu.au/philosophy/phimvt/joy.html"&gt;Joy papers&lt;/a&gt; by Manfred von Thun provide a theoretical basis for concatenative languages.&lt;/li&gt;&lt;li&gt;You could try reading some source code in the libraries. Simple demos have the tag &lt;code&gt;demo&lt;/code&gt;, which you can query in the vocab browser. The core sequence library is a good place to look for clean, idiomatic code.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;The best way to learn Factor, after you have a grasp of the basic syntax, is to play around with it and try to build something useful. Here are some project ideas:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Write a date parsing library, accepting ISO 8601 format or a natural English (and maybe internationalized) format. A calendar backend already exists.&lt;/li&gt;&lt;li&gt;Create an embedded, lexically scoped, infix/prefix DSL for complex math calculations&lt;/li&gt;&lt;li&gt;Make an analog clock with the Factor UI library, or a GUI for Factor server administration&lt;/li&gt;&lt;li&gt;Solve some &lt;a href="http://projecteuler.net/"&gt;Project Euler&lt;/a&gt; problems in Factor&lt;/li&gt;&lt;li&gt;Whenever you encounter some part of Factor that you don't like, or some missing language feature, implement it. It's very likely that you'll be able to do whatever you want in the library&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;These problems may sound a bit difficult, but most of them have simple part-way solutions so you can begin to get your feet wet. And with any of these, if you have useful working code, we'd be happy to take it into the Factor library, if you feel like releasing it under a BSD-style license.&lt;br /&gt;&lt;br /&gt;If you get stuck, don't despair! The friendly people on the Factor mailing list and irc.freenode.net's #concatenative channel are here to answer all of your questions. You could also just &lt;a href="mailto:ehrenbed@carleton.edu"&gt;email me&lt;/a&gt;. Eventually, we'll have a Factor book, but it hasn't been started yet, in part because Factor's not at version 1.0 and in part because none of us have enough free time to write it.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Factor documentation isn't as bad as I originally put it, so I changed this post a little.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5668129434880833118?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5668129434880833118/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5668129434880833118' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5668129434880833118'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5668129434880833118'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2008/01/learning-factor.html' title='Learning Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-6765027990678895117</id><published>2007-12-31T20:25:00.000-08:00</published><updated>2008-01-01T11:00:53.233-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>Unicode implementer's guide part 6: Input and output</title><content type='html'>Until now, in my blog posts on Unicode, I've been focusing on internal processing which has only an indirect effect on users. But now I'm thinking about something else: how can you write a text editing widget which potentially supports as many locales as possible? This should generally be the goal when writing internationalized applications, and almost all applications should be eventually internationalized, so why not a text editor?&lt;br /&gt;&lt;br /&gt;As it turns out, this is an extremely complicated task, combining the difficulties of input methods and rendering. In this article I'll try to brush on as many relevant topics as possible, trying not to ramble too much on any particular one. Really, each section should probably be given its own blog post.&lt;br /&gt;&lt;br /&gt;To test the &lt;a href="http://useless-factor.blogspot.com/2007/09/when-unicode-just-works-itself-out.html"&gt;Unicode-bind&lt;/a&gt; approach, I'm looking at the Factor .91 editor gadget, which pays no attention at all to internationalization and basically parrots back whatever characters it receives. Different keyboard layouts were tested on Mac OS X 10.5, though that shouldn't matter here.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Typing European scripts&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;European scripts are pretty easy. You just type letters on the keyboard and they appear, always in left-to-right order, on the screen, one character appearing for each keystroke in many languages. This works for English, German, Russian, Georgian and many other languages, but not for, say Spanish or Vietnamese. In those languages, there are characters like é or ở which require multiple keystrokes to input. With a typical Spanish keyboard, to write é, first the ´ key is pressed and then the e key. For ở, you first press the ơ key and then the  ̉ key.&lt;br /&gt;&lt;br /&gt;There are more complicated input methods for, say, typing é with a US keyboard, and these are system-dependent. On Mac OS X, I use Alt-e to get that first combining ´ mark, and then an e to constitute the character under it. This can be used in any application. In Microsoft Office applications on Windows, you can use Ctrl-' to make a combining ´ mark. Elsewhere, you can't use that key chord, just some complicated keystrokes to input the right hex code. I believe Gnome and KDE each define their own mechanisms for this kind of input.&lt;br /&gt;&lt;br /&gt;From a Unicode-bind text editor, none of these multiple-character inputs work. When in the Spanish keyboard mode in the Factor editor gadget, pressing ´ does nothing, and when in Vietnamese mode, pressing ơ does nothing. This is because the input system expects something more complicated to happen next. The OS-specific keys, of course, don't work properly either unless there is special handing for them. All of these things must be taken into account in an internationalized editor widget.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Typing East Asian scripts&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;When looking at the possible input methods in the Mac keyboard layout chooser, you're immediately struck by the fact that four East Asian languages--Chinese, Japanese, Korean and Vietnamese, have far more input methods listed than any other individual language. This is probably because (more than anything else) of the difficulty involved in inputing Han ideographs.&lt;br /&gt;&lt;br /&gt;Han ideographs (Chinese characters) are used in Chinese and Japanese (as Kanji), occasionally in Korean and historically with Vietnamese. Unicode encodes over 70,000 Han ideographs, but at least 90% of these are extremely rare. Still, thousands come up in everyday situations and need to be typed into computers.&lt;br /&gt;&lt;br /&gt;There are a number of ways of inputting Han ideographs. One way is through a Romanization. This requires several dictionaries, because each language pronounces the ideographs differently, and there are many Romanization systems for each language. One Romanization may correspond to more than one ideograph, so users are presented with a list of numbered choices. Another input method works by specifying the radicals, or component pieces, of the Han ideographs.&lt;br /&gt;&lt;br /&gt;Japanese and Korean (and Vietnamese, discussed above) also have alphabets. Japanese has the kana&amp;mdash;katakana and hiragana&amp;mdash;a syllabary which can be entered by a special keyboard or through Romanization. Korean has Hangul, (I &lt;a href="http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-3.html"&gt;blogged&lt;/a&gt; about this earlier) which are syllables made of the 24 letters in the jamo alphabet. These can be entered by Romanization or by typing jamo. In either case, things are expected to coalesce into Hangul syllables automatically during typing.&lt;br /&gt;&lt;br /&gt;The choice of input methods is usually chosen outside of the current application, and applications are expected to work perfectly with that choice. A good text editing widget must communicate with the OS settings to pick the right one, and it must implement it properly.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Rendering difficult scripts&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Hebrew, European scripts and East Asian scripts are relatively simple to render. Well, maybe not that simple, when it comes to placing (and stacking!) combing marks and putting Hangul syllables or Han ideographs. But at least two adjacent letters don't tend to combine with each other spontaneously, forming several types of ligatures depending on what's on the left and what's on the right. At least letters go in their prevailing order, not rearranging themselves visually.&lt;br /&gt;&lt;br /&gt;Certain Middle Eastern scripts like Arabic and Syriac, and several Indic scripts like Devanagari, don't follow these rules. Arabic and Syriac are written in a cursive form. Each letter has up to four (or sometimes five) forms in the simplified form that most computer typesetting systems use, and vowels are written above or below the characters as combining marks. In some computer typesetting systems, these shapes can be inferred from context, but others will display things incorrectly unless special code points are used which include information about which form it is.&lt;br /&gt;&lt;br /&gt;Indic scripts are also complicated, but differently. The term "Indic script" refers not only to scripts used in India but a whole family of scripts used around South and Southeast Asia. There are 19 of these scripts encoded in Unicode, all with somewhat similar properties. Devanagari, the script used for Hindi, Nepali and Sanskrit, among other things, is one of these Indic scripts, and others include Malayalam, Tibetan, Tamil, Khmer and Thai. Among consonants in most Indic scripts, there are a large number of ligatures, but these are not encoded into Unicode; they must be inferred by a suitable font. Vowels in Indic scripts can be written after a consonant group, above it, below it, or even &lt;em&gt;before&lt;/em&gt; the group, and this must also be inferred by the font. (I'm oversimplifying in both of those sentences, of course. Things are much more complicated.) In most Indic scripts' Unicode encodings, vowels which are placed before a consonant group come &lt;em&gt;aftewards&lt;/em&gt; in memory, following logical order. Thai and Lao scripts don't follow these rules, and put those vowels before, but this causes complications in collation. All of these properties of Indic scripts create difficulties not only in rendering but also in input methods, but I haven't read much specifically on this.&lt;br /&gt;&lt;br /&gt;Some or all of these issues might be solved by choosing a suitable font and display system rather than the editor widget itself.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Line breaking&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I &lt;a href="http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html"&gt;wrote about grapheme breaking&lt;/a&gt; previously. Unicode line breaking is sort of like grapheme breaking, except much much more complicated. By "line breaking" I mean what you might call word wrap. It's described in &lt;a href="http://unicode.org/reports/tr14/"&gt;UAX #14&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This document describes several classes of characters with different line-breaking properties. Some force a line break before or after their display, others passively allow a line break before or after, some forbid a line break and others have even more complicated properties. All in all, there are 38 categories. Even though the algorithm is complicated, it (like generating a collation key) can be done in linear time with respect to the length of the string.&lt;br /&gt;&lt;br /&gt;With respect to line breaking, different scripts and locales have different rules. In Chinese, Japanese and Korean text, where spaces between words are rare or nonexistent, you can do a line break between any two Han ideographs, kana or Hangul syllables. Often, arbitrary line breaks are even allowed within words written in Roman script. In Korean, we have to watch out that we don't break up a Hangul syllable, so conjoining jamo behavior rules are used.&lt;br /&gt;&lt;br /&gt;In Khmer, Lao and Thai scripts, there are also no spaces between words, but line breaks must not interrupt words. This requires some linguistic analysis and dictionary lookup, but Unicode provides a code point to indicate line breaking opportunities. This may sound a little weird and impractical, but a similar thing also exists for European scripts which may hyphenate words between lines: the soft hyphen.&lt;br /&gt;&lt;br /&gt;Even if things are restricted to American English, Unicode line breaking properties provide for better visual display than separating things at word breaking points, or just where there's whitespace. Still, it provides a significant implementation challenge for the writer of an internationalized text editor.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Bidirectional text&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;In Arabic, Hebrew and some other scripts, text runs from right to left rather than left to right. But what if we have a document that combines Hebrew and English: how do we organize text running in two directions? In an even more common situation, we could have text containing Hebrew and numbers. (Even in Hebrew and Arabic, decimal numbers are written left to right.) According to Unicode principles, both left-to-right (LTR) and right-to-left (RTL) scripts should be written in logical order, yet their display order is different.&lt;br /&gt;&lt;br /&gt;To negotiate these issues, the Unicode Consortium specified the &lt;a href="http://www.unicode.org/reports/tr9/"&gt;Bidirectional (BIDI) Algorithm&lt;/a&gt; (UAX #9). The BIDI Algorithm could easily fill up a whole blog post by itself, and I don't fully understand it, so I won't try to explain it all here. There are a number of different categories that Unicode code points go in for BIDI purposes: characters can be not only LTR or RTL, but also neutral, mirroring, strongly or weakly directioned, or an invisible special-purpose mark. These categories sometimes overlap, and there are several levels of directioning.&lt;br /&gt;&lt;br /&gt;The BIDI algorithm is difficult to implement for display, but it gets even more complicated in a text editor. When placing the cursor somewhere, there can be ambiguity as to where it is. When selecting things, there is sometimes a design decision whether to support logical selection (which could be discontiguous) and visual selection (which is hard to implement). The best resource to learn more about this is the &lt;a href="http://www.unicode.org/reports/tr9/"&gt;UAX document&lt;/a&gt; itself.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;For Factor, it will be impossible to keep the editor widget written completely in Factor. It would require a huge amount of duplicated effort and system-specific configuration to get it right, so we'll probably end up having to use native (meaning Gtk, Cocoa and Windows UI) editor widgets.&lt;br /&gt;&lt;br /&gt;Clearly, the Unicode-blind approach is insufficient here. Because of all this, internationalized editor widgets are implemented pretty rarely. English-speakers can get complacent in simple input methods, but these will not work for others.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Warning&lt;/strong&gt;: Readers should be aware that I haven't studied any one of these topics in-depth. This is just an introduction and shouldn't be taken very seriously.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-6765027990678895117?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/6765027990678895117/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=6765027990678895117' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/6765027990678895117'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/6765027990678895117'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2007/12/unicode-implementers-guide-part-6-input.html' title='Unicode implementer&apos;s guide part 6: Input and output'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-4886195175443119239</id><published>2007-12-23T21:15:00.000-08:00</published><updated>2008-03-04T11:35:00.637-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='ignorable'/><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>Books that should exist</title><content type='html'>As you can probably guess by the existence of this blog, I love to write. Part of what I love about it is the actually act of writing, but it's really more that I want things to be written that hadn't before been written. I want to write stuff that I wish I could have read. Right now, there are three books and one journal article that I think should exist that. Hopefully, I'll be able to write some of these some time in my life.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;em&gt;Implementing Unicode: A Practical Guide&lt;/em&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;One thing that struck me when beginning to write the Unicode library is that there aren't many books about Unicode. The two I found in my local Barnes and Noble were the &lt;a href="http://www.amazon.com/Unicode-Standard-Version-5-0-5th/dp/0321480910/ref=pd_bbs_sr_1?ie=UTF8&amp;s=books&amp;qid=1198444661&amp;sr=8-1"&gt;Unicode 5.0 Standard&lt;/a&gt; and &lt;a href="http://www.amazon.com/Unicode-Explained-Jukka-Korpela/dp/059610121X/ref=pd_bbs_sr_2?ie=UTF8&amp;s=books&amp;qid=1198444661&amp;sr=8-2"&gt;Unicode Explained&lt;/a&gt;. Looking on Amazon.com, I can't find any other books that address Unicode 3.1 (the version that moved Unicode from 1 to 17 planes) or newer in detail, ignoring more specialized books.&lt;br /&gt;&lt;br /&gt;Both of these were both great books, but they aren't optimal for figuring out how to implement a Unicode library. Unicode Explained focuses on understanding Unicode for software development, but shies away from details of implementation. The Unicode Standard explains everything, but it often gets overly technical and can be difficult to read for people not already familiar with the concepts described. A Unicode library implementor needs something in the middle. &lt;a href="http://www.amazon.com/Unicode-Demystified-Practical-Programmers-Encoding/dp/0201700522/ref=pd_bbs_sr_4?ie=UTF8&amp;s=books&amp;qid=1198444661&amp;sr=8-4"&gt;Unicode Demystified&lt;/a&gt; might be the right one, but it describes Unicode 3.0, so it is in many ways outdated.&lt;br /&gt;&lt;br /&gt;I wish a book existed which explained Unicode in suitable detail for most library implementors. If this book continues to not exist for many years, I might just have to write it myself. This, however, would be more difficult and less fun than my other book ideas.&lt;br /&gt;&lt;br /&gt;[&lt;strong&gt;Update&lt;/strong&gt;: After getting my hands on Unicode Demystified, I realize that I should not have thrown it aside so quickly. It's a great book, and nearly all of it is still relevant and current. It looks like the ebook version I have is updated for Unicode 3.1.]&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;em&gt;Programming with Factor&lt;/em&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Factor's included documentation is great, but it's not enough, by itself, to learn Factor. I, and probably most people who know Factor, learned through a combination of experimenting, reading blog posts and the mailing list, and talking on #concatenative. Many people will continue to learn Factor this way, but it still seems somehow insufficient. It should be possible to learn a programming language without participating in its community.&lt;br /&gt;&lt;br /&gt;Of course, we can't write a book on Factor until we get to see what Factor will look like in &lt;a href="http://useless-factor.blogspot.com/2007/12/roadmap-to-factor-10.html"&gt;version 1.0&lt;/a&gt;. But I'm confident that this book will be written, even if it goes unpublished in print, and I'm fairly sure that I'll have at least some part in it. Maybe it'll be a big group effort by the whole Factor community.&lt;br /&gt;&lt;br /&gt;What I'm thinking is that we could have a book which teaches programming &lt;em&gt;through&lt;/em&gt; Factor, to people who aren't yet programmers. I've talked a lot about this with Doug Coleman, and we agree that most programming books are bad; we should make a new book that does things very differently. But we can't agree or imagine just how...&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;em&gt;The Story of Programming&lt;/em&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I, like many of you reading this blog, have an unhealthy interest in programming languages. Mine may be a little more unhealthy than yours. Whenever I hear the name of a programming language that I don't know of, I immediately need to read about it, to get some basic knowledge of its history, syntax, semantics and innovations.&lt;br /&gt;&lt;br /&gt;Most study of programming languages works by examining the properties of the languages themselves: how functional programming languages are different from imperative object-oriented languages and logic languages and the like. But what about a historical perspective? The history of programming languages is useful for the same reason other kinds of historical inquiry are useful. When we know about the past, we know more about the way things are in the present, and we can better tell what will happen in the future. The history of programming languages could tell us what makes things popular and what makes things ultimately useful.&lt;br /&gt;&lt;br /&gt;Unfortunately, not much has been done on this. Knowledge of programming language history is passed on, unsourced, with as much verifiability as folk legend. The ACM has held three conferences called &lt;a href="http://research.ihost.com/hopl/HOPL.html"&gt;HOPL&lt;/a&gt; on this subject over the past 30 years, so all the source material is there. But apart from a book published in 1969, this is all I can find as far as a survey history of programming languages goes.&lt;br /&gt;&lt;br /&gt;There is a limit to how much academic research papers can provide. The proceedings of the HOPL conferences aren't bedtime reading, and they don't provide much by way of a strong narrative. A new book could present the whole history of programming from the first writings about algorithms to modern scripting languages and functional programming languages so it's both accessible to non-programmers and interesting to programmers. As far as I know, no one's really tried. But it would be really fun to try.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;em&gt;Inverses in Concatenative Languages&lt;/em&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Most of my ideas are either bad or unoriginal. But there's one idea that I came up with that seems to be both original and not horrible, and that's the idea of a particular kind of concatenative pattern matching (which I &lt;a href="http://useless-factor.blogspot.com/2007/06/concatenative-pattern-matching.html"&gt;blogged&lt;/a&gt; about, and Slava also &lt;a href="http://factor-language.blogspot.com/2007/08/units-and-reversable-computation.html"&gt;wrote&lt;/a&gt; about in relation to &lt;code&gt;units&lt;/code&gt;).&lt;br /&gt;&lt;br /&gt;The basic idea is that, in a concatenative language, the inverse of &lt;code&gt;foo bar&lt;/code&gt; is the inverse of &lt;code&gt;bar&lt;/code&gt; followed by the inverse of &lt;code&gt;foo&lt;/code&gt;. Since there are some stacks that we know &lt;code&gt;foo&lt;/code&gt; and &lt;code&gt;bar&lt;/code&gt; will never return (imagine &lt;code&gt;bar&lt;/code&gt; is &lt;code&gt;2array&lt;/code&gt; and the top of the stack is &lt;code&gt;1&lt;/code&gt;), this can fail. From this, we get a sort of pattern matching. Put more mathematically, if &lt;em&gt;f&lt;/em&gt; and &lt;em&gt;g&lt;/em&gt; are functions, then (&lt;em&gt;f&lt;/em&gt; o &lt;em&gt;g&lt;/em&gt;)&lt;sup&gt;-1&lt;/sup&gt; = &lt;em&gt;g&lt;/em&gt;&lt;sup&gt;-1&lt;/sup&gt; o &lt;em&gt;f&lt;/em&gt;&lt;sup&gt;-1&lt;/sup&gt;. &lt;br /&gt;&lt;br /&gt;In my implementation of this, I made it so you can invert and call a quotation with the word &lt;code&gt;undo&lt;/code&gt;. We don't actually need a full inverse; we only need a right inverse. That is, it's necessary that &lt;code&gt;[ foo ] undo foo&lt;/code&gt; be a no-op, but maybe &lt;code&gt;foo [ foo ] undo&lt;/code&gt; returns something different. Also, we're taking all functions to be partial.&lt;br /&gt;&lt;br /&gt;Anyway, I think this is a cool idea, but I doubt it could fill a book. I want to write an article about it for an academic journal so that I can explain it to more people and expand the idea. It could be made more rigorous, and it could use a lot more thought. I hope this works.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;When will I write these?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;So, I hope I'll be able to write these things. Unfortunately, I'm not sure when I could. I need to finish my next 10 years of education, which won't leave tons of free time unless I give up blogging and programming and my job. Also, I'm not sure if I'm capable of writing a book or an article for an academic journal, though maybe I will be in 10 years when I'm done with my education. It wouldn't be so bad if someone stole my thunder and wrote one of these things because what I really want to do is read these books.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Update&lt;/strong&gt;: Here are a couple more books (or maybe just long surveys) that should exist but don't: something about operational transformation, and something about edit distance calculations and merge operations for things more complicated than strings. Right now, you can only learn about these things from scattered research papers. (I never thought I'd find the references at the end so useful!)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-4886195175443119239?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/4886195175443119239/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=4886195175443119239' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4886195175443119239'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/4886195175443119239'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2007/12/books-that-should-exist.html' title='Books that should exist'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-5255528362180207929</id><published>2007-12-13T11:51:00.000-08:00</published><updated>2007-12-14T23:17:50.432-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unicode'/><title type='text'>Alphanum sort and Unicode</title><content type='html'>So, I'll just be the Unicode troll here and say, &lt;a href="http://code-factor.blogspot.com/2007/12/sorting-filenames-containing-numbers.html"&gt;all&lt;/a&gt; &lt;a href="http://www.davekoelle.com/alphanum.html"&gt;of&lt;/a&gt; &lt;a href="http://nedbatchelder.com/blog/200712.html#e20071211T054956"&gt;your&lt;/a&gt; &lt;a href="http://www.codinghorror.com/blog/archives/001018.html"&gt;solutions&lt;/a&gt; &lt;a href="http://sourcefrog.net/projects/natsort/"&gt;to&lt;/a&gt; the &lt;a href="http://weblog.masukomi.org/2007/12/10/alphabetical-asciibetical"&gt;problem&lt;/a&gt; of sorting alphanumeric file names are insufficient. To recap, if you haven't been following Reddit: it's annoying how, when looking at a directory, the file named z10.txt is sorted between z1.txt and z2.txt, and we programmers should come up with a solution. The gist of the solution is that you split up the number into a list of alphabetical and numeric parts, parse the numbers, and then sort it all.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;What's wrong with that?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Every one of these implementations, including a slightly &lt;a href="http://personal.inet.fi/cool/operator/Human%20Sort.py"&gt;internationalized&lt;/a&gt; version (just for some Scandanavian locale, apparently), does not produce a sort the way humans expect it. Remember, in ASCII, capital Z comes before lower-case a. Jeff Attwood hinted at internationalization problems, and that's just part of the problem. We should use the &lt;a href="http://unicode.org/reports/tr10/"&gt;Unicode Collation Algorithm&lt;/a&gt; (which I previously &lt;a href="http://useless-factor.blogspot.com/2007/10/unicode-implementers-guide-part-5.html"&gt;blogged&lt;/a&gt; about). The UCA provides a locale-dependent, mostly linguistically accurate collation for most locales (basically all that don't use Han ideographs).&lt;br /&gt;&lt;br /&gt;The basic idea that we need here is that the UCA is based on a number of levels of comparison. This is useful not only in "international" locales, but also in United States English. First, we compare two strings with most features removed, then we add back in accent marks, and then capitalization, and then punctuation/spacing. So, even if we want "B" to come before "b", we can have "bench" come before "Bundle"--before comparing for capitalization, we compare for the letters themselves, and "bench" obviously precedes "bundle". For a better description of the UCA, see my &lt;a href="http://useless-factor.blogspot.com/2007/10/unicode-implementers-guide-part-5.html"&gt;past blog post&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So, the most basic way to put the UCA together with the Alphanum algorithm is to just use the same algorithm, except with the UCA to compare the strings rather than an ASCII comparison like most other implementations have done. But this would completely destroy the UCA's subtle level effect: even if "A" is sorted before "a", we would want "a1" to come before "A2". We also want "A1" to come before "a2", and we also don't want to ignore case completely ignore case; it's still significant.&lt;br /&gt;&lt;br /&gt;The UCA solves this problem when we're working with "a1" and "A2", but not when we're comparing "a2" and "a10". For this, we need a slightly more complicated solution based on something like the &lt;a href="http://www.davekoelle.com/alphanum.html"&gt;Alphanum algorithm&lt;/a&gt;. But breaking the string up by itself won't allow for the layering kind of behavior that the UCA depends on.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;What's a solution for this?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;One way to fix this is to break the string up into its segments, right-pad the alphabetical strings and left-pad the numbers. We can just break things up into segments of consecutive digits and non-digits. The length that we'll pad to for the nth subsequence is the maximum of the lengths of the nth subsequences for all of the strings.  For numbers, we can left-pad with plain old 0, but for strings, we want to &lt;br /&gt;right-pad with something that's lower than any other character. This sounds hard to do, but by tailoring the DUCET (data table for the UCA), we can just define a character in a private use space as a new, special padding character. This padding character will be completely non-ignorable, fixed weight and have a lower primary, secondary and tertiary weight than any other character by definition.&lt;br /&gt;&lt;br /&gt;OK, that sounds really complicated. Let's step back a minute and see how this padding could work out to provide functionality equivalent to the existing Alphanum algorithm. For this, we'll assume access to an ASCII-order sorting algorithm, and just figure out how to pad the string in the right way to come up with a suitable sort key. Instead of some new character, we can just use a null byte. So, if we have the strings "apple1b" and "the2345thing1", the fully padded strings should look like "apple0001b\0\0\0\00" and "the\0\02345thing1", where '\0' is the null byte.&lt;br /&gt;&lt;br /&gt;I can make a simple Factor implementation by stealing Doug's &lt;code&gt;cut-all&lt;/code&gt; code from his &lt;a href="http://code-factor.blogspot.com/2007/12/sorting-filenames-containing-numbers.html"&gt;description&lt;/a&gt; of a solution to this problem.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: pad-seq ( seq quot -- newseq ) &lt;br /&gt;    &gt;r dup [ length ] map supremum r&gt; &lt;br /&gt;    curry map ; inline&lt;br /&gt;&lt;br /&gt;: pad-quote ( seq -- newseq )&lt;br /&gt;    [ "" pad-right ] pad-seq ;&lt;br /&gt;&lt;br /&gt;: pad-number ( str -- newstr )&lt;br /&gt;    [ CHAR: 0 pad-left ] pad-seq ;&lt;br /&gt;&lt;br /&gt;: pad-string ( str -- newstr )&lt;br /&gt;    [ 0 pad-right ] pad-seq ;&lt;br /&gt;&lt;br /&gt;: pieces ( strings -- pieces )&lt;br /&gt;    [ [ digit? ] [ ] cut-all ] map pad-quote flip ;&lt;br /&gt;&lt;br /&gt;: pad ( pieces ? -- ? newpieces )&lt;br /&gt;    [ pad-string f ] [ pad-number t ] if swap ;&lt;br /&gt;&lt;br /&gt;: pad-strings ( strings -- newstrings )&lt;br /&gt;    pieces t swap [ swap pad ] map nip flip [ concat ] map ;&lt;br /&gt;&lt;br /&gt;: alphanum-sort ( strings -- sorted )&lt;br /&gt;    dup pad-strings [ 2array ] 2map sort-values keys ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This, by far, isn't the most concise implementation, but the advantage is that it can easily be adapted for a better collation algorithm.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;But where's the real Unicode implementation of this?&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I'm not sure where to implement this in terms of tailoring and hooking it up with the UCA. I'd either have to (a) go into C++ and do this with ICU (b) write my own Unicode library which has tailoring, and do it there. I'm too scared to do the first, and the second isn't done yet. I've looked at Java's tailoring, but I don't see how I could define a character that's lower than &lt;em&gt;all&lt;/em&gt; other characters. I feel a little bit lame/trollish putting this blog post out there without a real implementation of the solution, but I hope I was able to expand on people's knowledge of the concerns of internationalization (yes, I'll write that word all out; I don't need to write i18n) in text-based algorithms.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-5255528362180207929?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/5255528362180207929/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=5255528362180207929' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5255528362180207929'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/5255528362180207929'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2007/12/alphanum-sort-and-unicode.html' title='Alphanum sort and Unicode'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-2761438217007845441</id><published>2007-12-10T22:40:00.000-08:00</published><updated>2007-12-10T23:09:18.108-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><category scheme='http://www.blogger.com/atom/ns#' term='parsing'/><category scheme='http://www.blogger.com/atom/ns#' term='macros'/><title type='text'>Multiline string literals in Factor</title><content type='html'>It's always annoyed me somewhat that Factor strings can only be on one line and that there was no mechanism for anything like "here documents" like Perl has. So I decided to write it myself. At this point, I don't really need it and have forgotten what I wanted it for, but it was still a fun exercise.&lt;br /&gt;&lt;br /&gt;I started out thinking I should do something similar to some other languages do it: write a word, maybe called &lt;code&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/code&gt;, which is followed by some string which is used to delineate a multiline string literal expression. But I realized this wouldn't be the most idiomatic way in Factor. First, if you're making a multiline string literal, why would you ever have it be within a word? For constants like this, it's considered best practice to put them in their own separate words. Second, why do I need to give the option of choosing your own ending? What's wrong with just using a semicolon, like other Factor definitions?&lt;br /&gt;&lt;br /&gt;Putting this together, I came up with the following syntax:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;STRING: something&lt;br /&gt;Hey, I can type all the text I want here&lt;br /&gt;And it can be over multiple lines&lt;br /&gt;But without the backslash-n escape sequence!&lt;br /&gt;;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The final line, the semicolon, has to have no whitespace before or after it. This allows for practically any useful string to be written this way. The syntax will compile into something like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: something&lt;br /&gt;    "Hey, I can type all the text I want here\nAnd it can be over multiple lines\nBut without the backslash-n escape sequence!" ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;With a previous parser design, multiline string literals like this were impossible, but now they can be done in 11 lines. I packaged this up and put it in my repository under &lt;code&gt;extra/multiline&lt;/code&gt; so others can use it.&lt;br /&gt;&lt;hr/&gt;&lt;br /&gt;Using the internals of the parser, the first word advances the parser state one line and returns the text of the new line.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: next-line-text ( -- str )&lt;br /&gt;    lexer get dup next-line line-text ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The next two words do the bulk of the parsing. They advance the parser's current line  until reaching a line consisting only of ";", and then advance the line one more time. While moving forward, a string is compiled consisting of all of the lines, interspersed with newlines.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: (parse-here) ( -- )&lt;br /&gt;    next-line-text dup ";" =&lt;br /&gt;    [ drop lexer get next-line ] [ % "\n" % (parse-here) ] if ;&lt;br /&gt;&lt;br /&gt;: parse-here ( -- str )&lt;br /&gt;    [ (parse-here) ] "" make 1 head*&lt;br /&gt;    lexer get next-line ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Finally, the word &lt;code&gt;STRING:&lt;/code&gt; puts it all together, defining a new word using a string gathered with &lt;code&gt;parse-here&lt;/code&gt;.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;: STRING:&lt;br /&gt;    CREATE dup reset-generic&lt;br /&gt;    parse-here 1quotation define-compound ; parsing&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;There are downsides to having an extremely flexible syntax like Factor. Things can be less predictable when libraries can alter the fundamentals of syntax. It'd be impossible to create a fixed BNF description of Factor syntax. Additionally, the particular structure of Factor sometimes encourages syntax extension that's heavily dependent on the details of the current implementation. But the upside is that we can do things like this. I think it's worth it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/273593670040001243-2761438217007845441?l=useless-factor.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://useless-factor.blogspot.com/feeds/2761438217007845441/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=273593670040001243&amp;postID=2761438217007845441' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/2761438217007845441'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/273593670040001243/posts/default/2761438217007845441'/><link rel='alternate' type='text/html' href='http://useless-factor.blogspot.com/2007/12/multiline-string-literals-in-factor.html' title='Multiline string literals in Factor'/><author><name>Daniel Ehrenberg</name><uri>http://www.blogger.com/profile/00902922561603041049</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-273593670040001243.post-8812955257342852094</id><published>2007-12-06T13:31:00.000-08:00</published><updated>2007-12-08T23:13:35.152-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='factor'/><title type='text'>Roadmap to Factor 1.0</title><content type='html'>According to the Factor website, Factor 1.0 is coming out some time in 2008. That's pretty scary. 2008 is coming in less than a month, and we currently have no solid plan for how we're going to go about reaching 1.0. Depending on the amount of available time that contributors have over the next year, it'll take a different amount of time to achieve this goal. Nevertheless, I'm proposing a roadmap of development goals for Factor 1.0.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Figuring out the goals&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;We first need to figure out the exact goals for Factor 1.0, and how we're going to go about completing them. From the Factor homepage, we have a list of goals for Factor 1.0:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;New object system with inheritance and multiple dispatch&lt;/li&gt;&lt;li&gt;Incremental garbage collection&lt;/li&gt;&lt;li&gt;Drastically reduce compile time&lt;/li&gt;&lt;li&gt;Continue improving existing libraries, such as databases, multimedia, networking, web, etc&lt;/li&gt;&lt;li&gt;Full Unicode support, Unicode text display in UI (in progress)&lt;/li&gt;&lt;li&gt;Better UI development tools&lt;/li&gt;&lt;li&gt;Get the UI running on Windows CE&lt;/li&gt;&lt;li&gt;Add support for Windows 64-bit and Mac OS X Intel  64-bit&lt;/li&gt;&lt;li&gt;Use kqueue (BSD, Mac OS X), epoll (Linux) for high-performance I/O&lt;/li&gt;&lt;li&gt;Directory change notification&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;This contains many different things, which will take a lot of work. For each of these tasks (or pieces of them), we need to decide who will be in charge of them and what a target date for completion will be. There are also other goals that aren't on this list, and we need to identify what these are.&lt;br /&gt;&lt;br /&gt;One list item is particularly involved: &lt;em&gt;Continue improving existing libraries, such as databases, multimedia, networking, web, etc&lt;/em&gt;. Right now, the libraries in &lt;code&gt;extra/&lt;/code&gt; are of inconsistent quality. We should set a standard for quality for things that are in &lt;code&gt;extra/&lt;/code&gt; and designate more experimental libraries as such, maybe by putting them in a different directory like &lt;code&gt;playground/&lt;/code&gt;. We'll should have an audit of all the libraries in &lt;code&gt;extra/&lt;/code&gt; to improve them as much as possible and decide whether they should go into &lt;code&gt;playground/&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Timetable&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;By the end of this year&lt;/em&gt; (December 31, 2007): We need to decide, in more detail, how the path to Factor 1.0 will go. We need a full list of goals, a person assigned to each one, and a target date for completion. We also need to decide the terms for an audit of the libraries.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;One-third through next year&lt;/em&gt; (April 31, 2008): We should be done with the library audit. All new libraries should be correctly sorted into &lt;code&gt;extra/&lt;/code&gt; or &lt;code&gt;playground/&lt;/code&gt;. At this point, around half of the remaining critical features of Factor 1.0 should be done. Language features, like the form of the object system, should be completed and frozen.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Two-thirds through the year&lt;/em&gt; (August 31, 2008): This date is the target for all of the key features to be completed. After this point, we will focus on bug fixes, performance and more complete documentation. If possible, 1.0 alpha/beta 1 should be released, and further alpha/beta releases should be made through the year. Here, Factor's runtime should be frozen with no significant changes in implementation.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;At the end of the year&lt;/em&gt; (December 31, 2008): By this date, if all goes well, we will release Factor 1.0, with no major additional features added after the alpha stage.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Specifics&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Here are guidelines that all vocabs in &lt;code&gt;extra/&lt;/code&gt; should eventually follow:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Each vocab foo should consist of six files: foo.factor, foo-tests.factor, foo-docs.factor, summary.txt, authors.txt, tags.txt&lt;/li&gt;&lt;li&gt;Every vocab's documentation should explain each externally useful word, and it should have a main article telling users where to start.&lt;/li&gt;&lt;li&gt;Every vocab's unit tests should be as complete as possible.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Eventually, some vocabs in &lt;code&gt;extra/&lt;/code&gt; and &lt;code&gt;playground/&lt;/code&gt; will be hosted outside of the main Factor distribution. But right now, the total quantity of Factor code is such that everything that's publicly distributed can be included. Of course, when more people start writing code that they refuse
