<?xml version="1.0" encoding="ISO-8859-1"?>
The idea of the algorithm is simple. It just goes by cases. First, check for a byte order mark (BOM), which would indicate UTF-16 and a particular endianness, or UTF-8. If there's no BOM, then the first character must be < or whitespace. If it's <, we can differentiate between UTF-16BE, UTF-16LE (without BOMs) and an 8-bit encoding. If it's one of the first two, we can tell by the fact that there's a null byte before or after the <. If it's an 8-bit encoding, we can be sure that there won't be any non-ASCII in the prolog, so just read the prolog as if it's UTF-8, and if an encoding is declared, use that.
To implement it, I just read byte by byte and have a case statement for each level. After two just octets, it's possible to differentiate between UTF-8, UTF-16 (with a BOM, for both endiannesses), UTF-16BE and UTF-16LE. A similar process could also identify UTF-32 and friends after 4 octets. In my implementation, I had to do a little bit of hacking inside the XML code itself to get this integrated properly. All together, it's about 40 or 50 lines of code. It's available now in the Factor git repository.
[Update: Thanks for pointing out my error, Subbu Allamaraju. Fixed a typo, see comments.]