Avner's Site

About Fakespeare

When I was much younger, I used to love reading works of fiction. I read Jules Verne and H.G. Wells and Judy Blume and whatever else was located on the shelves near me. Unsurprisingly, I also tried my hand at writing books. It turns out that it's much harder to write a book than to read a book. As a result, my book-writing dreams never really got anywhere.

When I was much younger, I used to love reading works of fiction. I read Jules Verne and H.G. Wells and Judy Blume and whatever else was located on the shelves near me. Unsurprisingly, I also tried my hand at writing books. It turns out that it's much harder to write a book than to read a book. As a result, my book-writing dreams never really got anywhere.

Of course, we are no longer constrained by our human limitations.

We live in modern times! It's the Roaring 20's, and we are able to make full use of automation!

The github repo at hand.

Let's go!

The idea

Everyone has seen examples of Markov chains [link] creating plausible-ish text. Some of us have even written our own. I conjectured that, through judicious application of silly math stuff, we can create a Markov chain that is capable of performing quite reasonably. In particular, I thought I could maximize the amount of information [link] by using byte-pair encoding [link] to tokenize my text.

You see, in a Markov model, the state that is entered is dependent entirely on the state that came before it. Therefore it stands to reason that increasing the amount of information encoded in each token is likely to improve the quality of the model's output.

Suppose we are capable of splitting the text into tokens that represent entire words. Each word carries a large amount of meaning. For instance, adjectives are always followed by other adjectives or by nouns. You will see phrases like "purple ball" or "tiny pencil" but rarely or never "purple cooking" or "tiny typing". As it turns out, this extends to subwords. Words that end in "y" tend to be adjectives and will often be followed by a noun. A word that ends in "ing" is probably a verb. These subwords of high frequency clearly do a very good job of providing information. Therefore a scheme which tokenizes words into the richest subwords will likely be very useful in training a Markov model.

As it turns out, such a scheme already exists. Byte-pair encoding [link] lets us maximize information in each token in a very simple way. Suppose we have a token `x` and a token `y`. Furthermore, suppose `x` is very frequently followed by `y`. In fact, the the pair `xy` is so frequent that `y` is the most common character to occur after `x`. Then we can replace the pair `xy` by some new token `z`. This new token `z` now lets us encode far more information about the text, because knowing that the previous token is `z` is equivalent to knowing that the previous two tokens were `xy`. This has effectively doubled our context window. By repeating this a great deal of times (possibly thousands), we can create extremely rich tokens that represent affixes, words, and even phrases. This lets our Markov model perform far better than it would otherwise be able to.

The implementation

I split the implementation into two parts, including a file called `main.c` and a file called `generator.py`. The former uses the Markov chain to predict tokens, and in doing so is capable of generating the output text. This is what is actually executed when you run `./fakespeare`. The latter file is used to conduct byte-pair encoding and statistical analysis to generate the Markov chain.

To encode the Markov data, I came up with a scheme where I inline the data in the code using a header file. Essentially, each token is stored in a struct that contains a string describing what text the token corresponds to, as well as an array of probabilities that describes a CDF. This CDF is used to predict the next token. By starting with a randomly-selected token, finding that token in an array, and then using it's CDF, we are able to generate new text. We can then use the token's corresponding string to convert an array of tokens to printed text. Since we used byte-pair encoding, even a small amount of tokens can produce a fairly lengthy span of text.

An example

Here are a few of the choicest examples. I have specially chosen them to represent the most interesting or entertaining text this model can produce. It does not represent the average text.

If you want to make your own, check out the repo.

The github repo at hand.

Let's go!
SEBASTIAN. Prithfield many fresh enough,'Jesday pen to my youth may; we know your extend Thanill for no pody villain. I love thee. Your servicle, That croop of bond be" was such ere. Fortune en 'twere in slaim go t'st out upon't, OW. Where is this ringless blood, We shall light bled at such As to rother like aking, are no obde. A gentleman, more noble natural, not to five hundred were England into this. Beseech and mark) The plan's name, Lord Cuck.
ANGELO. My lord, there's the tears, not make one said ' sadisperch too; thy suspect not what time let their safter fools to sooth come, sir, had been satis. Our peather in thee by the less; And there? Hath over the possesby'r your peasurrects, if he stars, To a face that so proctasks. Kate, you licensure and goes ther he have cut their thither, I was his plain I that ever broke, Bassian, Swell, daughter: I in her grave! What never see you whate what thou hast hanging thee.
TRANIO. Even with the loss function a roth she dens, here with him; bid the base prince, will I a sworder lawless the princese Which over a greaterpreced must be Fie, fiery of Bround my gain shall have so she smil'd in his humblown too curtorn.