Content Generation with N-grams

Although this is an outdated method, I thought I would post some content generation code I wrote a while ago. Google possesses the n-gram data (more on those later) and algorithms to detect content generated in this fashion. It’s a cool method for text generation but I haven’t found too much in the way of available source code for it. Sure, there is code to take some text and generate n-grams (there’s a perl module for it!), but no sample code to run the n-grams in “reverse” to generate statistically-equivalent text.

The steps for generating statistically-equivalent text to some document are as follows:

  1. Generate a database of n-grams from source document(s) that are similar in nature to what you want to generate. If you want to generate content about male pattern baldness, use articles and content about male pattern baldness. You must record how often each n-gram appears in your source text.
  2. For each n-gram, create a new record that has the first n-1 characters as they key, and the last character and how often it occurred as the value. For example, the 4-gram “then” occurred 15 times in your source text, so your new database entry would have the key “the” with the value ( “n”, 15 ).
  3. Group the same keys together, from step 2. This new database is what you will use to generate the content. For example, step 1 gave you the following 4-grams: ( “then” => 10, “ther” => 20, “thes” => 30 ). Grouping the results from step 2 would give you ( “the” => ( ( “n”, 10 ), ( “r”, 20 ), ( “s”, 30 ) ) ).
  4. Now to generate text, simply start with a random key from your database at step 3, and use the occurrence values as weights to a random number generator to decide which character ( n, r, or s) above should be chosen. Then use the next n-1 characters as a key into your dictionary at step 3 and lather, rinse, repeat until you have enough text.

Here is the source code to generate content. To generate 1,000 characters of text, put all your source content into a file (we’ll call it source.txt), and do the following:

$ gendict.pl 8 source.txt > s_dict.txt
$ gentext.pl s_dict.txt 1000

Obviously you can play around with the ‘n’ parameter (I chose 8 as a starting point.) If you go too small, you’ll end up generating garbage words, and if you go too big, you’ll generate large portions of your source text, but it will make more sense.

I used character-level n-grams in this code, but word-level n-grams would work well, for a large source body. This is similar to the Dissociated Press algorithm except we do a pre-processing step and build an n-gram database first. This n-gram database can be used for other things, such as duplicate content detection, generated content detection and source author recognition to detect cheaters and people using essay writing services.

Modifying the code to stick the generated text into a MySQL database, and then generate an RSS feed from that would allow you use a technique like Affiliate Marketing through RSS Feeds easily. The key to this method is giving it enough source content, and playing around with the size of the n-grams.

Here is some sample text I generated using this document as source with n=8:

baldness, use articles and content detection, generated in this code, but
word-level n-gram appears in your source text, so your new database of
n-grams in thi. It's a cool method, I thought I would work well, for a
large source to generaly-equivalent text.
WORDPRESS