Searching for Patterns
In our example, we picked out all the repeated words and put those in a dictionary. To us, this is the most obvious way to write a dictionary. But a compression program sees it quite differently: It doesn't have any concept of separate words -- it only looks for patterns. And in order to reduce the file size as much as possible, it carefully selects which patterns to include in the dictionary.
If we approach the phrase from this perspective, we end up with a completely different dictionary.
If the compression program scanned Kennedy's phrase, the first redundancy it would come across would be only a couple of letters long. In "ask not what your," there is a repeated pattern of the letter "t" followed by a space -- in "not" and "what." If the compression program wrote this to the dictionary, it could write a "1" every time a "t" were followed by a space. But in this short phrase, this pattern doesn't occur enough to make it a worthwhile entry, so the program would eventually overwrite it.
The next thing the program might notice is "ou," which appears in both "your" and "country." If this were a longer document, writing this pattern to the dictionary could save a lot of space -- "ou" is a fairly common combination in the English language. But as the compression program worked through this sentence, it would quickly discover a better choice for a dictionary entry: Not only is "ou" repeated, but the entire words "your" and "country" are both repeated, and they are actually repeated together, as the phrase "your country." In this case, the program would overwrite the dictionary entry for "ou" with the entry for "your country."
The phrase "can do for" is also repeated, one time followed by "your" and one time followed by "you," giving us a repeated pattern of "can do for you." This lets us write 15 characters (including spaces) with one number value, while "your country" only lets us write 13 characters (with spaces) with one number value, so the program would overwrite the "your country" entry as just "r country," and then write a separate entry for "can do for you." The program proceeds in this way, picking up all repeated bits of information and then calculating which patterns it should write to the dictionary. This ability to rewrite the dictionary is the "adaptive" part of LZ adaptive dictionary-based algorithm. The way a program actually does this is fairly complicated, as you can see by the discussions on this site.
No matter what specific method you use, this in-depth searching system lets you compress the file much more efficiently than you could by just picking out words. Using the patterns we picked out above, and adding "__" for spaces, we come up with this larger dictionary: