## Surprising Strings

I have been grabbed (a bit late) by the December 2003 Puzzling Adventures by Dennis E. Shasha in Scientific American: the surprising strings.

On this page you can find the results of a day-and-a-half session during my internet-less holidays on which I concentrated on solving some parts of this puzzle.

I have concentrated on 2-surprising strings. My main conclusion is that I think the longest string for 26 different symbols is 69 characters, and the number of symbols that is needed for a surprising string of 100 characters is 38.

First thing I did was what Shasha describes on the solution webpage: I made a systematic iterator that uses an object for a surprising string, and an automatically backtracking recursive algorithm to scan solution space. One restriction that I quickly put in is that the characters are introduced in "order": The surprising string "AAB" is not different from "BBA". This saves a factor of N! in the calculation time. But: as reported on Shasha's webpage about this puzzle, even with this shortcut this systematic approach is rather time consuming. Actually, the performance seems to be slightly worse than exponential. On my notebook, generating all surprising strings for short lengths took:

• for N=3 ~0.01 seconds
• for N=4 0.07 seconds
• for N=5 0.7 seconds
• for N=6 6 seconds
• for N=7 2 minutes
• for N=8 1 hour

Roughly extrapolating, N=16 would take me 8 billion years, and bigger N quite a bit more than the age of the universe.

Even sampling starting from a random starting point for the longer strings does not sample reasonably independent areas of the solution space, so I implemented another solver:

METHOD 2

• Choose a random, valid surprising string
• repeat:
• Shrink the last 10-15 characters off the end
• Make a point mutation that does not invalidate the string
• Iterate using the recursive procedure from there.
• Choose the most successful string from the recursed set

This takes a bit of time running the systematic approach ("small steps"), and then uses the mutation to start somewhere else in solution space ("big step"). This approach is very useful to get a quick feel of how large solutions can get for a given number of symbols. But it still failed to give me strings that were long enough fast enough. So I went for a third method.

METHOD 3

• Choose a random, nonsurprising string of the desired length.
• Calculate the "error" score for the string (how many violating pairs can be found).
• repeat:
• Make a point mutation.
• Use the error score of the new and old strings and a Metropolis-like monte-carlo approach to decide whether to continue with the original or the new string

Obviously this needs to know the length before it can start looking. And: even this approach can take long (hours) to find a solution. Solution space is VERY rough.

When I came back from my holidays I did read the page by David Joyce that describes a maximal length of 2-surprising strings as 3N-1. But my numbers point at a more strict limit: I performed tests for N=3..26, and the longest examples I have found are:

` N:L(N) L(N)/N Example 3:  7, 2.3333 AABCBAC 4: 10, 2.5000 ABBCADCDBA 5: 12, 2.4000 AABCDCEBAEDB 6: 15, 2.5000 AABCDBEFCFADEBA 7: 18, 2.5714 AABCDBEFGCGEADFBAC 8: 21, 2.6250 ABACDEFGDHECHGBBFEADC 9: 24, 2.6667 ABCDEFGGHIBHEDICFDAEACBG10: 26, 2.6000 ABCDEFGHIJACEJIICBHBDGADFE11: 29, 2.6364 ABCDEFAGHEIGCJJHKDKGBIFBAEDCH12: 31, 2.5833 ABCDEFGHIJKLLEKDAJBGJAIFBDCHCKF13: 34, 2.6154 ABCDEFGHIJKLMMBHGIDAGJLDCEAEKFCBFI14: 37, 2.6429 ABCDEFGHIJKLHMKEMLFNGNEADBJIMCCAHDFBK15: 39, 2.6000 ABCDEFBGHIHGAJEKLCMMNOKFODIJCNGDLAIKBHE16: 42, 2.6250 ABCDEFGHBIJKLMNKOCGNJMPEPOADDKHJFCLFHIGAEB17: 45, 2.6470 ABCDEFGFHIJKLLMNOAMPCJQODNBQCIMEPKNGHEGDBAFLJ18: 47, 2.6111 ABCDEFGHIJKLMENMLDOPKBQGIOQRHNDCCRJAFAPLHDIBFEK19: 50, 2.6315 ABCDEFGHIJKILMNOOBPQRSKEQSRGNLHJSDFCACLIGDPEBMAKHF20: 53, 2.6500 ABCDEFGHEIJKLAMINOPQRPKSQTODNCLSLHGMBRFOJTDBGKIHCAPEE21: 55, 2.6190 ABCDEFGHIJKLKMNLOEHJGPDQRSTUQCASROOBMFBUNITPFELDUGAMK               CH22: 57, 2.6364 ABCDEFGHIJKLMNMOPLDQRFNSGTPRHBQERTUVVOUCSJAIKFIDBPQOC               HGMLE23: 61, 2.6522 ABCDEFGHIJKLMNJOPOQRSTFBSUQVKWPMUTTEHVECNPADIWLRKDQAF               JGGBMOHC24: 63, 2.6250 ABCDEFGHIJKKLBMNOPQREQSTSGPRUDNVJWUHXOCMWIACPFSLVFXIM               AEHKGRDTJB25: 66, 2.6400 ABCDEFGHIJKLEMNOPQRRSOGTANUJSPVWVCMDLXWQTWHBYRHJLYKFX               IDUCBAPFNMGEO26: 69, 2.6538 ABCDEFGHIJKLAMNOPQRESTOJSUUQVLNWBXYXDHKRYZFYCGZTPIVOQ               GWKDCRBIHMANEJLS`

Now, this is slightly biased, because I let my computer run for hours on the lowest length-ratio series, but it does show the trend that the limit length is the smallest integer below 8/3*N. The one notable exception is for N=9 where an L=24 solution actually exists. Other exact 8/3 solutions for larger N may exist, but be very rare; I did not put a lot of effort in locating such solutions.

Finally I tackled the implicit challenge of finding a surprising string of length 100. 3/8*100 = 37.5, so we need 38 different symbols to manage and L(38) in fact should be 101. Method 3 has located an L=100 solution after a number of hours, and I am not planning to try and find an L=101 solution:

`N=38 L=100: ABCDEFGHIJKLMNBOPQRSTUVWXWYZaHZbPGEcdefghiARjJehNT            kkDlKjfXldieDVYbUaNQfbICRHOQcFZSKFLCgUEIBGMaXPJTMA`

All of this was programmed in python. For anyone interested in this code, I put it up for grabs.

Update: Recently (July 2005), Dan M. Shaw contacted me with his results on the same puzzle. His programs in C# are about 4 orders of magnitude faster than my python code, and he has been able to continue the systematic search much longer, until 12 different symbols. For 12 different symbols he found 1712864 strings of length 32, the first one being AABCDEFGHGIJKELFCLKBDJIHBAFDGACE. With this result he improved my length for 12 symbols by 1, and made it more likely that results for the other threefolds can be improved as well.

There is an entry in the On-Line Encyclopedia of Integer Sequences for the lengths of 2-surprising strings.