Skip to content

adamsnoah98/LPS-LCS-Algorithm-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LPS-LCS-Algorithm-Analysis

The longest common substring (LCS) problem is a classic computational problem used all the time. This project explores LCS, a special case of it, the longest palindromic substring (LPS) problem, and its generalizations alongside how different problem domains can affect algorithm performance. I implement a variety of solutions to these problems, discuss their theoretical performance and then quantitatively examine their capabilities.

Algorithms and data generation code is written in Java, and visualized with PyPlot. The domain for these visualizations evaluates performance on string length (27-225), comparing random data with structured data in the form of articles written as a random series of pseudo-sentences via a weighted context-free grammar and small dictionary. Additionally, the alphabet (character set size), and dictionary (word set size) are varied to test any possible performance effect.

Note for LCS cases both strings are of equal length for this visualization.

Setup

Java source code requires Java 8 or better; furthermore, test code utilizes JUnit 4.12, so this must be added to the project using Maven or another preferred method. Python visualization code is included and requires MatPlotLib. Download this to your classpath with your preferred package manager if needed.

Longest Common Substring (LCS)

Naive Implementation

This basic solution involves checking all possible pairs of indices between the input strings and computing the length shared starting there. While having the benefit of being in place, the runtime approaches cubic with O(nmL) - L being the optimal solution length and, unsurprisingly, is the worst of the 4 LCS solutions here.

Most of this domain is not tractable with a naive solution. Additionally, it's sensible alphabet size would not affect performance as this implementation leverages no aspects of its findings or input format.

Dynamic Programming

A classic dynamic programming solution is building, bottom-up, a n x m matrix with each entry being the length in a common substring ending at these respective indices. This is a well known O(nm) time and space solution; however, since only the previous row is needed to fill the next, this first implementation improves the space to O(min(n,m)).

At first glance it may be surprising to see how similar this looks to our naive performance; however, recalling the similar runtimes O(nmL) vs O(nm) validates this finding as typically L <<< n + m, and the vast majority of naive iterations end extremely quick.

Improving upon this classic algorithm, notice that if you know two indices i, j, don't match, then any optimal common substring must include i+l and j+l respectively - if l is the current best length - if the optimal solution lies between these bounds. This is called skipping, and we can use it to reduce the number of locations we check. In the best case, the optimal solution is found immediately and the runtime is O(nm/L), while the worst case remains O(nm).

My standout favorite result is this graph. First, it shows a vast performance boost over the standard dynamic programming approach, as only one long substring needs to be found to take advantage of sizeable skips. Second, it confirmed my suspicion that these implementations may have varying results with different types of strings. The performance difference is noticeable, and sensible, as smaller alphabets lend themselves to more shared strings simply by chance.

Suffix Trees

Suffix trees provide quick solutions to many string-based operations, and the discovery of linear time construction in the 90s, such as Ukkonen's Algorithm - upon which this solution is derived from, allows LCS solutions by finding the deepest shared node in the suffix tree.

Below we see a solidly linear time complexity, with the even scaling on the graph below. Besides the obvious comparisons to the other implementations, note the dependence on alphabet size. This is a likely result of two forces. The GST node implementation using HashMaps, which need to be resized, and more characters imply more nodes with shorter edges, increasing indirection within the algorithm. This effect is less pronounced on the structured data as the edges are longer (often words or fragments), and even the smallest dictionary size has a sizeable 15-20 character alphabet.

Performance on Uneven String Lengths

Clearly, from the theoretical runtimes, the dynamic programming solutions are faster when even just one string is small. From the graph below its clear dynamic programming is competitive up to a surprising threshold. From the graphs below even standard dynamic programming is no worse than the (albeit abstraction heavy) suffix tree implementation so long as one string is shorter than 212.

Longest Palindromic Substring (LPS)

LPS is a special case of LCS, where the two input strings are each others reverse. Interestingly, we will see this gives dynamic programming much better execution times for large but limited (<224 character) strings.

Naive Implementation

While this is not a truly naive solution, it is certainly the simplest of these three. The algorithm runs several iterations over the string, each time searching for a longer palindrome of length up to the optimal L. Since palindromes always have at least n / 2 inner palindromes the algorithm can terminate after L < n iterations, giving a runtime of O(nL^2).

Notice that the performance appears worse with smaller alphabets; however, this is just a symptom of the fact that the expected palindrome length in a random string is longer with fewer distinct characters.

Improved

The best performing algorithm for DPS (for strings under 224 characters), was a space-cheap version of manacher's algorithm, which runs in O(P) time, where P is the number of palindromic substrings. Typically this number is O(n), and the algorithm is very basic so there is little overhead. It iterates through the string, for each iteration i finding the longest possible odd-length palindrome centered on character i, and the longest even-length palindrome centered between characters i and i+1.

Manacher's Algorithm

Manacher's algorithm is the dynamic programming upgrade to the Improved algorithm above. It is a very simple O(n) time algorithm, improving on the O(P) above, at the cost of linear space.

Performance on certain strings, such as those with long series of repeated characters, can show speed improvements over the previous algorithm; however, on average the additional cost of linear space provides little benefit on random strings.

Suffix Trees

Despite having the best asymptotic runtime, suffix trees performed the worst for this sample space due to their large overhead and the abstraction heavy implementation I used. The algorithm utilizes a modified LCS solution, feeding in the target string and its reverse. This requires constant time LCA retrieval, processed in linear time. My solution utilizes a O(√ n log2n) O(n) processing algorithm presented by Bender & Farach-Colton (2000).

Despite the same linear runtime as Manacher's algorithm, the massive overhead of suffix trees makes this approach far worse in a case by case basis; however, if the trees are already initialized and LCA-prepared, the actual LPS computation is very fast.

Further Notes

The generalized k-Common Substring problem truly displays the strength of utilizing GSTs, as the given algorithm requires no refactoring* and the complexity of adding strings scales additively with their lengths (maintaining linear time). One the other hand, the naive and dynamic programming solutions scale multiplicatively and require more refactoring. I conducted a brief DP vs. suffix tree test on the case of finding the LCS of 3 strings, and on inputs of only 212 characters dynamic programming execution times already exceeded 10s, compared to suffix trees avg sub-1ms execution time. The GST's required inputs of 220 or more to see sluggish >500ms runs.

Extended Dynamic Programming Solution

Extended Suffix Tree Solution

* No refactoring in the case of only finding the LCS between all strings, finding the array of all 2-k LCSs requires only slight refactoring of the traversal.

Possible improvements and extensions for these implementations include:

  • Loops on naive and DP solutions could be optimized for locality.
  • Using primitive node representations in the GST can reduce indirection for speed at the cost of readability.
  • Dynamic programming solutions could utilize maps over the sparse arrays, saving space at the cost of some overhead.
  • All the solutions excluding suffix trees have straight forward parallelizable variants that give scalable speedups in the base case.
  • Suffix trees in the general case can be built in parallel with nodes as critical sections at the cost of O(k) sized nodes.
  • A dedicated LPS GST constructor could eliminate the need to copy and reverse the target string.

(All timed trials were run on a measly 16GB of 1800MHz DDR3 with a 2.7GHz Intel i5 processor)

About

Implementations of Longest Common Substring and related problems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published