Press ESC to close

Implement the KMP Algorithm for Substring Search in Python

Photo by VERPEX

The Knuth-Morris-Pratt (KMP) algorithm is an efficient pattern matching algorithm that searches for occurrences of a substring (or “pattern”) within a main string. It achieves this by preprocessing the pattern to create a longest prefix-suffix (LPS) array that helps avoid redundant comparisons, making it faster than the brute-force approach.

Here’s a Python implementation of the KMP algorithm:

KMP Algorithm in Python

def compute_lps(pattern):
    """
    Computes the Longest Prefix Suffix (LPS) array for the given pattern.
    The LPS array is used to skip characters while matching.
    """
    length = 0  # length of the previous longest prefix suffix
    i = 1
    lps = [0] * len(pattern)

    # Loop to calculate lps[i] for i = 1 to len(pattern) - 1
    while i < len(pattern):
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                # Try the previous lps value
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1

    return lps


def kmp_search(text, pattern):
    """
    KMP algorithm for pattern searching in the text.
    Returns a list of starting indices where the pattern is found in the text.
    """
    n = len(text)
    m = len(pattern)

    # Preprocess the pattern (compute LPS array)
    lps = compute_lps(pattern)

    result = []
    i = 0  # index for text
    j = 0  # index for pattern

    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1

        if j == m:
            # Pattern found at index i - j
            result.append(i - j)
            j = lps[j - 1]  # Reset j to check for more occurrences

        elif i < n and pattern[j] != text[i]:
            # Mismatch after j matches
            if j != 0:
                j = lps[j - 1]  # Use lps to avoid unnecessary comparisons
            else:
                i += 1

    return result
# Example usage:
if __name__ == "__main__":
    text = "ababcabcabababd"
    pattern = "ababd"
    result = kmp_search(text, pattern)
    if result:
        print(f"Pattern found at indices: {result}")
    else:
        print("Pattern not found")

Explanation:

  1. LPS Array (Longest Prefix Suffix):
    • This array stores the length of the longest proper prefix of the pattern that is also a suffix for every position in the pattern.
    • It helps to avoid redundant comparisons in the main string.
  2. compute_lps(pattern) Function:
    • It calculates the LPS array for the given pattern, which helps optimize the matching process by allowing jumps in the pattern rather than scanning every character.
  3. kmp_search(text, pattern) Function:
    • This function searches for occurrences of the pattern in the given text using the LPS array to skip unnecessary comparisons.
    • It returns a list of starting indices where the pattern is found.

Example Output:

For the input text “ababcabcabababd” and pattern “ababd”, the output will be:

Pattern found at indices: [10]

How the KMP Algorithm Works:

  • The algorithm begins by comparing the pattern with the text.
  • When a mismatch occurs, instead of moving the pattern by one character as in the brute-force approach, the algorithm uses the LPS array to determine the next position to continue the comparison.
  • This reduces the number of comparisons significantly.

Time Complexity:

  • Preprocessing (LPS Array): O(m), where m is the length of the pattern.
  • Searching: O(n), where n is the length of the text.
  • The overall time complexity is O(n + m), which is much more efficient than the brute-force O(n * m) for substring search.

Leave a Reply

Your email address will not be published. Required fields are marked *