kenyan programmer

mostly about programming


String searching

[orgmode source]

1. String Search

We define a function that takes in a pattern to search for and the string to search and returns a list of all found matches. Each match is in the form of a tuple of the index in the string at which the pattern was found, and the length of the substring matching the pattern.

string_to_search = "the quick brown fox jumps over the lazy dog"
pattern = "the"

# matches = search(pattern, string_to_search)
# assert list(matches) == [(0, 3), (31, 3)]

1.1. Naive string search

Here, we iterate throught each character of the string to check whether it matches the the first character of the pattern.

When we find one that matches, we check whether the characters that follow match the full pattern.

def search(pattern, to_search):
    pattern_length = len(pattern)
    search_length = len(to_search)

    if not pattern_length and search_length:
        return

    index = 0
    while index < search_length - pattern_length + 1:

        inner_index = 0
        while inner_index < pattern_length and pattern[inner_index] == to_search[index + inner_index]:
            inner_index += 1

        if inner_index == pattern_length:
            # matching substring found
            yield (index, pattern_length)
            index += (inner_index - 1)

        index += 1


matches = search(pattern, string_to_search)
assert list(matches) == [(0, 3), (31, 3)]
print(list(search('a', 'aaaa')))
[(0, 1), (1, 1), (2, 1), (3, 1)]

1.1.1. Complexity

With m as size of pattern and n as the size of the string, best case can have a time complexity of O(n+m), because mismatches are detected early.

Worst case can have a time complexity of O(mn) for example in case where there's a high similarity between characters in the pattern and the string, causing the pattern to be iterated through for each character in the string.

1.2. Boyer-Moore-Horspool Search Algorithm

With this approach, instead of iterating character by character in the string being searched looking for the pattern, we use a more optimized shifting strategy that reduces the number of comparisons we need to make.

We start by shifing by the length of the pattern being searched for, then iterate back checking whether the pattern matches from right to left.

If there is a mismatch, the first character of the mismatch will be used to determine by how much we shift, determined by its presence in the pattern being searched for. We create what is called a bad-match table that will tell us for any character, how much we'll shift by.

def search(pattern, to_search):

    bad_match_table = BadMatchTable(pattern)
    pattern_length = len(pattern)
    search_length = len(to_search)

    if not pattern_length and search_length:
        return

    search_index = pattern_length - 1
    while search_index < search_length:

        pattern_index = 0
        while pattern_index != pattern_length and pattern[pattern_length - 1 - pattern_index] == to_search[search_index - pattern_index]:
            pattern_index += 1

        if pattern_index == pattern_length:
            # we found a match, so shift by pattern length
            yield (search_index - (pattern_index - 1), pattern_length)
            shift = pattern_length
        else:
            # shift using bad-match table
            nonmatching_character = to_search[search_index - pattern_index]
            shift = bad_match_table.get_shift(nonmatching_character)

        search_index += shift

We then define the bad-match table. If the nonmatching character exists in pattern, we only need to shift by a specific number of steps so that the matching positions in the pattern and the search string are aligned, and we can check whether the whole pattern matches.

If the nonmatching character doesn't exist in the pattern, we shift by the default value, which is the length of the pattern.

However, the last character of the pattern is excluded from the bad match table.

class BadMatchTable:
    def __init__(self, pattern):
        self.default_shift = len(pattern)
        self.table = {}
        for i, character in enumerate(pattern[:-1]):
            self.table[character] = len(pattern) - 1 - i

    def get_shift(self, character):
        return self.table.get(character, self.default_shift)

We can now test search using the new algorithm.

matches = search(pattern, string_to_search)
assert list(matches) == [(0, 3), (31, 3)]
print(list(search('a', 'aaaa')))
[(0, 1), (1, 1), (2, 1), (3, 1)]

1.2.1. Complexity

Best case will have a time complexity O(n/m), where n is size of string and m is the size of the pattern; mismatches are detected early and result in shifts of size m, with the pattern and the string not having similar characters.

Worst case can have a time complexity of O(mn) when, for example, most but not all of the pattern matches and similar characters occur between the two causing small shifts.