Thursday, October 10, 2013

String Searching Algorithm


String searching algorithms are "an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text".
Some important concepts that you need to understand before reading this entry are pattern, string and alphabet:
  • Pattern: Set of elements that represent a predictable manner.
  • Alphabet: Set of symbols. For example human alphabet (A~Z), binary alphabet (1,0) and DNA alphabet (A,C,G,T).
  • String: Finite set of elements chose from an alphabet.
There are multiple categories of SSA. They basically depend on the number of patterns you chose. For example, there is single pattern, finite set of patterns and infinite set of patterns algorithms. The most common single pattern SSA are:
  • Brute-force searching algorithm.
  • Knuth-Morris-Pratt algorithm.
  • Boyer-Moore algorithm.
Brute-force searching algorithm (BFSA
According to Wikipedia brute-force search is "a trivial but very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement".

The BFSA aligns the first position of the pattern with the first position of the text and the algorithm compares their characters one by one until all pattern characters are found or a mismatch is detected.
In conclusion, the BFSA has the following characteristics:
  • The pattern is not preprocessed.
  • The algorithm compares from left to right character by character.
  • The worst time efficiency is Θ(mn) comparisons.
  • The algorithm returns the first occurrence of the pattern.

Advantages

  • Simple.
  • Rapid approach.
  • It is widely used.

Disadvantages

  • Too slow.
  • Not scalable.

Java Code


public class BruteForceSearch {

private char[] text;
private char[] pattern;
private int n;
private int m;
public void setString(String t, String p) {
this.text = t.toCharArray();
this.pattern = p.toCharArray();
this.n = t.length();
this.m = p.length();
}
public int search() {
for (int i = 0; i < n-m; i++) {
int j = 0;
while (j < m && text[i+j] == pattern[j]) {
j++;
}
if (j == m) return i;
}
return -1;
}
}

class Test {

public static void main(String[] args) {
BruteForceSearch bfs = new BruteForceSearch();
String text = "Lorem ipsum dolor sit amet";
String pattern = "ipsum";
bfs.setString(text, pattern);
int first_occur_position = bfs.search();
System.out.println("The text '" + pattern + "' is first found after the " + first_occur_position + " position.");
}

}


Knuth-Morris-Pratt (KMP)


Overview

According to Wikipedia, the Knuth-Morris-Pratt algorithm "searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters".

Being 'A' the portion of the pattern that fits with the text, 'B' is part of the text and 'f' is the length of 'A'. The KMP algorithm moves the pattern according to the characters of the pattern that fits with the text.
The main components of the KPM algorithm are the prefix function (Π) and the KMP matcher:
  • The prefix fuction (Π) keeps information about how pattern matches against shifts of itself. This information is to avoid useless shifts of the pattern.
  • The KMP matcher uses the prefix function, the pattern and the text as inputs to find the occurrence of the pattern within the text and returns the number of shifts after the first occurrence.
In conclusion, the KMP algorithm has the following characteristics:
  • The algorithm compares from left to right.
  • The algorithm preprocess the pattern.
  • The preprocessing phase time efficiency is Θ(m).
  • The searching phase time efficiency is Θ(m+n).
  • The algorithm accomplish at most 2n - 1 comparisons.

Advantages

  • Fast.
  • Simple.
  • It is good for processing large files.
  • Avoid comparisons with elements of 'B' that have previously been involved in comparison with some element of the pattern.

Disadvantages

  • Chances of mismatch increases with big sized alphabets.

Java Code


public class KnuthMorrisPratt {
public int[] prekmp(String pattern) {
int[] next = new int[pattern.length()];
int i=0, j=-1;
next[0]=-1;
while (i while (j>=0 && pattern.charAt(i)!=pattern.charAt(j))
j = next[j];
i++; 
j++;
next[i] = j;
}
return next;
}
public int kmp(String text, String pattern) {
int[] next = prekmp(pattern);
int i=0, j=0;
while (i while (j>=0 && text.charAt(i)!=pattern.charAt(j))
j = next[j];
i++; j++;
if (j==pattern.length()) return i-pattern.length();
}
return -1;
}

}

public class Test {
public static void main(String[] args) {
KnuthMorrisPratt k = new KnuthMorrisPratt();
String text = "Lorem ipsum dolor sit amet";
String pattern = "ipsum";
int first_occur_position = k.kmp(text, pattern);
System.out.println("The text '" + pattern + "' is first found on the " 
                                   + first_occur_position + " position.");
}
}


Boyer-Moore (BM)

Overview

According to Wikipedia, the Boyer-Moore algorithm "is an efficient string searching algorithm that is the standard benchmark for practical string search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the text does not persist across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the algorithm runs faster as the pattern length increases".

The BM algorithm have two shifting functions: bad character rule (occurrence shift) and the good suffix rule (matching shift):
  • The bad character rule keeps information about how pattern matches against shifts of itself. This information is to avoid useless shifts of the pattern.


  • Case #1. If mismatch is within pattern.


    Case #2. If mismatch is not within pattern.
  • The good suffix rule uses the prefix function, the pattern and the text as inputs to find the occurrence of the pattern within the text and returns the number of shifts after the first occurrence.

In conclusion, the BM algorithm has the following characteristics:
  • The algorithm searches for a class of patterns given by a pattern description instead of searching for a fixed pattern.
  • The algorithm compares from right to left.
  • The algorithm preprocesses the pattern.
  • The preprocessing phase time efficiency is Θ(m+σ).
  • The searching phase time efficiency is Θ(m*n).
  • The algorithm accomplish at most 3n text character comparisons.

Advantages

  • Most efficient algorithm to search for a pattern inside a string.
  • It is used in many search and replace operations in text editors.

Disadvantages

  • Not simple.
  • The preprocessing for the good suffix rule is difficult to understand and implement.

Java Code


public class BoyerMoore {
    private final int BASE;
    private int[] occurrence;
    private String pattern;

    public BoyerMoore(String pattern) {
        this.BASE = 256;
        this.pattern = pattern;

        occurrence = new int[BASE];
        for (int c = 0; c < BASE; c++)
            occurrence[c] = -1;
        for (int j = 0; j < pattern.length(); j++)
            occurrence[pattern.charAt(j)] = j;
    }

    public int search(String text) {
    int n = text.length();
    int m = pattern.length();
        int skip;
        for (int i = 0; i <= n - m; i += skip) {
            skip = 0;
            for (int j = m-1; j >= 0; j--) {
                if (pattern.charAt(j) != text.charAt(i+j)) {
                    skip = Math.max(1, j - occurrence[text.charAt(i+j)]);
                    break;
                }
            }
            if (skip == 0) return i;
        }
        return n;
    }
}


public class Test {

public static void main(String[] args) {
String text = "Lorem ipsum dolor sit amet";
String pattern = "ipsum";
BoyerMoore bm = new BoyerMoore(pattern);
int first_occur_position = bm.search(text);
System.out.println("The text '" + pattern + "' is first found after the " 
                                    + first_occur_position + " position.");
}

}


Rabin-Karp String Searching Algorithm

coming soon......

No comments:

Post a Comment