Skip to main content

String Matching

Introduction

Search for a string (pattern) in a large body of text

  • T[0...n − 1]: The text (or haystack) being searched within
  • P[0...m − 1]: The pattern (or needle) being searched for
  • Return smallest i such that P[j]=T[i+j]P[j] = T[i+j] for 0jm10\leq j\leq m-1

This is the first occurrence of P in T

Brute Force: Θ(mn)

Knuth-Morris-Pratt Algorithm

When a mismatch occurs, the most we can shift the pattern: the largest prefix of P[0...j] that is a suffix of P[1...j]

KMP Failure Array

The failure array : fail[j] is defined as the length of the largest prefix of P[0...j] that is also a suffix of P[1...j]

  • fail[0] = 0
  • If a mismatch occurs at P[j] \neq T[i] we set j = fail[j − 1]

20230531170333

Construct the FailArray: almost same as KMP, pattern becomes part of P

Algorithm

  • Runtime: Θ(m + n)
  • Space: Θ(m + n)
int KMP(){
//T: String of length n (text), P: String of length m (pattern)
// Construct FailArray, but each case add F[i]
int i = 1, j = 0;
while(i < m){
if(p[j] == p[i]){
fail[i] = j + 1; // 0...j
++i;
++j;
}
else{
if(j > 0)j = fail[j-1];
else{
fail[i] = 0;
++i;
}
}
}
// KMP
i = 0, j = 0;
while(i < n){
if(p[j] == t[i]){
++j;
++i;
if(j == m){
j = fail[j - 1];
result.push_back(i - m); // starting point
}
}
else{
if(j > 0)j = fail[j - 1];
else ++i;
}
}
}

Regular Expression

Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*' where:

'.' Matches any single character.​​​​

'*' Matches zero or more of the preceding element.

Solution:

The matching should cover the entire input string (not partial).

Iterate Two strings, match one by one

  • case 0: 一般情况judge(str[i-1],pattern[j-1])

    • dp[i][j]=dp[i-1][j-1]
  • case 1: pattern[j-1]=='*':

    两种小情况:

    (1)not use preceding char

    (2)preceding char use multiple times

    • dp[i][j]=dp[i][j-2]||judge(str[i-1],pattern[j-2]&&dp[i-1][j])

Code

class Solution {
public:
bool match(string s, string p) {
int n=p.size();
int m=s.size();
vector<vector<int>> dp(m+1,vector<int>(n+1,0));
dp[0][0]=1;
for(int i=2;i<=n;++i){
if(p[i-1]=='*')dp[0][i]=dp[0][i-2];
}
for(int i=1;i<=m;++i)for(int j=1;j<=n;++j){
if(judge(s[i-1],p[j-1]))
dp[i][j]=dp[i-1][j-1];
else if(p[j-1]=='*')
dp[i][j]=dp[i][j-2]||judge(s[i-1],p[j-2])&&dp[i-1][j];
}
return dp[m][n];
}
private:
bool judge(char a,char b){return a==b||b=='.';}
};

Regular Expressions

CategoryPatternMatches
Disjuncitons/[A-Z|a-z]/An upper/lower case letter
/[0-9]/A single digit
/[^A-Z]/Negations
/\d/any digit
/\D/any non-digit
/./Any character except for newline
Anchors/^/Start of Input
/$/End of Input
Frequencies/?/Optional(one or none)
/*/zero or more
/+/one or more
/{m}/m repetitions
/{m,n}/m to n repetitions
Others/()/capture group