String Matching
Introduction
Search for a string (pattern) in a large body of text
- T[0...n − 1]: The text (or haystack) being searched within
- P[0...m − 1]: The pattern (or needle) being searched for
- Return smallest i such that for
This is the first occurrence of P in T
Brute Force: Θ(mn)
Knuth-Morris-Pratt Algorithm
When a mismatch occurs, the most we can shift the pattern: the largest prefix of P[0...j] that is a suffix of P[1...j]
KMP Failure Array
The failure array : fail[j] is defined as the length of the largest prefix of P[0...j] that is also a suffix of P[1...j]
- fail[0] = 0
- If a mismatch occurs at P[j] T[i] we set j = fail[j − 1]
Construct the FailArray: almost same as KMP, pattern becomes part of P
Algorithm
- Runtime: Θ(m + n)
- Space: Θ(m + n)
int KMP(){
//T: String of length n (text), P: String of length m (pattern)
// Construct FailArray, but each case add F[i]
int i = 1, j = 0;
while(i < m){
if(p[j] == p[i]){
fail[i] = j + 1; // 0...j
++i;
++j;
}
else{
if(j > 0)j = fail[j-1];
else{
fail[i] = 0;
++i;
}
}
}
// KMP
i = 0, j = 0;
while(i < n){
if(p[j] == t[i]){
++j;
++i;
if(j == m){
j = fail[j - 1];
result.push_back(i - m); // starting point
}
}
else{
if(j > 0)j = fail[j - 1];
else ++i;
}
}
}
Regular Expression
Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*' where:
'.' Matches any single character.
'*' Matches zero or more of the preceding element.
Solution:
The matching should cover the entire input string (not partial).
Iterate Two strings, match one by one
case 0: 一般情况judge(str[i-1],pattern[j-1])
- dp[i][j]=dp[i-1][j-1]
case 1: pattern[j-1]=='*':
两种小情况:
(1)not use preceding char
(2)preceding char use multiple times
- dp[i][j]=dp[i][j-2]||judge(str[i-1],pattern[j-2]&&dp[i-1][j])
Code
class Solution {
public:
bool match(string s, string p) {
int n=p.size();
int m=s.size();
vector<vector<int>> dp(m+1,vector<int>(n+1,0));
dp[0][0]=1;
for(int i=2;i<=n;++i){
if(p[i-1]=='*')dp[0][i]=dp[0][i-2];
}
for(int i=1;i<=m;++i)for(int j=1;j<=n;++j){
if(judge(s[i-1],p[j-1]))
dp[i][j]=dp[i-1][j-1];
else if(p[j-1]=='*')
dp[i][j]=dp[i][j-2]||judge(s[i-1],p[j-2])&&dp[i-1][j];
}
return dp[m][n];
}
private:
bool judge(char a,char b){return a==b||b=='.';}
};
Regular Expressions
Category | Pattern | Matches |
---|---|---|
Disjuncitons | /[A-Z|a-z]/ | An upper/lower case letter |
/[0-9]/ | A single digit | |
/[^A-Z]/ | Negations | |
/\d/ | any digit | |
/\D/ | any non-digit | |
/./ | Any character except for newline | |
Anchors | /^/ | Start of Input |
/$/ | End of Input | |
Frequencies | /?/ | Optional(one or none) |
/*/ | zero or more | |
/+/ | one or more | |
/{m}/ | m repetitions | |
/{m,n}/ | m to n repetitions | |
Others | /()/ | capture group |