C++ Programming Code Examples
C++ > Strings Code Examples
C++ Program to Implement Aho-Corasick Algorithm for String Matching
/* C++ Program to Implement Aho-Corasick Algorithm for String Matching
n computer science, the Aho-Corasick string matching algorithm is a string searching algorithm, it is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously. The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). */
using namespace std;
#include <algorithm>
#include <iostream>
#include <iterator>
#include <numeric>
#include <sstream>
#include <fstream>
#include <cassert>
#include <climits>
#include <cstdlib>
#include <cstring>
#include <string>
#include <cstdio>
#include <vector>
#include <cmath>
#include <queue>
#include <deque>
#include <stack>
#include <list>
#include <map>
#include <set>
#define foreach(x, v) for (typeof (v).begin() x=(v).begin(); x !=(v).end(); ++x)
#define For(i, a, b) for (int i=(a); i<(b); ++i)
#define D(x) cout << #x " is " << x << endl
const int MAXS = 6 * 50 + 10; // Max number of states in the matching machine.
// Should be equal to the sum of the length of all keywords.
const int MAXC = 26; // Number of characters in the alphabet.
int out[MAXS]; // Output for each state, as a bitwise mask.
int f[MAXS]; // Failure function
int g[MAXS][MAXC]; // Goto function, or -1 if fail.
int buildMatchingMachine(const vector<string> &words, char lowestChar = 'a',
char highestChar = 'z')
{
memset(out, 0, sizeof out);
memset(f, -1, sizeof f);
memset(g, -1, sizeof g);
int states = 1; // Initially, we just have the 0 state
for (int i = 0; i < words.size(); ++i)
{
const string &keyword = words[i];
int currentState = 0;
for (int j = 0; j < keyword.size(); ++j)
{
int c = keyword[j] - lowestChar;
if (g[currentState][c] == -1)
{ // Allocate a new node
g[currentState][c] = states++;
}
currentState = g[currentState][c];
}
out[currentState] |= (1 << i); // There's a match of keywords[i] at node currentState.
}
// State 0 should have an outgoing edge for all characters.
for (int c = 0; c < MAXC; ++c)
{
if (g[0][c] == -1)
{
g[0][c] = 0;
}
}
// Now, let's build the failure function
queue<int> q;
for (int c = 0; c <= highestChar - lowestChar; ++c)
{ // Iterate over every possible input
// All nodes s of depth 1 have f[s] = 0
if (g[0][c] != -1 and g[0][c] != 0)
{
f[g[0][c]] = 0;
q.push(g[0][c]);
}
}
while (q.size())
{
int state = q.front();
q.pop();
for (int c = 0; c <= highestChar - lowestChar; ++c)
{
if (g[state][c] != -1)
{
int failure = f[state];
while (g[failure][c] == -1)
{
failure = f[failure];
}
failure = g[failure][c];
f[g[state][c]] = failure;
out[g[state][c]] |= out[failure]; // Merge out values
q.push(g[state][c]);
}
}
}
return states;
}
int findNextState(int currentState, char nextInput, char lowestChar = 'a')
{
int answer = currentState;
int c = nextInput - lowestChar;
while (g[answer][c] == -1)
answer = f[answer];
return g[answer][c];
}
int main()
{
vector<string> keywords;
keywords.push_back("he");
keywords.push_back("she");
keywords.push_back("hers");
keywords.push_back("his");
string text = "ahishers";
buildMatchingMachine(keywords, 'a', 'z');
int currentState = 0;
for (int i = 0; i < text.size(); ++i)
{
currentState = findNextState(currentState, text[i], 'a');
if (out[currentState] == 0)
continue; // Nothing new, let's move on to the next character.
for (int j = 0; j < keywords.size(); ++j)
{
if (out[currentState] & (1 << j))
{ // Matched keywords[j]
cout << "Keyword " << keywords[j] << " appears from " << i
- keywords[j].size() + 1 << " to " << i << endl;
}
}
}
return 0;
}
#include is a way of including a standard or user-defined file in the program and is mostly written at the beginning of any C/C++ program. This directive is read by the preprocessor and orders it to insert the content of a user-defined or system header file into the following program. These files are mainly imported from an outside source into the current program. The process of importing such files that might be system-defined or user-defined is known as File Inclusion. This type of preprocessor directive tells the compiler to include a file in the source code program.
In C++ programming we are using the iostream standard library, it provides cin and cout methods for reading from input and writing to output respectively. To read and write from a file we are using the standard C++ library called fstream. Let us see the data types define in fstream library is: • ofstream: This data type represents the output file stream and is used to create files and to write information to files. • ifstream: This data type represents the input file stream and is used to read information from files. • fstream: This data type represents the file stream generally, and has the capabilities of both ofstream and ifstream which means it can create files, write information to files, and read information from files.
Return size. Returns the number of elements in the queue. This member function effectively calls member size of the underlying container object. The number of elements in the queue is an actual representation of the size, and the size value is given by this function. size() function is used to return the size of the list container or the number of elements in the list container.
An array is a collection of data items, all of the same type, accessed using a common name. A one-dimensional array is like a list; A two dimensional array is like a table; The C++ language places no limits on the number of dimensions in an array, though specific implementations may. Some texts refer to one-dimensional arrays as vectors, two-dimensional arrays as matrices, and use the general term arrays when the number of dimensions is unspecified or unimportant. (2D) array in C++ programming is also known as matrix. A matrix can be represented as a table of rows and columns. In C/C++, we can define multi dimensional arrays in simple words as array of arrays. Data in multi dimensional arrays are stored in tabular form (in row major order).
Return size. Returns the number of elements in the vector. This is the number of actual objects held in the vector, which is not necessarily equal to its storage capacity. vector::size() is a library function of "vector" header, it is used to get the size of a vector, it returns the total number of elements in the vector. The dynamic array can be created by using a vector in C++. One or more elements can be inserted into or removed from the vector at the run time that increases or decreases the size of the vector. The size or length of the vector can be counted using any loop or the built-in function named size(). This function does not accept any parameter.
Strings are objects that represent sequences of characters. The standard string class provides support for such objects with an interface similar to that of a standard container of bytes, but adding features specifically designed to operate with strings of single-byte characters. The string class is an instantiation of the basic_string class template that uses char (i.e., bytes) as its character type, with its default char_traits and allocator types. Note that this class handles bytes independently of the encoding used: If used to handle sequences of multi-byte or variable-length characters (such as UTF-8), all members of this class (such as length or size), as well as its iterators, will still operate in terms of bytes (not actual encoded characters).
A program shall contain a global function named main, which is the designated start of the program in hosted environment. main() function is the entry point of any C++ program. It is the point at which execution of program is started. When a C++ program is executed, the execution control goes directly to the main() function. Every C++ program have a main() function.
In C++, vectors are used to store elements of similar data types. However, unlike arrays, the size of a vector can grow dynamically. That is, we can change the size of the vector during the execution of a program as per our requirements. Vectors are part of the C++ Standard Template Library. To use vectors, we need to include the vector header file in our program. The vector class provides various methods to perform different operations on vectors. Add Elements to a Vector: To add a single element into a vector, we use the push_back() function. It inserts an element into the end of the vector. Access Elements of a Vector: In C++, we use the index number to access the vector elements. Here, we use the at() function to access the element from the specified index.
In while loop, condition is evaluated first and if it returns true then the statements inside while loop execute, this happens repeatedly until the condition returns false. When condition returns false, the control comes out of loop and jumps to the next statement in the program after while loop. The important point to note when using while loop is that we need to use increment or decrement statement inside while loop so that the loop variable gets changed on each iteration, and at some point condition returns false. This way we can end the execution of while loop otherwise the loop would execute indefinitely. A while loop that never stops is said to be the infinite while loop, when we give the condition in such a way so that it never returns false, then the loops becomes infinite and repeats itself indefinitely.
In computer programming, we use the if statement to run a block code only when a certain condition is met. An if statement can be followed by an optional else statement, which executes when the boolean expression is false. There are three forms of if...else statements in C++: • if statement, • if...else statement, • if...else if...else statement, The if statement evaluates the condition inside the parentheses ( ). If the condition evaluates to true, the code inside the body of if is executed. If the condition evaluates to false, the code inside the body of if is skipped.
Fill block of memory. Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char). This function converts the value of a character to unsigned character and copies it into each of first num character of the object pointed by the given str[]. If the num is larger than string size, it will be undefined.
In computer programming, loops are used to repeat a block of code. For example, when you are displaying number from 1 to 100 you may want set the value of a variable to 1 and display it 100 times, increasing its value by 1 on each loop iteration. When you know exactly how many times you want to loop through a block of code, use the for loop instead of a while loop. A for loop is a repetition control structure that allows you to efficiently write a loop that needs to execute a specific number of times.
As the name already suggests, these operators help in assigning values to variables. These operators help us in allocating a particular value to the operands. The main simple assignment operator is '='. We have to be sure that both the left and right sides of the operator must have the same data type. We have different levels of operators. Assignment operators are used to assign the value, variable and function to another variable. Assignment operators in C are some of the C Programming Operator, which are useful to assign the values to the declared variables. Let's discuss the various types of the assignment operators such as =, +=, -=, /=, *= and %=. The following table lists the assignment operators supported by the C language:
The bitwise operators are the operators used to perform the operations on the data at the bit-level. When we perform the bitwise operations, then it is also known as bit-level programming. It consists of two digits, either 0 or 1. It is mainly used in numerical computations to make the calculations faster. We have different types of bitwise operators in the C++ programming language. The following is the list of the bitwise operators: Bitwise AND operator is denoted by the single ampersand sign (&). Two integer operands are written on both sides of the (&) operator. If the corresponding bits of both the operands are 1, then the output of the bitwise AND operation is 1; otherwise, the output would be 0. This is one of the most commonly used logical bitwise operators. It is represented by a single ampersand sign (&). Two integer expressions are written on each side of the (&) operator.
Consider a situation, when we have two persons with the same name, jhon, in the same class. Whenever we need to differentiate them definitely we would have to use some additional information along with their name, like either the area, if they live in different area or their mother's or father's name, etc. Same situation can arise in your C++ applications. For example, you might be writing some code that has a function called xyz() and there is another library available which is also having same function xyz(). Now the compiler has no way of knowing which version of xyz() function you are referring to within your code.
Add element at the end. Adds a new element at the end of the vector, after its current last element. The content of val is copied (or moved) to the new element. This effectively increases the container size by one, which causes an automatic reallocation of the allocated storage space if -and only if- the new vector size surpasses the current vector capacity. push_back() function is used to push elements into a vector from the back. The new value is inserted into the vector at the end, after the current last element and the container size is increased by 1. This function does not return any value.
Iterators are just like pointers used to access the container elements. Iterators are one of the four pillars of the Standard Template Library or STL in C++. An iterator is used to point to the memory address of the STL container classes. For better understanding, you can relate them with a pointer, to some extent. Iterators act as a bridge that connects algorithms to STL containers and allows the modifications of the data present inside the container. They allow you to iterate over the container, access and assign the values, and run different operators over them, to get the desired result. • Iterators are used to traverse from one element to another element, a process is known as iterating through the container. • The main advantage of an iterator is to provide a common interface for all the containers type. • Iterators make the algorithm independent of the type of the container used.
Return length of string. Returns the length of the string, in terms of bytes. This is the number of actual bytes that conform the contents of the string, which is not necessarily equal to its storage capacity. Note that string objects handle bytes without knowledge of the encoding that may eventually be used to encode the characters it contains. Therefore, the value returned may not correspond to the actual number of encoded characters in sequences of multi-byte or variable-length characters (such as UTF-8). Both string::size and string::length are synonyms and return the same value.
FIFO queue. queues are a type of container adaptor, specifically designed to operate in a FIFO context (first-in first-out), where elements are inserted into one end of the container and extracted from the other. queues are implemented as containers adaptors, which are classes that use an encapsulated object of a specific container class as its underlying container, providing a specific set of member functions to access its elements. Elements are pushed into the "back" of the specific container and popped from its "front". The underlying container may be one of the standard container class template or some other specifically designed container class. This underlying container shall support at least the following operations:
In the C++ Programming Language, the #define directive allows the definition of macros within your source code. These macro definitions allow constant values to be declared for use throughout your code. Macro definitions are not variables and cannot be changed by your program code like variables. You generally use this syntax when creating constants that represent numbers, strings or expressions. The syntax for creating a constant using #define in the C++ is: #define token value
Inserts a new element at the end of the queue, after its current last element. The content of this new element is initialized to val. This member function effectively calls the member function push_back of the underlying container object. In C++ STL, Queue is a type of container that follows FIFO (First-in-First-Out) elements arrangement i.e. the elements which insert first will be removed first. In queue, elements are inserted at one end known as "back" and are deleted from another end known as "front". In the Data Structure, "push" is an operation to insert an element in any container, "pop" is an operation to remove an element from the container.
Continue statement is used inside loops. Whenever a continue statement is encountered inside a loop, control directly jumps to the beginning of the loop for next iteration, skipping the execution of statements inside loop's body for the current iteration. The continue statement works somewhat like the break statement. Instead of forcing termination, however, continue forces the next iteration of the loop to take place, skipping any code in between. For the for loop, continue causes the conditional test and increment portions of the loop to execute. For the while and do...while loops, program control passes to the conditional tests.
Check whether eofbit is set. Returns true if the eofbit error state flag is set for the stream. This flag is set by all standard input operations when the End-of-File is reached in the sequence associated with the stream. Note that the value returned by this function depends on the last operation performed on the stream (and not on the next). Operations that attempt to read at the End-of-File fail, and thus both the eofbit and the failbit end up set. This function can be used to check whether the failure is due to reaching the End-of-File or to some other reason.
Remove next element. Removes the next element in the queue, effectively reducing its size by one. The element removed is the "oldest" element in the queue whose value can be retrieved by calling member queue::front. This calls the removed element's destructor. This member function effectively calls the member function pop_front of the underlying container object. C++ Queue pop() function is used for removing the topmost element of the queue. The function is implied only for deletion of elements.
Access next element. Returns a reference to the next element in the queue. The next element is the "oldest" element in the queue and the same element that is popped out from the queue when queue::pop is called. This member function effectively calls member front of the underlying container object. In C++ STL, Queue is a type of container that follows FIFO (First-in-First-Out) elements arrangement i.e. the elements which insert first will be removed first. In queue, elements are inserted at one end known as "back" and are deleted from another end known as "front". The function front() returns the reference to the first element in the queue i.e. the oldest element in the queue, so it is used to get the first element from the front of the list of a queue.