In one of our project we needed to search for multiple words from a file and return the found words.
These files would be large and there might be large number of words to search for also.
Ruby has a good support for regexp, however it would be slow in doing these searches for patterns as compared to unix commands and hence we decided to look for certain unix commands for doing these searches.
We went through “grep“, “awk” and “boyermoore searching algorithm in ruby” for doing this search.
Each of these were used for searching patterns in a file.
grep and AWK were giving the results as the line(not he line number but the whole content of line) on which the matching pattern were present. Working on grep and awk further we found differnt ways of searching from files using these.
Using grep:
For searching a word from a file by grep we can say :
1 | grep "word" filepath |
e.g
1 | grep "cool" fun.txt |
For searching multiple words from a file we can use :
1 | grep "word1|word2" filepath |
e.g
1 | grep "cool|hot" fun.txt |
This would return any word which contains cool or hot in it.(i.e it would return even hotter as a match)
To make it a whole word compare we have to add a -w option to it.
1 | grep -w "word1|word2" filepath |
This would return only words which contains cool or hot in it.
Also we can take the words from the list in file and compare those and search for other file.
1 | grep -f list.txt filepath |
e.g
1 | grep -f list.txt file.txt |
Here list.txt contains the list of words(each word in a new line) and the file.txt is the file to be searched in.
Now these gave the line in which the words were present as the result.
Now we wanted which words matched and not the whole line, so we added a option ” -o ” to it.
So final grep command was :
1 | grep -o -w "word1|word2" filepath |
Using AWK:
For searching a word from a file by awk we can say:
1 | awk '/word/' filepath |
For searching the file with the whole word and not those that has word as a part of it, we can use (This is similar to using -w with grep) :
1 | awk '/<word>/' filepath |
For searching multiple words we can use :
1 | awk '/<word1>|<word2>/' filepath |
Similarly here also the result wld give the whole line as result in which the word was present, so to give only the matched words the command was :
1 | awk '{for(i=1;i<=NF;i++){if($i~/<word1>|<word2>/){print $i}}}' filepath |
This would split each word within the file and compare it and give the matched words out as result.
Using Boyermoore Searching Algorithm:
Also we has tried a searching algorithm written in ruby namely “boyermoore searching algorithm”
Here we had to require the library file required for it and call it as:
1 | BoyerMoore.search( "ANPANMAN" , "ANP" ) |
This would give the result as the position in which the letter has matched, which would give the result as 0 in this case.
Now by benchmarking these three types of searching methods we found that grep and awk are better.
Both of them were giving minor time difference.
So thus we know that grep and awk are one of the solution for searching patterns in a file.
1 comments On Word Search Using grep and AWK
Good one ..and helpful …