How to make searching a string in text files quicker - python

I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.
I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.
Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.
The code is as below(Python 2.7).
# -*- coding: utf-8 -*-
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Sample Input below.
List of strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
Usual text file containing different lines separated by \n
Sample Output below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.
Is there a quicker way to search without any splitting of text files?

Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.
Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).
I rewrote your code using list comprehensions and it became a lot more compact.
# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile
def search_strings_in_files(path_str, search_list):
""" Returns a list of lists, where each inner list contans three fields:
the filename (without path), a string in search_list and the
frequency (number of occurences) of that string in that file"""
filelist = listdir(path_str)
return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
for filename in filelist
if isfile(path_str+filename)
for s in [sl.lower() for sl in search_list] ]
if __name__ == '__main__':
print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])
Mechanism's that I use in this code:
list comprehension to loop thought search_lists and though the files.
compound statements to loop only through the files in a directory (and not through sub directories).
method chaining to directly call a method of an object that is returned.
Tip for reading the list comprehension: try reading it form bottom to top, so:
I convert all items in search_list to lower using list comprehension.
Then I loop over that list (for s in...)
Then I filter out the directory entries that are not files using a compound statement (if isfile...)
Then I loop over all files (for filename...)
In the top line, I create the sublist containing three items:
filename
s, that is the lower case search string
a method chained call to open the file, read all its contents, convert it to lowercase and count the number of occurrences of s.
This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.

Related

How to show the output in multiple text files in python, for example: print digits from 1 to 10 in 1.txt,2.txt,3.txt

How to show the output in multiple text files in python, for example: print digits from 1 to 10 in 1.txt(o/p-1),2.txt(o/p-2),3.txt(o/p-3) files...
Here is a piece of code that I tried, to get the image files as text and store it in individual folders.
c=-1
outFile=[] //created a list,as it would overwrite the same outfile multiple times if not used(expecting it to be a different file when list is used)
for i in range(0, len(onlyfiles)):
`
text = pytesseract.image_to_string(images[i])
c=c+1
outFile.append(c)
outFile[c] = open(str1+"outputfile.txt", "w")//str values could be incremented using a different loop/function to have difference in name.
outFile[c].write(text)
outFile[c].close()
Any modification or new approach is really appreciated.
for i in images:
text = pytesseract.image_to_string(i)
with open(f"{i}.txt", w) as f:
f.write(text)
Assumption:
We will save the text files (1.txt, 2.txt, .... ) in current directory.
images is an array containing multiple images (not the physical location)

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

What is a good data structure to use for a very long list of strings? [duplicate]

This question already has answers here:
Processing Large Files in Python [ 1000 GB or More]
(8 answers)
Closed 5 years ago.
I have a very large file (80 GB) containing one sentence per line. I want to search for a user-given string for a match in this file (spaces, hyphens, case to be ignored).
Right now I have the file as text and I am using grep but it's taking a lot of time. What could be a better solution?
Example of contents of text file:
applachian
rocky mountains
andes
sierra nevada
long mountain ranges of the world
Example of search query:
rocky (no match)
sierra nevada (match found)
Based on your comment that you're searching for entire sentences:
Build an index of prefixes.
Sort the file. Next, process your file one time. Compute the length of the prefix needed to reduce a search to, say, 1000 sentences. That is, how many characters of prefix do you need to get within about 1000 sentences of a given sentence.
For example: "The" is probably a common starting word in English. But "The quick" is probably enough to get close, because "q" is low-frequency, to anything like "The quick brown fox ... etc."
One way to do this would be to put all prefixes up to a certain length (say, 40) into a Collections.counter. Find the maximum count at each length, and pick your length so that max is <= 1000. There may be other ways. ;-)
Now, process the file a second time. Build a separate index file, consisting of prefix-length (in the file header), prefixes and offsets. All sentences that start with prefix K begin at offset V. Because the file is sorted, the index will also be sorted.
Your program can read the index into memory, open the file, and start processing searches. For each search, chop off the prefix, look that up in the index, seek to the file offset, and scan for a match.
You can build a shardable DB by mapping the sentences to a hash, and then you can seek into your data at potential locations.
from collections import defaultdict
from cStringIO import StringIO
DATA = """applachian
rocky mountains
andes
sierra nevada
long mountain ranges of the world"""
def normalize(sentence):
return "".join(sentence.lower().strip())
def create_db(inf):
db = defaultdict(list)
offset = 0
for line in inf:
l = len(line)
db[hash(normalize(line))].append((offset, l))
offset += l
return db
def main():
db = create_db(StringIO(DATA))
# save this db, and in a different script, load it to retrieve:
for needle in ["rocky", "sierra nevada"]:
key = hash(normalize(needle))
for offset, length in db.get(key, []):
print "possibly found at", offset, length
if __name__ == '__main__':
main()
This demonstrates the idea: you build a database (store as a pickle for example) of all normalised search keys mapping to locations where these are found. Then you can quickly retrieve the offset and length, and seek into that position in the real file, makeing a proper ==-based compare.

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

Faster way to parse file to array, compare to array in second file

I currently have an MGF file containing MS2 spectral data (QE_2706_229_sequest_high_conf.mgf). The file template is in the link below, as well as a snippet of example:
http://www.matrixscience.com/help/data_file_help.html
BEGIN IONS
TITLE=File3249 Spectrum10594 scans: 11084
PEPMASS=499.59366 927079.3
CHARGE=3+
RTINSECONDS=1710
SCANS=11084
104.053180 3866.360000
110.071530 178805.000000
111.068610 1869.210000
111.074780 10738.600000
112.087240 13117.900000
113.071150 7148.790000
114.102690 4146.490000
115.086840 11835.600000
116.070850 6230.980000
... ...
END IONS
This unannotated spectral file contains thousands of these entries, the total file size is ~150 MB.
I then have a series of text files which I need to parse. Each file is similar to the format above, with the first column being read into a numpy array. Then the unannotated spectra file is parsed for each entry until a matching array is found from the annotated text files input.
(Filename GRPGPVAGHHQMPR)
m/z i matches
104.05318 3866.4
110.07153 178805.4
111.06861 1869.2
111.07478 10738.6
112.08724 13117.9
113.07115 7148.8
114.10269 4146.5
115.08684 11835.6
116.07085 6231.0
Once a match is found, an MGF annotated file is written that then contains the full entry information in the unannotated file, but with a line that specifies the filename of the annotated text file that matched that particular entry. The output is below:
BEGIN IONS
SEQ=GRPGPVAGHHQMPR
TITLE=File3249 Spectrum10594 scans: 11084
PEPMASS=499.59366 927079.3
... ...
END IONS
There may be a much more computationally inexpensive way to parse. Given 2,000 annotated files to search through, with the above large unannotated file, parsing currently takes ~ 12 hrs on a 2.6 GHz quad-core Intel Haswell cpu.
Here is the below working code:
import numpy as np
import sys
from pyteomics import mgf
from glob import glob
def main():
"""
Usage: python mgf_parser
"""
pep_files = glob('*.txt')
mgf_file = 'QE_2706_229_sequest_high_conf.mgf'
process(mgf_file, pep_files)
def process(mgf_file, pep_files):
"""Parses spectra from annotated text file. Converts m/z values to numpy array.
If spectra array matches entry in MGF file, writes annotated MGF file.
"""
ann_arrays = {}
for ann_spectra in pep_files:
a = np.genfromtxt(ann_spectra, dtype=float, invalid_raise=False,
usemask=False, filling_values=0.0, usecols=(0))
b = np.delete(a, 0)
ann_arrays[ann_spectra] = b
with mgf.read(mgf_file) as reader:
for spectrum in reader:
for ann_spectra, array in ann_arrays.iteritems():
if np.array_equal(array, spectrum['m/z array']):
print '> Spectral match found for file {}.txt'.format(ann_spectra[:-4])
file_name = '{}.mgf'.format(ann_spectra[:-4])
spectrum['params']['seq'] = file_name[52:file_name.find('-') - 1]
mgf.write((spectrum,), file_name)
if __name__ == '__main__':
main()
This was used to be able to only parse a given number of files at a time. Suggestions on any more efficient parsing methods?
I see room for improvement in the fact that you are parsing the whole MGF file repeatedly for each of the small files. If you refactor the code so that it is only parsed once, you may get a decent speedup.
Here's how I would tweak your code, simultaneously getting rid of the bash loop, and also using the mgf.write function, which is probably a bit slower than np.savetxt, but easier to use:
from pyteomics import mgf
import sys
import numpy as np
def process(mgf_file, pep_files):
ann_arrays = {}
for ann_spectra in pep_files:
a = np.genfromtxt(ann_spectra, invalid_raise=False,
filling_values=0.0, usecols=(0,))
b = np.delete(a, 0)
ann_arrays[ann_spectra] = b
with mgf.read(mgf_file) as reader:
for spectrum in reader:
for ann_spectra, array in ann_arrays.iteritems():
if np.allclose(array, spectrum['m/z array']):
# allclose may be better for floats than array_equal
file_name = 'DeNovo/good_training_seq/{}.mgf'.format(
ann_spectra[:-4])
spectrum['params']['seq'] = ann_spectra[
:ann_spectra.find('-') - 1]
mgf.write((spectrum,), file_name)
if __name__ == '__main__':
pep_files = sys.argv[1:]
mgf_file = '/DeNovo/QE_2706_229_sequest_high_conf.mgf'
process(mgf_file, pep_files)
Then to achieve the same as your bash loop did, you would call it as
python2.7 mgf_parser.py *.txt
If the expanded argument list is too long, you can use glob instead of relying on bash to expand it:
from glob import iglob
pep_files = iglob(sys.argv[1])
And call it like this to prevent expansion by bash:
python2.7 mgf_parser.py '*.txt'

Categories