Python large text file searching - python

I have a 500 MB text file that was made a long time ago. It has what looks like html or xml tags but they are not consistent throughout the file. I am trying to find the information between two tags that do not match. What I am using currently works but is very slow: myDict has a list of keywords in it. I can only guarantee the X+key and /N exist. There are no other tags that are consistent. The Dictionary has 18000 keys.
for key in myDict:
start_position = 0
start_position = the_whole_file.find('<X>'+key, start_position)
end_position = the_whole_file.find('</N>', start_position)
date = the_whole_file[start_position:end_position]
Is there a way to do this faster?

reverse the way you are doing it, instead of iterating through the dictionary and searching for potential matches. iterate through potential matches and search the dictionary
import re
for part in re.findall("\<X\>(.*)\<\/N\>",the_whole_text):
key = part.split(" ",1)[0]
if key in my_dict:
do_something(part)
since dictionary lookup is O(1) as opposed to string finding of O(N) (searching the whole file for every key is expensive ...)
so searching your file contents is ~O(500,000,000) and you are doing that 18,000 times
this way you only search the file once finding all potentials ... then you look up each one to see if its in your data dictionary

You can always read the file line by line instead of storing the whole file in memory:
inside_tag = False
data = ''
with open(your file, 'r') as fil:
for line in fil:
if '</N>' in line:
data += line.split('<X>')[0]
print data
inside_tag = False
if inside_tag:
data += line
if '<X>' in line:
data = line.split('<X>')[-1]
inside_tag = True
Note that this does not work when the beginning and end tags are on the same line.

Related

python keeping track of set of values on disk, is there a better approach? (no pandas)

I am parsing 1000s of of logging documents and I have the need to keep track of the collection of single user_ids that appear 1000 of times in those files.
What I thought of is keeping a text file containing that list of user_ids.
The text is read into a list, the new list is merged, the set is extracted and the list is again saved as a file:
def add_value_to_csv_text_set_file(file,
values, # list of strings
raise_error=True,
verbose=False,
DEBUG=False):
# check if file exists, otherwise creates empty file
file = Path(file)
if file.is_file() == False:
file.write_text("")
# read the contents
with open(file, 'r') as f:
registered_values = f.read().split(',')
registered_values = [value for value in registered_values if value != '']
if DEBUG: print('now: ', registered_values)
set_of_values = set([str(value) for value in values])
new_values = [value for value in set_of_values if value not in registered_values]
if DEBUG: print("to_add", rows_to_write)
new_text = ','.join(sorted(registered_values+new_values))
with open(file, 'w') as f:
f.write(new_text)
Somehow this does not seem very efficient. For once I read the whole text into memory, secondly I use the set(list) func that I think is not very fast, and third I convert lists back and forth into text, fourth I check every single time if the file exists and also evaluate is there are empty elements (because an empty element is created the first time at the beginning, i.e. in file.write_text("")).
Would someone point to a better and more pythonic solution?

Using python (acora) to find lines containing keywords

I'm writing a program that reads in a directory of text files and finds a specific combination of strings that are overlapping (i.e. shared among all files). My current approach is to take one file from this directory, parse it, build a list of every string combo, and then search for this string combo in the other files. For instance, if I'd ten files, I'd read one file, parse it, store the keywords I need, then search the other nine files for this combination. I'd repeat this for every file (making sure that the single file doesn't search itself). To do this, I'm trying to use python's acora module.
The code I've thus far is:
def match_lines(f, *keywords):
"""Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
builder = AcoraBuilder('\r', '\n', *keywords)
ac = builder.build()
line_start = 0
matches = False
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield f[line_start:pos]
matches = False
line_start = pos + 1
else:
matches = True
if matches:
yield f[line_start:]
def find_overlaps(f_in, fl_in, f_out):
"""f_in: input file to extract string combo from & use to search other files.
fl_in: list of other files to search against.
f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
"""
string_list = build_list(f_in) # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
found_lines = [] # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
for keywords in string_list: # For each tuple (string #1, string #2) in the list of tuples
for f in fl_in: # For each file in the input file list
for line in match_lines(f, *keywords):
found_lines.append(line)
As you can probably tell, I used the function match_lines from the acora web page, "FAQ and recipes" #3. I also used it in the mode to parse files (using ac.filefind()), also located from the web page.
The code seems to work, but it's only yielding me the file name that has the matching string combination. My desired output is to write out the entire line from the other files that contain my matching string combination (tuple).
I'm not seeing what here would produce filenames, as you say it does.
Regardless, to get line numbers, you just need to count them as you pass them in match_lines():
line_start = 0
line_number = 0
matches = False
text = open(f, 'r').read()
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield line_number, text[line_start:pos]
matches = False
line_start = pos + 1
line_number += 1
else:
matches = True
if matches:
line_number, yield text[line_start:]

Python - Opening files for comparison

I am attempting to open two files and check if the first word in file_1 is in any line in file_2. If the first word in a line from file_1 matches the first word in a line in file_2 I'd like to print both lines out. However, with the below code I am not getting any result. I will be dealing with very large files so I'd like to avoid putting the files in to memory using a list or dictionary. I can only use the built in functions in Python3.3. Any advice would be appreciated? Also if there is a better way please also advise.
Steps I am trying to perform:
1.) Open file_1
2.) Open file_2
3.) Check if the first Word is in ANY line of file_2.
4.) If the first word in both files match print the line from both file_1 and file_2.
Contents of files:
file_1:
Apples: 5 items in stock
Pears: 10 items in stock
Bananas: 15 items in stock
file_2:
Watermelon: 20 items in stock
Oranges: 30 items in stock
Pears: 25 items in stock
Code Attempt:
with open('file_1', 'r') as a, open('file_2', 'r') as b:
for x, y in zip(a, b):
if any(x.split()[0] in item for item in b):
print(x, y)
Desired Output:
('Pears: 10 items in stock', 'Pears: 25 items in stock')
Try:
for i in open('[Your File]'):
for x in open('[Your File 2]'):
if i == x:
print(i)
I would actually heavily suggest against storing data in 1GB sized text files and not in some sort of database/standard data storage file format. If your data were more complex, I'd suggest CSV or some sort of delimited format at minimum. If you can split and store the data in much smaller chunks, maybe a markup language like XML, HTML, or JSON (which would make navigation and extraction of data easy) which are far more organized and already optimized to handle what you're trying to do (locating matching keys and returning their values).
That said, you could use the "readline" method found in section 7.2.1 of the Python 3 docs to efficiently do what you're trying to do: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-file.
Or, you could just iterate over the file:
def _get_key(string, delim):
#Split key out of string
key=string.split(delim)[0].strip()
return key
def _clean_string(string, charToReplace):
#Remove garbage from string
for character in charToReplace:
string=string.replace(character,'')
#Strip leading and trailing whitespace
string=string.strip()
return string
def get_matching_key_values(file_1, file_2, delim, charToReplace):
#Open the files to be compared
with open(file_1, 'r') as a, open(file_2, 'r') as b:
#Create an object to hold our matches
matches=[]
#Iterate over file 'a' and extract the keys, one-at-a-time
for lineA in a:
keyA=_get_key(lineA, delim)
#Iterate over file 'b' and extract the keys, one-at-a-time
for lineB in b:
keyB=_get_key(lineB, delim)
#Compare the keys. You might need upper, but I usually prefer
#to compare all uppercase to all uppercase
if keyA.upper()==keyB.upper():
cleanedOutput=(_clean_string(lineA, charToReplace),
_clean_string(lineB, charToReplace))
#Append the match to the 'matches' list
matches.append(cleanedOutput)
#Reset file 'b' pointer to start of file and try again
b.seek(0)
#Return our final list of matches
#--NOTE: this method CAN return an empty 'matches' object!
return matches
This is not really the best/most efficient way to go about this:
ALL matches are saved to a list object in memory
There is no handling of duplicates
No speed optimization
Iteration over file 'b' occurs 'n' times, where 'n' is the number of
lines in file 'a'. Ideally, you would only iterate over each file once.
Even only using base Python, I'm sure there is a better way to go about it.
For the Gist: https://gist.github.com/MetaJoker/a63f8596d1084b0868e1bdb5bdfb5f16
I think the Gist also has a link to the repl.it I used to write and test the code if you want a copy to play with in your browser.

How to get text between 2 lines with PYthon

So I have a text file that is structured like this:
Product ID List:
ABB:
578SH8
EFC025
EFC967
CNC:
HDJ834
HSLA87
...
...
This file continues on with many companies' names and Id's below them. I need to then get the ID's of the chosen company and append them to a list, where they will be used to search a website. Here is the current line I have to get the data:
PID = open('PID.txt').read().split()
This works great if there are only Product ID's of only 1 company in there and no text. This does not work for what I plan on doing however... How can I have the reader read from (an example) after where it says ABB: to before the next company? I was thinking maybe add some kind of thing in the file like ABB END to know where to cut to, but I still don't know how to cut out between lines in the first place... If you could let me know, that would be great!
Two consecutive newlines act as a delimeter, so just split there an construct a dictionary of the data:
data = {i.split()[0]: i.split()[1:] for i in open('PID.txt').read().split('\n\n')}
Since the file is structured like that you could follow these steps:
Split based on the two newline characters \n\n into a list
Split each list on a single newline character \n
Drop the first element for a list containing the IDs for each company
Use the first element (mentioned above) as needed for the company name (make sure to remove the colon)
Also, take a look at regular expressions for parsing data like this.
with open('file.txt', 'r') as f: # open the file
next(f) # skip the first line
results = {} # initialize a dictionary
for line in f: # iterate through the remainder of the file
if ':' in line: # if the line contains a :
current = line.strip() # strip the whitespace
results[current] = [] # and add it as a dictionary entry
elif line.strip(): # otherwise, and if content remains after stripping whitespace,
results[current].append(line.strip()) # append this line to the relevant list
This should at least get you started, you will likely have better luck using dictionaries than lists, at least for the first part of your logic. By what method will you pass the codes along?
a = {}
f1 = open("C:\sotest.txt", 'r')
current_key = ''
for row in f1:
strrow = row.strip('\n')
if strrow == "":
pass
elif ":" in strrow:
current_key = strrow.strip(':')
a[current_key] = []
else:
a[current_key].append(strrow)
for key in a:
print key
for item in a[key]:
print item

Simplify python code for txt searching

I am a beginner at python and I need to check the presence of a given set of string in a huge txt file. I've written this code so far and it runs with no problems on a light subsample of my database. The problem is that it takes more than 10 hours when searching through the whole database and I'm looking for a way to speed up the process.
The code so far reads a list of strings from a txt I've put together (list.txt) and search for every item in every line of the database (hugedataset.txt). My final output should be a list of items which are present in the database (or, alternatively, a list of items which are NOT present). I bet there is a more efficient way to do things though...
Thank you for your support!
import re
fobj_in = open('hugedataset.txt')
present=[]
with open('list.txt', 'r') as f:
list1 = [line.strip() for line in f]
print list1
for l in fobj_in:
for title in list1:
if title in l:
print title
present.append(title)
set=set(presenti)
print set
Since you don't need any per-line information, you can search the whole thing in one go for each string:
data = open('hugedataset.txt').read() # Assuming it fits in memory
present=[] # As #svk points out, you could make this a set
with open('list.txt', 'r') as f:
list1 = [line.strip() for line in f]
print list1
for title in list1:
if title in data:
print title
present.append(title)
set=set(present)
print set
You could use a regexp to check for all substring with a single pass. Look for example at this answer: Check to ensure a string does not contain multiple values

Categories