How to efficiently iterate two files in python? - python

I have two text files which should have a lot of matching lines and I want to find out exactly how many lines match between the files. The problem is that both of the files are quite large (one file is about 3gb and the other is over 16gb). So obviously reading them into the system memory using read() or readlines() could be very problematic. Any tips? The code I'm writing is basically just a 2 loops and an if statement to compare them.

Since the input files are very large, if you care about performance, you should consider simply using grep -f. The -f option reads patterns from a file, so depending on the exact semantics you're after, it may do what you need. You probably want the -x option too, to take only whole-line matches. So the whole thing in Python might look something like this:
child = subprocess.Popen(['grep', '-xf', file1, file2], stdout=subprocess.PIPE)
for line in child.stdout:
print line

why not use unix grep? if you want your solution platform independent then this solution will not work. But in unix it works. Run this command from your python script.
grep --fixed-strings --file=file_B file_A > result_file
Also this problem seems to be a good reason to go for map-reduce.
UPDATE 0: To elucidate. --fixed-strings = Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. and --file= Obtain patterns from FILE, one per line.
So what we ar doing is getting patterns from file_B matched against the content in file_A and fixed-strings treats them as a sequence of patterns the way they are in a file. Hope this makes it clearer.
Since you want the count of matching lines a slight modification of the above grep we get the count -
grep --fixed-strings --file=file_B file_A | wc -l
UPDATE 1: You could you do this. first go through each file separately line by line. dont read the entire file into memory. when you read one line compute md5 hash of this line and write it to another file. When you do this 2 both files, you get 2 new files filled with md5 hashes. I am hoping that these 2 files are substantially smaller in size the original files, since md5 is 16bytes irrespective of i/p string. Now you can probably do grep or other diffing techniques with little or no memory problem. – Srikar 3 mins ago edit
UPDATE 2: (few days later) Can you do this? create 2 tables table1, table2 in mysql. Both having only 2 fields id, data. Insert both the files into both these tables, line by line. After which run a query to find count of duplicates. You have to go through both files. Thats given. We cant run away from that fact. Now the optimisations can be done in how dups are found. MySQL is one such option. It removes a lot of things that you need to do like RAM space, index creation etc.

Well thanks all for your input! But what I ended up doing was painfully simple. I was trying things like this, which read in the whole file.
file = open(xxx,"r")
for line in file:
if.....
What I ended up doing was
for line in open(xxx)
if.....
The second one takes the file line by line. It's very time consuming, but I've pretty much accepted that there isn't some magically way to do this that will take very little time :(

Related

Summarizing huge amounts of data

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.
There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

String search across multiple documents - grep?

If you are given a list of documents, with strings in the documents, how do you go about and search from the documents and return the list of documents that contains the string that you were searching for?
How would I go about implementing a program in Python or C for this problem statement? I've considered grep, but I'm not sure how implementing that inside of a native Python/C application would work.
Thought process at the moment is simply to parse through documents in a loop, then parse through all strings, etc., but it seems a little inefficient.
Any help appreciated.
The simple solution is just as you stated: loop through the files and search through each one.
Naive approach
for file in files:
for line in file:
if line contains pattern:
print file.name
If you wanted to be a little better, you could immediately bail out of the file as soon as you found a match.
Slightly better
for file in files:
for line in file:
if line contains pattern:
print file.name
break # found what we were looking for. continue to next file
At this point you could attempt to distribute the problem across multiple threads. You will probably be IO bound and may even see worse performance because multiple threads are trying to read different parts of the disk at the same time
Threaded approach
for file in files:
# create new worker thread which does...
for line in file:
if line contains pattern:
# insert filename into data structure
break # found what we were looking for. continue to next file
# wait for all threads to finish, collect and display data
But if you are concerned about performance, you should either use grep or copy how it works. It saves time by reading the files as raw binary (rather than break it up line by line) and makes use of a string searching algorithm called the Boyer–Moore algorithm. Refer to this other SO about how grep runs fast.
Probably What You Want™ approach
grep -l pattern files

How can I read four specific lines of a file without reading the whole file in python?

I need to read 4 specific lines of a file in python. I don't want to read all the file and then get four out of it ( for the sake of menory). Does anyone know how to do that?
Thanks!
P. S. I used the following code but apparently it reads all the file and then take 4 out of it.
a=open("file", "r")
b=a.readlines() [c:d]
you have to read at least to the lines you are interested in ... you can use islice to grab a slice
interesting_lines = list(itertools.islice(a,c,d))
but it still reads up to those lines
Files, at least on Macs and Windows and Linux and other UNIXy systems, are just streams of bytes; there's no concept of "line" in the file structure, just bytes that happen to represent newline characters. So the only way to find the Nth line in the file is to start at the beginning and read until you've found (N-1) newlines. You don't have to store all the content you scan through, but you do have to read it.
Then you have to read and store from that point until you find 4 more newlines.
You can do this in Python, but it's not clear to me that it's a win compared to using the straightforward approach that reads more than it needs to; feels like premature optimization to me.

What is the best way to do a find and replace of multiple queries on multiple files?

I have a file that has over 200 lines in this format:
name old_id new_id
The name is useless for what I'm trying to do currently, but I still want it there because it may become useful for debugging later.
Now I need to go through every file in a folder and find all the instances of old_id and replace them with new_id. The files I'm scanning are code files that could be thousands of lines long. I need to scan every file with each of the 200+ ids that I have, because some may be used in more than one file, and multiple times per file.
What is the best way to go about doing this? So far I've been creating python scripts to figure out the list of old ids and new ids and which ones match up with each other, but I've been doing it very inefficient because I basically scanned the first file line by line and got the current id of the current line, then I would scan the second file line by line until I found a match. Then I did this over again for each line in the first file, which ended up with my reading the second file a lot. I didn't mind doing this inefficiently because they were small files.
Now that I'm searching probably somewhere around 30-50 files that can have thousands of line of code in it, I want it to be a little more efficient. This is just a hobbyist project, so it doesn't need to be super good, I just don't want it to take more than 5 minutes to find and replace everything, then look at the result and see that I made a little mistake and need to do it all over again. Taking a few minutes is fine(although I'm sure with computers nowadays they can do this almost instantly still) but I just don't want it to be ridiculous.
So what's the best way to go about doing this? So far I've been using python but it doesn't need to be a python script. I don't care about elegance in the code or way I do it or anything, I just want an easy way to replace all of my old ids with my new ids using whatever tool is easiest to use or implement.
Examples:
Here is a line from the list of ids. The first part is the name and can be ignored, the second part is the old id, and the third part is the new id that needs to replace the old id.
unlock_music_play_grid_thumb_01 0x108043c 0x10804f0
Here is an example line in one of the files to be replaced:
const v1, 0x108043c
I need to be able to replace that id with the new id so it looks like this:
const v1, 0x10804f0
Use something like multiwordReplace (I've edited it for your situation) with mmap.
import os
import os.path
import re
from mmap import mmap
from contextlib import closing
id_filename = 'path/to/id/file'
directory_name = 'directory/to/replace/in'
# read the ids into a dictionary mapping old to new
with open(id_filename) as id_file:
ids = dict(line.split()[1:] for line in id_file)
# compile a regex to do the replacement
id_regex = re.compile('|'.join(map(re.escape, ids)))
def translate(match):
return ids[match.group(0)]
def multiwordReplace(text):
return id_regex.sub(translate, text)
for code_filename in os.listdir(directory_name):
with open(os.path.join(directory, code_filename), 'r+') as code_file:
with closing(mmap(code_file.fileno(), 0)) as code_map:
new_file = multiword_replace(code_map)
with open(os.path.join(directory, code_filename), 'w') as code_file:
code_file.write(new_file)

String replacement on a whole text file in Python 3.x?

How can I replace a string with another string, within a given text file. Do I just loop through readline() and run the replacement while saving out to a new file? Or is there a better way?
I'm thinking that I could read the whole thing into memory, but I'm looking for a more elegant solution...
Thanks in advance
fileinput is the module from the Python standard library that supports "what looks like in-place updating of text files" as well as various other related tasks.
for line in fileinput.input(['thefile.txt'], inplace=True):
print(line.replace('old stuff', 'shiny new stuff'), end='')
This code is all you need for the specific task you mentioned -- it deals with all of the issues (writing to a different file, removing the old one when done and replacing it with the new one). You can also add a further parameter such as backup='.bk' to automatically preserve the old file as (in this case) thefile.txt.bk, as well as process multiple files, take the filenames to process from the commandline, etc, etc -- do read the docs, they're quite good (and so is the module I'm suggesting!-).
If the file can be read into memory at once, I'd say that
old = myfile.read()
new = old.replace("find this", "replace by this")
output.write(new)
is at least as readable as
for line in myfile:
output.write(line.replace("find this", "replace by this"))
and it might be a little faster, but in the end it probably doesn't really matter.

Categories