How to jump to the same line in two huge text files? - python

I am trying to use python to do some manipulations on huge text files, and by huge I mean over 100GB. Specifically, I'd like to take samples from the lines of the files. For example, let's say I have a file with ~300 million lines, I want to take just a million, write them to a new file and analyze them later to get some statistics. The problem is, I can't start from the first line, since the first fraction of the file does not represent the rest of it good enough. Therefore, I have to get about 20% into the file, and then start extracting lines. If I do it the naive way, it takes very long (20-30 minutes on my machine) to get to the 20% line. For example, let's assume again that my file has 300 million lines, and I want to start sampling from line 60,000,000th (20%) line. I could do something like:
start_in_line = 60000000
sample_size = 1000000
with open(huge_file,'r') as f, open(out_file,'w') as fo:
for x in range(start_in_line):
f.readline()
for y in range(sample_size):
print(f.readline(),file=fo)
But as I said, this is very slow. I tried using some less naive ways, for example the itertools functions, but the improvement in running time was rather slight.
Therefore, I went for another approach - random seeks into the file. What I do is get the size of the file in bytes, calculate 20% of it and than make a seek to this byte. For example:
import os
huge_file_size = os.stat(huge_file).st_size
offset_percent = 20
sample_size = 1000000
start_point_byte = int(huge_file_size*offset_percent/100)
with open(huge_file) as f, open(out_file,'w') as fo:
f.seek(start_point_byte)
f.readline() # get to the start of next line
for y in range(sample_size):
print(f.readline(),file=fo)
This approach works very nice, BUT!
I always work with pairs of files. Let's call them R1 and R2. R1 and R2 will always have the same number of lines, and I run my sampling script on each one of them. It is crucial for my downstream analyses that the samples taken from R1 and R2 coordinate, regarding the lines sampled. For example, if I ended up starting sampling from line 60,111,123 of R1, I must start sampling from the very same line in R2. Even if I miss by one line, my analyses are doomed. If R1 and R2 are of exactly the same size (which is sometimes the case), then I have no problem, because my f.seek() will get me to the same place in both files. However, if the line lengths are different between the files, i.e. the total sizes of R1 and R2 are different, then I am in a problem.
So, do you have any idea for a workaround, without having to resort to the naive iteration solution? Maybe there is a way to tell which line I am at, after performing the seek? (couldn't find one...) I am really out of ideas at this point, so any help/hint would be appreciated.
Thanks!

If the lines in each file can have different lengths, there is really no other way than to scan them first (unless there is some form of unique identifier on each line which is the same in both files).
Even if both files have the same length, there could still be lines with different lengths inside.
Now, if you're doing those statistics more than once on different parts of the same files, you could do the following:
do a one time scan of both files and store the filepositions of each line in a third file (preferably in binary form (2 x 64bit value) or at least the same width so you can directly jump to the position-pair of the line you want, which you could calculate then).
then just use those filepositions to access the lines in both files (you could even calculate the size of the block you need from the different filepositions in your third file).
When scanning both files at the same time, make sure you use some buffering to avoid a lot of harddisk seeks.
edit:
I don't know Python (I'm a C++ programmer), but I did a quick search and it seems Python also supports memory mapped files (mmap).
Using mmap you could speed things up dramaticly (no need to do a readline each time just to know the positions of the lines): just map a view on part(s) of your file and scan through that mapped memory for the newline (\n or 0x0a in hexadecimal). This should take no longer than the time it takes to read the file.

Unix files are just streams of characters, so there is no way to seek to a given line, or find the line number corresponding to a given character, or anything else of that form.
You can use standard utilities to find the character position of a line. For example,
head -n 60000000 /path/to/file | wc -c
will print the number of characters in the first 60,000,000 lines of /path/to/file.
While that may well be faster than using python, it is not going to be fast; it is limited by the speed of reading from disk. If you need to read 20GB, it's going to take minutes. But it would be worth trying at least once to calibrate your python programs.
If your files don't change, you could create indexes mapping line numbers to character position. Once you build the index, it would be extremely fast to seek to the desired line number. If it takes half an hour to read 20% of the file, it would take about five hours to construct two indexes, but if you only needed to do it once, you could leave it running overnight.

OK, so thanks for the interesting answers, but this is what I actually ended up doing:
First, I estimate the number of lines in the file, without actually counting them. Since my files are ASCII, i know that each character takes 1 byte, so I get the number of characters in, say, the first 100 lines, then get the size of the file and use these numbers to get a (quite rough) estimation of the number of lines. I should say here that although my lines might be of different length, they are within a limited range, so this estimation is reasonable.
Once I have that, I use as system call of the Linux sed command to extract a range of lines. So let's say that my file really has 300 million lines, and I estimated it to have 250 million lines (I get much better estimations, but it doesn't really matter in my case). I use an offset of 20%, so I'd like to start sampling from line 50,000,000 and take 1,000,000 lines. I do:
os.system("sed -n '50000000,51000000p;51000000q' in_file > out_file")
Note the 51000000q - without this, you'll end up running on the whole file.
This solution is not as fast as using random seeks, but it's good enough for me. It also includes some inaccuracy, but it doesn't bother me in this specific case.
I'd be glad to hear your opinion on this solution.

Related

Finding least commonly occurring fuzzy string

I've got a 10K+ line log file that I'm using to debug an issue. I'm looking for 'aberrant' log lines that occur infrequently relative to the other lines in the file to hopefully extract interesting happenings that may be occurring. These log lines are highly variable and diverse.
My initial approach was to do a fuzzy comparison of each line against the remaining lines in the file, get an average of those ratios and assign it to each line, sort those ratios and return the smallest N items in that set.
However, this takes a very, very long time on my machine when using Python (I'm using fuzzywuzzy).
Any alternative suggestions?
Instead of that comparison, make one pass of the file to categorize the lines by their distinctive features. Store a reference to each line in a dict, keyed by category.
Then make a pass over the dict, eliminating any keys with too many references (i.e. boring categories). The remaining categories are the interesting ones.
This is an O(N) process, rather than the O(N^2) process with which you started.

Analyzing huge dataset from 80GB file with only 32GB of RAM

I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.
My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?
I tried splitting files into smaller file size and then analyzed them individually in loop like follows
pre_name = "split_file"
for k in range(11): #There are 10 files with almost 8G each
filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
#Some analysis here
plt.hist(all_rows[:,8],100) #Plotting histogram for 9th Column
all_rows = None
I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?
For such lengthy operations (when data don't fit in memory), it might be useful to use libraries like dask ( http://dask.pydata.org/en/latest/ ), particularly the dask.dataframe.read_csv to read the data and then perform your operations like you would do in pandas library (another useful package to mention).
Two alternatives come to my mind:
You should consider performing your computation by online algorithms
In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.
It is possible to compute mean and variance and a histogram with pre-specified bins this way with constant memory complexity.
You should throw your data into a proper database and make use of that database system's statistical and data handling capabilities, e.g., aggregate functions and indexes.
Random links:
http://www.postgresql.org/
https://www.mongodb.org/

Removing duplicates from BIG text file

I have a rather big text file , average 30GB. I want to remove duplicate lines from this file. What is a good efficient algorithm to do this. for small files, I usually use dictionaries, eg Python dictionaries to store unique keys. But this time the file is rather big. any language suggestion is fine. ( i am thinking of using C? or is it rather not language dependent but the algorithm that is more important? ). thanks
If you can't just fire up an instance on amazon with enough memory to hold everything in RAM, this is the strategy I would use:
Step 1 - go through and generate a checksum/hashvalue for each line. I'd probably use SIPHASH. Output these to a file.
Step 2 - sort the file of siphash values, and throw away any that only have one entry. Output the result as a set of hashvalues & number of matches.
Step 3 - read through the file. regenerate the hashvalue for each line. If its a line that has a match, hold onto it in memory. If there's another already in memory with same hashvalue, compare to see if the lines themselves match. Output "match" if true. If you've already seen all N lines that have the same hashvalue and they didn't match, go ahead and dispose of the record.
This strategy depends on the number of duplicates being only a small fraction of the number of total lines. If that's not the case, then I would use some other strategy, like a divide and conquer.

Read flat file as transpose, python

I'm interested in reading fixed width text files in Python in as efficient a manner as I can. Specifically, most of the time I'm interested in one or more columns in the flat file but not entire records.
It strikes me as inefficient to read the file a line at a time and extract the desired columns after reading the entire line into memory. I think I'd rather have the option of reading only the desired columns, top to bottom, left to right (instead of reading left to right, top to bottom).
Is such a thing desirable, and if so, is it possible?
Files are laid out as a (one-dimensional) sequence of bits. 'Lines' are just a convenience we added to make things easy to read for humans. So, in general, what you're asking is not possible on plain files. To pull this off, you would need some way of finding where a record starts. The two most common ways are:
Search for newline symbols (in other words, read the entire file).
Use a specially spaced layout, so that each record is laid out using a fixed with. That way, you can use low level file operations, like seek, to go directly to where you need to go. This avoids reading the entire file, but is painful to do manually.
I wouldn't worry too much about file reading performance unless it becomes a problem. Yes, you could memory map the file, but your OS probably already caches for you. Yes, you could use a database format (e.g., the sqlite3 file format through sqlalchemy), but it probably isn't worth the hassle.
Side note on "fixed width:" What precisely do you mean by this? If you really mean 'every column always starts at the same offset relative to the start of the record' then you can definitely use Python's seek to skip past data that you are not interested in.
How big are the lines? Unless each record is huge, it's probably likely to make little difference only reading in the fields you're interested in rather than the whole line.
For big files with fixed formatting, you might get something out of mmapping the file. I've only done this with C rather than Python, but it seems like mmapping the file then accessing the appropriate fields directly is likely to be reasonably efficient.
Flat files are not good with what you're trying to do. My suggestion is to convert the files to SQL database (using sqlite3) and then reading just the columns you want. SQLite3 is blazing fast.
If it's truly fixed width, then you should be able to just call read(N) to skip past the fixed number of bytes from the end of your column on one line to the start of it on the next.

Which is faster?

is opening a large file once reading it completely once to list faster (or) opening smaller files whose total sum of size is equal to large file and loading smaller file into list manupalating one by one faster?
which is faster?? is the difference is time large enough to impact my program??
total time difference of lesser then of 30 sec is negligible for me
It depends if your data fit in your available memory. If you need to resort to paging, or virtual memory, then opening a single giant file might become slower than opening more smaller files. This will be even more true if the computation you need to make creates intermediate variables that won't fit in the physical RAM either.
So, as long as the file is not that big, one opening will be faster, but if this is not true, then many opening may be faster.
At last, note that if you can do many opening, you might be able to do them in parallel and process various parts in different processes, which might make things faster again.
Obviously one open and close is going to be faster than n opens and closes if you are reading the same amount of data. Plus, when reading a single file the I/O classes you use can take advantage of things like buffering, etc, which makes it even faster.
If you are reading the file sequentially from start until end, one open/close is faster than multiple open/close operations.
However keep in mind that if you need to do a lot of seeking in your 1 big file, then maybe storing separate files won't be slower in that case.
Also keep in mind that no matter which approach you are using, you shouldn't read the entire file in at once. Do it in chunks.
Working with a single file is almost certainly going to be faster: you have to read the same amount of data in both cases, but when working with multiple files, you have that much more housekeeping operations slowing you down.
Additionally, you can read data from a single file at the maximum speed the disk can handle, using the disk buffer to the maximum etc., whereas with multiple files, the disk head does a lot more dancing jumping from file to file.
30sec time difference? Define large. Everything that fits into an average's computer RAM would probably not take much more time than 30sec in total.
Why do you think you need to read the file(s) into a list?
If you can open several small files and process each independently, then surely that means:
(a) that you don't need to read into a list, you can process any file (including 1 large file) a line at a time (avoiding running-out-of-real-memory problems)
or
(b) what you need to do is more complicated than you have told us.

Categories