Summarizing huge amounts of data

Summarizing huge amounts of data - python

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.

There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.

I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

Related

Tips for working with large quantity .txt files (and overall large size) - python?

I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.
My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.
Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.
I was wondering if I could get tips on how to make my code more efficient.
for opening and reading files, I use:
filenames = os.listdir('.')
dict = {}
for file in filenames:
with open(file) as f:
contents = f.read()
dict[file.replace(".txt", "")] = contents
Doing print(dict) crashes (at least it seems like it) my python.
Is there a better way to handle this?
Additionally, I also convert all the values in my dict to lowercase, using:
def lower_dict(d):
lcase_dict = dict((k, v.lower()) for k, v in d.items())
return lcase_dict
lower = lower_dict(dict)
I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?
Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?
however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different.
Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.
Any help is very much appreciated.
thanks

I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:
for line in lines:
line = re.sub('[T:-]', '', line)
Let me know if this helps!

Loading / Streaming 8GB txt file?? And tokenize

I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors
But this still doesnt do the job.. when I run my code, my pc gets stuck.
Am I doing something wrong?
I want to get all words into a list (tokenize them). Further, doesnt the code reads each line and tokenizes the line? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line?
I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file?
word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
for line in t.readlines():
word_list+=nltk.word_tokenize(line)
number = number + 1
print(number)

By using the following line:
for line in t.readlines():
# do the things
You are forcing python to read the whole file with t.readlines(), then return an array of strings that represents the whole file, thus bringing the whole file into memory.
Instead, if you do as the example you linked states:
for line in t:
# do the things
The Python VM will natively process the file line-by-line, like you want.
the file will act like a generator, yielding each line one at a time.
After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line). This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory.
For data this large, you will have to either
find a way to store an intermediate version of your tokenized data, or
design your code in a way that you can handle one, or just a few tokenized words at a time.
Some thing like this might do the trick:
def enumerated_tokens(filepath):
index = 0
with open(filepath, rb, encoding="utf-8") as t:
for line in t:
for word in nltk.word_tokenize(line):
yield (index, word)
index += 1
for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
print(index, word)
# Do the thing with your word.
Notice how this never actually stores the word anywhere. This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. This approach will likely be faster, more stable, and use less memory overall.

How can I efficiently open 30gb of file and process pieces of it without slowing down?

I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each slice.
So I have a dictionary with keys as beginning line numbers and values as count of following rows, and I use this dictionary to loop through the file and get slices over it. for each slice, I create a table, make some conversions and averaging, create a new table an convert it into a dictionary. I use islice for slicing and pandas dataframe to create tables from each slice.
however, in time process is getting slower and slower, even the size of the slices are more or less the same.
First 1k slices - processed in 1h
Second 1k slices - processed in 4h
Third 1k slices - processed in 8h
Second 1k slices - processed in 17h
And I am waiting for days to complete the processes.
Right now I am doing this on a windows 10 machine, 1tb SSD, 32 GB ram. Previously I also tried on a Linux machine (ubuntu 18.4) with 250gb SSD and 8gb ram + 8gb virtual ram. Both resulted more or less the same.
What I noticed in windows is, 17% of CPU and 11% of memory is being used, but disk usage is 100%. I do not fully know what disk usage means and how I can improve it.
As a part of the code I was also importing data into mongodb while working on Linux, and I thought maybe it was because of indexing in mongodb. but when I print the processing time and import time I noticed that almost all time is spent on processing, import takes few seconds.
Also to gain time, I am now doing the processing part on a stronger windows machine and writing the docs as txt files. I expect that writing on disk slows down the process a bit but txt file sizes are not more than 600kb.
Below is the piece of code, how I read the file:
with open(infile) as inp:
for i in range(0,len(seg_ids)):
inp.seek(0)
segment_slice = islice(inp,list(seg_ids.keys())[i], (list(seg_ids.keys())[i]+list(seg_ids.values())[i]+1))
segment = list(segment_slice)
for _, line in enumerate(segment[1:]):
#create dataframe and perform calculations
So I want to learn if there is a way to improve the processing time. I suppose my code reads whole file from beginning for each slice, and going through the end of the file reading time goes longer and longer.
As a note, because of the time constraints, I started with the most important slices I have to process first. So the rest will be more random slices on the files. So solution should be applicable for random slices, if there are any (I hope).
I am not experienced in scripting so please forgive me if I am asking a silly question, but I really could not find any answer.

A couple of things come to mind.
First, if you bring the data into a pandas DataFrame there is a 'chunksize' argument for importing large data. It allows you to process / dump what you need / don't, while proving information such as df.describe which will give you summary stats.
Also, I hear great things about dask. It is a scalable platform via parallel, multi-core, multi-machine processing and is almost as simple as using pandas and numpy with very little management of resources required.

Use pandas or dask and pay attention to the options for read_csv(). mainly: chunck_size, nrows, skiprows, usecols, engine (use C), low_memory, memory_map
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html][1]

The problem here is you are rereading a HUGE unindexed file multiple times line by line from the start of the file. No wonder this takes HOURS.
Each islice in your code is starting from the very beginning of the file -- every time -- and reads and discards all the lines from the file before it even reaches the start of the data of interest. That is very slow and inefficient.
The solution is to create a poor man's index to that file and then read smaller chucks for each slice.
Let's create a test file:
from pathlib import Path
p=Path('/tmp/file')
with open(p, 'w') as f:
for i in range(1024*1024*500):
f.write(f'{i+1}\n')
print(f'{p} is {p.stat().st_size/(1024**3):.2f} GB')
That creates a file of roughly 4.78 GB. Not as large as 30 GB but big enough to be SLOW if you are not thoughtful.
Now try reading that entire file line-by-line with the Unix utility wc to count the total lines (wc being the fastest way to count lines, in general):
$ time wc -l /tmp/file
524288000 file
real 0m3.088s
user 0m2.452s
sys 0m0.636s
Compare that with the speed for Python3 to read the file line by line and print the total:
$ time python3 -c 'with open("file","r") as f: print(sum(1 for l in f))'
524288000
real 0m53.880s
user 0m52.849s
sys 0m0.940s
Python is almost 18x slower to read the file line by line than wc.
Now do a further comparison. Check out the speed of the Unix utility tail to print the last n lines of a file:
$ time tail -n 3 file
524287998
524287999
524288000
real 0m0.007s
user 0m0.003s
sys 0m0.004s
The tail utility is 445x faster than wc (and roughly 8,000x faster than Python) to get to the last three lines of the file because it is using a windowed index buffer. ie, tail reads some number of bytes at the end of the file and then gets the last n lines from the buffer it read.
It is possible to use the same tail approach to your application.
Consider this photo:
The approach you are using the equivalent to reading every tape on that rack to find the data that is on two of the middle tapes only -- and doing it over and over and over...
In the 1950's (the era of the photo) each tape was roughly indexed for what it held. Computers would call for a specific tape in the rack -- not ALL the tapes in the rack.
The solution to your issue (in oversight) is to build a tape-like indexing scheme:
Run through the 30 GB file ONCE and create an index of sub-blocks by starting line number for the block. Think of each sub-block as roughly one tape (except it runs easily all the way to the end...)
Instead of using int.seek(0) before every read, you would seek to the block that contains the line number of interest (just like tail does) and then use islice offset adjusted to where that block's starting line number is in relationship to the start of the file.
You have a MASSIVE advantage compared with what they had to do in the 50's and 60's: You only need to calculate the starting block since you have access to the entire remaining file. 1950's tape index would call for tapes x,y,z,... to read data larger than one tape could hold. You only need to find x which contains the starting line number of interest.
BTW, since each IBM tape of this type held roughly 3 MB, your 30 GB file would be more than 10 million of these tapes...
Correctly implemented (and this is not terribly hard to do) it would speed up the read performance by 100x or more.
Constructing a useful index to a text file by line offset might be as easy as something like this:
def index_file(p, delimiter=b'\n', block_size=1024*1024):
index={0:0}
total_lines, cnt=(0,0)
with open(p, 'rb') as f:
while buf:=f.raw.read(block_size):
cnt=buf.count(delimiter)
idx=buf.rfind(delimiter)
key=cnt+total_lines
index[key]=f.tell()-(len(buf)-idx)+len(delimiter)
total_lines+=cnt
return index
# this index is created in about 4.9 seconds on my computer...
# with a 4.8 GB file, there are ~4,800 index entries
That constructs an index correlating starting line number (in that block) with byte offset from the beginning of the file:
>>> idx=index_file(p)
>>> idx
{0: 0, 165668: 1048571, 315465: 2097150, 465261: 3145722,
...
524179347: 5130682368, 524284204: 5131730938, 524288000: 5131768898}
Then if you want access to lines[524179347:524179500] you don't need to read 4.5 GB to get there; you can just do f.seek(5130682368) and start reading instantly.

Is the index_db method from the SeqIO object in Biopython slow?

I've got this:
files = glob.glob(str(dir_path) + "*.fa")
index = SeqIO.index_db(index_filename, files, "fasta")
seq = index[accession] # Slow
index.close()
return seq
and i'm working on big files (gene sequences) but for some reasons, it takes about 4 secondes to get the sequence I'm looking for. I'm wondering if the index_db method is suppose to be that slow? Am I using the right method?
Thanks.

The first time the database is created it can take some time. The next times, if you don't delete the index_filename created, it should go faster.
Lets say you have your 25 files each with some genes. This method creates a SQLite DB that helps locating the sequences among the files, like "Get me the gene XXX" and the SQLite/index_db knows that the gene is in the file 12.fasta and its exact location inside the file. So Biopython opens the file and scans quickly to the gene position.
Without that index_db you have to load every Records into memory, which is fast but some files might not fit in the RAM.
If you want speed to fetch regions you can use FastaFile from pysam and samtools. Like this:
You have to index all the fasta files with faidx:
$ samtools faidx big_fasta.fas
From your code write something like this:
from pysam import FastaFile
rec = FastaFile("big_fasta.fas") # big_fasta.fas.fai must exist.
seq = rec.fetch(reference=gene_name, start=1000, end= 1200)
print(s)
In my computer this times 2 orders of magnitude faster than Biopython for the same operation, but you only get the pure sequence of bases.

Confusing loop problem (python)

this is similar to the question in merge sort in python
I'm restating because I don't think I explained the problem very well over there.
basically I have a series of about 1000 files all containing domain names. altogether the data is > 1gig so I'm trying to avoid loading all the data into ram. each individual file has been sorted using .sort(get_tld) which has sorted the data according to its TLD (not according to its domain name. sorted all the .com's together, .orgs together, etc)
a typical file might look like
something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org
but obviosuly longer.
I now want to merge all the files into one, maintaining the sort so that in the end the one large file will still have all the .coms together, .orgs together, etc.
What I want to do basically is
open all the files
loop:
read 1 line from each open file
put them all in a list and sort with .sort(get_tld)
write each item from the list to a new file
the problem I'm having is that I can't figure out how to loop over the files
I can't use with open() as because I don't have 1 file open to loop over, I have many. Also they're all of variable length so I have to make sure to get all the way through the longest one.
any advice is much appreciated.

Whether you're able to keep 1000 files at once is a separate issue and depends on your OS and its configuration; if not, you'll have to proceed in two steps -- merge groups of N files into temporary ones, then merge the temporary ones into the final-result file (two steps should suffice, as they let you merge a total of N squared files; as long as N is at least 32, merging 1000 files should therefore be possible). In any case, this is a separate issue from the "merge N input files into one output file" task (it's only an issue of whether you call that function once, or repeatedly).
The general idea for the function is to keep a priority queue (module heapq is good at that;-) with small lists containing the "sorting key" (the current TLD, in your case) followed by the last line read from the file, and finally the open file ready for reading the next line (and something distinct in between to ensure that the normal lexicographical order won't accidentally end up trying to compare two open files, which would fail). I think some code is probably the best way to explain the general idea, so next I'll edit this answer to supply the code (however I have no time to test it, so take it as pseudocode intended to communicate the idea;-).
import heapq
def merge(inputfiles, outputfile, key):
"""inputfiles: list of input, sorted files open for reading.
outputfile: output file open for writing.
key: callable supplying the "key" to use for each line.
"""
# prepare the heap: items are lists with [thekey, k, theline, thefile]
# where k is an arbitrary int guaranteed to be different for all items,
# theline is the last line read from thefile and not yet written out,
# (guaranteed to be a non-empty string), thekey is key(theline), and
# thefile is the open file
h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
h = [[key(s), k, s, i] for k, s, i in h if s]
heapq.heapify(h)
while h:
# get and output the lowest available item (==available item w/lowest key)
item = heapq.heappop(h)
outputfile.write(item[2])
# replenish the item with the _next_ line from its file (if any)
item[2] = item[3].readline()
if not item[2]: continue # don't reinsert finished files
# compute the key, and re-insert the item appropriately
item[0] = key(item[2])
heapq.heappush(h, item)
Of course, in your case, as the key function you'll want one that extracts the top-level domain given a line that's a domain name (with trailing newline) -- in a previous question you were already pointed to the urlparse module as preferable to string manipulation for this purpose. If you do insist on string manipulation,
def tld(domain):
return domain.rsplit('.', 1)[-1].strip()
or something along these lines is probably a reasonable approach under this constraint.
If you use Python 2.6 or better, heapq.merge is the obvious alternative, but in that case you need to prepare the iterators yourself (including ensuring that "open file objects" never end up being compared by accident...) with a similar "decorate / undecorate" approach from that I use in the more portable code above.

You want to use merge sort, e.g. heapq.merge. I'm not sure if your OS allows you to open 1000 files simultaneously. If not you may have to do it in 2 or more passes.

Why don't you divide the domains by first letter, so you would just split the source files into 26 or more files which could be named something like: domains-a.dat, domains-b.dat. Then you can load these entirely into RAM and sort them and write them out to a common file.
So:
3 input files split into 26+ source files
26+ source files could be loaded individually, sorted in RAM and then written to the combined file.
If 26 files are not enough, I'm sure you could split into even more files... domains-ab.dat. The point is that files are cheap and easy to work with (in Python and many other languages), and you should use them to your advantage.

Your algorithm for merging sorted files is incorrect. What you do is read one line from each file, find the lowest-ranked item among all the lines read, and write it to the output file. Repeat this process (ignoring any files that are at EOF) until the end of all files has been reached.

#! /usr/bin/env python
"""Usage: unconfuse.py file1 file2 ... fileN
Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""
import sys, os
spools = {}
for name in sys.argv[1:]:
for line in file(name):
if (line == "\n"): continue
tld = line[line.rindex(".")+1:-1]
spool = spools.get(tld, None)
if (spool == None):
spool = file(tld + ".spool", "w+")
spools[tld] = spool
spool.write(line)
for tld in sorted(spools.iterkeys()):
spool = spools[tld]
spool.seek(0)
for line in spool:
sys.stdout.write(line)
spool.close()
os.remove(spool.name)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.