Loading / Streaming 8GB txt file?? And tokenize

Loading / Streaming 8GB txt file?? And tokenize - python

I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors
But this still doesnt do the job.. when I run my code, my pc gets stuck.
Am I doing something wrong?
I want to get all words into a list (tokenize them). Further, doesnt the code reads each line and tokenizes the line? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line?
I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file?
word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
for line in t.readlines():
word_list+=nltk.word_tokenize(line)
number = number + 1
print(number)

By using the following line:
for line in t.readlines():
# do the things
You are forcing python to read the whole file with t.readlines(), then return an array of strings that represents the whole file, thus bringing the whole file into memory.
Instead, if you do as the example you linked states:
for line in t:
# do the things
The Python VM will natively process the file line-by-line, like you want.
the file will act like a generator, yielding each line one at a time.
After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line). This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory.
For data this large, you will have to either
find a way to store an intermediate version of your tokenized data, or
design your code in a way that you can handle one, or just a few tokenized words at a time.
Some thing like this might do the trick:
def enumerated_tokens(filepath):
index = 0
with open(filepath, rb, encoding="utf-8") as t:
for line in t:
for word in nltk.word_tokenize(line):
yield (index, word)
index += 1
for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
print(index, word)
# Do the thing with your word.
Notice how this never actually stores the word anywhere. This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. This approach will likely be faster, more stable, and use less memory overall.

Related

Summarizing huge amounts of data

I have a problem that I have not been able to solve. I have 4 .txt files each between 30-70GB. Each file contains n-gram entries as follows:
blabla1/blabla2/blabla3
word1/word2/word3
...
What I'm trying to do is count how many times each item appear, and save this data to a new file, e.g:
blabla1/blabla2/blabla3 : 1
word1/word2/word3 : 3
...
My attempts so far has been simply to save all entries in a dictionary and count them, i.e.
entry_count_dict = defaultdict(int)
with open(file) as f:
for line in f:
entry_count_dict[line] += 1
However, using this method I run into memory errors (I have 8GB RAM available). The data follows a zipfian distribution, e.g. the majority of the items occur only once or twice.
The total number of entries is unclear, but a (very) rough estimate is that there is somewhere around 15,000,000 entries in total.
In addition to this, I've tried h5py where all the entries are saved as a h5py dataset containing the array [1], which is then updated, e.g:
import h5py
import numpy as np
entry_count_dict = h5py.File(filename)
with open(file) as f:
for line in f:
if line in entry_count_dict:
entry_count_file[line][0] += 1
else:
entry_count_file.create_dataset(line,
data=np.array([1]),
compression="lzf")
However, this method is way to slow. The writing speed gets slower and slower. As such, unless the writing speed can be increased this approach is implausible. Also, processing the data in chunks and opening/closing the h5py file for each chunk did not show any significant difference in processing speed.
I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)).
However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
Perhaps when iterating over the file, one could open a pickle and save the entry in a dict, and then close the pickle. But doing this for each item slows down the process quite a lot due to the time it takes to open, load and close the pickle file.
One way of solving this would be to sort all the entries during one pass, then iterate over the sorted file and count the entries alphabetically. However, even sorting the file is painstakingly slow using the linux command:
sort file.txt > sorted_file.txt
And, I don't really know how to solve this using python given that loading the whole file into memory for sorting would cause memory errors. I have some superficial knowledge of different sorting algorithms, however they all seem to require that the whole object to be sorted needs get loaded into memory.
Any tips on how to approach this would be much appreciated.

There are a number of algorithms for performing this type of operation. They all fall under the general heading of External Sorting.
What you did there with "saving entries which start with certain letters in separate files" is actually called bucket sort, which should, in theory, be faster. Try it with sliced data sets.
or,
try Dask, a DARPA + Anaconda backed distributive computing library, with interfaces familiar to numpy, pandas, and works like Apache-Spark. (works on single machine too)
btw it scales
I suggest trying dask.array,
which cuts the large array into many small ones, and implements numpy ndarray interface with blocked algorithms to utilize all of your cores when computing these larger-than-memory datas.

I've been thinking about saving entries which start with certain letters in separate files, i.e. all the entries which start with a are saved in a.txt, and so on (this should be doable using defaultdic(int)). However, to do this the file have to iterated once for every letter, which is implausible given the file sizes (max = 69GB).
You are almost there with this line of thinking. What you want to do is to split the file based on a prefix - you don't have to iterate once for every letter. This is trivial in awk. Assuming your input files are in a directory called input:
mkdir output
awk '/./ {print $0 > ( "output/" substr($0,0,1))}` input/*
This will append each line to a file named with the first character of that line (note this will be weird if your lines can start with a space; since these are ngrams I assume that's not relevant). You could also do this in Python but managing the opening and closing of files is somewhat more tedious.
Because the files have been split up they should be much smaller now. You could sort them but there's really no need - you can read the files individually and get the counts with code like this:
from collections import Counter
ngrams = Counter()
for line in open(filename):
ngrams[line.strip()] += 1
for key, val in ngrams.items():
print(key, val, sep='\t')
If the files are still too large you can increase the length of the prefix used to bucket the lines until the files are small enough.

Textual analysis with python - program stalls after 8 runs

I want to do a textual analysis of multiple text files (>50,000 files), some of which are in html script. My program (below) iterates over these files, opening each one in turn, analyzing the content with NLTK module and writing the output to a CSV file, before continuing the analysis with the second file.
The program runs fine for single files, but the loop almost stalls after the 8th run, even though the 9th file to analyse is no larger than the 8th. Eg. The first eight iterations took 10 minutes total, whereas the 9th took 45 minutes. The 10th took even longer than 45 minutes (file is much smaller than the first ones).
I am sure the program could be optimized further, as I am still relatively new to Python, but I don't understand why its becoming so slow after the 8th run? Any help would be appreciated. Thanks!
#import necessary modules
import urllib, csv, re, nltk
from string import punctuation
from bs4 import BeautifulSoup
import glob
#Define bags of words (There are more variable, ie word counts, that are calculated)
adaptability=['adaptability', 'flexibility']
csvfile=open("test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):
###Open files and arrange them so that they are ready for pre-processing
review=open(filename, encoding='utf-8', errors='ignore').read()
soup=BeautifulSoup(review)
text=soup.get_text()
from nltk.stem import WordNetLemmatizer
wnl=WordNetLemmatizer()
adaptability_counts=[]
adaptability_counter=0
review_processed=text.lower().replace('\r',' ').replace('\t',' ').replace('\n',' ').replace('. ', ' ').replace(';',' ').replace(', ',' ')
words=review_processed.split(' ')
word_l1=[word for word in words if word not in stopset]
word_l=[x for x in word_l1 if x != ""]
word_count=len(word_l)
for word in words:
wnl.lemmatize(word)
if word in adaptability:
adaptability_counter=adaptability_counter+1
adaptability_counts.append(adaptability_counter)
#I then repeat the analysis with 2 subsections of the text files
#(eg. calculate adaptability_counts for Part I only)
output=zip(adaptability_counts)
writer=csv.writer(open('test_10.csv','a',newline='', encoding='cp850', errors='replace'))
writer.writerows(output)
csvfile.flush()

You're never closing the files once you open them. My guess is you are running out of memory and it's taking so long because your machine has to swap data from the page file (on disk). Instead of just calling open(), you either have to close() the file when you are finished with it or use the with open construct, which will close the file automatically when you are done. See this page for more information: http://effbot.org/zone/python-with-statement.htm
If it were me, I would change this line:
review=open(filename, encoding='utf-8', errors='ignore').read()
to this:
with open(filename, encoding='utf-8', errors='ignore') as f:
review = f.read()
...
and make sure you indent appropriately. The code you execute with the file open will need to be indented within the with block.

Since the accepted answer hasn't quite solved your problem, here's a follow-up:
You have a list adaptability in which you look up each word in your inputs. Never look up words in a list! Replace the list with a set and you should see a huge improvement. (If you are using the list to count individual words, replace it with collections.counter, or the nltk's FreqDist.) If your adaptability list grows with each file you read (does it? was it supposed to?), this is definitely enough to cause your problem.
But there might be more than one culprit. You leave out a lot of your code, so there's no telling what you other data structures are growing with each file you see, or if that's intended. It's pretty clear that your code is "quadratic" and gets slower as your data gets bigger, not because of memory size but because you need more steps.
Don't bother switching to arrays and CountVectorizer, you'll just postpone the problem a little. Figure out how to process each file in constant time. If your algorithm doesn't require collecting words from more than one file, the quickest solution is to run it on each file separately (it's not hard to automate that).

Python: Change a few characters at a certain location in a certain line of a file

I know there are many questions regarding editing lines of a file, but my problem is quite specific and in two days I couldn't find a question/answer here which hits it.
The Problem
How do I replace a few (contiguous) characters s1 of one specific line in a file with another few characters s2 meeting the following conditions?
The line number is always the same. (number 5)
The part of the line in front of s1 is always the same. (and therefore has constant length = 18)
The part of the line in front of s1 won't occur anywhere else in the file.
s1 and s2 both are not constant and can even have different lengths.
s1 and s2 both may occur anywhere else in the file.
The file can be very long, so I don't want to load the whole file into memory.
For the same reason as 6. I want to avoid copying the file contents into a new file. Im just changing a few characters so rewriting the whole file would be much of a overhead, wouldn't it?
I'm using Python 3.X.
Most similar approaches I found so far didn't meet either 6. or 7. I found this (opening the file with r+ and performing a write(s2) right before s1), but it doesn't work for me because of 4. Is it even possible in Python to achieve what I want or do I have to copy my file somehow and modify the line along the way after all?
The Background
I have a text file consisting of a few lines of metadata followed by a potentially large number of datasets. The metadata contains a line saying No. of patterns : n while n is the number of datasets in the file. Among other things my script should be able to append additional datasets to an existing file by appending the sets themselves and updating n.
The design of this file I want to be generated/extended by my script is not invented by me, so I mustn't change it. The file will serve as input for another application not invented by me - the JavaNNS.

The answer you linked states
you can only extend and truncate a file at the end, not at the head
With this limitation, python just mirrors the restrictions imposed by the data storage abstraction we call 'file system'. All programs, no matter the programming language, are bound by this when using the file system. Some just hide this fact from the user by re-writing complete files in the background.
If due to the size of the file this causes performance problems when updating the file, then that's really a problem of that crude file format, even though you aren't the one to be blamed for that: The file format doesn't seem to be suited for in-place updates of the file that change the number of patterns.
How to avoid (re)writing large amounts of data
Pipes
If the program which will consume the updated file (JavaNNS) accepts the file contents on standard input, consider to keep the meta-data and the patterns in separate files. Like this, you can append the patterns file and only have the re-write the (hopefully small) meta-data file. Then, just pipe both files into JavaNNS in a single call:
cat metadata.txt patterns.txt | JavaNNS
If JavaNNS does not take accept the required file content on standard input but insists on opening the file itself, you can probably still use a named pipe and pass that as the file to open. (This might not work if JavaNNS does random access on the file instead of just reading and seeking.)
Padding
If you'll be appending to the file several times and the file format is flexible enough to allow some padding, then just pad to make some space for n with potentially increased number of digits in future writes. Like this, you only have to rewrite the file completely when the padding wasn't sufficiently large.

You can't edit in place and just change s1 for s2 as they can be different lengths. You will need to write out the rest of the file, and this will be safer with a replacement file.
If s1 and s2 are guaranteed to be the same length then you could do it in place, e.g. the value is padded to the maximum size of s1/s2:
with open('<file>', 'r+') as f:
for line_no, line in enumerate(f):
if line_no == 5: # read 5 lines
f.seek(18, 1) # jump forward 18 characters
f.write("{: 8d}".format(s2)) # overwrite with padded s2 (int)
break
With different lengths you will need a different file:
with open('<file>', 'r') as r:
with open('<file-new>', 'w+') as w:
for line_no, line in enumerate(r):
if line_no == 5:
w.write(line[:18] + str(s2) + line[18+len(s1):])
else:
w.write(line)

How to load a big text file efficiently in python

I have a text file containing 7000 lines of strings. I got to search for a specific string based upon few params.
Some are saying that the below code wouldn't be efficient (speed and memory usage).
f = open("file.txt")
data = f.read().split() # strings as list
First of all, if don't even make it as a list, how would I even start searching at all?
Is it efficient to load the entire file? If not, how to do it?
To filter anything, we need to search for that we need to read it right!
A bit confused

iterate over each line of the file, without storing it. This will make for program memory Efficient.
with open(filname) as f:
for line in f:
if "search_term" in line:
break

Efficient way to read/write/parse large text files (python)

Say I have an absurdly large text file. I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let's say it is on the order of a few gig.
My end goal is to map it to an array of sentences (separated by '?' '!' '.' and for all intents and purposes ';') and each sentence to an array of words. I was then going to use numpy for some statistical analysis.
What would be the most scalable way to go about doing this?
PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file.

My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. This is typically done with markup languages like HTML and XML, so you'll find a lot of parsers for those languages out there, including in the Python standard library. A simple sentence parser is something you can write yourself, though; for example:
import re, collections
sentence_terminator = re.compile(r'(?<=[.!?;])\s*')
class SentenceParser(object):
def __init__(self, filelike):
self.f = filelike
self.buffer = collections.deque([''])
def next(self):
while len(self.buffer) < 2:
data = self.f.read(512)
if not data:
raise StopIteration()
self.buffer += sentence_terminator.split(self.buffer.pop() + data)
return self.buffer.popleft()
def __iter__(self):
return self
This will only read data from the file as needed to complete a sentence. It reads in 512-byte blocks so you'll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is.
After a stream parser, my second thought would be to memory map the file. That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; after that, each sentence would start on a new line, and you'd be able to open the file and use readline() or a for loop to go through it line by line. But you'd still have to worry about multi-line sentences; plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.