Say I have an absurdly large text file. I would not think my file would grow larger than ~500mb, but for the sake of scalability and my own curiosity, let's say it is on the order of a few gig.
My end goal is to map it to an array of sentences (separated by '?' '!' '.' and for all intents and purposes ';') and each sentence to an array of words. I was then going to use numpy for some statistical analysis.
What would be the most scalable way to go about doing this?
PS: I thought of rewriting the file to have one sentence per line, but I ran into problems trying to load the file into memory. I know of the solution where you read off chucks of data in one file, manipulate them, and write them to another, but that seems inefficient with disk memory. I know, most people would not worry about using 10gig of scratch space nowadays, but it does seem like there ought to be a way of directly editing chucks of the file.
My first thought would be to use a stream parser: basically you read in the file a piece at a time and do the statistical analysis as you go. This is typically done with markup languages like HTML and XML, so you'll find a lot of parsers for those languages out there, including in the Python standard library. A simple sentence parser is something you can write yourself, though; for example:
import re, collections
sentence_terminator = re.compile(r'(?<=[.!?;])\s*')
class SentenceParser(object):
def __init__(self, filelike):
self.f = filelike
self.buffer = collections.deque([''])
def next(self):
while len(self.buffer) < 2:
data = self.f.read(512)
if not data:
raise StopIteration()
self.buffer += sentence_terminator.split(self.buffer.pop() + data)
return self.buffer.popleft()
def __iter__(self):
return self
This will only read data from the file as needed to complete a sentence. It reads in 512-byte blocks so you'll be holding less than a kilobyte of file contents in memory at any one time, no matter how large the actual file is.
After a stream parser, my second thought would be to memory map the file. That way you could go through and replace the space that (presumably) follows each sentence terminator by a newline; after that, each sentence would start on a new line, and you'd be able to open the file and use readline() or a for loop to go through it line by line. But you'd still have to worry about multi-line sentences; plus, if any sentence terminator is not followed by a whitespace character, you would have to insert a newline (instead of replacing something else with it) and that could be horribly inefficient for a large file.
Related
I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors
But this still doesnt do the job.. when I run my code, my pc gets stuck.
Am I doing something wrong?
I want to get all words into a list (tokenize them). Further, doesnt the code reads each line and tokenizes the line? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line?
I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file?
word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
for line in t.readlines():
word_list+=nltk.word_tokenize(line)
number = number + 1
print(number)
By using the following line:
for line in t.readlines():
# do the things
You are forcing python to read the whole file with t.readlines(), then return an array of strings that represents the whole file, thus bringing the whole file into memory.
Instead, if you do as the example you linked states:
for line in t:
# do the things
The Python VM will natively process the file line-by-line, like you want.
the file will act like a generator, yielding each line one at a time.
After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line). This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory.
For data this large, you will have to either
find a way to store an intermediate version of your tokenized data, or
design your code in a way that you can handle one, or just a few tokenized words at a time.
Some thing like this might do the trick:
def enumerated_tokens(filepath):
index = 0
with open(filepath, rb, encoding="utf-8") as t:
for line in t:
for word in nltk.word_tokenize(line):
yield (index, word)
index += 1
for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
print(index, word)
# Do the thing with your word.
Notice how this never actually stores the word anywhere. This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. This approach will likely be faster, more stable, and use less memory overall.
I want to do a textual analysis of multiple text files (>50,000 files), some of which are in html script. My program (below) iterates over these files, opening each one in turn, analyzing the content with NLTK module and writing the output to a CSV file, before continuing the analysis with the second file.
The program runs fine for single files, but the loop almost stalls after the 8th run, even though the 9th file to analyse is no larger than the 8th. Eg. The first eight iterations took 10 minutes total, whereas the 9th took 45 minutes. The 10th took even longer than 45 minutes (file is much smaller than the first ones).
I am sure the program could be optimized further, as I am still relatively new to Python, but I don't understand why its becoming so slow after the 8th run? Any help would be appreciated. Thanks!
#import necessary modules
import urllib, csv, re, nltk
from string import punctuation
from bs4 import BeautifulSoup
import glob
#Define bags of words (There are more variable, ie word counts, that are calculated)
adaptability=['adaptability', 'flexibility']
csvfile=open("test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):
###Open files and arrange them so that they are ready for pre-processing
review=open(filename, encoding='utf-8', errors='ignore').read()
soup=BeautifulSoup(review)
text=soup.get_text()
from nltk.stem import WordNetLemmatizer
wnl=WordNetLemmatizer()
adaptability_counts=[]
adaptability_counter=0
review_processed=text.lower().replace('\r',' ').replace('\t',' ').replace('\n',' ').replace('. ', ' ').replace(';',' ').replace(', ',' ')
words=review_processed.split(' ')
word_l1=[word for word in words if word not in stopset]
word_l=[x for x in word_l1 if x != ""]
word_count=len(word_l)
for word in words:
wnl.lemmatize(word)
if word in adaptability:
adaptability_counter=adaptability_counter+1
adaptability_counts.append(adaptability_counter)
#I then repeat the analysis with 2 subsections of the text files
#(eg. calculate adaptability_counts for Part I only)
output=zip(adaptability_counts)
writer=csv.writer(open('test_10.csv','a',newline='', encoding='cp850', errors='replace'))
writer.writerows(output)
csvfile.flush()
You're never closing the files once you open them. My guess is you are running out of memory and it's taking so long because your machine has to swap data from the page file (on disk). Instead of just calling open(), you either have to close() the file when you are finished with it or use the with open construct, which will close the file automatically when you are done. See this page for more information: http://effbot.org/zone/python-with-statement.htm
If it were me, I would change this line:
review=open(filename, encoding='utf-8', errors='ignore').read()
to this:
with open(filename, encoding='utf-8', errors='ignore') as f:
review = f.read()
...
and make sure you indent appropriately. The code you execute with the file open will need to be indented within the with block.
Since the accepted answer hasn't quite solved your problem, here's a follow-up:
You have a list adaptability in which you look up each word in your inputs. Never look up words in a list! Replace the list with a set and you should see a huge improvement. (If you are using the list to count individual words, replace it with collections.counter, or the nltk's FreqDist.) If your adaptability list grows with each file you read (does it? was it supposed to?), this is definitely enough to cause your problem.
But there might be more than one culprit. You leave out a lot of your code, so there's no telling what you other data structures are growing with each file you see, or if that's intended. It's pretty clear that your code is "quadratic" and gets slower as your data gets bigger, not because of memory size but because you need more steps.
Don't bother switching to arrays and CountVectorizer, you'll just postpone the problem a little. Figure out how to process each file in constant time. If your algorithm doesn't require collecting words from more than one file, the quickest solution is to run it on each file separately (it's not hard to automate that).
I'm reading through a large file, and processing it.
I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file?
If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n separating lines.
The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.
You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).
Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as #wwii suggested recipe is also worth checking out
Not really.
If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.
The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, i.e. find out and remember the positions of each line. If you have that in e.g. a dictionary that you save and load with pickle, you can skip to it in quasi-constant time with seek.
It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.
It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.
You should consider a different storage format for your needs. What are your needs?
I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?
If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()
These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.
I am a complete beginner to Python or any serious programming language for that matter. I finally got a prototype code to work but I think it will be too slow.
My goal is to find and replace some Chinese characters across all files (they are csv) in a directory with integers as per a csv file I have. The files are nicely numbered by year-month, for example 2000-01.csv, and will be the only files in that directory.
I will be looping across about 25 files that are in the neighborhood of 500mb each (and about a million lines). The dictionary I will be using will have about 300 elements and I will be changing unicode (Chinese character) to integers. I tried with a test run and, assuming everything scales up linearly (?), it looks like it would take about a week for this to run.
Thanks in advance. Here is my code (don't laugh!):
# -*- coding: utf-8 -*-
import os, codecs
dir = "C:/Users/Roy/Desktop/test/"
Dict = {'hello' : 'good', 'world' : 'bad'}
for dirs, subdirs, files in os.walk(dir):
for file in files:
inFile = codecs.open(dir + file, "r", "utf-8")
inFileStr = inFile.read()
inFile.close()
inFile = codecs.open(dir + file, "w", "utf-8")
for key in Dict:
inFileStr = inFileStr.replace(key, Dict[key])
inFile.write(inFileStr)
inFile.close()
In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.
If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:
for file in files:
fname = os.path.join(dir, file)
inFile = codecs.open(fname, "r", "utf-8")
outFile = codecs.open(fname + ".new", "w", "utf-8")
for line in inFile:
newline = do_replacements_on(line)
outFile.write(newline)
inFile.close()
outFile.close()
os.rename(fname + ".new", fname)
If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.
Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:
>>> u"\xff and \ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'
For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:
>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
... return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'
I think you can lower memory use greatly (and thus limit swap use and make things faster) by reading a line at a time and writing it (after the regexp replacements already suggested) to a temporary file - then moving the file to replace the original.
A few things (unrelated to the optimization problem):
dir + file should be os.path.join(dir, file)
You might want to not reuse infile, but instead open (and write to) a separate outfile. This also won't increase performance, but is good practice.
I don't know if you're I/O bound or cpu bound, but if your cpu utilization is very high, you may want to use threading, with each thread operating on a different file (so with a quad core processor, you'd be reading/writing 4 different files simultaneously).
Open the files read/write ('r+') and avoid the double open/close (and likely associated buffer flush). Also, if possible, don't write back the entire file, seek and write back only the changed areas after doing the replace on the file's contents. Read, replace, write changed areas (if any).
That still won't help performance too much though: I'd profile and determine where the performance hit actually is and then move onto optimising it. It could just be the reading of the data from disk that's very slow, and there's not much you can do about that in Python.