How to load a big text file efficiently in python - python

I have a text file containing 7000 lines of strings. I got to search for a specific string based upon few params.
Some are saying that the below code wouldn't be efficient (speed and memory usage).
f = open("file.txt")
data = f.read().split() # strings as list
First of all, if don't even make it as a list, how would I even start searching at all?
Is it efficient to load the entire file? If not, how to do it?
To filter anything, we need to search for that we need to read it right!
A bit confused

iterate over each line of the file, without storing it. This will make for program memory Efficient.
with open(filname) as f:
for line in f:
if "search_term" in line:
break

Related

Loading / Streaming 8GB txt file?? And tokenize

I have a pretty large file (about 8 GB).. now I read this post: How to read a large file line by line and this one Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors
But this still doesnt do the job.. when I run my code, my pc gets stuck.
Am I doing something wrong?
I want to get all words into a list (tokenize them). Further, doesnt the code reads each line and tokenizes the line? Doesnt this might prevent the tokenizer from tokenizing words properly since some words (and sentences) do not end after just one line?
I considered splitting it up into smaller files, but doesnt this still consume my RAM if I just have 8GB Ram since the list of words will probably be equally big (8GB) like the initial txt file?
word_list=[]
number = 0
with open(os.path.join(save_path, 'alldata.txt'), 'rb',encoding="utf-8") as t:
for line in t.readlines():
word_list+=nltk.word_tokenize(line)
number = number + 1
print(number)
By using the following line:
for line in t.readlines():
# do the things
You are forcing python to read the whole file with t.readlines(), then return an array of strings that represents the whole file, thus bringing the whole file into memory.
Instead, if you do as the example you linked states:
for line in t:
# do the things
The Python VM will natively process the file line-by-line, like you want.
the file will act like a generator, yielding each line one at a time.
After looking at your code again, I see that you are constantly appending to the word list, with word_list += nltk.word_tokenize(line). This means that even if you do import the file one line at a time, you are still retaining that data in your memory, even after the file has moved on. You will likely need to find a better way of doing whatever this is, as you will still be consuming massive amounts of memory, because the data has not been dropped from memory.
For data this large, you will have to either
find a way to store an intermediate version of your tokenized data, or
design your code in a way that you can handle one, or just a few tokenized words at a time.
Some thing like this might do the trick:
def enumerated_tokens(filepath):
index = 0
with open(filepath, rb, encoding="utf-8") as t:
for line in t:
for word in nltk.word_tokenize(line):
yield (index, word)
index += 1
for index, word in enumerated_tokens(os.path.join(save_path, 'alldata.txt')):
print(index, word)
# Do the thing with your word.
Notice how this never actually stores the word anywhere. This doesn't mean that you can't temporarily store anything, but if you're memory constrained, generators are the way to go. This approach will likely be faster, more stable, and use less memory overall.

Python pointers

I was asked to write a program to find string "error" from a file and print matched lines in python.
Will first open a file with read more
i use fh.readlines and store it in a variable
After this, will use for loop and iterate line by line. check for the string "error".print those lines if found.
I was asked to use pointers in python since assigning file content to a variable consumes time when logfile contains huge output.
I did research on python pointers. But not found anything useful.
Could anyone help me out writing the above code using pointers instead of storing the whole content in a variable.
There are no pointers in python, although something like pointer can be implemented, but is not worth the efforts for your case.
As pointed out in the solution of this link,
Read large text files in Python, line by line without loading it in to memory .
You can use something like:
with open("log.txt") as infile:
for line in infile:
if "error" in line:
print(line.strip()) .
The context managers will close the file automatically and it only reads one line at a time. When the next line is read, the previous one will be garbage collected unless you have stored a reference to it somewhere else.
You can use a dictionary by using key-pair value. Just dump the log file into dictionary wherein the key would be words and value would be the line number. So if you search for string "error" you will get the line numbers they are present it and accordingly you can print them. Since searching in dictionary or hashtable is in constant time O(1) it will take less time. But yes storing might take time depends if you avoid collision.
I used below code instead of putting the data in a variable and then for loop.
for line in open('c182573.log','r').readlines():
if ('Executing' in line):
print line
So there is no way that we can implement pointers or reference in python.
Thanks all
There are no pointers in python.
But something like pointer can be implemented, but for your case it's not required.
Try Below Code
with open('test.txt') as f:
content = f.readlines()
for i in content:
if "error" in i:
print(i.strip())
Even if you want to understand Python variables as pointers go to this link
http://scottlobdell.me/2013/08/understanding-python-variables-as-pointers/

Where should I declare a list of 5,000+ words?

I am writing a game in python in which I must periodically pull a random word from a list of words. When I prototyped my game I declared a word_list = ['cat','dog','rat','house'] of ten words at the top of one of my modules. I then use choice(word_list) to get a random word. However, I must must change this temporary hack into something more elegant because I need to increase the size of the word list to 5,000+ words. If I do this in my current module it will look ridiculous.
Should I put all of these words in a flat txt file, and then read from that file as I need words? If so, how would I best do that? Put each word an a separate line and then read one random line? I'm not sure what the most efficient way is.
I'd put all of the words in a flat text file, one per line:
cat
dog
....
and just load it in whenever you need it with the following one-liner:
word_list = [word.rstrip() for word in open("words.txt","r")]
See: http://docs.python.org/tutorial/datastructures.html#list-comprehensions
This solution is a tad more elegant since it doesn't depend on anything but built-in functions. No importing modules required.
Be sure to cache it once it's loaded, you don't want to load the words from the file everytime you need a new word, though.
Read the words from the file at startup (or at least the line indexes), and use as required.
I would create a separated module called random_words, or something like that, hiding the list inside it and encapsulating the choice(word_list) inside an interface function.
As to load them from a file, well, since I would need to type them anyway, and a python file is just a text file in the end, I would type them right there, probably one per line for easy maintenance.
You could store each word on a separate line in a text file, and then use reader (from the csv module) to load the file in at startup into a list. You could then randomly choose a word from the list:
import csv
FILENAME = 'word_list.txt'
word_list = []
# Open word list and get words
with open(FILENAME, 'rb') as f:
reader = csv.reader(f)
for row in reader:
word_list.append(row)
Even with five-thousand words you shouldn't go over fifty kilobytes of RAM, so I would consider this an efficient way to do it.

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

python join "large" file

In python I have read in a file into a list using file.readlines() , later on after some logic, I would like to put it back together in a string using fileString = ''.join(file), for some reason, even without a print function, it prints the fileString out to the console up to a certain point, then it just stops. It does not run the rest of the program which is not useful for me.
Why does join do this, how do I perhaps pre-allocate how much memory I would like my list/string to use so that it does not stop. Or some other solution too.
Thank you
File is your file pointer in memory. When you attempt to join on it, you don't actually have a string to work with.
How about this?
with open(file, 'rb') as myfile:
strings = myfile.readlines()
# do your stuff to strings
filestring = ''.join(strings)
Note that strings is a list of lines like this:
['my line\n', 'my other line!\n']
And as such, a large file will require quite a bit of memory. You may be better served by building a mini filter.
You should also consider what you are going to do with the resulting string. If you just want to write the contents back to a file, there is no need to join the parts first, you can use file.writelines(strings) directly.

Categories