python join "large" file - python

In python I have read in a file into a list using file.readlines() , later on after some logic, I would like to put it back together in a string using fileString = ''.join(file), for some reason, even without a print function, it prints the fileString out to the console up to a certain point, then it just stops. It does not run the rest of the program which is not useful for me.
Why does join do this, how do I perhaps pre-allocate how much memory I would like my list/string to use so that it does not stop. Or some other solution too.
Thank you

File is your file pointer in memory. When you attempt to join on it, you don't actually have a string to work with.
How about this?
with open(file, 'rb') as myfile:
strings = myfile.readlines()
# do your stuff to strings
filestring = ''.join(strings)
Note that strings is a list of lines like this:
['my line\n', 'my other line!\n']
And as such, a large file will require quite a bit of memory. You may be better served by building a mini filter.

You should also consider what you are going to do with the resulting string. If you just want to write the contents back to a file, there is no need to join the parts first, you can use file.writelines(strings) directly.

Related

How to read a specific range of lines in an external file in python?

Lets say you have a python file with 50 lines of code in it, and you want to read a specific range lines into a list. If you want to read ALL the lines in the file, you can just use the code from this answer:
with open('yourfile.py') as f:
content = f.readlines()
print(content)
But what if you want to read a specific range of lines, like reading line 23-27?
I tried this, but it doesn't work:
f.readlines(23:27)
You were close. readlines returns a list and you can slice that, but it's invalid syntax to try and pass the slice directly in the function call.
f.readlines()[23:27]
If the file is very large, avoid the memory overhead of reading the entire file:
start, stop = 23, 27
for i in range(start):
next(f)
content = []
for i in range(stop-start):
content.append(next(f))
Try this:
sublines = content[23:27]
If there are lots and lots of lines in your file, I believe you should consider using f.readline() (without an s) 27 times, and only save your lines starting wherever you want. :)
Else, the other ones solution is what I would have done too (meaning : f.readlines()[23:28]. 28, because as far as I remember, outer range is excluded.

How to load a big text file efficiently in python

I have a text file containing 7000 lines of strings. I got to search for a specific string based upon few params.
Some are saying that the below code wouldn't be efficient (speed and memory usage).
f = open("file.txt")
data = f.read().split() # strings as list
First of all, if don't even make it as a list, how would I even start searching at all?
Is it efficient to load the entire file? If not, how to do it?
To filter anything, we need to search for that we need to read it right!
A bit confused
iterate over each line of the file, without storing it. This will make for program memory Efficient.
with open(filname) as f:
for line in f:
if "search_term" in line:
break

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

Optimizing find and replace over large files in Python

I am a complete beginner to Python or any serious programming language for that matter. I finally got a prototype code to work but I think it will be too slow.
My goal is to find and replace some Chinese characters across all files (they are csv) in a directory with integers as per a csv file I have. The files are nicely numbered by year-month, for example 2000-01.csv, and will be the only files in that directory.
I will be looping across about 25 files that are in the neighborhood of 500mb each (and about a million lines). The dictionary I will be using will have about 300 elements and I will be changing unicode (Chinese character) to integers. I tried with a test run and, assuming everything scales up linearly (?), it looks like it would take about a week for this to run.
Thanks in advance. Here is my code (don't laugh!):
# -*- coding: utf-8 -*-
import os, codecs
dir = "C:/Users/Roy/Desktop/test/"
Dict = {'hello' : 'good', 'world' : 'bad'}
for dirs, subdirs, files in os.walk(dir):
for file in files:
inFile = codecs.open(dir + file, "r", "utf-8")
inFileStr = inFile.read()
inFile.close()
inFile = codecs.open(dir + file, "w", "utf-8")
for key in Dict:
inFileStr = inFileStr.replace(key, Dict[key])
inFile.write(inFileStr)
inFile.close()
In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.
If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:
for file in files:
fname = os.path.join(dir, file)
inFile = codecs.open(fname, "r", "utf-8")
outFile = codecs.open(fname + ".new", "w", "utf-8")
for line in inFile:
newline = do_replacements_on(line)
outFile.write(newline)
inFile.close()
outFile.close()
os.rename(fname + ".new", fname)
If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.
Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:
>>> u"\xff and \ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'
For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:
>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
... return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'
I think you can lower memory use greatly (and thus limit swap use and make things faster) by reading a line at a time and writing it (after the regexp replacements already suggested) to a temporary file - then moving the file to replace the original.
A few things (unrelated to the optimization problem):
dir + file should be os.path.join(dir, file)
You might want to not reuse infile, but instead open (and write to) a separate outfile. This also won't increase performance, but is good practice.
I don't know if you're I/O bound or cpu bound, but if your cpu utilization is very high, you may want to use threading, with each thread operating on a different file (so with a quad core processor, you'd be reading/writing 4 different files simultaneously).
Open the files read/write ('r+') and avoid the double open/close (and likely associated buffer flush). Also, if possible, don't write back the entire file, seek and write back only the changed areas after doing the replace on the file's contents. Read, replace, write changed areas (if any).
That still won't help performance too much though: I'd profile and determine where the performance hit actually is and then move onto optimising it. It could just be the reading of the data from disk that's very slow, and there's not much you can do about that in Python.

Delineating a Read File

Not really too sure how to word this question, therefore if you don't particularly understand it then I can try again.
I have a file called example.txt and I'd like to import this into my Python program. Here I will do some calculations with what it contains and other things that are irrelevant.
Instead of me importing this file, going through it line-by-line and extracting the information I want.. can Python do it instead? As in, if I structure the .txt correctly (whether it be key / value pairs seperated by an equals on each line), is there a current Python 'way' where it can handle it all and I work with that?
with open("example.txt") as f:
for line in f:
key, value = line.strip().split("=")
do_something(key,value)
looks like a starting point if I understand you correctly. You need Python 2.6 or 3.x for this.
Another place to look is the csv module that can parse comma-separated value files - and you can tell it to use = as a separator instead. This will abstract away some of the "manual work" in that previous example - but it seems your example doesn't especially need that kind of abstraction.
Another idea:
with open("example.txt") as f:
d = dict([line.strip().split("=") for line in f])
Now that's concise and pythonic :)
for line in open("file")
key, value = line.strip().split("=")
key=key.strip()
value=value.strip()
do_something(key,value)
There's also another method - you can create a valid python file (let it be a list, dict definition or whatever else), read its content using
f = open('file.txt', r)
content = f.read() #assuming file isn't too long
And then just parse it:
parsedContent = eval(content)
You can pass any environment to eval (see docs), so it might not have access to your globals and locals. This is evil and wrong, but in small program that won't be distributed and won't get 'file.txt' from network or from so called malicious user - you can use it.

Categories