How can I change a huge file into csv in python - python

I'm a beginner in python. I have a huge text file (hundreds of GB) and I want to convert the file into csv file. In my text file, I know the row delimiter is a string "<><><><><><><>". If a line contains that string, I want to replace it with ". Is there a way to do it without having to read the old file and rewriting a new file.
Normally I thought I need to do something like this:
fin = open("input", "r")
fout = open("outpout", "w")
line = f.readline
while line != "":
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line)
line = f.readline
but copying hundreds of GB is wasteful. Also I don't know if open will eat lots of memory (does it treat file handler as a stream?)
Any help is greatly appreciated.
Note: an example of the file would be
file.txt
<><><><><><><>
abcdefeghsduai
asdjliwa
1231214 ""
<><><><><><><>
would be one row and one column in csv.

#richard-levasseur
I agree, sed seems like the right way to go. Here's a rough cut at what the OP describes:
sed -i -e's/<><><><><><><>/"/g' foo.txt
This will do the replacement in-place in the existing foo.txt. For that reason, I recommend having the original file under some sort of version control; any of the DVCS should fit the bill.

Yes, open() treats the file as a stream, as does readline(). It'll only read the next line. If you call read(), however, it'll read everything into memory.
Your example code looks ok at first glance. Almost every solution will require you to copy the file elsewhere. Its not exactly easy to modify the contents of a file inplace without a 1:1 replacement.
It may be faster to use some standard unix utilities (awk and sed most likely), but I lack the unix and bash-fu necessary to provide a full solution.

It's only wasteful if you don't have disk to spare. That is, fix it when it's a problem. Your solution looks ok as a first attempt.
It's not wasteful of memory because a file handler is a stream.

Reading lines is simply done using a file iterator:
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
Also consider the CSV writer object to write CSV files, e.g:
import csv
writer = csv.writer(open("some.csv", "wb"))
writer.writerows(someiterable)

With python you will have to create a new file for safety sake, it will cause alot less headaches than trying to write in place.
The below listed reads your input 1 line at a time and buffers the columns (from what I understood of your test input file was 1 row) and then once the end of row delimiter is hit it will write that buffer to disk, flushing manually every 1000 lines of the original file. This will save some IO as well instead of writing every segment, 1000 writes of 32 bytes each will be faster than 4000 writes of 8 bytes.
fin = open(input_fn, "rb")
fout = open(output_fn, "wb")
row_delim = "<><><><><><><>"
write_buffer = []
for i, line in enumerate(fin):
if not i % 1000:
fout.flush()
if row_delim in line and i:
fout.write('"%s"\r\n'%'","'.join(write_buffer))
write_buffer = []
else:
write_buffer.append(line.strip())
Hope that helps.
EDIT: Forgot to mention, while using .readline() is not a bad thing don't use .readlines() which will go and read the entire content of the file into a list containing each line which is incredibly inefficient. Using the built in iterator that comes with a file object is the best memory usage and speed.

#Constatin suggests that if you would be satisfied with replacing '<><><><><><><>\n' by '" \n'
then the replacement string is the same length, and in that case you can craft a solution to in-place editing with mmap. You will need python 2.6. It's vital that the file is opened in the right mode!
import mmap, os
CHUNK = 2**20
oldStr = ''
newStr = '" '
strLen = len(oldStr)
assert strLen==len(newStr)
f = open("myfilename", "r+")
size = os.fstat(f.fileno()).st_size
for offset in range(0,size,CHUNK):
map = mmap.mmap(f.fileno(),
length=min(CHUNK+strLen,size-offset), # not beyond EOF
offset=offset)
index = 0 # start at beginning
while 1:
index = map.find(oldStr,index) # find next match
if index == -1: # no more matches in this map
break
map[index:index+strLen] = newStr
f.close()
This code is not debugged! It works for me on a 3 MB test case, but it may not work on a large ( > 2GB) file - the mmap module still seems a bit immature, so I wouldn't rely on it too much.
Looking at the bigger picture, from what you've posted it isn't clear that your file will end up as valid CSV. Also be aware that the tool you're planning to use to actually process the CSV may be flexible enough to deal with the file as it stands.

If you're delimiting fields with double quotes, it looks like you need to escape the double quotes you have occurring in your elements (for example 1231214 "" will need to be \n1231214 \"\").
Something like
fin = open("input", "r")
fout = open("output", "w")
for line in fin:
if line.contains("<><><><><><><>"):
fout.writeline("\"")
else:
fout.writeline(line.replace('"',r'\"')
fin.close()
fout.close()

[For the problem exactly as stated] There's no way that this can be done without copying the data, in python or any other language. If your processing always replaced substrings with new substrings of equal length, maybe you could do it in-place. But whenever you replace <><><><><><><> with " you are changing the position of all subsequent characters in the file. Copying from one place to another is the only way to handle this.
EDIT:
Note that the use of sed won't actually save any copying...sed doesn't really edit in-place either. From the GNU sed manual:
-i[SUFFIX]
--in-place[=SUFFIX]
This option specifies that files are to be edited in-place. GNU sed does this by creating a temporary file and sending output to this file rather than to the standard output.
(emphasis mine.)

Related

Python: read a line and write back to that same line

I am using python to make a template updater for html. I read a line and compare it with the template file to see if there are any changes that needs to be updated. Then I want to write any changes (if there are any) back to the same line I just read from.
Reading the file, my file pointer is positioned now on the next line after a readline(). Is there anyway I can write back to the same line without having to open two file handles for reading and writing?
Here is a code snippet of what I want to do:
cLine = fp.readline()
if cLine != templateLine:
# Here is where I would like to write back to the line I read from
# in cLine
Updating lines in place in text file - very difficult
Many questions in SO are trying to read the file and update it at once.
While this is technically possible, it is very difficult.
(text) files are not organized on disk by lines, but by bytes.
The problem is, that read number of bytes on old lines is very often different from new one, and this mess up the resulting file.
Update by creating a new file
While it sounds inefficient, it is the most effective way from programming point of view.
Just read from file on one side, write to another file on the other side, close the files and copy the content from newly created over the old one.
Or create the file in memory and finally do the writing over the old one after you close the old one.
At the OS level the things are a bit different from how it looks from Python - from Python a file looks almost like a list of strings, with each string having arbitrary length, so it seems to be easy to swap a line for something else without affecting the rest of the lines:
l = ["Hello", "world"]
l[0] = "Good bye"
In reality, though, any file is just a stream of bytes, with strings following each other without any "padding". So you can only overwrite the data in-place if the resulting string has exactly the same length as the source string - otherwise it'll simply overwrite the following lines.
If that is the case (your processing guarantees not to change the length of strings), you can "rewind" the file to the start of the line and overwrite the line with new data. The below script converts all lines in file to uppercase in-place:
def eof(f):
cur_loc = f.tell()
f.seek(0,2)
eof_loc = f.tell()
f.seek(cur_loc, 0)
if cur_loc >= eof_loc:
return True
return False
with open('testfile.txt', 'r+t') as fp:
while True:
last_pos = fp.tell()
line = fp.readline()
new_line = line.upper()
fp.seek(last_pos)
fp.write(new_line)
print "Read %s, Wrote %s" % (line, new_line)
if eof(fp):
break
Somewhat related: Undo a Python file readline() operation so file pointer is back in original state
This approach is only justified when your output lines are guaranteed to have the same length, and when, say, the file you're working with is really huge so you have to modify it in place.
In all other cases it would be much easier and more performant to just build the output in memory and write it back at once. Another option is to write to a temporary file, then delete the original and rename the temporary file so it replaces the original file.

Replacing a line on Python

I'm trying to convert PHP code to Python, and I have problems with replacing lines. Although I find it easier to do using Python, I'm absolutely lost; I can find the line to replace, I can add something to the end of the line, but I can't write the line again on the file.
file = open("cache.ucb", 'rb')
for line in file:
if line.split('~!')[0] == ex[4]:
line += "~!" + mask[0]
line = line.rstrip() + "\n"
# Write on the file here!
Basically, the file uses ~! as a separator, and I read each line. If the first token separated with ~! of the line starts with ex[4], which could be for example Catbuntu, I want to append mask[0], which could be Bousie, on the end of that line. Then I remove the new line characters and add one to the end.
And there's the problem. I want to write the file as it was, but changing only that line. Is that possible?
Assuming you're on python >=2.7, the following should work a treat
original = open(filename)
newfile = []
for line in original:
if line.split('~!')[0] == ex[4]:
line += "~!" + mask[0]
line = line.rstrip() + "\n"
newfile.append(line)
original.close()
amended.open(filename, "w")
amended.writeLines(newfile)
amended.close()
If for whatever reason you are on python 2.6 or lower, replace the second to last line with:
amended.write("".join(newfile))
EDIT: Fixed to replace a mistake copied from the question, factor out a filename.
You cannot modify a file in-place, at least not if you want to insert characters to a line. You'll just end up overwriting the start of the next line.
There are two different ways to do this:
Read the file into memory, close it, then write back the new version.
Write a new temporary file as you go along, then move it over the original version.
So, how do you choose between them? I'll try to summarize the differences, ordered so that each one typically trumps the ones below if it's important (but that's just "typically"—you have to think through your own use case):
2 doesn't require holding the entire thing in memory. If your file is, say, 20GB long, this is obviously a huge win; if it's 16KB, it doesn't matter.
2 makes the entire operation atomic. Even if it fails halfway through, or some other process tries to read the file while you're in the middle of changing it, there is no way anyone can see some invalid half-modified file; they will see either the original file, or the new one.
2 requires some free disk space (because there are, temporarily, two copies of the file at the same time).
2 is a huge pain in the neck if you care about both Windows and POSIX.
2 can involve copying across filesystems if the original file and the temp directory are on different filesystems, unless you're careful about it.
2 is simpler if neither of the above two are an issue.
Drakekin's answer tells you how to do #1.
Here's how to do #2 if you don't care about Windows or about cross-filesystem issues:
infile = open("cache.ucb", 'rb')
outfile = tempfile.NamedTemporaryFile(delete=False)
for line in infile:
if line.split('~!')[0] == ex[4]:
line += "~!" + mask[0]
line = line.rstrip() + "\n"
outfile.write(line)
infile.close()
os.rename(outfile.name, "cache.ucb")
outfile.close()
You can solve the cross-filesystem problem by, e.g., passing dir=os.path.dirname(original path) to the NamedTemporaryFile constructor, but only if you're sure you'll always have permissions to create a new file alongside the original (which isn't always guaranteed, just because you have permission to rewrite the original—UNIX permissions, Windows ACLs, the OS X sandbox, etc. all give ways that can be false).
To solve the Windows problem… well, start with Is an atomic file rename (with overwrite) possible on Windows, and similar discussions all over the internet.
Open the file in mode 'wb' and put file.write(line) at the end of your loop.
You don't have your file open for writing.
file = open("cache.ucb", 'rb')
This line opens a file for reading in binary mode. You need to open it for writing also.
Try opening the file in write mode, 'w' and writing the line back.
Or you can simply open your file for read/write at the beginning and write inside your loop:
file = open("cache.ucb", 'a+')

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

How do I remove lines from a big file in Python, within limited environment

Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.
Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?
How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...
I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.
Also, you'll need to define the function isRequired(line).
writeLoc = 0
readLoc = 0
with open( "filename" , "r+" ) as f:
while True:
line = f.readline()
#manual EOF check; not sure of the correct
#Python way to do this manually...
if line == "":
break
#save how far we've read
readLoc = f.tell()
#if we need this line write it and
#update the write location
if isRequired(line):
f.seek( writeLoc )
f.write( line )
writeLoc = f.tell()
f.seek( readLoc )
#finally, chop off the rest of file that's no longer needed
f.truncate( writeLoc )
Try this:
currentReadPos = 0
removedLinesLength = 0
for line in file:
currentReadPos = file.tell()
if remove(line):
removedLinesLength += len(line)
else:
file.seek(file.tell() - removedLinesLength)
file.write(line + "\n")
file.flush()
file.seek(currentReadPos)
I have not run this, but the idea is to modify the file in place by overwriting the lines you want to remove with lines you want to keep. I am not sure how the seeking and modifying interacts with the iterating over the file.
Update:
I have tried fileinput with inplace by creating a 1GB file. What I expected was different from what happened. I read the documentation properly this time.
Optional in-place filtering: if the
keyword argument inplace=1 is passed
to fileinput.input() or to the
FileInput constructor, the file is
moved to a backup file and standard
output is directed to the input file
(if a file of the same name as the
backup file already exists, it will be
replaced silently).
from docs/fileinput
So, this doesn't seem to be an option now for you. Please check other answers.
Before Edit:
If you are looking for editing the file inplace, then check out Python's fileinput module - Docs.
I am really not sure about its efficiency when used with a 10gb file. But, to me, this seemed to be the only option you have using Python.
Just sequentially read and write to the files.
f.readlines() returns a list
containing all the lines of data in
the file. If given an optional
parameter sizehint, it reads that many
bytes from the file and enough more to
complete a line, and returns the lines
from that. This is often used to allow
efficient reading of a large file by
lines, but without having to load the
entire file in memory. Only complete
lines will be returned.
Source
Process the file getting 10/20 or more MB of chunks.
This would be the fastest way.
Other way of doing this is to stream this file and filter it using AWK for example.
example pseudo code:
file = open(rw)
linesCnt=50
newReadOffset=0
tmpWrtOffset=0
rule=1
processFile()
{
while(rule)
{
(lines,newoffset)=getLines(file, newReadOffset)
if lines:
[x for line in lines if line==cool: line]
tmpWrtOffset = writeBackToFile(file, x, tmpWrtOffset) #should return new offset to write for the next time
else:
rule=0
}
}
To resize file at the end use truncate(size=None)

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.
Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)
To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.
Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.
I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

Categories