I have a file in CSV format where the delimiter is the ASCII unit separator ^_ and the line terminator is the ASCII record separator ^^ (obviously, since these are nonprinting characters, I've just used one of the standard ways of writing them here). I've written plenty of code that reads and writes CSV files, so my issue isn't with Python's csv module per se. The problem is that the csv module doesn't support reading (but it does support writing) line terminators other than a carriage return or line feed, at least as of Python 2.6 where I just tested it. The documentation says that this is because it's hard coded, which I take to mean it's done in the C code that underlies the module, since I didn't see anything in the csv.py file that I could change.
Does anyone know a way around this limitation (patch, another CSV module, etc.)? I really need to read in a file where I can't use carriage returns or new lines as the line terminator because those characters will appear in some of the fields, and I'd like to avoid writing my own custom reader code if possible, even though that would be rather simple to meet my needs.
Why not supply a custom iterable to the csv.reader function? Here is a naive implementation which reads the entire contents of the CSV file into memory at once (which may or may not be desirable, depending on the size of the file):
def records(path):
with open(path) as f:
contents = f.read()
return (record for record in contents.split('^^'))
csv.reader(records('input.csv'))
I think that should work.
Related
Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
Is it possible to parse a file line by line, and edit a line in-place while going through the lines?
It can be simulated using a backup file as stdlib's fileinput module does.
Here's an example script that removes lines that do not satisfy some_condition from files given on the command line or stdin:
#!/usr/bin/env python
# grep_some_condition.py
import fileinput
for line in fileinput.input(inplace=True, backup='.bak'):
if some_condition(line):
print line, # this goes to the current file
Example:
$ python grep_some_condition.py first_file.txt second_file.txt
On completion first_file.txt and second_file.txt files will contain only lines that satisfy some_condition() predicate.
fileinput module has very ugly API, I find beautiful module for this task - in_place, example for Python 3:
import in_place
with in_place.InPlace('data.txt') as file:
for line in file:
line = line.replace('test', 'testZ')
file.write(line)
main difference from fileinput:
Instead of hijacking sys.stdout, a new filehandle is returned for writing.
The filehandle supports all of the standard I/O methods, not just readline().
Important Notes:
This solution deletes every line in the file if you don't re-write it with the file.write() line.
Also, if the process is interrupted, you lose any line in the file that has not already been re-written.
No. You cannot safely write to a file you are also reading, as any changes you make to the file could overwrite content you have not read yet. To do it safely you'd have to read the file into a buffer, updating any lines as required, and then re-write the file.
If you're replacing byte-for-byte the content in the file (i.e. if the text you are replacing is the same length as the new string you are replacing it with), then you can get away with it, but it's a hornets nest, so I'd save yourself the hassle and just read the full file, replace content in memory (or via a temporary file), and write it out again.
If you only intend to perform localized changes that do not change the length of the part of the file that is modified (e.g. changing all characters to lower case), then you can actually overwrite the old contents of the file dynamically.
To do that, you can use random file access with the seek() method of a file object.
Alternatively, you may be able to use an mmap object to treat the whole file as a mutable string. Keep in mind that mmap objects may impose a maximum file-size limit in the 2-4 GB range on a 32-bit CPU, depending on your operating system and its configuration.
You have to back up by the size of the line in characters. Assuming you used readline, then you can get the length of the line and back up using:
file.seek(offset[, whence])
Set whence to SEEK_CUR, set offset to -length.
See Python Docs or look at the manpage for seek.
I have a python script that is supposed to read a file. The issue is that that file is very large so for efficiency I decided that my script should only read from line 650000 and onward, since previous line does not contain relevant information.
Is there any way to only modify lines 650000 till eof, so for example, if i read() this file only those specific lines would appear?
Files are not line-oriented, they are blocks of bytes.
There's no way, short of reading the data in, to figure out how many bytes make up those first 650,000 lines, so you'd have to do that just in order to skip them.
Starting modifying a file at a certain offset is possible, but that offset will be in bytes which is the addressing unit used by files.
Skipping lines can be done easily enough:
with open("myfile.txt", "w+t") as f:
for i in xrange(650000):
f.readline() # Read a line and throw it away
f.write("hello")
This will truncate the file so that there will be no data after the hello (but 650,000 lines before it, of course).
I need to read 4 specific lines of a file in python. I don't want to read all the file and then get four out of it ( for the sake of menory). Does anyone know how to do that?
Thanks!
P. S. I used the following code but apparently it reads all the file and then take 4 out of it.
a=open("file", "r")
b=a.readlines() [c:d]
you have to read at least to the lines you are interested in ... you can use islice to grab a slice
interesting_lines = list(itertools.islice(a,c,d))
but it still reads up to those lines
Files, at least on Macs and Windows and Linux and other UNIXy systems, are just streams of bytes; there's no concept of "line" in the file structure, just bytes that happen to represent newline characters. So the only way to find the Nth line in the file is to start at the beginning and read until you've found (N-1) newlines. You don't have to store all the content you scan through, but you do have to read it.
Then you have to read and store from that point until you find 4 more newlines.
You can do this in Python, but it's not clear to me that it's a win compared to using the straightforward approach that reads more than it needs to; feels like premature optimization to me.
I'm reading through a large file, and processing it.
I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file?
If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n separating lines.
The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.
You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).
Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as #wwii suggested recipe is also worth checking out
Not really.
If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.
The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, i.e. find out and remember the positions of each line. If you have that in e.g. a dictionary that you save and load with pickle, you can skip to it in quasi-constant time with seek.
It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.
It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.
You should consider a different storage format for your needs. What are your needs?
I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,
pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file
As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).