Issues reading in large .gz files

Issues reading in large .gz files - python

I am reading in a large zipped json file ~4GB. I want to read in the first n lines.
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
line_n = f.readlines(1)
print(ast.literal_eval(line_n[0])['events']) # a dictionary object
This works fine when I want to read a single line. If now try and read in a loop e.g.
no_of_lines = 1
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in range(no_of_lines):
line_n = f.readlines(line)
print(ast.literal_eval(line_n[0])['events'])
My code takes forever to execute, even if that loop is of length 1. I'm assuming this behaviour has something to do with how gzip read files, perhaps when I loop it tries to obtain information about the file length which causes the long execution time? Can anyone shed some light on this and potentially provide an alternative way of doing this?
An edited first line of my data:
['{"events": {"category": "EVENT", "mac_address": "123456", "co_site": "HSTH"}}\n']

You are using the readlines() method, which reads all lines from a file simultaneously. This can cause performance issues when reading huge files, as Python needs to load all the lines into memory at once.
An alternative is to use the iter() method to iterate over the lines of the file, without having to load all the lines into memory at once:
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in f:
print(ast.literal_eval(line)['events'])

Related

python jump to a line in a txt file (a gzipped one)

I'm reading through a large file, and processing it.
I want to be able to jump to the middle of the file without it taking a long time.
right now I am doing:
f = gzip.open(input_name)
for i in range(1000000):
f.read() # just skipping the first 1M rows
for line in f:
do_something(line)
is there a faster way to skip the lines in the zipped file?
If I have to unzip it first, I'll do that, but there has to be a way.
It's of course a text file, with \n separating lines.

The nature of gzipping is such that there is no longer the concept of lines when the file is compressed -- it's just a binary blob. Check out this for an explanation of what gzip does.
To read the file, you'll need to decompress it -- the gzip module does a fine job of it. Like other answers, I'd also recommend itertools to do the jumping, as it will carefully make sure you don't pull things into memory, and it will get you there as fast as possible.
with gzip.open(filename) as f:
# jumps to `initial_row`
for line in itertools.slice(f, initial_row, None):
# have a party
Alternatively, if this is a CSV that you're going to be working with, you could also try clocking pandas parsing, as it can handle decompressing gzip. That would look like: parsed_csv = pd.read_csv(filename, compression='gzip').
Also, to be extra clear, when you iterate over file objects in python -- i.e. like the f variable above -- you iterate over lines. You do not need to think about the '\n' characters.

You can use itertools.islice, passing a file object f and starting point, it will still advance the iterator but more efficiently than calling next 1000000 times:
from itertools import islice
for line in islice(f,1000000,None):
print(line)
Not overly familiar with gzip but I imagine f.read() reads the whole file so the next 999999 calls are doing nothing. If you wanted to manually advance the iterator you would call next on the file object i.e next(f).
Calling next(f) won't mean all the lines are read into memory at once either, it advances the iterator one line at a time so if you want to skip a line or two in a file or a header it can be useful.
The consume recipe as #wwii suggested recipe is also worth checking out

Not really.
If you know the number of bytes you want to skip, you can use .seek(amount) on the file object, but in order to skip a number of lines, Python has to go through the file byte by byte to count the newline characters.
The only alternative that comes to my mind is if you handle a certain static file, that won't change. In that case, you can index it once, i.e. find out and remember the positions of each line. If you have that in e.g. a dictionary that you save and load with pickle, you can skip to it in quasi-constant time with seek.

It is not possible to randomly seek within a gzip file. Gzip is a stream algorithm and so it must always be uncompressed from the start until where your data of interest lies.
It is not possible to jump to a specific line without an index. Lines can be scanned forward or scanned backwards from the end of the file in continuing chunks.
You should consider a different storage format for your needs. What are your needs?

How to load a big text file efficiently in python

I have a text file containing 7000 lines of strings. I got to search for a specific string based upon few params.
Some are saying that the below code wouldn't be efficient (speed and memory usage).
f = open("file.txt")
data = f.read().split() # strings as list
First of all, if don't even make it as a list, how would I even start searching at all?
Is it efficient to load the entire file? If not, how to do it?
To filter anything, we need to search for that we need to read it right!
A bit confused

iterate over each line of the file, without storing it. This will make for program memory Efficient.
with open(filname) as f:
for line in f:
if "search_term" in line:
break

Reading lines in text files using python

I am currently programming a game that requires reading and writing lines in a text file. I was wondering if there is a way to read a specific line in the text file (i.e. the first line in the text file). Also, is there a way to write a line in a specific location (i.e. change the first line in the file, write a couple of other lines and then change the first line again)? I know that we can read lines sequentially by calling:
f.readline()
Edit: Based on responses, apparently there is no way to read specific lines if they are different lengths. I am only working on a small part of a large group project and to change the way I'm storing data would mean a lot of work.
But is there a method to change specifically the first line of the file? I know calling:
f.write('text')
Writes something into the file, but it writes the line at the end of the file instead of the beginning. Is there a way for me to specifically rewrite the text at the beginning?

If all your lines are guaranteed to be the same length, then you can use f.seek(N) to position the file pointer at the N'th byte (where N is LINESIZE*line_number) and then f.read(LINESIZE). Otherwise, I'm not aware of any way to do it in an ordinary ASCII file (which I think is what you're asking about).
Of course, you could store some sort of record information in the header of the file and read that first to let you know where to seek to in your file -- but at that point you're better off using some external library that has already done all that work for you.
Unless your text file is really big, you can always store each line in a list:
with open('textfile','r') as f:
lines=[L[:-1] for L in f.readlines()]
(note I've stripped off the newline so you don't have to remember to keep it around)
Then you can manipulate the list by adding entries, removing entries, changing entries, etc.
At the end of the day, you can write the list back to your text file:
with open('textfile','w') as f:
f.write('\n'.join(lines))
Here's a little test which works for me on OS-X to replace only the first line.
test.dat
this line has n characters
this line also has n characters
test.py
#First, I get the length of the first line -- if you already know it, skip this block
f=open('test.dat','r')
l=f.readline()
linelen=len(l)-1
f.close()
#apparently mode='a+' doesn't work on all systems :( so I use 'r+' instead
f=open('test.dat','r+')
f.seek(0)
f.write('a'*linelen+'\n') #'a'*linelen = 'aaaaaaaaa...'
f.close()

These days, jumping within files in an optimized fashion is a task for high performance applications that manage huge files.
Are you sure that your software project requires reading/writing random places in a file during runtime? I think you should consider changing the whole approach:
If the data is small, you can keep / modify / generate the data at runtime in memory within appropriate container formats (list or dict, for instance) and then write it entirely at once (on change, or only when your program exits). You could consider looking at simple databases. Also, there are nice data exchange formats like JSON, which would be the ideal format in case your data is stored in a dictionary at runtime.
An example, to make the concept more clear. Consider you already have data written to gamedata.dat:
[{"playtime": 25, "score": 13, "name": "rudolf"}, {"playtime": 300, "score": 1, "name": "peter"}]
This is utf-8-encoded and JSON-formatted data. Read the file during runtime of your Python game:
with open("gamedata.dat") as f:
s = f.read().decode("utf-8")
Convert the data to Python types:
gamedata = json.loads(s)
Modify the data (add a new user):
user = {"name": "john", "score": 1337, "playtime": 1}
gamedata.append(user)
John really is a 1337 gamer. However, at this point, you also could have deleted a user, changed the score of Rudolf or changed the name of Peter, ... In any case, after the modification, you can simply write the new data back to disk:
with open("gamedata.dat", "w") as f:
f.write(json.dumps(gamedata).encode("utf-8"))
The point is that you manage (create/modify/remove) data during runtime within appropriate container types. When writing data to disk, you write the entire data set in order to save the current state of the game.

Python: efficient file io

What is the most efficient (fastest) way to simultaneously read in two large files and do some processing?
I have two files; a.txt and b.txt, each containing about a hundred thousand corresponding lines. My goal is to read in the two files and then do some processing on each line pair
def kernel:
a_file=open('a.txt','r')
b_file=open('b.txt', 'r')
a_line = a_file.readline()
b_line = b_file.readline()
while a_line:
process(a_spl,b_spl) #process requiring both corresponding file lines
I looked in to xreadlines and readlines but i'm wondering if i can do better. speed is of paramount importance for this task.
thank you.

The below code does not accumulate data from the input files in memory, unless the process function does that by itself.
from itertools import izip
def process(line1, line2):
# process a line from each input
with open(file1, 'r') as f1:
with open(file2, 'r') as f2:
for a, b in izip(f1, f2):
process(a, b)
If the process function is efficient, this code should run quickly enough for most purposes. The for loop will terminate when the end of one of the files is reached. If either file contains an extraordinarily long line (i.e. XML, JSON), or if the files are not text, this code may not work well.

You can use with statement to make sure your files are closed after the execution. From this blog entry:
to open a file, process its contents, and make sure to close it, you can simply do:
with open("x.txt") as f:
data = f.read()
do something with data

String IO can be pretty fast -- probably your processing will be what slows things down. Consider a simple input loop to feed a queue like:
queue = multiprocessing.Queue(100)
a_file = open('a.txt')
b_file = open('b.txt')
for pair in itertools.izip(a_file, b_file):
queue.put(pair) # blocks here on full queue
You can set up a pool of processes pulling items from the queue and taking action on each, assuming your problem can be parallelised this way.

I'd change your while condition to the following so that it doesn't fail when a has more lines than b.
while a_line and b_line
Otherwise, that looks good. You are reading in the two lines that you need, then processing. You could even multithread this by reading in N pairs of line and sending each pair off to a new thread or similar.

How do I remove lines from a big file in Python, within limited environment

Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.
Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?

How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...
I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.
Also, you'll need to define the function isRequired(line).
writeLoc = 0
readLoc = 0
with open( "filename" , "r+" ) as f:
while True:
line = f.readline()
#manual EOF check; not sure of the correct
#Python way to do this manually...
if line == "":
break
#save how far we've read
readLoc = f.tell()
#if we need this line write it and
#update the write location
if isRequired(line):
f.seek( writeLoc )
f.write( line )
writeLoc = f.tell()
f.seek( readLoc )
#finally, chop off the rest of file that's no longer needed
f.truncate( writeLoc )

Try this:
currentReadPos = 0
removedLinesLength = 0
for line in file:
currentReadPos = file.tell()
if remove(line):
removedLinesLength += len(line)
else:
file.seek(file.tell() - removedLinesLength)
file.write(line + "\n")
file.flush()
file.seek(currentReadPos)
I have not run this, but the idea is to modify the file in place by overwriting the lines you want to remove with lines you want to keep. I am not sure how the seeking and modifying interacts with the iterating over the file.

Update:
I have tried fileinput with inplace by creating a 1GB file. What I expected was different from what happened. I read the documentation properly this time.
Optional in-place filtering: if the
keyword argument inplace=1 is passed
to fileinput.input() or to the
FileInput constructor, the file is
moved to a backup file and standard
output is directed to the input file
(if a file of the same name as the
backup file already exists, it will be
replaced silently).
from docs/fileinput
So, this doesn't seem to be an option now for you. Please check other answers.
Before Edit:
If you are looking for editing the file inplace, then check out Python's fileinput module - Docs.
I am really not sure about its efficiency when used with a 10gb file. But, to me, this seemed to be the only option you have using Python.

Just sequentially read and write to the files.
f.readlines() returns a list
containing all the lines of data in
the file. If given an optional
parameter sizehint, it reads that many
bytes from the file and enough more to
complete a line, and returns the lines
from that. This is often used to allow
efficient reading of a large file by
lines, but without having to load the
entire file in memory. Only complete
lines will be returned.
Source

Process the file getting 10/20 or more MB of chunks.
This would be the fastest way.
Other way of doing this is to stream this file and filter it using AWK for example.
example pseudo code:
file = open(rw)
linesCnt=50
newReadOffset=0
tmpWrtOffset=0
rule=1
processFile()
{
while(rule)
{
(lines,newoffset)=getLines(file, newReadOffset)
if lines:
[x for line in lines if line==cool: line]
tmpWrtOffset = writeBackToFile(file, x, tmpWrtOffset) #should return new offset to write for the next time
else:
rule=0
}
}
To resize file at the end use truncate(size=None)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issues reading in large .gz files - python

Related

python jump to a line in a txt file (a gzipped one)

How to load a big text file efficiently in python

Reading lines in text files using python

Python: efficient file io

How do I remove lines from a big file in Python, within limited environment

Categories

Resources