python: read lines from compressed text files

python: read lines from compressed text files - python

Is it possible to read a line from a gzip-compressed text file using Python without extracting the file completely? I have a text.gz file which is around 200 MB. When I extract it, it becomes 7.4 GB. And this is not the only file I have to read. For the total process, I have to read 10 files. Although this will be a sequential job, I think it will a smart thing to do it without extracting the whole information. How can this be done using Python? I need to read the text file line-by-line.

Using gzip.GzipFile:
import gzip
with gzip.open('input.gz','rt') as f:
for line in f:
print('got line', line)
Note: gzip.open(filename, mode) is an alias for gzip.GzipFile(filename, mode).
I prefer the former, as it looks similar to with open(...) as f: used for opening uncompressed files.

You could use the standard gzip module in python. Just use:
gzip.open('myfile.gz')
to open the file as any other file and read its lines.
More information here: Python gzip module

Have you tried using gzip.GzipFile? Arguments are similar to open.

The gzip library (obviously) uses gzip, which can be a bit slow. You can speed things up with a system call to pigz, the parallelized version of gzip. The downsides are you have to install pigz and it will take more cores during the run, but it is much faster and not more memory intensive. The call to the file then becomes os.popen('pigz -dc ' + filename) instead of gzip.open(filename,'rt'). The pigz flags are -d for decompress and -c for stdout output which can then be grabbed by os.popen.
The following code take in a file and a number (1 or 2) and counts the number of lines in the file with the different calls while measuring the time the code takes. Defining the following code in the unzip-file.py:
#!/usr/bin/python
import os
import sys
import time
import gzip
def local_unzip(obj):
t0 = time.time()
count = 0
with obj as f:
for line in f:
count += 1
print(time.time() - t0, count)
r = sys.argv[1]
if sys.argv[2] == "1":
local_unzip(gzip.open(r,'rt'))
else:
local_unzip(os.popen('pigz -dc ' + r))
Calling these using /usr/bin/time -f %M which measures the maximum memory usage of the process on a 28G file we get:
$ /usr/bin/time -f %M ./unzip-file.py $file 1
(3037.2604110240936, 1223422024)
5116
$ /usr/bin/time -f %M ./unzip-file.py $file 2
(598.771901845932, 1223422024)
4996
Showing that the system call is about five times faster (10 minutes compared to 50 minutes) using basically the same maximum memory. It is also worth noting that depending on what you are doing per line reading in the file might not be the limiting factor, in which case the option you take does not matter.

Related

Reduce runtime, file reading, string manipulation of every line and file writing

I'm writing on a script that reads all lines from multiple files, reads in a number at the beginning of each block and puts that number in front of every line of the block until the next number and so on. Afterwards it writes all read lines into a single .csv file.
The files I am reading look like this:
13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25
And the output file should look like this:
13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
Currently my script is this:
from asyncio import Semaphore, ensure_future, gather, run
import time
limit = 8
async def read(file_list):
tasks = list()
result = None
sem = Semaphore(limit)
for file in file_list:
task = ensure_future(read_bounded(file,sem))
tasks.append(task)
result = await gather(*tasks)
return result
async def read_bounded(file,sem):
async with sem:
return await read_one(file)
async def read_one(filename):
result = list()
with open(filename) as file:
dataList=[]
content = file.read().split(":")
file.close()
j=1
filmid=content[0]
append=result.append
while j<len(content):
for entry in content[j].split("\n"):
if len(entry)>10:
append("%s%s%s%s" % (filmid,",",entry,"\n"))
else:
if len(entry)>0:
filmid=entry
j+=1
return result
if __name__ == '__main__':
start=time.time()
write_append="w"
files = ['combined_data_1.txt', 'combined_data_2.txt', 'combined_data_3.txt', 'combined_data_4.txt']
res = run(read(files))
with open("output.csv",write_append) as outputFile:
for result in res:
outputFile.write(''.join(result))
outputFile.flush()
outputFile.close()
end=time.time()
print(end-start)
It has a runtime of about 135 Seconds (The 4 files that are read are each 500MB big and the output file has 2.3GB). Running the script takes about 10GB of RAM. I think this might be a problem.
The biggest amount of time is needed to create the list of all lines, I think.
I would like to reduce the runtime of this program, but I am new to python and not sure how to do this. Can you give me some advice?
Thanks
Edit:
I measured the times for the following commands in cmd (I have only Windows installed on my Computer, so I used hopefully equivalent cmd-Commands):
sequential writing to NUL
timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt > NUL"
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
command took 0:1:25.87 (85.87s total)
sequential writing to file
timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt > test.csv"
combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt
command took 0:2:42.93 (162.93s total)
parallel
timecmd "type combined_data_1.txt > NUL & type combined_data_2.txt > NUL & type combined_data_3.txt >NUL & type combined_data_4.txt > NUL"
command took 0:1:25.51 (85.51s total)

In this case you're not gaining anything by using asyncio for two reasons:
asyncio is single-threaded and doesn't parallelize processing (and, in Python, neither can threads)
the IO calls access the file system, which asyncio doesn't cover - it is primarily about network IO
The giveaway that you're not using asyncio correctly is the fact that your read_one coroutine doesn't contain a single await. That means that it never suspends execution, and that it will run to completion before ever yielding to another coroutine. Making it an ordinary function (and dropping asyncio altogether) would have the exact same result.
Here is a rewritten version of the script with the following changes:
byte IO throughout, for efficiency
iterates through the file rather than loading all at once
sequential code
import sys
def process(in_filename, outfile):
with open(in_filename, 'rb') as r:
for line in r:
if line.endswith(b':\n'):
prefix = line[:-2]
continue
outfile.write(b'%s,%s' % (prefix, line))
def main():
in_files = sys.argv[1:-1]
out_file = sys.argv[-1]
with open(out_file, 'wb') as out:
for fn in in_files:
process(fn, out)
if __name__ == '__main__':
main()
On my machine and Python 3.7, this version performs at approximately 22 s/GiB, tested on four randomly generated files, of 550 MiB each. It has a negligible memory footprint because it never loads the whole file into memory.
The script runs on Python 2.7 unchanged, where it clocks at 27 s/GiB. Pypy (6.0.0) runs it much faster, taking only 11 s/GiB.
Using concurrent.futures in theory ought to allow processing in one thread while another is waiting for IO, but the result ends up being significantly slower than the simplest sequential approach.

You want to read 2 GiB and write 2 GiB with low elapsed time and low memory consumption.
Parallelism, for core and for spindle, matters.
Ideally you would tend to keep all of them busy.
I assume you have at least four cores available.
Chunking your I/O matters, to avoid excessive malloc'ing.
Start with the simplest possible thing.
Please make some measurements and update your question to include them.
sequential
Please make sequential timing measurements of
$ cat combined_data_[1234].csv > /dev/null
and
$ cat combined_data_[1234].csv > big.csv
I assume you will have low CPU utilization, and thus will be measuring read & write I/O rates.
parallel
Please make parallel I/O measurements:
cat combined_data_1.csv > /dev/null &
cat combined_data_2.csv > /dev/null &
cat combined_data_3.csv > /dev/null &
cat combined_data_4.csv > /dev/null &
wait
This will let you know if overlapping reads offers a possibility for speedup.
For example, putting the files on four different physical filesystems might allow this -- you'd be keeping four spindles busy.
async
Based on these timings, you may choose to ditch async I/O, and instead fork off four separate python interpreters.
logic
content = file.read().split(":")
This is where much of your large memory footprint comes from.
Rather than slurping in the whole file at once, consider reading by lines, or in chunks.
A generator might offer you a convenient API for that.
EDIT:
compression
It appears that you are I/O bound -- you have idle cycles while waiting on the disk.
If the final consumer of your output file is willing to do decompression, then
consider using gzip, xz/lzma, or snappy.
The idea is that most of the elapsed time is spent on I/O, so you want to manipulate smaller files to do less I/O.
This benefits your script when writing 2 GiB of output,
and may also benefit the code that consumes that output.
As a separate item, you might possibly arrange for the code that produces the four input files to produce compressed versions of them.

I have tried to solve your problem. I think this is very easy and simple way if don't have any prior knowledge of any special library.
I just took 2 input files named input.txt & input2.txt with following contents.
Note: All files are in same directory.
input.txt
13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25
input2.txt
13364:
2385001,5,2004-06-08
659435,1,2005-03-16
13370:
751811,2,2023-12-16
2625220,2,2015-05-26
I have written the code in modular way so that you could easily import and use it in your project. Once you run the below code from terminal using python3 csv_writer.py, it will read all the files provided in list file_names and generate output.csv will the result that you're looking for.
csv_writer.py
# https://stackoverflow.com/questions/55226823/reduce-runtime-file-reading-string-manipulation-of-every-line-and-file-writing
import re
def read_file_and_get_output_lines(file_names):
output_lines = []
for file_name in file_names:
with open(file_name) as f:
lines = f.readlines()
for new_line in lines:
new_line = new_line.strip()
if not re.match(r'^\d+:$', new_line):
output_line = [old_line]
output_line.extend(new_line.split(","))
output_lines.append(output_line)
else:
old_line = new_line.rstrip(":")
return output_lines
def write_lines_to_csv(output_lines, file_name):
with open(file_name, "w+") as f:
for arr in output_lines:
line = ",".join(arr)
f.write(line + '\n')
if __name__ == "__main__":
file_names = [
"input.txt",
"input2.txt"
]
output_lines = read_file_and_get_output_lines(file_names)
print(output_lines)
# [['13368', '2385003', '4', '2004-07-08'], ['13368', '659432', '3', '2005-03-16'], ['13369', '751812', '2', '2002-12-16'], ['13369', '2625420', '2', '2004-05-25'], ['13364', '2385001', '5', '2004-06-08'], ['13364', '659435', '1', '2005-03-16'], ['13370', '751811', '2', '2023-12-16'], ['13370', '2625220', '2', '2015-05-26']]
write_lines_to_csv(output_lines, "output.csv")
output.csv
13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
13364,2385001,5,2004-06-08
13364,659435,1,2005-03-16
13370,751811,2,2023-12-16
13370,2625220,2,2015-05-26

How can I speed up this python script to read and process a csv file?

I am trying to process a relatively large (about 100k lines) csv file in python. This is what my code looks like:
#!/usr/bin/env python
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import csv
import os
csvFileName = sys.argv[1]
with open(csvFileName, 'r') as inputFile:
parsedFile = csv.DictReader(inputFile, delimiter=',')
totalCount = 0
for row in parsedFile:
target = row['new']
source = row['old']
systemLine = "some_curl_command {source}, {target}".format(source = source, target = target)
os.system(systemLine)
totalCount += 1
print "\nProcessed number: " + str(totalCount)
I'm not sure how to optimize this script. Should I use something besides DictReader?
I have to use Python 2.7, and cannot upgrade to Python 3.

If you want to avoid multiprocessing it is possible to split your long csv file into few smaller csvs and run them simultaneously. Like
$ python your_script.py 1.csv &
$ python your_script.py 2.csv &
Ampersand stands for background execution in linux envs. More details here. I don't have enough knowledge about anything similar in Windows, but it's possible to open few cmd windows, lol.
Anyway it's much better to stick with multiprocessing, ofc.
What about to use requests instead of curl?
import requests
response = requests.get(source_url)
html = response.content
with open(target, "w") as file:
file.write(html)
Here's the doc.
Avoid print statements, in long-term run they're slow as hell. For development and debugging that's ok, but when you decide to start final execution of your script you can remove it and check count of processed files directly in the target folder.

running
subprocess.Popen(systemLine)
instead of
os.system(systemLine)
should speed things up. Please note that systemLine has to be a list of strings e.g ['some_curl_command', 'source', 'target'] in order to work. If you want to limit the number of concurrent commands have a look at that.

How do I pipe to a file or shell program via Pythons subprocess?

I am working with some fairly large gzipped text files that I have to unzip, edit and re-zip. I use Pythons gzip module for unzipping and zipping, but I have found that my current implementation is far from optimal:
input_file = gzip.open(input_file_name, 'rb')
output_file = gzip.open(output_file_name, 'wb')
for line in input_file:
# Edit line and write to output_file
This approach is unbearably slow – probably because there is a huge overhead involved in doing per line iteration with the gzip module: I initially also run a line-count routine where I - using the gzip module - read chunks of the file and then count the number of newline chars in each chunk and that is very fast!
So one of the optimizations should definitely be to read my files in chunks and then only do per line iterations once the chunks have been unzipped.
As an additional optimization, I have seen a few suggestions to unzip in a shell command via subprocess. Using this approach, the equivalent of the first line in the above could be:
from subprocess import Popen, PIPE
file_input = Popen(["zcat", fastq_filename], stdout=PIPE)
input_file = file_input.stdout
Using this approach input_file becomes a file-like object. I don't know exactly how it is different to a real file object in terms of available attributes and methods, but one difference is that you obviously cannot use seek since it is a stream rather than a file.
This does run faster and it should - unless you run your script in a single core machine the claim is. The latter must mean that subprocess automatically ships different threads to different cores if possible, but I am no expert there.
So now to my current problem: I would like to zip my output in a similar fashion. That is, instead of using Pythons gzip module, I would like to pipe it to a subprocess and then call the shell gzip. This way I could potentially get reading, editing and writing in separate cores, which sounds wildly effective to me.
I have made a puny attempt at this, but attempting to write to output_file resulted in an empty file. Initially, I create an empty file using the touch command because Popen fails if the file does not exist:
call('touch ' + output_file_name, shell=True)
output = Popen(["gzip", output_file_name], stdin=PIPE)
output_file = output.stdin
Any help is greatly appreciated, I am using Python 2.7 by the way. Thanks.

Here is a working example of how this can be done:
#!/usr/bin/env python
from subprocess import Popen, PIPE
output = ['this', 'is', 'a', 'test']
output_file_name = 'pipe_out_test.txt.gz'
gzip_output_file = open(output_file_name, 'wb', 0)
output_stream = Popen(["gzip"], stdin=PIPE, stdout=gzip_output_file) # If gzip is supported
for line in output:
output_stream.stdin.write(line + '\n')
output_stream.stdin.close()
output_stream.wait()
gzip_output_file.close()
If our script only wrote to console and we wanted the output zipped, a shell command equivalent of the above could be:
script_that_writes_to_console | gzip > output.txt.gz

You meant output_file = gzip_process.stdin. After that you can use output_file as you've used gzip.open() object previously (no-seeking).
If the result file is empty then check that you call output_file.close() and gzip_process.wait() at the end of your Python script. Also, the usage of gzip may be incorrect: if gzip writes the compressed output to its stdout then pass stdout=gzip_output_file where gzip_output_file = open(output_file_name, 'wb', 0).

Extracting data from file performance wise (subprocess vs file read) Python

Wondering what is the most efficient method to read data from a locally hosted file using python.
Either using subprocesses and just cat the contents of the file:
ssh = subprocess.Popen(['cat', dir_to_file],
stdout=subprocess.PIPE)
for line in ssh.stdout:
print line
OR simply read contents of the file:
f = open(dir_to_file)
data = f.readlines()
f.close()
for line in data:
print line
I am creating a script that has to read the contents of many files and I was wondering which method is most efficient in terms of CPU usage and also which is the fastest in terms of runtime.
This is my first post here at stackoverflow, apologies on formatting.
Thanks

#chrisd1100 is correct that printing line by line is the bottleneck. After a quick experiment, here is what I found.
I ran and timed the two methods above repeatedly (A - subprocess, B - readline) on two different file sizes (~100KB and ~10MB).
Trial 1: ~100KB
subprocess: 0.05 - 0.1 seconds
readline: 0.02 - 0.026 seconds
Trial 2: ~10MB
subprocess: ~7 seconds
readlin: ~7 seconds
At the larger file size, printing line by line becomes by far the most expensive operation. On smaller file sizes, it seems that readline has about 2x speed performance. Tentatively, I'd say that readline is faster.
These were all run on Python 2.7.10, OSX 10.11.13, 2.8 Ghz i7.

Python or Bash - Iterate all words in a text file over itself

I have a text file that contains thousands of words, e.g:
laban
labrador
labradors
lacey
lachesis
lacy
ladoga
ladonna
lafayette
lafitte
lagos
lagrange
lagrangian
lahore
laius
lajos
lakeisha
lakewood
I want to iterate every word over itself so i get:
labanlaban
labanlabrador
labanlabradors
labanlacey
labanlachesis
etc...
In bash i can do the following, but it is extremely slow:
#!/bin/bash
( cat words.txt | while read word1; do
cat words.txt | while read word2; do
echo "$word1$word2" >> doublewords.txt
done; done )
Is there a faster and more efficient way to do this?
Also, how would i iterate two different text files in this manner?

If you can fit the list into memory:
import itertools
with open(words_filename, 'r') as words_file:
words = [word.strip() for word in words_file]
for words in itertools.product(words, repeat=2):
print(''.join(words))
(You can also do a double-for loop, but I was feeling itertools tonight.)
I suspect the win here is that we can avoid re-reading the file over and over again; the inner loop in your bash example will cat the file one for each iteration of the outer loop. Also, I think Python just tends to execute faster than bash, IIRC.
You could certainly pull this trick with bash (read the file into an array, write a double-for loop), it's just more painful.

It looks like sed is pretty efficient to append a text to each line.
I propose:
#!/bin/bash
for word in $(< words.txt)
do
sed "s/$/$word/" words.txt;
done > doublewords.txt
(Do you confuse $ which means end of line for sed and $word which is a bash variable).
For a 2000 line file, this runs in about 20 s on my computer, compared to ~2 min for you solution.
Remark: it also looks like you are slightly better off redirecting the standard output of the whole program instead of forcing writes at each loop.
(Warning, this is a bit off topic and personal opinion!)
If you are really going for speed, you should consider using a compiled language such as C++. For example:
vector<string> words;
ifstream infile("words.dat");
for(string line ; std::getline(infile,line) ; )
words.push_back(line);
infile.close();
ofstream outfile("doublewords.dat");
for(auto word1 : data)
for(auto word2 : data)
outfile << word1 << word2 << "\n";
outfile.close();
You need to understand that both bash and python are bad at double for loops: that's why you use tricks (#Thanatos) or predefined commands (sed). Recently, I came across a double for loop problem (given a set of 10000 points in 3d, compute all the distances between pairs) and I successful solved it using C++ instead of python or Matlab.

If you have GHC available, Cartesian products are a synch!
Q1: One file
-- words.hs
import Control.Applicative
main = interact f
where f = unlines . g . words
g x = map (++) x <*> x
This splits the file into a list of words, and then appends each word to each other word with the applicative <*>.
Compile with GHC,
ghc words.hs
and then run with IO redirection:
./words <words.txt >out
Q2: Two files
-- words2.hs
import Control.Applicative
import Control.Monad
import System.Environment
main = do
ws <- mapM ((liftM words) . readFile) =<< getArgs
putStrLn $ unlines $ g ws
where g (x:y:_) = map (++) x <*> y
Compile as before and run with the two files as arguments:
./words2 words1.txt words2.txt > out
Bleh, compiling?
Want the convenience of a shell script and the performance of a compiled executable? Why not do both?
Simply wrap the Haskell program you want in a wrapper script which compiles it in /var/tmp, and then replaces itself with the resulting executable:
#!/bin/bash
# wrapper.sh
cd /var/tmp
cat > c.hs <<CODE
# replace this comment with haskell code
CODE
ghc c.hs >/dev/null
cd - >/dev/null
exec /var/tmp/c "$#"
This handles arguments and IO redirection as though the wrapper didn't exist.
Results
Timing against some of the other answers with two 2000 word files:
$ time ./words2 words1.txt words2.txt >out
3.75s user 0.20s system 98% cpu 4.026 total
$ time ./wrapper.sh words1.txt words2.txt > words2
4.12s user 0.26s system 97% cpu 4.485 total
$ time ./thanatos.py > out
4.93s user 0.11s system 98% cpu 5.124 total
$ time ./styko.sh
7.91s user 0.96s system 74% cpu 11.883 total
$ time ./user3552978.sh
57.16s user 29.17s system 93% cpu 1:31.97 total

You can do this in pythonic way by creating a tempfile and write data to it while reading the existing file and finally remove the original file and move the new file to original file.
import sys
from os import remove
from shutil import move
from tempfile import mkstemp
def data_redundent(source_file_path):
fh, target_file_path = mkstemp()
with open(target_file_path, 'w') as target_file:
with open(source_file_path, 'r') as source_file:
for line in source_file:
target_file.write(line.replace('\n', '')+line)
remove(source_file_path)
move(target_file_path, source_file_path)
data_redundent('test_data.txt')

I'm not sure how efficient this is, but a very simple way, using the Unix tool specifically designed for this sort of thing, would be
paste -d'\0' <file> <file>
The -d option specifies the delimiter to be used between the concatenated parts, and \0 indicates a NULL character (i.e. no delimiter at all).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.