How can I parallelize a csv merging algorithm in python?

How can I parallelize a csv merging algorithm in python? - python

I have a folder of 100gb of csv files that I want to merge into a single csv. The file names are in order of row position. I've written a single threaded script to tackle this, but it is understandably slow.
def JoinRows(rows_to_join, init=True):
#rows_to_join is a list of csv paths.
for i, row in enumerate(rows_to_join):
with open('join_rows.csv', 'a') as f1:
#join_rows.csv is just the output file with all the rows
with open(row, 'r') as f2:
for line in f2:
f1.write('\n'+line)
I also wrote a recursive function that doesn't work and isn't parallel (yet). My thought was to join each csv with another, delete the second of the two, and keep repeating until only one file was left. This way the task could be split up among different available threads. Any suggestions?
def JoinRows(rows_to_join, init=False):
if init==True: rows_to_join.sort()
LEN = len(rows_to_join)
print(LEN)
if len(rows_to_join) == 2:
with open(rows_to_join[0], 'a') as f1:
with open(rows_to_join[1], 'rb') as f2:
for line in f2:
f1.write('\n'+line)
subprocess.check_call(['rm '+rows_to_join[1]], shell=True)
return(rows_to_join[1])
else:
rows_to_join.remove(JoinRows(rows_to_join[:LEN//2]))
rows_to_join.remove(JoinRows(rows_to_join[LEN//2:]))

Related

How to merge two files by line names using python

I think this should be easy but yet have not been able to solve it. I have two files as below and I want to merge them in a way that lines starting with > in the file1 to be the header of the lines in the file2
file1:
>seq12
ACGCTCGCA
>seq34
GCATCGCGT
>seq56
GCGATGCGC
file2:
ATCGCGCATGATCTCAG
AGCGCGCATGCGCATCG
AGCAAATTTAGCAACTC
so the desired output should be:
>seq12
ATCGCGCATGATCTCAG
>seq34
AGCGCGCATGCGCATCG
>seq56
AGCAAATTTAGCAACTC
I have tried this code so far but in output, all the lines coming from file2 are the same:
from Bio import SeqIO
with open(file1) as fw:
with open(file2,'r') as rv:
for line in rv:
items = line
for record in SeqIO.parse(fw, 'fasta'):
print('>' + record.id)
print(line)

If you cannot store your files in memory, you need a solution that reads line by line from each file, and writes accordingly to the output file. The following program does that. The comments try to clarify, though I believe it is clear from the code.
with open("file1.txt") as first, open("file2.txt") as second, open("output.txt", "w+") as output:
while 1:
line_first = first.readline() # line from file1 (header)
line_second = second.readline() # line from file2 (body)
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump one line from file1
first.readline()
Note that this will only work if file1.txt has the specific format you presented (odd lines are headers, even lines are useless).
In order to allow a bit more customization, you can wrap it up in a function as:
def merge_files(header_file_path, body_file_path, output_file="output.txt", every_n_lines=2):
with open(header_file_path) as first, open(body_file_path) as second, open(output_file, "w+") as output:
while 1:
line_first = first.readline() # line from header
line_second = second.readline() # line from body
if not (line_first and line_second):
# if any file has ended
break
# write to output file
output.writelines([line_first, line_second])
# jump n lines from header
for _ in range(every_n_lines - 1):
first.readline()
And then calling merge_files("file1.txt", "file2.txt") should do the trick.

If both files are small enough to fit in memory simultaneously, you can simply read them simultaneously and interleave them.
# Open two file handles.
with open("f1", mode="r") as f1, open("f2", mode="r") as f2:
lines_first = f1.readlines() # Read all lines in f1.
lines_second = f2.readlines() # Read all lines in f2.
lines_out = []
# For each line in the file without headers...
for idx in range(len(lines_second)):
# Take every even line from the first file and prepend it to
# the line from the second.
lines_out.append(lines_first[2 * idx + 1].rstrip() + lines_second[idx].rstrip())
You can generate the seq headers very easily given idx: I leave this as an exercise to the reader.
If either or both files are too large to fit in memory, you can repeat the above process line-by-line over both handles (using one variable to store information from the file with headers).

Memory efficient way to add columns to .csv files

Ok, I couldn't really find an answer to this anywhere else, so I figured I'd ask.
I'm working with some .csv files that have about 74 million lines right now and I'm trying to add columns into one file from another file.
ex.
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,25.14,0,0,3
3,1110,7,3301,15766,1216,4,33.52,0,0,4
combined with
Units_cat
0
1
so that
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Units_cat,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,0,25.14,0,0,3
3,1110,7,3301,15766,1216,4,1,33.52,0,0,4
I've been using pandas to read in and output the .csv files, but the issue I'm coming to is the program keeps crashing because creating the DataFrame overloads my memory. I've tried applying the csv library from Python but I'm not sure how merge the files the way I want (not just append).
Anyone know a more memory efficient method of combining these files?

Something like this might work for you:
Using csv.DictReader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.DictReader(file1)
file2 = csv.DictReader(file2)
# Get the field order correct here:
fieldnames = file1.fieldnames
index = fieldnames.index('Units Sold')+1
fieldnames = fieldnames[:index] + file2.fieldnames + fieldnames[index:]
result = csv.DictWriter(result, fieldnames)
def dict_merge(a,b):
a.update(b)
return a
result.writeheader()
result.writerows(dict_merge(a,b) for a,b in izip(file1, file2))
Using csv.reader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.reader(file1)
file2 = csv.reader(file2)
result = csv.writer(result)
result.writerows(a[:7] + b + a[7:] for a,b in izip(file1, file2))
Notes:
This is for Python2. You can use the normal zip() function in Python3. If the two files are not of equivalent lengths, consider itertools.izip_longest().
The memory efficiency comes from passing a generator expression to .writerows() instead of a list. This way, only the current line is under consideration at any moment in time, not the entire file. If a generator expression isn't appropriate, you'll get the same benefit from a for loop: for a,b in izip(...): result.writerow(...)
The dict_merge function is not required starting from Python3.5. In sufficiently new Pythons, try result.writerows({**a,**b} for a,b in zip(file1, file2)) (See this explanation).

Python - Compare values in lists (not 1:1 match)

I've got 2 txt files that are structured like this:
File 1
LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
File 2
FILENAME1
FILENAME2
FILENAME3
And I use this code to print the "unique" lines contained in both files:
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
a = f1.readlines()
b = f2.readlines()
non_duplicates = [line for line in a if line not in b]
non_duplicates += [line for line in b if line not in a]
for i in range(1, len(non_duplicates)):
print non_duplicates[i]
The problem is that in this way it prints all the lines of both files, what I want to do is to search if FILENAME1 is in some line of file 1 (the one with both links and filenams) and delete this line.

You need to first load all lines in 2.txt and then filter lines in 1.txt that contains a line from the former. Use a set or frozenset to organize the "blacklist", so that each not in runs in O(1) in average. Also note that f1 and f2 are already iterable:
with open('2.txt', 'r') as f2:
blacklist = frozenset(f2)
with open('1.txt', 'r') as f1:
non_duplicates = [x.strip() for x in f1 if x.split(";")[1] not in blacklist]

If the file2 is not too big create a set of all the lines, split the file1 lines and check if the second element is in the set of lines:
import fileinput
import sys
with open("file2.txt") as f:
lines = set(map(str.rstrip,f)) # itertools.imap python2
for line in fileinput.input("file1.txt",inplace=True):
# if FILENAME1 etc.. is not in the line write the line
if line.rstrip().split(";")[1] not in lines:
sys.stdout.write(line)
file1:
LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6
file2:
FILENAME1
FILENAME2
FILENAME3
file1 after:
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6
fileinput.input with inplace changes the original file. You don't need to store the lines in a list.
You can also write to a tempfile, writing the unique lines to it and using shutil.move to replace the original file:
from tempfile import NamedTemporaryFile
from shutil import move
with open("file2.txt") as f, open("file1.txt") as f2, NamedTemporaryFile(dir=".",delete=False) as out:
lines = set(map(str.rstrip,f))
for line in f2:
if line.rstrip().split(";")[1] not in lines:
out.write(line)
move(out.name,"file1.txt")
If your code errors you won't lose any data in the original file using a tempfile.
using a set to store the lines means we have on average 0(1) lookups, storing all the lines in a list would give you a quadratic as opposed to a linear solution which for larger files would give you a significantly more efficient solution. There is also no need to store all the lines of the other file in a list with readlines as you can write as you iterate over the file object and do your lookups.

Unless the files are too large, then you may print the lines in file1.txt (that I call entries) whose filename-part is not listed in file2.txt with something like this:
with open('file1.txt') as f1:
entries = f1.read().splitlines()
with open('file2.txt') as f2:
filenames_to_delete = f2.read().splitlines()
print [entry for entry in entries if entry.split(';')[1] not in filenames_to_delete]
If file1.txt is large and file2.txt is small, then you may load the filenames in file2.txt entirely in memory, and then open file1.txt and go through it, checking against the in-memory list.
If file1.txt is small and file2.txt is large, you may do it the other way round.
If file1.txt and file2.txt are both excessively large, then if it is known that both files’ lines are sorted by filename, one could write some elaborate code to take advantage of that sorting to get the task done without loading the entire files in memory, as in this SO question. But if this is not an issue, you’ll be better off loading everything in memory and keeping things simple.
P.S. Once it is not necessary to open the two files simultaneously, we avoid it; we open a file, read it, close it, and then repeat for the next. Like that the code is simpler to follow.

python opening multiple files and using multiple directories at once

I can open two files at once using with open, now if i am going through two directories using this same method,
f = open(os.path.join('./directory/', filename1), "r")
f2 = open(os.path.join('./directory2/', filename1) "r")
with open(file1, 'a') as x:
for line in f:
if "strin" in line:
x.write(line)
with open(file2, 'a') as y:
for line in f1:
if "string" in line:
y.write(line)
merge these into one method

Your pseudocode (for line in f and f1, x.write(line in f) y.write(line in f1)) has the same effect as the original code you posted, and isn't useful unless there is something about the corresponding lines in the two files that you want to process.
But you can use zip to combine iterables to get what you want
import itertools
with open(os.path.join('./directory', filename1)) as r1, \
open(os.path.join('./directory2', filename1)) as r2, \
open(file1, 'a') as x, \
open(file2, 'a') as y:
for r1_line, r2_line in itertools.izip_longest(r1, r2):
if r1_line and "string" in line:
x.write(r1_line)
if r2_line and "string" in line:
y.write(r1_line)
I put all of the file objects in a single with clause using \ to escape the new line so that python sees it as a single line
The various permutations of zip combine iterables into a sequence of tuples.
I chose izip_longest because it will continue to emit lines from both files, using None for the files that empty first, until all lines are consumed. if r1_line ... just makes sure we aren't at the Nones for file that has been fully consumed.
This is a strange way to do things - for the example you've given, it's not the better choice.

Split a file based on number of occurrences of 1 in position 1 of a line

I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.
What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.
Any ideas would be greatly appreciated.

Maybe try recognizing it by the fact that there is a '1' plus a new line?
with open(input_file, 'r') as f:
my_string = f.read()
my_list = my_string.split('\n1\n')
Separates each record to a list assuming it has the following format:
1
....
....
1
....
....
....
You can then output each element in the list to a separate file.
for x in range(len(my_list)):
print >> str(x)+'.txt', my_list[x]

To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:
#!/usr/bin/env python3
from itertools import zip_longest
with open('input.txt') as input_file:
files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
for n, records in enumerate(files):
open('output{n}.txt'.format(n=n), 'w') as output_file:
output_file.writelines(''.join(lines)
for r in records for lines in r)
where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:
from itertools import chain
def generate_records(input_file, start='1\n', eof=[]):
def record(yield_start=True):
if yield_start:
yield start
for line in input_file:
if line == start: # start new record
break
yield line
else: # EOF
eof.append(True)
# the first record may include lines before the first 1\n
yield chain(record(yield_start=False),
record())
while not eof:
yield record()
generate_records() is a generator that yield generators like itertools.groupby() does.
For performance reasons, you could read/write chunks of multiple lines at once.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parallelize a csv merging algorithm in python? - python

Related

How to merge two files by line names using python

Memory efficient way to add columns to .csv files

Python - Compare values in lists (not 1:1 match)

python opening multiple files and using multiple directories at once

Split a file based on number of occurrences of 1 in position 1 of a line

Categories

Resources