Memory efficient way to add columns to .csv files - python

Ok, I couldn't really find an answer to this anywhere else, so I figured I'd ask.
I'm working with some .csv files that have about 74 million lines right now and I'm trying to add columns into one file from another file.
ex.
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,25.14,0,0,3
3,1110,7,3301,15766,1216,4,33.52,0,0,4
combined with
Units_cat
0
1
so that
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Units_cat,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,0,25.14,0,0,3
3,1110,7,3301,15766,1216,4,1,33.52,0,0,4
I've been using pandas to read in and output the .csv files, but the issue I'm coming to is the program keeps crashing because creating the DataFrame overloads my memory. I've tried applying the csv library from Python but I'm not sure how merge the files the way I want (not just append).
Anyone know a more memory efficient method of combining these files?

Something like this might work for you:
Using csv.DictReader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.DictReader(file1)
file2 = csv.DictReader(file2)
# Get the field order correct here:
fieldnames = file1.fieldnames
index = fieldnames.index('Units Sold')+1
fieldnames = fieldnames[:index] + file2.fieldnames + fieldnames[index:]
result = csv.DictWriter(result, fieldnames)
def dict_merge(a,b):
a.update(b)
return a
result.writeheader()
result.writerows(dict_merge(a,b) for a,b in izip(file1, file2))
Using csv.reader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.reader(file1)
file2 = csv.reader(file2)
result = csv.writer(result)
result.writerows(a[:7] + b + a[7:] for a,b in izip(file1, file2))
Notes:
This is for Python2. You can use the normal zip() function in Python3. If the two files are not of equivalent lengths, consider itertools.izip_longest().
The memory efficiency comes from passing a generator expression to .writerows() instead of a list. This way, only the current line is under consideration at any moment in time, not the entire file. If a generator expression isn't appropriate, you'll get the same benefit from a for loop: for a,b in izip(...): result.writerow(...)
The dict_merge function is not required starting from Python3.5. In sufficiently new Pythons, try result.writerows({**a,**b} for a,b in zip(file1, file2)) (See this explanation).

Related

How to search a string in a file in another file

I need to scan 2 files in python and say which words in file1 are also in file2. I made a list with all words from file2 and then scan if the line from file1 is in the list.
So this works perfectly, but large files (like 500k) it can take 1h+ and I was wondering if there is a faster way
Thanks in advance
(defined var etc and files)
a = []
for line in var:
a += [line]
teller = 0
for line1 in new_file:
if line1 not in a:
print(line1, file=filter, end='')
else:
teller += 1
print(line1, file=bad, end='' )
print('There were', teller, 'lines that were in the old file.')
A faster alternative is using sets (as long as you can keep the content of both files in memory):
with open('a.txt', 'r') as a, open('b.txt', 'r') as b:
a_content = set(a)
b_content = set(b)
result = a_content.intersection(b_content)
If you're worried about speed, then you should be using your OS facilities, not Python loops. Typically, the fastest way to look for individual lines would be to sort both files and then do a simple file diff. If you insist on using Python, that would also be a much quicker way.
Your method will work, but it's super inefficient because you are traversing through the file2 for every single word/line within file1. Try turning both file1 and file2 to sets and then compare the sets; I'm pretty sure Python has something like .intersect or .intersection to compare two sets, lists, arrays, or other data structures.

How can I parallelize a csv merging algorithm in python?

I have a folder of 100gb of csv files that I want to merge into a single csv. The file names are in order of row position. I've written a single threaded script to tackle this, but it is understandably slow.
def JoinRows(rows_to_join, init=True):
#rows_to_join is a list of csv paths.
for i, row in enumerate(rows_to_join):
with open('join_rows.csv', 'a') as f1:
#join_rows.csv is just the output file with all the rows
with open(row, 'r') as f2:
for line in f2:
f1.write('\n'+line)
I also wrote a recursive function that doesn't work and isn't parallel (yet). My thought was to join each csv with another, delete the second of the two, and keep repeating until only one file was left. This way the task could be split up among different available threads. Any suggestions?
def JoinRows(rows_to_join, init=False):
if init==True: rows_to_join.sort()
LEN = len(rows_to_join)
print(LEN)
if len(rows_to_join) == 2:
with open(rows_to_join[0], 'a') as f1:
with open(rows_to_join[1], 'rb') as f2:
for line in f2:
f1.write('\n'+line)
subprocess.check_call(['rm '+rows_to_join[1]], shell=True)
return(rows_to_join[1])
else:
rows_to_join.remove(JoinRows(rows_to_join[:LEN//2]))
rows_to_join.remove(JoinRows(rows_to_join[LEN//2:]))

picking and writing out some fractions of list with for-loop

There is a dataset as a csv-file which contains some tabular of data.
I want to pick out the fractions with the same number.
For example i have one list
a = [1,1,2,2,3,3,4,4,4,5,5,5,5,6]
and i want a loop, which writes text-files with the same numbers
file_1.txt contains 1,1
file_2.txt contains 2,2
file_3.txt contains 3,3
file_4.txt contains 4,4,4
file_5.txt contains 5,5,5,5
file_6.txt contains 6
I still have no real result, because so far everything is wrong.
A much cleaner approach would be to use itertools.groupby and str.join:
from itertools import groupby
for num, group in groupby(a):
filename = "file_%d.txt"%num
with open(filename, 'w') as f:
f.write(",".join(map(str, group)) + "\n")
Another important point is that you should always use the with statement when reading and writing to files.
Using groupby assumes that the data is already sorted. Another approach would be to use collections.Counter:
from collections import Counter
for num, count in Counter(a).items():
filename = "file_%d.txt"%num
with open(filename, 'w') as f:
f.write(",".join([str(num)]*count) + "\n")
If I understood correctly, this should work:
for x in set(a):
text_file = open("file_"+str(x)+".txt", "w")
text_file.write(((str(x)+',')*a.count(x))[:-1])
text_file.close()
Where that [:-1] in the third line is to remove extra comma ;)

Trying to copy column1 from a csv file to another empty file using python

I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)

Nested with blocks in Python, level of nesting variable

I would like to combine columns of various csv files into one csv file, with a new heading, concatenated horizontally. I want to only select certain columns,chosen by heading. There are different columns in each of the files to be combined.
Example input:
freestream.csv:
static pressure,static temperature,relative Mach number
1.01e5,288,5.00e-02
fan.csv:
static pressure,static temperature,mass flow
0.9e5,301,72.9
exhaust.csv:
static pressure,static temperature,mass flow
1.7e5,432,73.1
Desired output:
combined.csv:
P_amb,M0,Ps_fan,W_fan,W_exh
1.01e5,5.00e-02,0.9e6,72.9,73.1
Possible call to the function:
reorder_multiple_CSVs(["freestream.csv","fan.csv","exhaust.csv"],
"combined.csv",["static pressure,relative Mach number",
"static pressure,mass flow","mass flow"],
"P_amb,M0,Ps_fan,W_fan,W_exh")
Here is a previous version of the code, with only one input file allowed. I wrote this with help from write CSV columns out in a different order in Python:
def reorder_CSV(infilename,outfilename,oldheadings,newheadings):
with open(infilename) as infile:
with open(outfilename,'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
readnames = reader.next()
name2index = dict((name,index) for index, name in enumerate(readnames))
writeindices = [name2index[name] for name in oldheadings.split(",")]
reorderfunc = operator.itemgetter(*writeindices)
writer.writerow(newheadings.split(","))
for row in reader:
towrite = reorderfunc(row)
if isinstance(towrite,str):
writer.writerow([towrite])
else:
writer.writerow(towrite)
So what I have figure out, in order to adapt this to multiple files, is:
-I need infilename, oldheadings, and newheadings to be a list now (all of the same length)
-I need to iterate over the list of input files to make a list of readers
-readnames can also be a list, iterating over the readers
-which means I can make name2index a list of dictionaries
One thing I don't know how to do, is use the keyword with, nested n-levels deep, when n is known only at run time. I read this: How can I open multiple files using "with open" in Python? but that seems to only work when you know how many files you need to open.
Or is there a better way to do this?
I am quite new to python so I appreciate any tips you can give me.
I am only replying to the part about opening multiple files with with, where the number of files is unknown before. It shouldn't be too hard to write your own contextmanager, something like this (completely untested):
from contextlib import contextmanager
#contextmanager
def open_many_files(filenames):
files = [open(filename) for filename in filenames]
try:
yield files
finally:
for f in files:
f.close()
Which you would use like this:
innames = ['file1.csv', 'file2.csv', 'file3.csv']
outname = 'out.csv'
with open_many(innames) as infiles, open(outname, 'w') as outfile:
for infile in infiles:
do_stuff(in_file)
There is also a function that does something similar, but it is deprecated.
I am not sure if this is the correct way to do this, but I wanted to expand on Bas Swinckels answer. He had a couple small inconsistencies in his very helpful answer and I wanted to give the correect code.
Here is what I did, and it worked.
from contextlib import contextmanager
import csv
import operator
import itertools as IT
#contextmanager
def open_many_files(filenames):
files=[open(filename,'r') for filename in filenames]
try:
yield files
finally:
for f in files:
f.close()
def reorder_multiple_CSV(infilenames,outfilename,oldheadings,newheadings):
with open_many_files(filter(None,infilenames.split(','))) as handles:
with open(outfilename,'w') as outfile:
readers=[csv.reader(f) for f in handles]
writer = csv.writer(outfile)
reorderfunc=[]
for i, reader in enumerate(readers):
readnames = reader.next()
name2index = dict((name,index) for index, name in enumerate(readnames))
writeindices = [name2index[name] for name in filter(None,oldheadings[i].split(","))]
reorderfunc.append(operator.itemgetter(*writeindices))
writer.writerow(filter(None,newheadings.split(",")))
for rows in IT.izip_longest(*readers,fillvalue=['']*2):
towrite=[]
for i, row in enumerate(rows):
towrite.extend(reorderfunc[i](row))
if isinstance(towrite,str):
writer.writerow([towrite])
else:
writer.writerow(towrite)

Categories