I would like to combine columns of various csv files into one csv file, with a new heading, concatenated horizontally. I want to only select certain columns,chosen by heading. There are different columns in each of the files to be combined.
Example input:
freestream.csv:
static pressure,static temperature,relative Mach number
1.01e5,288,5.00e-02
fan.csv:
static pressure,static temperature,mass flow
0.9e5,301,72.9
exhaust.csv:
static pressure,static temperature,mass flow
1.7e5,432,73.1
Desired output:
combined.csv:
P_amb,M0,Ps_fan,W_fan,W_exh
1.01e5,5.00e-02,0.9e6,72.9,73.1
Possible call to the function:
reorder_multiple_CSVs(["freestream.csv","fan.csv","exhaust.csv"],
"combined.csv",["static pressure,relative Mach number",
"static pressure,mass flow","mass flow"],
"P_amb,M0,Ps_fan,W_fan,W_exh")
Here is a previous version of the code, with only one input file allowed. I wrote this with help from write CSV columns out in a different order in Python:
def reorder_CSV(infilename,outfilename,oldheadings,newheadings):
with open(infilename) as infile:
with open(outfilename,'w') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
readnames = reader.next()
name2index = dict((name,index) for index, name in enumerate(readnames))
writeindices = [name2index[name] for name in oldheadings.split(",")]
reorderfunc = operator.itemgetter(*writeindices)
writer.writerow(newheadings.split(","))
for row in reader:
towrite = reorderfunc(row)
if isinstance(towrite,str):
writer.writerow([towrite])
else:
writer.writerow(towrite)
So what I have figure out, in order to adapt this to multiple files, is:
-I need infilename, oldheadings, and newheadings to be a list now (all of the same length)
-I need to iterate over the list of input files to make a list of readers
-readnames can also be a list, iterating over the readers
-which means I can make name2index a list of dictionaries
One thing I don't know how to do, is use the keyword with, nested n-levels deep, when n is known only at run time. I read this: How can I open multiple files using "with open" in Python? but that seems to only work when you know how many files you need to open.
Or is there a better way to do this?
I am quite new to python so I appreciate any tips you can give me.
I am only replying to the part about opening multiple files with with, where the number of files is unknown before. It shouldn't be too hard to write your own contextmanager, something like this (completely untested):
from contextlib import contextmanager
#contextmanager
def open_many_files(filenames):
files = [open(filename) for filename in filenames]
try:
yield files
finally:
for f in files:
f.close()
Which you would use like this:
innames = ['file1.csv', 'file2.csv', 'file3.csv']
outname = 'out.csv'
with open_many(innames) as infiles, open(outname, 'w') as outfile:
for infile in infiles:
do_stuff(in_file)
There is also a function that does something similar, but it is deprecated.
I am not sure if this is the correct way to do this, but I wanted to expand on Bas Swinckels answer. He had a couple small inconsistencies in his very helpful answer and I wanted to give the correect code.
Here is what I did, and it worked.
from contextlib import contextmanager
import csv
import operator
import itertools as IT
#contextmanager
def open_many_files(filenames):
files=[open(filename,'r') for filename in filenames]
try:
yield files
finally:
for f in files:
f.close()
def reorder_multiple_CSV(infilenames,outfilename,oldheadings,newheadings):
with open_many_files(filter(None,infilenames.split(','))) as handles:
with open(outfilename,'w') as outfile:
readers=[csv.reader(f) for f in handles]
writer = csv.writer(outfile)
reorderfunc=[]
for i, reader in enumerate(readers):
readnames = reader.next()
name2index = dict((name,index) for index, name in enumerate(readnames))
writeindices = [name2index[name] for name in filter(None,oldheadings[i].split(","))]
reorderfunc.append(operator.itemgetter(*writeindices))
writer.writerow(filter(None,newheadings.split(",")))
for rows in IT.izip_longest(*readers,fillvalue=['']*2):
towrite=[]
for i, row in enumerate(rows):
towrite.extend(reorderfunc[i](row))
if isinstance(towrite,str):
writer.writerow([towrite])
else:
writer.writerow(towrite)
Related
Ok, I couldn't really find an answer to this anywhere else, so I figured I'd ask.
I'm working with some .csv files that have about 74 million lines right now and I'm trying to add columns into one file from another file.
ex.
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,25.14,0,0,3
3,1110,7,3301,15766,1216,4,33.52,0,0,4
combined with
Units_cat
0
1
so that
Week,Sales Depot,Sales Channel,Route,Client,Product,Units Sold,Units_cat,Sales,Units Returned,Returns,Adjusted Demand
3,1110,7,3301,15766,1212,3,0,25.14,0,0,3
3,1110,7,3301,15766,1216,4,1,33.52,0,0,4
I've been using pandas to read in and output the .csv files, but the issue I'm coming to is the program keeps crashing because creating the DataFrame overloads my memory. I've tried applying the csv library from Python but I'm not sure how merge the files the way I want (not just append).
Anyone know a more memory efficient method of combining these files?
Something like this might work for you:
Using csv.DictReader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.DictReader(file1)
file2 = csv.DictReader(file2)
# Get the field order correct here:
fieldnames = file1.fieldnames
index = fieldnames.index('Units Sold')+1
fieldnames = fieldnames[:index] + file2.fieldnames + fieldnames[index:]
result = csv.DictWriter(result, fieldnames)
def dict_merge(a,b):
a.update(b)
return a
result.writeheader()
result.writerows(dict_merge(a,b) for a,b in izip(file1, file2))
Using csv.reader()
import csv
from itertools import izip
with open('file1.csv') as file1:
with open('file2.csv') as file2:
with open('result.csv', 'w') as result:
file1 = csv.reader(file1)
file2 = csv.reader(file2)
result = csv.writer(result)
result.writerows(a[:7] + b + a[7:] for a,b in izip(file1, file2))
Notes:
This is for Python2. You can use the normal zip() function in Python3. If the two files are not of equivalent lengths, consider itertools.izip_longest().
The memory efficiency comes from passing a generator expression to .writerows() instead of a list. This way, only the current line is under consideration at any moment in time, not the entire file. If a generator expression isn't appropriate, you'll get the same benefit from a for loop: for a,b in izip(...): result.writerow(...)
The dict_merge function is not required starting from Python3.5. In sufficiently new Pythons, try result.writerows({**a,**b} for a,b in zip(file1, file2)) (See this explanation).
I'm writing a script that has a for loop to extract a list of variables from each 'data_i.csv' file in a folder, then appends that list as a new row in a single 'output.csv' file.
My objective is to define the headers of the file once and then append data to the 'output.csv' container-file so it will function as a backlog for a standard measurement.
The first time I run the script it will add all the files in the folder. Next time I run it, I want it to only append files that have been added since. I thought one way of doing this would be to check for duplicates, but the codes I found for that so far only searched for consecutive duplicates.
Do you have suggestions?
Here's how I made it so far:
import csv, os
# Find csv files
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('.csv'):
continue
# Read in csv file and choose certain cells
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
cellID = csvLines[4][3]
# Read in several variables...
csvRows = [cellID]
csvFileObj.close()
resultFile = open("Output.csv", 'a') #open in 'append' modus
wr = csv.writer(resultFile)
wr.writerows([csvRows])
csvFileObj.close()
resultFile.close()
This is the final script after mgc's answer:
import csv, os
f = open('Output.csv', 'r+')
merged_files = csv.reader(f)
merged_files = list()
for csvFilename in os.listdir('.'):
if not csvFilename.endswith('_spm.txt'):
continue
if csvFilename in merged_files:
continue
csvRows = []
csvFileObj = open(csvFilename)
csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
csvLines = list(csvData)
waferID = csvLines[4][3]
temperature = csvLines[21][2]
csvRows = [waferID,thickness]
merged_files.append(csvRows)
csvFileObj.close()
wr = csv.writer(f)
wr.writerows(merged_files)
f.close()
You can keep track of the name of each file already handled. If this log file don't need to be human readable, you can use pickle. At the start of your script, you can do :
import pickle
try:
with open('merged_log', 'rb') as f:
merged_files = pickle.load(f)
except FileNotFoundError:
merged_files = set()
Then you can add a condition to avoid files previously treated :
if filename in merged_files: continue
Then when you are processing a file you can do :
merged_files.add(filename)
And keep trace of your variable at the end of your script (so it will be used on a next use) :
with open('merged_log', 'wb') as f:
pickle.dump(merged_files, f)
(However there is other options to your problem, for example you can slightly change the name of your file once it has been processed, like changing the extension from .csv to .csv_ or moving processed files in a subfolder, etc.)
Also, in the example in your question, i don't think that you need to open (and close) your output file on each iteration of your for loop. Open it once before your loop, write what you have to write, then close it when you have leaved the loop.
I would like to use the Python CSV module to open a CSV file for appending. Then, from a list of CSV files, I would like to read each csv file and write it to the appended CSV file. My script works great - except that I cannot find a way to remove the headers from all but the first CSV file being read. I am certain that my else block of code is not executing properly. Perhaps my syntax for my if else code is the problem? Any thoughts would be appreciated.
writeFile = open(append_file,'a+b')
writer = csv.writer(writeFile,dialect='excel')
for files in lstFiles:
readFile = open(input_file,'rU')
reader = csv.reader(readFile,dialect='excel')
for i in range(0,len(lstFiles)):
if i == 0:
oldHeader = readFile.readline()
newHeader = writeFile.write(oldHeader)
for row in reader:
writer.writerow(row)
else:
reader.next()
for row in reader:
row = readFile.readlines()
writer.writerow(row)
readFile.close()
writeFile.close()
You're effectively iterating over lstFiles twice. For each file in your list, you're running your inner for loop up from 0. You want something like:
writeFile = open(append_file,'a+b')
writer = csv.writer(writeFile,dialect='excel')
headers_needed = True
for input_file in lstFiles:
readFile = open(input_file,'rU')
reader = csv.reader(readFile,dialect='excel')
oldHeader = reader.next()
if headers_needed:
newHeader = writer.writerow(oldHeader)
headers_needed = False
for row in reader:
writer.writerow(row)
readFile.close()
writeFile.close()
You could also use enumerate over the lstFiles to iterate over tuples containing the iteration count and the filename, but I think the boolean shows the logic more clearly.
You probably do not want to mix iterating over the csv reader and directly calling readline on the underlying file.
I think you're iterating too many times (over various things: both your list of files and the files themselves). You've definitely got some consistency problems; it's a little hard to be sure since we can't see your variable initializations. This is what I think you want:
with open(append_file,'a+b') as writeFile:
need_headers = True
for input_file in lstFiles:
with open(input_file,'rU') as readFile:
headers = readFile.readline()
if need_headers:
# Write the headers only if we need them
writeFile.write(headers)
need_headers = False
# Now write the rest of the input file.
for line in readFile:
writeFile.write(line)
I took out all the csv-specific stuff since there's no reason to use it for this operation. I also cleaned the code up considerably to make it easier to follow, using the files as context managers and a well-named boolean instead of the "magic" i == 0 check. The result is a much nicer block of code that (hopefully) won't have you jumping through hoops to understand what's going on.
I am new to python but I have searched on Stack Overflow, google, and CodeAcademy for an answer or inspiration for my obviously very simple problem. I thought finding a simple example where a for loop is used to save every interation would be easy to find but I've either missed it or don't have the vocab to ask the right question. So please don't loudly sigh in front of your monitor at this simple question. Thanks.
I would like to simply write a csv file with each iteration of the two print lines on the code below in a seperate column. so an output example might look like:
##################
andy.dat, 8
brett.dat, 9
candice.dat, 11
#################
the code I have so far is:
import sys
import os.path
image_path = "C:\\"
for filename in os.listdir (image_path):
print filename
print len(filename)
If I try to do
x = filename
then I only get the last interation of the loop written to x. How do I write all of them to x using a for loop? Also, how to write it as a column in a csv with the print result of len(filename) next to it? Thanks.
Although for this task you don't need it, I would take advantage of standard library modules when you can, like csv. Try something like this,
import os
import csv
csvfile = open('outputFileName.csv', 'wb')
writer = csv.writer(csvfile)
for filename in os.listdir('/'): # or C:\\ if on Windows
writer.writerow([filename, len(filename)])
csvfile.close()
I'd probably change this:
for filename in os.listdir (image_path):
print filename
print len(filename)
To something like
lines = list()
for filename in os.listdir(image_path):
lines.append("%s, %d" % (filename, len(filename)))
My version creates a python list, then on each iteration of your for loop, appends an entry to it.
After you're done, you could print the lines with something like:
for line in lines:
print(line)
Alternatively, you could initially create a list of tuples in the first loop, then format the output in the second loop. This approach might look like:
lines = list()
# Populate list
for filename in os.listdir(image_path):
lines.append((filename, len(filename))
# Print list
for line in lines:
print("%s, %d" % (line[0], line[1]))
# Or more simply
for line in lines:
print("%s, %d" % line)
Lastly, you don't really need to explicitly store the filename length, you could just calculate it and display it on the fly. In fact, you don't even really need to create a list and use two loops.
Your code could be as simple as
import sys, os
image_path = "C:\\"
for filename in os.listdir(image_path):
print("%s, %d" % (filename, len(filename))
import sys
import os
image_path = "C:\\"
output = file("output.csv", "a")
for filename in os.listdir (image_path):
ouput.write("%s,%d" % (filename, len(filename)))
Where a in file constructor opens file for appending, you can read more about different modes in which you can use file object here.
Try this:
# Part 1
import csv
import os
# Part 2
image_path = "C:\\"
# Part 3
li = [] # empty list
for filename in os.listdir(image_path):
li.append((filename, len(filename))) # populating the list
# Part 4
with open('test.csv', 'w') as f:
f.truncate()
writer = csv.writer(f)
writer.writerows(li)
Explanation:
In Part 1,
we import the module os and csv.
In Part 2,
we declare image_path.
Now the for loops..
In Part 3,
we declare an empty list (li).
Then we go into a for loop, in which we populate the list with every item and it's length in image_path.
In Part 4,
we move to writing the csv file. In the with statement, we wrote all the data from li, into our file.
I'd like to read the contents from several files into unique lists that I can call later - ultimately, I want to convert these lists to sets and perform intersections and subtraction on them. This must be an incredibly naive question, but after poring over the iterators and loops sections of Lutz's "Learning Python," I can't seem to wrap my head around how to approach this. Here's what I've written:
#!/usr/bin/env python
import sys
OutFileName = 'test.txt'
OutFile = open(OutFileName, 'w')
FileList = sys.argv[1: ]
Len = len(FileList)
print Len
for i in range(Len):
sys.stderr.write("Processing file %s\n" % (i))
FileNum = i
for InFileName in FileList:
InFile = open(InFileName, 'r')
PathwayList = InFile.readlines()
print PathwayList
InFile.close()
With a couple of simple test files, I get output like this:
Processing file 0
Processing file 1
['alg1\n', 'alg2\n', 'alg3\n', 'alg4\n', 'alg5\n', 'alg6']
['csr1\n', 'csr2\n', 'csr3\n', 'csr4\n', 'csr5\n', 'csr6\n', 'csr7\n', 'alg2\n', 'alg6']
These lists are correct, but how do I assign each one to a unique variable so that I can call them later (for example, by including the index # from range in the variable name)?
Thanks so much for pointing a complete programming beginner in the right direction!
#!/usr/bin/env python
import sys
FileList = sys.argv[1: ]
PathwayList = []
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % (i))
InFile = open(InFileName, 'r')
PathwayList.append(InFile.readlines())
InFile.close()
Assuming you read in two files, the following will do a line by line comparison (it won't pick up any extra lines in the longer file, but then they'd not be the same if one had more lines than the other ;)
for i, s in enumerate(zip(PathwayList[0], PathwayList[1]), 1):
if s[0] == s[1]:
print i, 'match', s[0]
else:
print i, 'non-match', s[0], '!=', s[1]
For what you're wanting to do, you might want to take a look at the difflib module in Python. For sorting, look at Mutable Sequence Types, someListVar.sort() will sort the contents of someListVar in place.
You could do it like that if you don't need to remeber where the contents come from :
PathwayList = []
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % InFileName)
InFile = open(InFileName, 'r')
PathwayList.append(InFile.readlines())
InFile.close()
for contents in PathwayList:
# do something with contents which is a list of strings
print contents
or, if you want to keep track of the files names, you could use a dictionary :
PathwayList = {}
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % InFileName)
InFile = open(InFileName, 'r')
PathwayList[InFile] = InFile.readlines()
InFile.close()
for filename, contents in PathwayList.items():
# do something with contents which is a list of strings
print filename, contents
You might want to check out Python's fileinput module, which is a part of the standard library and allows you to process multiple files at once.
Essentially, you have a list of files and you want to change to list of lines of these files...
Several ways:
result = [ list(open(n)) for n in sys.argv[1:] ]
This would get you a result like -> [ ['alg1', 'alg2', 'alg3'], ['csr1', 'csr2'...]] Accessing would be like 'result[0]' which would result in ['alg1', 'alg2', 'alg3']...
Somewhat better might be dictionary:
result = dict( (n, list(open(n))) for n in sys.argv[1:] )
If you want to just concatenate, you would just need to chain it:
import itertools
result = list(itertools.chain.from_iterable(open(n) for n in sys.argv[1:]))
# -> ['alg1', 'alg2', 'alg3', 'csr1', 'csr2'...
Not one-liners for a beginner...however now it would be a good exercies to try to comprehend what's going on :)
You need to dynamically create the variable name for each file 'number' that you're reading. (I'm being deliberately vague on purpose, knowing how to build variables like this is quite valuable and more readily remembered if you discover it yourself)
something like this will give you a start
You need a list which holds your PathwayList lists, that is a list of lists.
One remark: it is quite uncommon to use capitalized variable names. There is no strict rule for that, but by convention most people only use capitalized names for classes.