Merging multiple CSV files without headers being repeated (using Python) - python

I am a beginner with Python. I have multiple CSV files (more than 10), and all of them have same number of columns. I would like to merge all of them into a single CSV file, where I will not have headers repeated.
So essentially I need to have just the first row with all the headers and from then I need all the rows from all CSV files merged. How do I do this?
Here's what I tried so far.
import glob
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout,delimiter=',')
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
# Open and process file
h = True
with open(filename,'rb') as fin:
fin.next()#skip header
for line in csv.reader(fin,delimiter=','):
wout.writerow(line)

If you are on a linux system:
head -1 director/one_file.csv > output csv ## writing the header to the final file
tail -n +2 director/*.csv >> output.csv ## writing the content of all csv starting with second line into final file

While I think that the best answer is the one from #valentin, you can do this without using csv module at all:
import glob
interesting_files = glob.glob("*.csv")
header_saved = False
with open('output.csv','wb') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
fout.write(header)
header_saved = True
for line in fin:
fout.write(line)

If you dont mind the overhead, you could use pandas which is shipped with common python distributions. If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
Just a little more on pandas. Because it is made to deal with spreadsheet like data, it knows the first line is a header. When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe, the standard datatype in pandas. If you concat several of these dataframes it concatenates only the dataparts if their headers are the same. If the headers are not the same it fails and gives you an error. Probably a good thing in case your directory is polluted with CSV files from another source.
Another thing: I just added sorted() around the interesting_files. I assume your files are named in order and this order should be kept. I am not sure about glob, but the os functions are not necessarily returning files sorted by their name.

Your attempt is almost working, but the issues are:
you're opening the file for reading but closing it before writing the rows.
you're never writing the title. You have to write it once
Also you have to exclude output.csv from the "glob" else the output is also in input!
Here's the corrected code, passing the csv object direcly to csv.writerows method for shorter & faster code. Also writing the title from the first file to the output file.
import glob
import csv
output_file = 'output.csv'
header_written = False
with open(output_file,'w',newline="") as fout: # just "wb" in python 2
wout = csv.writer(fout,delimiter=',')
# filter out output
interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
for filename in interesting_files:
print('Processing {}'.format(filename))
with open(filename) as fin:
cr = csv.reader(fin,delmiter=",")
header = cr.next() #skip header
if not header_written:
wout.writerow(header)
header_written = True
wout.writerows(cr)
Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file.
csv module (or pandas, too) handle those cases gracefully.

Your indentation is wrong, you need to put the loop inside the with block. You can also pass the file object to writer.writerows.
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout)
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
with open(filename,'rb') as fin:
next(fin) # skip header
wout.writerows(fin)

Related

Combining multiple csv files into one csv file

I am trying to combine multiple csv files into one, and have tried a number of methods but I am struggling.
I import the data from multiple csv files, and when I compile them together into one csv file, it seems that the first few rows get filled out nicely, but then it starts randomly inputting spaces of variable number in between the rows, and it never finishes filling out the combined csv file, it just seems to continuously get information added to it, which does not make sense to me because I am trying to compile a finite amount of data.
I have already tried writing close statements for the file, and I still get the same result, my designated combined csv file never stops getting data, and it will randomly space the data throughout the file - I just want a normally compiled csv.
Is there an error in my code? Is there any explanation as to why my csv file is behaving this way?
csv_file_list = glob.glob(Dir + '/*.csv') #returns the file list
print (csv_file_list)
with open(Avg_Dir + '.csv','w') as f:
wf = csv.writer(f, delimiter = ',')
print (f)
for files in csv_file_list:
rd = csv.reader(open(files,'r'),delimiter = ',')
for row in rd:
print (row)
wf.writerow(row)
Your code works for me.
Alternatively, you can merge files as follows:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
for line in rf:
if line.strip(): # if line is not empty
if not line.endswith("\n"):
line+="\n"
wf.write(line)
Or, if the files are not too large, you can read each file at once. But in this case all empty lines an headers will be copied:
csv_file_list = glob.glob(Dir + '/*.csv')
with open(Avg_Dir + '.csv','w') as wf:
for file in csv_file_list:
with open(file) as rf:
wf.write(rf.read().strip()+"\n")
Consider several adjustments:
Use context manager, with, for both the read and write process. This avoids the need to close() file objects which you do not do on the read objects.
For skipping lines issue: use either the argument newline='' in open() or lineterminator="\n" argument in csv.writer(). See SO answers for former and latter.
Use os.path.join() to properly concatenate folder and file paths. This method is os-agnostic so accounts for Windows or Unix machines using forward or backslashes types.
Adjusted script:
import os
import csv, glob
Dir = r"C:\Path\To\Source"
Avg_Dir = r"C:\Path\To\Destination\Output"
csv_file_list = glob.glob(os.path.join(Dir, '*.csv')) # returns the file list
print (csv_file_list)
with open(os.path.join(Avg_Dir, 'Output.csv'), 'w', newline='') as f:
wf = csv.writer(f, lineterminator='\n')
for files in csv_file_list:
with open(files, 'r') as r:
next(r) # SKIP HEADERS
rr = csv.reader(r)
for row in rr:
wf.writerow(row)

How can I open multiple csv files in a folder, take the average of a column and save in a separate file using python?

I am extremely new at python and need some help with this one. I've tried various codes and none seem to work, so suggestions would be awesome.
I have a folder with about 1500 csv files that each contain multiple columns of data. I need to take the average of the first column called "agr" and save this value in a different excel or csv file. It would be great if I could also somehow save the name of the file with its averaged value so that I can keep track of which file it came from. The name of the files are crop_city (e.g. corn_omaha).
import glob
import csv
import numpy as np
import pandas as pd
path = ('C:/test/*.csv')
for fname in glob.glob(path):
with open(fname) as csvfile:
agr = []
reader = csv.DictReader(fname)
print row['agr']
I know the code above is extremely rudimentary, so any help would be great thanks everyone!
Assuming the first column in these CSV files is a decimal or float, you don't really need to parse the entire line. Just split at the first separator and parse the first token. There is no real advantage to numpy or pandas either. Just use the builtin sum function.
import glob
import os
path = ('test/*.csv') # using local dir for test
outfile.write("Filename,Sum\r\n") # header for output
with open('output.csv', 'w', newline='') as outfile:
for fname in glob.glob(path):
with open(fname) as csvfile:
next(csvfile) # skip header
outfile.writelines("{},{}\r\n".format(os.path.basename(fname),
sum(float(line.split(',', 1)[0].strip())
for line in csvfile)))
Contrary to the answer by #tdelaney, I would not advise you to limit your code by relying on the fact that you are adding up the first column; what if you need to work with the third column next week? It's easy to do this properly by building on the code you provide. Parsing a couple of thousand text files is not going to slow you down.
The csv.DictReader constructor will automatically treat the first row of its input as a header (unless you explicitly specify a list of column names with the fieldnames parameter). So your code can look like this:
import csv
import glob
averages = []
for fname in glob.glob(path):
with open(fname, "rb") as csvfile:
reader = csv.DictReader(csvfile)
values = [ float(row["agr"]) for row in reader ]
avg = sum(values) / len(values)
averages.append((fname, avg))
The list averages now contains the numbers you want. This is how you write it out to another CSV file:
with open("avegages.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["File", "Average agr"])
for row in averages:
writer.writerow(row)
PS. Since you included pandas in your imports, here's one way to do the same thing with pandas. However, I recommend sticking with csv for now. The pandas object model is complex, and hard to wrap your head around.
averages = []
for fname in glob.glob(path):
data = pd.DataFrame.from_csv(fname)
averages.append((fname, data["agr"].mean()))
df_out = pd.DataFrame.from_records(averages, columns=["File", "Average agr"])
df_out.to_csv("averages.csv", index=False)
As you can see the code is a lot shorter, since file i/o and calculations can be done with one statement.

Merging several csv files and storing the file names as a variable - Python

I am trying to append several csv files into a single csv file using python while adding the file name (or, even better, a sub-string of the file name) as a new variable. All files have headers. The following script does the trick of merging the files, but does not cover the file name as variable issue:
import glob
filenames=glob.glob("/filepath/*.csv")
outputfile=open("out.csv","a")
for line in open(str(filenames[1])):
outputfile.write(line)
for i in range(1,len(filenames)):
f = open(str(filenames[i]))
f.next()
for line in f:
outputfile.write(line)
outputfile.close()
I was wondering if there are any good suggestions. I have about 25k small size csv files (less than 100KB each).
You can use Python's csv module to parse the CSV files for you, and to format the output. Example code (untested):
import csv
with open(output_filename, "wb") as outfile:
writer = None
for input_filename in filenames:
with open(input_filename, "rb") as infile:
reader = csv.DictReader(infile)
if writer is None:
field_names = ["Filename"] + reader.fieldnames
writer = csv.DictWriter(outfile, field_names)
writer.writeheader()
for row in reader:
row["Filename"] = input_filename
writer.writerow(row)
A few notes:
Always use with to open files. This makes sure they will get closed again when you are done with them. Your code doesn't correctly close the input files.
CSV files should be opened in binary mode.
Indices start at 0 in Python. Your code skips the first file, and includes the lines from the second file twice. If you just want to iterate over a list, you don't need to bother with indices in Python. Simply use for x in my_list instead.
Simple changes will achieve what you want:
For the first line
outputfile.write(line) -> outputfile.write(line+',file')
and later
outputfile.write(line+','+filenames[i])

Loop through multiple csv files, copying only certain columns to new files

I have a number of .csv files in a folder (1.csv, 2.csv, 3.csv, etc.) and I need to loop over them all. The output should be a corresponding NEW file for each existing one, but each should only contain 2 columns.
Here is a sample of the csv files:
004,444.444.444.444,448,11:16 PDT,11-24-15
004,444.444.444.444,107,09:55 PDT,11-25-15
004,444.444.444.444,235,09:45 PDT,11-26-15
004,444.444.444.444,241,11:00 PDT,11-27-15
And here is how I would like the output to look:
448,11-24-15
107,11-25-15
235,11-26-15
241,11-27-15
Here is my working attempt at achieving this with Python:
import csv
import os
import glob
path = '/csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
inputfile = open(infile, 'r')
output = os.rename(inputfile + ".out", 'w')
#Extracts the important columns from the .csv into a new file
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Using only the second half of this code, I am able to get the desired output by specifying the input files in the code. However, this Python script will be a small part of a much larger bash script that will be (hopefully) fully automated.
How can I adjust the input of this script to loop over each file and create a new one with just the 2 specified columns?
Please let me know if there is anything I need to clarify.
inputfile file is a file you openned , but then you are doing -
os.rename(inputfile + ".out", 'w')
This does not work, you are trying to add a string and the openned file using + operator. I am not even sure why you need that line or even the line - inputfile = open(infile, 'r') . You are openning the file again in the with statement.
Another issue -
You specify your path as - path = '/csvs/' , it is highly unlikely that you have a 'csvs' directory under root directory. You may have wanted to use some other relative directory, so you should use relative directory.
You can just do -
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
You can use pandas library. It offers several functionality for dealing with csv files. read_csv will read the csv file for you and give you a dataframe object. Visit this link to get example about how to write csv file from pandas dataframe. Moreore there are lot of tutorials available on the net.

Extracting Rows of Data from a CSV-like File Using Python

I have a large file from a proprietary archive format. Unzipping this archive gives a file that has no extension, but the data inside is comma-delimited. Adding a .csv extension or simply opening the file with Excel will work.
I have about 375-400 of these files, and I'm trying to extract a chunk of rows (about 13,500 out of 1.2M+ rows) between a keyword "Point A" and another keyword "Point B".
I found some code on this site that I think is extracting the data correctly, but I'm getting an error:
AttributeError: 'list' object has no attribute 'rows'
when trying to save out the file. Can somebody help me get this data to save into a csv?
import re
import csv
import time
print(time.ctime())
file = open('C:/Users/User/Desktop/File with No Extension That\'s Very Similar to CSV', 'r')
data = file.read()
x = re.findall(r'Point A(.*?)Point B', data,re.DOTALL)
name = "C:/Users/User/Desktop/testoutput.csv"
with open(name, 'w', newline='') as file2:
savefile = csv.writer(file2)
for i in x.rows:
savefile.writerow([cell.value for cell in i])
print(time.ctime())
Thanks in advance, any help would be much appreciated.
The following should work nicely. As mentioned, your regex usage was almost correct. It is possible to still use the Python CSV library to do the CSV processing by converting the found text into a StringIO object and passing that to the CSV reader:
import re
import csv
import time
import StringIO
print(time.ctime())
input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"
with open(input_name, 'r') as f_input, open(output_name, 'wb') as f_output:
# Read whole file in
all_input = f_input.read()
# Extract interesting lines
ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]
# Convert into a file object and parse using the CSV reader
fab_input = StringIO.StringIO(ab_input)
csv_input = csv.reader(fab_input)
csv_output = csv.writer(f_output)
# Iterate a row at a time from the input
for input_row in csv_input:
# Skip any empty rows
if input_row:
# Write row at a time to the output
csv_output.writerow(input_row)
print(time.ctime())
You have not given us an example from your CSV file, so if there are problems, you might need to configure the CSV 'dialect' to process it better.
Tested using Python 2.7
You have 2 problems here: the first is related to the regular expression and the other is about the list syntax.
Getting what you want
The way you are using the regular expression will return to you a list with a single value (all lines into an unique string).
Probably there is a better way of doing this but I would go now with something like this:
with open('bla', 'r') as input:
data = input.read()
x = re.findall(r'Point A(.*?)Point B', data, re.DOTALL)[0]
x = x.splitlines(False)[1:]
That's not pretty but will return a list with all values between those two points.
Working with lists
There is no rows attribute inside lists. You just have to iterate over it:
for i in x:
do what you have to do
See, I'm not familiar to the csv library but it looks that you will have to perform some manipulations to the i value before adding it to the library.
IMHO, I would avoid using CSV format since it is kind of "locale dependent" so it may not work as expected depending the settings your end-users may have on OS.
Updating the code so that #Martin Evans answer works on the latest Python version.
import re
import csv
import time
import io
print(time.ctime())
input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"
with open(input_name, 'r') as f_input, open(output_name, 'wt') as f_output:
# Read whole file in
all_input = f_input.read()
# Extract interesting lines
ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]
# Convert into a file object and parse using the CSV reader
fab_input = io.StringIO(ab_input)
csv_input = csv.reader(fab_input)
csv_output = csv.writer(f_output)
# Iterate a row at a time from the input
for input_row in csv_input:
# Skip any empty rows
if input_row:
# Write row at a time to the output
csv_output.writerow(input_row)
print(time.ctime())
Also, by using 'wt' instead of 'wb' one can avoid
"TypeError: a bytes-like object is required, not 'str'"

Categories