Take average of each column in multiple csv files using Python - python

I am a beginner in Python. I have searched about my problem but could not find the exact requirement.
I have a folder in which there are multiple files getting scored for each experimental measurement. Their names follow a trend, e.g. XY0001.csv, XY0002.csv ... XY0040.csv. I want to read all of these files and take the average of each column in all files, storing it in 'result.csv' in the same format.

I would suggest to use pandas (import pandas as pd). I suggest to start by reading the file using pd.read_csv(). How to read the files exactly depends on how your CSV files are formatted, I cannot tell that from here. If you want to read all files in a directory (which may be the easiest solution for this problem), try to use read all files.
Then, you could concatenate all files using pd.concat(). Lastly, you can calculate the metrics you want to generate (use the search functionality to find how to calculate each specific metric). A nice function that does a lot of stuff for you is the describe function.

For access multiple files you can use glob module.
import glob
path =r'/home/root/csv_directory'
filenames = glob.glob(path + "/*.csv")
Python's pandas module have a method to parse csv file. It also some options to manage and process csv files.
import pandas as pd
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
.read_csv() method is used for parse csv files.
pd.concat(dfs, ignore_index=True)
.concat() used to concatenate all data into one dataframe and its easy for processing.

The following makes use of the glob module to get a list of all files in the current folder of the form X*.csv, i.e. all CSV files starting with x. For each file it finds, it first skips a header row (optional) and it then loads all remaining rows using a zip() trick to transpose the list of rows into a list of columns.
For each column, it converts each cell into an integer and sums the values, dividing this total by the number of elements found, thus giving an average for each column. It then writes the values to your output result.csv in the format filename, av_col1, av_col2 etc:
import glob
import csv
with open('result.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for filename in glob.glob('X*.csv'):
print (filename)
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
averages = []
for col in zip(*csv_input):
averages.append(sum(int(x) for x in col) / len(col))
csv_output.writerow([filename] + averages)
So if you had XY0001.csv containing:
Col1,Col2,Col3
6,1,10
2,1,20
5,2,30
result.csv would be written as follows:
XY0001.csv,4.333333333333333,1.3333333333333333,20.0
Tested using Python 3.5.2

Related

How to covert multiple .txt files into .csv file in Python

I'm trying to covert multiple text files into a single .csv file using Python. My current code is this:
import pandas
import glob
#Collects the files names of all .txt files in a given directory.
file_names = glob.glob("./*.txt")
#[Middle Step] Merges the text files into a single file titled 'output_file'.
with open('output_file.txt', 'w') as out_file:
for i in file_names:
with open(i) as in_file:
for j in in_file:
out_file.write(j)
#Reading the merged file and creating dataframe.
data = pandas.read_csv("output_file.txt", delimiter = '/')
#Store dataframe into csv file.
data.to_csv("convert_sample.csv", index = None)
So as you can see, I'm reading from all the files and merging them into a single .txt file. Then I convert it into a single .csv file. Is there a way to accomplish this without the middle step? Is it necessary to concatenate all my .txt files into a single .txt to convert it to .csv, or is there a way to directly convert multiple .txt files to a single .csv?
Thank you very much.
Of course it is possible. And you really don't need to involve pandas here, just use the standard library csv module. If you know the column names ahead of time, the most painless way is to use csv.DictWriter and csv.DictReader objects:
import csv
import glob
column_names = ['a','b','c'] # or whatever
with open("convert_sample.csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob("./*.txt"):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='/', fieldnames=column_names)
writer.writerows(reader)

Extract text from multiple PDFs and write to a single CSV

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.
Here is my code:
import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path
import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])
for i in range(len(pdf_files)):
scraped_text = convert_pdf_to_txt(pdf_files_path[i])
df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)
df.to_csv('output.csv')
The variables have the following values:
pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']
pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]
Empty DataFrame
Columns: [FileName, Text]
Index: []
Update: based on a suggestion from #AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.
12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
I guess you don't need pandas for that. You can make it simpler by using the standard library csv.
Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.
Here is an almost complete example:
import csv
from pathlib import Path
folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')
with csv_file.open('w', encoding='utf-8') as f:
writer = csv.writer(f, csv.QUOTE_ALL)
writer.writerow(['FileName', 'Text'])
for pdf_file in folder.glob('*.pdf'):
pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
writer.writerow([pdf_file.name, pdf_text])
Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:
pdf_text.replace('\n', '|')
It is not meant to be a complete example but a starting point. I hope it helps.

How can I open multiple csv files in a folder, take the average of a column and save in a separate file using python?

I am extremely new at python and need some help with this one. I've tried various codes and none seem to work, so suggestions would be awesome.
I have a folder with about 1500 csv files that each contain multiple columns of data. I need to take the average of the first column called "agr" and save this value in a different excel or csv file. It would be great if I could also somehow save the name of the file with its averaged value so that I can keep track of which file it came from. The name of the files are crop_city (e.g. corn_omaha).
import glob
import csv
import numpy as np
import pandas as pd
path = ('C:/test/*.csv')
for fname in glob.glob(path):
with open(fname) as csvfile:
agr = []
reader = csv.DictReader(fname)
print row['agr']
I know the code above is extremely rudimentary, so any help would be great thanks everyone!
Assuming the first column in these CSV files is a decimal or float, you don't really need to parse the entire line. Just split at the first separator and parse the first token. There is no real advantage to numpy or pandas either. Just use the builtin sum function.
import glob
import os
path = ('test/*.csv') # using local dir for test
outfile.write("Filename,Sum\r\n") # header for output
with open('output.csv', 'w', newline='') as outfile:
for fname in glob.glob(path):
with open(fname) as csvfile:
next(csvfile) # skip header
outfile.writelines("{},{}\r\n".format(os.path.basename(fname),
sum(float(line.split(',', 1)[0].strip())
for line in csvfile)))
Contrary to the answer by #tdelaney, I would not advise you to limit your code by relying on the fact that you are adding up the first column; what if you need to work with the third column next week? It's easy to do this properly by building on the code you provide. Parsing a couple of thousand text files is not going to slow you down.
The csv.DictReader constructor will automatically treat the first row of its input as a header (unless you explicitly specify a list of column names with the fieldnames parameter). So your code can look like this:
import csv
import glob
averages = []
for fname in glob.glob(path):
with open(fname, "rb") as csvfile:
reader = csv.DictReader(csvfile)
values = [ float(row["agr"]) for row in reader ]
avg = sum(values) / len(values)
averages.append((fname, avg))
The list averages now contains the numbers you want. This is how you write it out to another CSV file:
with open("avegages.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["File", "Average agr"])
for row in averages:
writer.writerow(row)
PS. Since you included pandas in your imports, here's one way to do the same thing with pandas. However, I recommend sticking with csv for now. The pandas object model is complex, and hard to wrap your head around.
averages = []
for fname in glob.glob(path):
data = pd.DataFrame.from_csv(fname)
averages.append((fname, data["agr"].mean()))
df_out = pd.DataFrame.from_records(averages, columns=["File", "Average agr"])
df_out.to_csv("averages.csv", index=False)
As you can see the code is a lot shorter, since file i/o and calculations can be done with one statement.

Merge two CSV files in python iteratively

I have a set of data saved across multiple .csv files with a fixed number of columns. Each column corresponds to a different measurement.
I would like to add a header to each file. The header will be identical for all files, and is comprised of three rows. Two of these rows are used to identify their corresponding columns.
I am thinking that I could save the header in a separate .csv file, then iteratively merge it with each data file using a for loop.
How can I do this in python? I am new to the language.
Yeah, you can do that easily with pandas. It will be faster and easier than what you're currently thinking which may create problems.
Three simple commands will be used for reading, merging and putting that in a new file and they are:
pandas.read_csv()
pandas.merge()
pandas.to_csv()
You can read what arguments you have to use and more details about them here.
for your case you may need first to create new files with
the headers with them. then you would do another loop to
add the rows, but skipping the header.
import csv
with open("data_out.csv","a") as fout:
# first file:
with open("data.csv") as f: # you header file
for line in f:
fout.write(line)
with open("data_2.csv") as f:
next(f) # this will skip first line
for line in f:
fout.write(line)
Instead of running a for loop appending two files for multiple files, an easier solution would be to put all the csv files you want to merge into a single folder and feed the path to the program. This will merge all the csv files into a single csv file.
(Note: The attributes of each file must be same)
import os
import pandas as pd
#give the path to the folder containing the multiple csv files
dirList = os.listdir(path)
#Put all their names into a list
filenames = []
for item in dirList:
if ".csv" in item:
filenames.append(item)
#Create a dataframe and make sure it's empty (not required but safe practice if using for appending)
df1 = pd.Dataframe()
df1.drop(df1.index, inplace=True)
#Convert each file to a dataframe and append it to dataframe df1
for f in filenames:
df = pd.read_csv(f)
df1 = df1.append(df)
#Convert the dataframe into a single csvfile
df1.to_csv(csvfile, encoding='utf-8', index=False)

Reading CSV files and rewriting them without certain rows Python

I am new to programming. I have hundreds of CSV files in a folder and certain files have the letters DIF in the second column. I want to rewrite the CSV files without those lines in them. I have attempted doing that for one file and have put my attempt below. I need also need help getting the program to do that dfor all the files in my directory. Any help would be appreciated.
Thank you
import csv
reader=csv.reader(open("40_5.csv","r"))
for row in reader:
if row[1] == 'DIF':
csv.writer(open('40_5N.csv', 'w')).writerow(row)
I made some changes to your code:
import csv
import glob
import os
fns = glob.glob('*.csv')
for fn in fns:
reader=csv.reader(open(fn,"rb"))
with open (os.path.join('out', fn), 'wb') as f:
w = csv.writer(f)
for row in reader:
if not 'DIF' in row:
w.writerow(row)
The glob command produces a list of all files ending with .csv in the current directory. If you want to give the source directory as an argument to your program, have a look into sys.argv or argparse (especially the latter is very powerful for command line parsing).
You also have to be careful when opening a file in 'w' mode: It means truncating the file, i.e. in your loop you would always overwrite the existing file, ending up in only one csv line.
The direcotry 'out' must exist or the script will produce an IOError.
Links:
open
sys.argv
argparse
glob
Most sequence types support the in or not in operators, which are much simpler to use to test for values than figuring index positions.
for row in reader:
if not 'DIF' in row:
csv.writer(open('40_5N.csv', 'w')).writerow(row)
If you're willing to install numpy, you can also read a csv file into the convenient numpy array format with either recfromcsv or the more general genfromtxt (genfromtxt requires you specify the comma delimiter), and you can specify which rows and columns to ignore. Documentation can be found here for genfromtxt:
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
And here for recfromcsv: http://nullege.com/codes/search/numpy.recfromcsv?fulldoc=1

Categories