Merge two CSV files in python iteratively - python

I have a set of data saved across multiple .csv files with a fixed number of columns. Each column corresponds to a different measurement.
I would like to add a header to each file. The header will be identical for all files, and is comprised of three rows. Two of these rows are used to identify their corresponding columns.
I am thinking that I could save the header in a separate .csv file, then iteratively merge it with each data file using a for loop.
How can I do this in python? I am new to the language.

Yeah, you can do that easily with pandas. It will be faster and easier than what you're currently thinking which may create problems.
Three simple commands will be used for reading, merging and putting that in a new file and they are:
pandas.read_csv()
pandas.merge()
pandas.to_csv()
You can read what arguments you have to use and more details about them here.

for your case you may need first to create new files with
the headers with them. then you would do another loop to
add the rows, but skipping the header.
import csv
with open("data_out.csv","a") as fout:
# first file:
with open("data.csv") as f: # you header file
for line in f:
fout.write(line)
with open("data_2.csv") as f:
next(f) # this will skip first line
for line in f:
fout.write(line)

Instead of running a for loop appending two files for multiple files, an easier solution would be to put all the csv files you want to merge into a single folder and feed the path to the program. This will merge all the csv files into a single csv file.
(Note: The attributes of each file must be same)
import os
import pandas as pd
#give the path to the folder containing the multiple csv files
dirList = os.listdir(path)
#Put all their names into a list
filenames = []
for item in dirList:
if ".csv" in item:
filenames.append(item)
#Create a dataframe and make sure it's empty (not required but safe practice if using for appending)
df1 = pd.Dataframe()
df1.drop(df1.index, inplace=True)
#Convert each file to a dataframe and append it to dataframe df1
for f in filenames:
df = pd.read_csv(f)
df1 = df1.append(df)
#Convert the dataframe into a single csvfile
df1.to_csv(csvfile, encoding='utf-8', index=False)

Related

How to save the variables as different files in a for loop?

I have a list of csv file pathnames in a list, and I am trying to save them as dataframes. How can I do it?
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = "/Users/azmath/Library/CloudStorage/OneDrive-Personal/Projects/LESA/2022 HY/All"
csv_files = glob.glob(os.path.join(path, "*.xlsx"))
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_excel(f)
display(df)
print()
The issue is that it only prints. but I dont know how to save. I would like to save all the data frames as variables, preferably as their file names.
By “save” I think you mean store dataframes in variables. I would use a dictionary for this instead of separate variables.
import os
data = {}
for f in csv_files:
name = os.path.basename(f)
# read the csv file
data[name] = pd.read_excel(f)
display(data[name])
print()
Now all your dataframes are stored in the data dictionary where you can iterate them (easily handle all of them together if needed). Their key in the dictionary is the basename (filename) of the input file.
Recall that dictionaries remember insertion order, so the order the files were inserted is also preserved. I'd probably recommend sorting the input files before parsing - this way you get a reproducible script and sequence of operations!
try this:
a = [pd.read_excel(file) for file in csv_files]
Then a will be a list of all your dataframe. If you want a dictionary instead of list:
a = {file: pd.read_csv(file) for file in csv_files}

How to covert multiple .txt files into .csv file in Python

I'm trying to covert multiple text files into a single .csv file using Python. My current code is this:
import pandas
import glob
#Collects the files names of all .txt files in a given directory.
file_names = glob.glob("./*.txt")
#[Middle Step] Merges the text files into a single file titled 'output_file'.
with open('output_file.txt', 'w') as out_file:
for i in file_names:
with open(i) as in_file:
for j in in_file:
out_file.write(j)
#Reading the merged file and creating dataframe.
data = pandas.read_csv("output_file.txt", delimiter = '/')
#Store dataframe into csv file.
data.to_csv("convert_sample.csv", index = None)
So as you can see, I'm reading from all the files and merging them into a single .txt file. Then I convert it into a single .csv file. Is there a way to accomplish this without the middle step? Is it necessary to concatenate all my .txt files into a single .txt to convert it to .csv, or is there a way to directly convert multiple .txt files to a single .csv?
Thank you very much.
Of course it is possible. And you really don't need to involve pandas here, just use the standard library csv module. If you know the column names ahead of time, the most painless way is to use csv.DictWriter and csv.DictReader objects:
import csv
import glob
column_names = ['a','b','c'] # or whatever
with open("convert_sample.csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob("./*.txt"):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='/', fieldnames=column_names)
writer.writerows(reader)

read in list of urls from file then use pandas to make changes and output new csv file

I have this script that read in a list of csv's from a text file, the files are hosted online. The script then makes changes to each of the files and creates a new csv from the changes. At the moment it works for a single file by I cant get it to run through all the links and make new csvs files from each.
import pandas as pd
f = open( "urls.txt", "r" )
lines = []
url = lines
for line in f:
lines.append(line)
df = pd.read_csv(url,skiprows=7)
df = df.rename(columns={'*.conradbrothers.com/*_20180501':'Ranking','*.conradbrothers.com/*_difference': 'Difference'}) # Raname the top coloumn names
df = df.replace(to_replace=["-"],value="0") #find - and replace with 0
print(df)
df.to_csv('seo.csv',index=False) #output file as a csv
print(line)
f.close()
Seeing the code, it seems like you are reading an empty list:
df = pd.read_csv(url,skiprows=7)
wouldn't it be
df = pd.read_csv(line,skiprows=7)
I am just guessing, the information provided in the question is limited
Oh an by the way, you are saving the csv each time with the same name, so you are overwriting it, consider changing the name in each iteration of the loop

Take average of each column in multiple csv files using Python

I am a beginner in Python. I have searched about my problem but could not find the exact requirement.
I have a folder in which there are multiple files getting scored for each experimental measurement. Their names follow a trend, e.g. XY0001.csv, XY0002.csv ... XY0040.csv. I want to read all of these files and take the average of each column in all files, storing it in 'result.csv' in the same format.
I would suggest to use pandas (import pandas as pd). I suggest to start by reading the file using pd.read_csv(). How to read the files exactly depends on how your CSV files are formatted, I cannot tell that from here. If you want to read all files in a directory (which may be the easiest solution for this problem), try to use read all files.
Then, you could concatenate all files using pd.concat(). Lastly, you can calculate the metrics you want to generate (use the search functionality to find how to calculate each specific metric). A nice function that does a lot of stuff for you is the describe function.
For access multiple files you can use glob module.
import glob
path =r'/home/root/csv_directory'
filenames = glob.glob(path + "/*.csv")
Python's pandas module have a method to parse csv file. It also some options to manage and process csv files.
import pandas as pd
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
.read_csv() method is used for parse csv files.
pd.concat(dfs, ignore_index=True)
.concat() used to concatenate all data into one dataframe and its easy for processing.
The following makes use of the glob module to get a list of all files in the current folder of the form X*.csv, i.e. all CSV files starting with x. For each file it finds, it first skips a header row (optional) and it then loads all remaining rows using a zip() trick to transpose the list of rows into a list of columns.
For each column, it converts each cell into an integer and sums the values, dividing this total by the number of elements found, thus giving an average for each column. It then writes the values to your output result.csv in the format filename, av_col1, av_col2 etc:
import glob
import csv
with open('result.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for filename in glob.glob('X*.csv'):
print (filename)
with open(filename, newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
averages = []
for col in zip(*csv_input):
averages.append(sum(int(x) for x in col) / len(col))
csv_output.writerow([filename] + averages)
So if you had XY0001.csv containing:
Col1,Col2,Col3
6,1,10
2,1,20
5,2,30
result.csv would be written as follows:
XY0001.csv,4.333333333333333,1.3333333333333333,20.0
Tested using Python 3.5.2

Looping through multiple excel files in python using pandas

I know this type of question is asked all the time. But I am having trouble figuring out the best way to do this.
I wrote a script that reformats a single excel file using pandas.
It works great.
Now I want to loop through multiple excel files, preform the same reformat, and place the newly reformatted data from each excel sheet at the bottom, one after another.
I believe the first step is to make a list of all excel files in the directory.
There are so many different ways to do this so I am having trouble finding the best way.
Below is the code I currently using to import multiple .xlsx and create a list.
import os
import glob
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
I am not sure if the previous glob code actually created the list that I need.
Then I have trouble understanding where to go from there.
The code below fails at pd.ExcelFile(File)
I beleive I am missing something....
# create for loop
for File in FileList:
for x in File:
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(File)
xlsx_file
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('Data',header= None)
# select important rows,
df_NoHeader = df[4:]
#then It does some more reformatting.
'
Any help is greatly appreciated
I solved my problem. Instead of using the glob function I used the os.listdir to read all my excel sheets, loop through each excel file, reformat, then append the final data to the end of the table.
#first create empty appended_data table to store the info.
appended_data = []
for WorkingFile in os.listdir('C:\ExcelFiles'):
if os.path.isfile(WorkingFile):
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(WorkingFile)
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('sheet1',header= None)
#.... do so reformating, call finished sheet reformatedDataSheet
reformatedDataSheet
appended_data.append(reformatedDataSheet)
appended_data = pd.concat(appended_data)
And thats it, it does everything I wanted.
you need to change
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
to just
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)
Why does this fix it? glob returns a single list. Since you put for FileList in glob.glob(...), you're going to walk that list one by one and put the result into FileList. At the end of your loop, FileList is a single filename - a single string.
When you do this code:
for File in FileList:
for x in File:
the first line will assign File to the first character of the last filename (as a string). The second line will assign x to the first (and only) character of File. This is not likely to be a valid filename, so it throws an error.

Categories