Python - Merge Excel Files with missing column names - python

Help & valuable advise needed from the experts please.
I have 100s of csv files which i would like to merge but table columns aren't same. For example
File 1: Header1, Header2, Header3
File 2: Header1, Header2
Files 3: Header3, Header4
Files 4: Header1, Header3, Header4
I would like to merge the data from all of these csv files.
I thought of a logic but I am struggling to implement it in a python code. I think it requires a Dataframe with pre-defined headers. (Header1, Header2, Header3, Header4). Then i should loop through each csv file to search for that header. If the header exists then the data is appended to the dataframe else skip it.
Can someone please advise if there is a function in Python that can make it simple ? I have tried to use read_csv but the data structure should be the same when looping throw the csv files.
import pandas as pd
import os
import glob
# Location of CSV Files
data_location='C:\\Tableau Reports\\ST Database\\Files\\Downloads\\CSV Files\\*.csv'
# This will give path location to all CSV files
csv_files=glob.glob(data_location)
for csv_file in csv_files:
# Encoding to be changed (UTF-8).
with open(csv_file, newline='', encoding='cp1252') as csvfile:
df_file = pd.read_csv(csvfile)
Combined_df = pd.concat(df_file)
print(Combined_df)
I tried to following advise given in this forum but getting error at line 12 df_file = pd.read_csv(csvfile).
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 114934: character maps to
Any help or advise is greatly appreciated. Thank you

EDIT
The following code you posted seems overly complex. You don't have to use open with pd.read_csv. You can directly pass in a file name as a string
for csv_file in csv_files:
# Encoding to be changed (UTF-8).
with open(csv_file, newline='', encoding='cp1252') as csvfile:
df_file = pd.read_csv(csvfile)
The following should work using a list-comprehension
df = pd.concat(pd.read_csv(csv_file) for csv_file in csv_files)
If you're not a fan of list-comprehensions then you can instead start with an empty df and keep concatenating your small dfs onto it
large_df = pd.DataFrame()
for csv_file in csv_files:
small_df = pd.read_csv(csv_file)
large_df = pd.concat((large_df,small_df))
ORIGINAL
pd.concat can already do this for you! I've made example data. f1_r1 means "file1, row1" just to show that it's working as expected
import pandas as pd
import io
#The io.StringIO makes these "pretend" .csv files for illustration
#All of this is just to create the example data, you won't do this
file1_name = io.StringIO("""
Header1,Header2,Header3
f1_r1,f1_r1,f1_r1
f1_r2,f1_r2,f1_r2
""")
file2_name = io.StringIO("""
Header1,Header2
f2_r1,f2_r1
f2_r2,f2_r2
""")
file3_name = io.StringIO("""
Header3,Header4
f3_r1,f3_r1
f3_r2,f3_r2
""")
file4_name = io.StringIO("""
Header1,Header3,Header4
f4_r1,f4_r1,f4_r1
f4_r2,f4_r2,f4_r2
""")
#this would be a list of your .csv file names
files = [file1_name, file2_name, file3_name, file4_name]
#pd.concat already handles this use-case for you!
combined_df = pd.concat(pd.read_csv(f_name,encoding='utf8') for f_name in files)
print(combined_df)

Related

Python - for all csv files in folder make new csv file with header names and the file they came from

I have a code where I am writing to five csv files, and after all of the CSV files are created, I would like to run a function to put all of the headers into a csv or xlsx file where each row represents a header in a file.
So in a folder called "Example" there are 5 csv files, called "1.csv", "2.csv"... "5.csv"; for the code I would like to have, a new file would be created called "Headers of files in Example", where the first column is the name of the csv file the header came from, and the second column contains the headers. Ultimately looking like this:contents of Headers of files in example, where the headers of 1.csv are a,b,c and so on.
My python coding is fairly basic at this point, but I definitely think what I would like to do is possible. Any suggestions to help would be greatly appreciated!
After some more digging I was able to find some code that did what I wanted it to, after some slight modifications:
import csv
import glob
import pandas as pd
def headers():
path = r'path to folder containing csv files/'
all_files = glob.glob(path + "*.csv")
files = all_files
myheaders = ['filename', 'header']
with open("Headers of foldername.csv", "w", newline='') as fw:
cw = csv.writer(fw, delimiter=",")
for filename in files:
with open(filename, 'r') as f:
cr = csv.reader(f)
# get title
for column_name in (x.strip() for x in next(cr)):
cw.writerow([filename, column_name])
file = pd.read_csv("Headers of foldername.csv")
file.to_csv("Headers of foldername.csv", header=myheaders, index=False)
Given you have the DataFrames in the memory, you just need to create a new DataFrame, I like to use dictionaries of lists to create it, then for each file/dataframe you extract the columns and upload it to the mock DataFrame.
Later you can save the new DataFrame to a file.
summary_df = {
'file_name': list(),
'headers': list()}
for file, filename in zip(list_of_files, list_of_names):
aux_headers = file.columns.to_list()
summary_df['headers'] += aux_headers
summary_df['file_name'] += [filename] * len(aux_headers)
summary_df = pd.DataFrame(summary_df)
I hope this piece of code helps. Essentially what it does is to iterate over all files you want, their names in file_names then read them using pandas. Once the csv is loaded you extract the headers with df.columns and store it in a list which is then saves as a new csv by pandas.
import pandas as pd
header_names = []
file_names = ['1.csv', '2.csv']
for file_name in file_names:
df = pd.read_csv(file_name)
header_names.extend(list(df.columns))
new_df = pd.DataFrame(l)
new_df.to_csv("headers.csv")

Sort out columns of multiple csv files at once in Python

really appreciate your help.
I have around 200 csv files with same header.
eg of headers are x , y, z, time, id, type
I would like to sort out time colums of all csv files and save them again.
This is so far I have tried. But it doesn't work.
Could you please help me ??
Thank you
import csv
import operator
import glob
import pandas as pd
data = dict() # filename : lists
path="./*.csv"
files=glob.glob(path)
for filename in files:
# process each file
with open(filename, 'r') as f:
# read file to a list of lists
lists = [row for row in csv.reader(f, delimiter=',')]
# sort and save into a dict
sorted_df = lists.sort_values(by=["time"], ascending=True)
sorted_df.to_csv('%.csv', index=False)
I don't have much knowledge about the csv module but you're using pandas and it supports reading csv files with pd.read_csv, why not utilize that..
for filename in files:
df = pd.read_csv(filename)
df.sort_values('time', inplace=True)
df.to_csv(filename, index=False)
This would overwrite all the files with same data sorted by time.

Pandas - Trying to store multiple .txt files in a .csv

I have a folder with about 500 .txt files. I would like to store the content in a csv file, with 2 columns, column 1 being the name of the file and column 2 being the file content in string. So I'd end up with a CSV file with 501 rows.
I've snooped around SO and tried to find similar questions, and came up with the following code:
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
for files in os.listdir(path):
with open(files, 'r') as file:
try:
df = pd.read_csv(file, header=None, delim_whitespace=True)
except EmptyDataError:
df = pd.DataFrame()
return df.to_csv('file.csv', index=False)
However it returns an empty .csv file. Am I doing something wrong?
There are several problems on your code. One of them is that pd.read_csv is not opening file because you're not passing the path to the given file. I think you should try to play from this code
import os
import pandas as pd
from pandas.io.common import EmptyDataError
def Aggregate_txt_csv(path):
files = os.listdir(path)
df = []
for file in files:
try:
d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
d["file"] = file
except EmptyDataError:
d = pd.DataFrame({"file":[file]})
df.append(d)
df = pd.concat(df, ignore_index=True)
df.to_csv('file.csv', index=False)
Use pathlib
Path.glob() to find all the files
When using path objects, file.stem returns the file name from the path.
Use pandas.concat to combine the dataframes in df_list
from pathlib import Path
import pandas as pd
p = Path('e:/PythonProjects/stack_overflow') # path to files
files = p.glob('*.txt') # get all txt files
df_list = list() # create an empty list for the dataframes
for file in files: # iterate through each file
with file.open('r') as f:
text = '\n'.join([line.strip() for line in f.readlines()]) # join all rows in list as a single string separated with \n
df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]})) # create and append a dataframe
df_all = pd.concat(df_list) # concat all the dataframes
df_all.to_csv('files.txt', index=False) # save to csv
I noticed there's already an answer, but I've gotten it to work with a relatively simple piece of code. I've only edited the file read-in a little bit, and the dataframe is outputting successfully.
Link here
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
result = []
print(os.listdir(path))
for files in os.listdir(path):
fullpath = os.path.join(path, files)
if not os.path.isfile(fullpath):
continue
with open(fullpath, 'r', errors='replace') as file:
try:
content = '\n'.join(file.readlines())
result.append({'title': files, 'body': content})
except EmptyDataError:
result.append({'title': files, 'body': None})
df = pd.DataFrame(result)
return df
df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')
Most importantly here, I am appending to an array so as not to run pandas' concatenate function too much, as that would be pretty bad for performance. Additionally, reading in the file should not need read_csv, as there isn't a set format for the file. So using '\n'.join(file.readlines()) allows you to read in the file plainly and take out all lines into a string.
At the end, I convert the array of dictionaries into a final dataframe, and it returns the result.
EDIT: for paths that aren't the current directory, I updated it to append the path so that it could find the necessary files, apologies for the confusion

Copy column,add some text and write in a new csv file

I want to make a script that would copy 2nd column from multiple csv files in a folder and add some text before saving it to a single csv file .
here is what i want to do :
1.) Grab data in the 2nd column from all csv files
2.) Append text "hello" & "welcome" to each row at start and end
3.) Write the data into a single file
I tried creating it using pandas
import os
import pandas as pd
dataframes = [pd.read_csv(p, index_col=2, header=None) for p in ('1.csv','2.csv','3.csv')]
merged_dataframe = pd.concat(dataframes, axis=0)
merged_dataframe.to_csv("all.csv", index=False)
The Problem is -
In above code I am forced to mention the file names manually which is very difficult, as a solution I need to include all csv file *.csv
Need to use something like writr.writerow(("Hello"+r[1]+"welcome"))
As there are multiple csv files with many rows(around 100k) in each file so i need to speed up as well.
Here is a sample of the csv files:
"1.csv" "2.csv" "3.csv"
a,Jac b,William c,James
And here is how I would like the output to look all.csv:
Hello Jac welcome
Hello William welcome
Hello James welcome
Any solution using .merge() .append() or .concat() ??
How can I achieve this using python ?
You don't need pandas for this. Here's a really simple way of doing this with csv
import csv
import glob
with open("path/to/output", 'w') as outfile:
for fpath in glob.glob('path/to/directory/*.csv'):
with open(fpath) as infile:
for row in csv.reader(infile):
outfile.write("Hello {} welcome\n".format(row[1]))
1) If you would like to import all .csv files in a folder, you can just use
for i in [a in os.listdir() if a[-4:] == '.csv']:
#code to read in .csv file and concatenate to existing dataframe
2) To append the text and write to a file, you can map a function to each element of the dataframe's column 2 to add the text.
#existing dataframe called df
df[df.columns[1]].map(lambda x: "Hello {} welcome".format(x)).to_csv(<targetpath>)
#replace <targetpath> with your target path
See http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.to_csv.html for all the various parameters you can pass in to to_csv.
Here is a non-pandas solution using built in csv module. Not sure about speed.
import os
import csv
path_to_files = "path to files"
all_csv = os.path.join(path_to_files, "all.csv")
file_list = os.listdir(path_to_files)
names = []
for file in file_list:
if file.endswith(".csv"):
path_to_current_file = os.path.join(path_to_files, file)
with open(path_to_current_file, "r") as current_csv:
reader = csv.reader(current_csv, delimiter=',')
for row in reader:
names.append(row[1])
with open(all_csv, "w") as out_csv:
writer = csv.writer(current_csv, delimiter=',')
for name in names:
writer.writerow(["Hello {} welcome".format(name))

How to extract a single row from multiple CSV files to a new file

I have hundreds of CSV files on my disk, and one file added daily and I want to extract one row from each of them and put them in a new file. Then I want to daily add values to that same file. CSV files looks like this:
business_day,commodity,total,delivery,total_lots
.
.
20160831,CTC,,201710,10
20160831,CTC,,201711,10
20160831,CTC,,201712,10
20160831,CTC,Total,,385
20160831,HTC,,201701,30
20160831,HTC,,201702,30
.
.
I want to fetch the row that contains 'Total' from each file. The new file should look like:
business_day,commodity,total,total_lots
20160831,CTC,Total,385
20160901,CTC,Total,555
.
.
The raw files on my disk are named '20160831_foo.CSV', '20160901_foo.CSV etc..
After Googling this I have yet not seen any examples on how to extract only one value from a CSV file. Any hints/help much appreciated. Happy to use pandas if that makes life easier.
I ended up with the following:
import pandas as pd
import glob
list_ = []
filenames = glob.glob('c:\\Financial Data\\*_DAILY.csv')
for filename in filenames:
df = pd.read_csv(filename, index_col = None, usecols = ['business_day', 'commodity', 'total', 'total_lots'], parse_dates = ['business_day'], infer_datetime_format = True)
df = df[((df['commodity'] == 'CTC') & (df['total'] == 'Total'))]
list_.append(df)
df = pd.concat(list_, ignore_index = True)
df['total_lots'] = df['total_lots'].astype(int)
df = df.sort_values(['business_day'])
df = df.set_index('business_day')
Then I save it as my required file.
Read the csv files and process them directly like so:
with open('some.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
# do something here with `row`
break
I would recommend appending rows onto a list after processing for the rows that you desire, and then passing it onto a pandas Dataframe that will simplify your data manipulations a lot.

Categories