merge multiple CSV files continiously - python

I have 1000 csv files with same columns names. i want to merge them respectively. I am using below code, but it merge all csv files randomly.
files = os.path.join(path_files, "*_a.csv")
files = glob.glob(files)
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
for example it put first 1000_a.csv then 1_a.csv and etc.
But i want to merge respectively and then remove first 100 of them.
like this as a dataframe or a single csv file:
1_a.csv, 2_a.csv, 3_a.csv, ..., 1000_a.csv
could you please let me know how it is possible?

i hope it will be usefully for you
import pandas as pd
df = pd.DataFrame()
for csv_file in sorted(list_filenames):
temp_df = pd.read_csv(scv_file)
df = pd.concat([df, temp_df])

You can sort filenames by integers before _, or remove _a.csv or last 6 characters:
files = os.path.join(path_files, "*_a.csv")
files = sorted(glob.glob(files), key=lambda x: int(x.split('_')[0]))
#alternative1
#files = sorted(glob.glob(files), key=lambda x: int(x.replace('_a.csv','')))
#alternative2
#files = sorted(glob.glob(files), key=lambda x: int(x[:-6]))
df = pd.concat(map(pd.read_csv, files), ignore_index=True)

You shoud re-order glob.glob() results like this:
files_path = os.path.join(base_path, "*_a.csv")
files = sorted(glob.glob(files_path), key=lambda name: int(name[0:-6]))
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
And there are similar questions about natural sort:
Is there a built in function for string natural sort?

This is an alternative solution.
os.chdir(path_files)
all_filenames = [i for i in sorted(glob.glob('*.{}'.format('csv')))]
df = pd.concat([pd.read_csv(f) for f in all_filenames ]).reset_index()

Related

Creating one unique file from many others saved in a folder

I've a list of csv files (approx. 100) that I'd like to include in one single csv file.
The list is found using
PATH_DATA_FOLDER = 'mypath/'
list_files = os.listdir(PATH_DATA_FOLDER)
for f in list_files:
list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, f)).columns)
df = pd.DataFrame(columns=list_columns)
print(df)
Which returns the files (it is just a sample, since I have 100 and more files):
['file1.csv', 'name2.csv', 'example.csv', '.DS_Store']
This, unfortunately, includes also hidden files, that I'd like to exclude.
Each file has the same columns:
Columns: [Name, Surname, Country]
I'd like to find a way to create one unique file with all these fields, plus information of the original file (e.g., adding a new column with the file name).
I've tried with
df1 = pd.read_csv(os.path.join(PATH_DATA_FOLDER, f))
df1['File'] = f # file name
df = df.append(df1)
df = df.reset_index(drop=True).drop_duplicates() # I'd like to drop duplicates in both Name and Surname
but it returns a dataframe with the last entry, so I guess the problem is in the for loop.
I hope you can provide some help.
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#drop duplicates and reset index
combined_csv.drop_duplicates().reset_index(drop=True)
#Save the combined file
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Have you tried using glob?
filenames = glob.glob("mypath/*.csv") #list of all you csv files.
df = pd.DataFrame(columns=["Name", "Surname", "Country"])
for filename in filenames:
df = df.append(pd.read_csv(filename))
df = df.drop_duplicates().reset_index(drop=True)
Another way would be concatenating the csv files using the cat command after removing the headers and then read the concatenated csv file using pd.read_csv.

Merge all the csv file in one folder and add a new column to pandas dataframe with partial file name in Python

I have a folder inside have over 100 CSV files. They all have the same prefix name.
eg:
shcool.Math001.csv
School.Math002.csv.
School.Physics001.csv. etc... They all contain the same number of columns.
How can I merge all the CSV files in one data frame in Python and add a new column with those files names but the prefix name "School." needs to be removed?
I found some code example online but did not sovle my problem:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)
Try this, haven't tested:
import os
import pandas as pd
path ='<folder path to CSVs>'
dfs = []
for filename in os.listdir(path):
sample_df = pd.read_csv(filename)
sample_df['filename'] = ''.join(filename[7:])
dfs.append(sample_df)
df = pd.concat(dfs, axis=0, ignore_index=True)
Add DataFrame.assign in generator comprehension for add new column:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t').assign(New=+os.path.basename(f[7:]).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

Elegant way to read multiple files but perform summary on one in python

I have multiple files as shown below. My task is to read all those files, merge them and create one final dataframe. However, one file (Measurement_table_sep_13th.csv) has to be summarized before being used for merge. It is too huge, so we summarize it and then merge it.
filenames = sorted(glob.glob('*.csv'))
filenames # gives the below output
filenames = sorted(glob.glob('*.csv'))
for f in filenames:
print(f)
if f == 'Measurement_table_sep_13th.csv':
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
else:
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
try:
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
print(JKeys)
df_final = df_final.join(df, on=JKeys, how='left')
print("success in try")
except:
df_final = df
print("success in except")
As you can see, I am summarizing Measurement_table_sep_13th.csv file before merging, but is there any other elegant and efficient way to write this?
If you do not want to save the one file in a different folder, you can also exlude it directly with glob:
followed by this post:
glob exclude pattern
files = glob.glob('files_path/[!_]*')
You can use this to run a glob function for all the files except your measurement file and then join it.
then you can avoid the long if-code.
It would look like (followed by this post: Loading multiple csv files of a folder into one dataframe):
files = glob.glob("[!M]*.csv")
dfs = [pd.read_csv(f, header=True, sep=";", inferShema=True) for f in files]
df2 = pd.concat(dfs,ignore_index=True)
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
df_final = df(df2, on=JKeys, how='left')

How to append Dataframe by rows in Python

I would like to merge (using df.append()) some python dataframes by rows.
The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.
def __merge(self, json_dir_path):
if os.path.exists(json_dir_path):
filelist = [f for f in os.listdir( json_dir_path )]
df_output = pd.DataFrame()
for json_fn in filelist:
json_full_name = os.path.join( json_dir_path, json_fn )
# print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
if os.path.exists(json_full_name):
with open(json_full_name, 'r') as in_json_file:
json_data = json.load(in_json_file)
input_fn = json_data["accPreparedCSVFileName"]
df_i = pd.read_csv(input_fn)
df_output = df_i.append(df_output)
return df_output
else:
return pd.DataFrame(data=[], columns=self.DATA_FORMAT)
I got only 2 files are merged out of 12. What am I doing wrong?
Any help would be very appreciated.
Best Regards,
Carlo
You can also set ignore_index=True when appending.
df_output = df_i.append(df_output, ignore_index=True)
Also you can concatenate the dataframes:
df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)
As #jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.
I strongly recommend you do not concatenate dataframes in a loop.
It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:
lst = []
for fn in input_fn:
lst.append(pd.read_csv(fn))
df_output = pd.concat(lst, ignore_index=True)

Appending several panda dataframes are not working

I have this code
import os
import pandas as pd
path = r'c:\Temp\factory'
os.chdir(path)
files = os.listdir()
files_csv = [f for f in files if f[-3:] == 'csv']
x = pd.DataFrame()
for f in files_csv:
data = pd.read_csv(f, sep=';', encoding='latin-1')
x = x.append(data, ignore_index=True)
I have used the same code before to concatenate CSV files but now it just does not work.
The problem i face is that only the content of one file makes it to the dataframe by name x.
I know i process all files and i expect the x dataframe to contain in total about 10000 rows but i only get the content of one file aproximatley 2000 rows.
My files typically looks like this:
Computer;Managed by;Given Name
cp1;user1;olle
cp2;user2;niklas
cp3;user3;kalle
I've needed to do something similar before. My solution would be:
x = pd.DataFrame()
for f in files_csv:
data = pd.read_csv(f, sep=';', encoding='latin-1')
x = pd.concat([x, data], ignore_index=True, axis=1)

Categories