merge multiple CSV files continiously

merge multiple CSV files continiously - python

I have 1000 csv files with same columns names. i want to merge them respectively. I am using below code, but it merge all csv files randomly.
files = os.path.join(path_files, "*_a.csv")
files = glob.glob(files)
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
for example it put first 1000_a.csv then 1_a.csv and etc.
But i want to merge respectively and then remove first 100 of them.
like this as a dataframe or a single csv file:
1_a.csv, 2_a.csv, 3_a.csv, ..., 1000_a.csv
could you please let me know how it is possible?

i hope it will be usefully for you
import pandas as pd
df = pd.DataFrame()
for csv_file in sorted(list_filenames):
temp_df = pd.read_csv(scv_file)
df = pd.concat([df, temp_df])

You can sort filenames by integers before _, or remove _a.csv or last 6 characters:
files = os.path.join(path_files, "*_a.csv")
files = sorted(glob.glob(files), key=lambda x: int(x.split('_')[0]))
#alternative1
#files = sorted(glob.glob(files), key=lambda x: int(x.replace('_a.csv','')))
#alternative2
#files = sorted(glob.glob(files), key=lambda x: int(x[:-6]))
df = pd.concat(map(pd.read_csv, files), ignore_index=True)

You shoud re-order glob.glob() results like this:
files_path = os.path.join(base_path, "*_a.csv")
files = sorted(glob.glob(files_path), key=lambda name: int(name[0:-6]))
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
And there are similar questions about natural sort:
Is there a built in function for string natural sort?

This is an alternative solution.
os.chdir(path_files)
all_filenames = [i for i in sorted(glob.glob('*.{}'.format('csv')))]
df = pd.concat([pd.read_csv(f) for f in all_filenames ]).reset_index()

Related

Creating one unique file from many others saved in a folder

I've a list of csv files (approx. 100) that I'd like to include in one single csv file.
The list is found using
PATH_DATA_FOLDER = 'mypath/'
list_files = os.listdir(PATH_DATA_FOLDER)
for f in list_files:
list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, f)).columns)
df = pd.DataFrame(columns=list_columns)
print(df)
Which returns the files (it is just a sample, since I have 100 and more files):
['file1.csv', 'name2.csv', 'example.csv', '.DS_Store']
This, unfortunately, includes also hidden files, that I'd like to exclude.
Each file has the same columns:
Columns: [Name, Surname, Country]
I'd like to find a way to create one unique file with all these fields, plus information of the original file (e.g., adding a new column with the file name).
I've tried with
df1 = pd.read_csv(os.path.join(PATH_DATA_FOLDER, f))
df1['File'] = f # file name
df = df.append(df1)
df = df.reset_index(drop=True).drop_duplicates() # I'd like to drop duplicates in both Name and Surname
but it returns a dataframe with the last entry, so I guess the problem is in the for loop.
I hope you can provide some help.

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#drop duplicates and reset index
combined_csv.drop_duplicates().reset_index(drop=True)
#Save the combined file
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

Have you tried using glob?
filenames = glob.glob("mypath/*.csv") #list of all you csv files.
df = pd.DataFrame(columns=["Name", "Surname", "Country"])
for filename in filenames:
df = df.append(pd.read_csv(filename))
df = df.drop_duplicates().reset_index(drop=True)
Another way would be concatenating the csv files using the cat command after removing the headers and then read the concatenated csv file using pd.read_csv.

Merge all the csv file in one folder and add a new column to pandas dataframe with partial file name in Python

I have a folder inside have over 100 CSV files. They all have the same prefix name.
eg:
shcool.Math001.csv
School.Math002.csv.
School.Physics001.csv. etc... They all contain the same number of columns.
How can I merge all the CSV files in one data frame in Python and add a new column with those files names but the prefix name "School." needs to be removed?
I found some code example online but did not sovle my problem:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

Try this, haven't tested:
import os
import pandas as pd
path ='<folder path to CSVs>'
dfs = []
for filename in os.listdir(path):
sample_df = pd.read_csv(filename)
sample_df['filename'] = ''.join(filename[7:])
dfs.append(sample_df)
df = pd.concat(dfs, axis=0, ignore_index=True)

Add DataFrame.assign in generator comprehension for add new column:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t').assign(New=+os.path.basename(f[7:]).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

Elegant way to read multiple files but perform summary on one in python

I have multiple files as shown below. My task is to read all those files, merge them and create one final dataframe. However, one file (Measurement_table_sep_13th.csv) has to be summarized before being used for merge. It is too huge, so we summarize it and then merge it.
filenames = sorted(glob.glob('*.csv'))
filenames # gives the below output
filenames = sorted(glob.glob('*.csv'))
for f in filenames:
print(f)
if f == 'Measurement_table_sep_13th.csv':
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
else:
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
try:
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
print(JKeys)
df_final = df_final.join(df, on=JKeys, how='left')
print("success in try")
except:
df_final = df
print("success in except")
As you can see, I am summarizing Measurement_table_sep_13th.csv file before merging, but is there any other elegant and efficient way to write this?

If you do not want to save the one file in a different folder, you can also exlude it directly with glob:
followed by this post:
glob exclude pattern
files = glob.glob('files_path/[!_]*')
You can use this to run a glob function for all the files except your measurement file and then join it.
then you can avoid the long if-code.
It would look like (followed by this post: Loading multiple csv files of a folder into one dataframe):
files = glob.glob("[!M]*.csv")
dfs = [pd.read_csv(f, header=True, sep=";", inferShema=True) for f in files]
df2 = pd.concat(dfs,ignore_index=True)
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
df_final = df(df2, on=JKeys, how='left')

How to append Dataframe by rows in Python

I would like to merge (using df.append()) some python dataframes by rows.
The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.
def __merge(self, json_dir_path):
if os.path.exists(json_dir_path):
filelist = [f for f in os.listdir( json_dir_path )]
df_output = pd.DataFrame()
for json_fn in filelist:
json_full_name = os.path.join( json_dir_path, json_fn )
# print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
if os.path.exists(json_full_name):
with open(json_full_name, 'r') as in_json_file:
json_data = json.load(in_json_file)
input_fn = json_data["accPreparedCSVFileName"]
df_i = pd.read_csv(input_fn)
df_output = df_i.append(df_output)
return df_output
else:
return pd.DataFrame(data=[], columns=self.DATA_FORMAT)
I got only 2 files are merged out of 12. What am I doing wrong?
Any help would be very appreciated.
Best Regards,
Carlo

You can also set ignore_index=True when appending.
df_output = df_i.append(df_output, ignore_index=True)
Also you can concatenate the dataframes:
df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)
As #jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.

I strongly recommend you do not concatenate dataframes in a loop.
It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:
lst = []
for fn in input_fn:
lst.append(pd.read_csv(fn))
df_output = pd.concat(lst, ignore_index=True)

Appending several panda dataframes are not working

I have this code
import os
import pandas as pd
path = r'c:\Temp\factory'
os.chdir(path)
files = os.listdir()
files_csv = [f for f in files if f[-3:] == 'csv']
x = pd.DataFrame()
for f in files_csv:
data = pd.read_csv(f, sep=';', encoding='latin-1')
x = x.append(data, ignore_index=True)
I have used the same code before to concatenate CSV files but now it just does not work.
The problem i face is that only the content of one file makes it to the dataframe by name x.
I know i process all files and i expect the x dataframe to contain in total about 10000 rows but i only get the content of one file aproximatley 2000 rows.
My files typically looks like this:
Computer;Managed by;Given Name
cp1;user1;olle
cp2;user2;niklas
cp3;user3;kalle

I've needed to do something similar before. My solution would be:
x = pd.DataFrame()
for f in files_csv:
data = pd.read_csv(f, sep=';', encoding='latin-1')
x = pd.concat([x, data], ignore_index=True, axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge multiple CSV files continiously - python

i hope it will be usefully for you import pandas as pd df = pd.DataFrame() for csv_file in sorted(list_filenames): temp_df = pd.read_csv(scv_file) df = pd.concat([df, temp_df])

This is an alternative solution. os.chdir(path_files) all_filenames = [i for i in sorted(glob.glob('*.{}'.format('csv')))] df = pd.concat([pd.read_csv(f) for f in all_filenames ]).reset_index()

Related

Creating one unique file from many others saved in a folder

Merge all the csv file in one folder and add a new column to pandas dataframe with partial file name in Python

Elegant way to read multiple files but perform summary on one in python

How to append Dataframe by rows in Python

Appending several panda dataframes are not working

Categories

Resources