I am using a for loop to merge csv files on jupyter notebook, however my result returns a list instead of a dataframe. Could someone help me and tell me what I am doing wrong? Thank you in advance.
files = ['babd_light_z1.csv','babd_light_z2.csv','babd_light_z3.csv']
data = []
for f in files:
data.append(pd.read_csv(f))
type(data) # returns list
You can simply use pd.concat(data, axis=0, ignore_index=True) outside your loop to merge your csv files as in:
files = ['babd_light_z1.csv', 'babd_light_z2.csv', 'babd_light_z3.csv']
data = []
for f in files:
data.append(pd.read_csv(f))
df = pd.concat(data, axis=0, ignore_index=True)
type(df) should return pandas.core.frame.DataFrame
Give this a shot:
combined = pd.combine(data, axis=0)
Related
I have 1000 csv files with same columns names. i want to merge them respectively. I am using below code, but it merge all csv files randomly.
files = os.path.join(path_files, "*_a.csv")
files = glob.glob(files)
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
for example it put first 1000_a.csv then 1_a.csv and etc.
But i want to merge respectively and then remove first 100 of them.
like this as a dataframe or a single csv file:
1_a.csv, 2_a.csv, 3_a.csv, ..., 1000_a.csv
could you please let me know how it is possible?
i hope it will be usefully for you
import pandas as pd
df = pd.DataFrame()
for csv_file in sorted(list_filenames):
temp_df = pd.read_csv(scv_file)
df = pd.concat([df, temp_df])
You can sort filenames by integers before _, or remove _a.csv or last 6 characters:
files = os.path.join(path_files, "*_a.csv")
files = sorted(glob.glob(files), key=lambda x: int(x.split('_')[0]))
#alternative1
#files = sorted(glob.glob(files), key=lambda x: int(x.replace('_a.csv','')))
#alternative2
#files = sorted(glob.glob(files), key=lambda x: int(x[:-6]))
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
You shoud re-order glob.glob() results like this:
files_path = os.path.join(base_path, "*_a.csv")
files = sorted(glob.glob(files_path), key=lambda name: int(name[0:-6]))
df = pd.concat(map(pd.read_csv, files), ignore_index=True)
And there are similar questions about natural sort:
Is there a built in function for string natural sort?
This is an alternative solution.
os.chdir(path_files)
all_filenames = [i for i in sorted(glob.glob('*.{}'.format('csv')))]
df = pd.concat([pd.read_csv(f) for f in all_filenames ]).reset_index()
I am running a loop to open and modify a set of files in a directory using pandas. I am testing on a subset of 10 files and one of them is somehow transposing onto the other and I have no idea why. I have a column for filename and it is the correct file, but using data from the other. It's only this file and I can't figure out why. In the end I get a concatinated dataset where a subset are identical minus the "filename". It seems to be happening before line 8 because that output file has the incorrect info as well. The source files are indeed different and the names of the files are not the same.
Thank you for any help!
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
df = pd.read_excel(filename, header = None)
for i, row in df.iterrows():
if row.notnull().all():
df2 = df.iloc[(i+1):].reset_index(drop=True)
df2.columns = list(df.iloc[i])
df2.to_excel(filename+"test.xlsx", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*test2.xlsx'))
CAT = pd.concat([pd.read_excel(f) for f in all_filenames ], ignore_index=True, sort=False)
CAT.pop("Unnamed: 0")
CAT.to_excel("All_DF.xlsx", index=filename)
CAT.to_csv("All_DF.csv", index=filename)
I have multiple files as shown below. My task is to read all those files, merge them and create one final dataframe. However, one file (Measurement_table_sep_13th.csv) has to be summarized before being used for merge. It is too huge, so we summarize it and then merge it.
filenames = sorted(glob.glob('*.csv'))
filenames # gives the below output
filenames = sorted(glob.glob('*.csv'))
for f in filenames:
print(f)
if f == 'Measurement_table_sep_13th.csv':
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
else:
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
try:
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
print(JKeys)
df_final = df_final.join(df, on=JKeys, how='left')
print("success in try")
except:
df_final = df
print("success in except")
As you can see, I am summarizing Measurement_table_sep_13th.csv file before merging, but is there any other elegant and efficient way to write this?
If you do not want to save the one file in a different folder, you can also exlude it directly with glob:
followed by this post:
glob exclude pattern
files = glob.glob('files_path/[!_]*')
You can use this to run a glob function for all the files except your measurement file and then join it.
then you can avoid the long if-code.
It would look like (followed by this post: Loading multiple csv files of a folder into one dataframe):
files = glob.glob("[!M]*.csv")
dfs = [pd.read_csv(f, header=True, sep=";", inferShema=True) for f in files]
df2 = pd.concat(dfs,ignore_index=True)
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
df_final = df(df2, on=JKeys, how='left')
I would like to merge (using df.append()) some python dataframes by rows.
The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.
def __merge(self, json_dir_path):
if os.path.exists(json_dir_path):
filelist = [f for f in os.listdir( json_dir_path )]
df_output = pd.DataFrame()
for json_fn in filelist:
json_full_name = os.path.join( json_dir_path, json_fn )
# print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
if os.path.exists(json_full_name):
with open(json_full_name, 'r') as in_json_file:
json_data = json.load(in_json_file)
input_fn = json_data["accPreparedCSVFileName"]
df_i = pd.read_csv(input_fn)
df_output = df_i.append(df_output)
return df_output
else:
return pd.DataFrame(data=[], columns=self.DATA_FORMAT)
I got only 2 files are merged out of 12. What am I doing wrong?
Any help would be very appreciated.
Best Regards,
Carlo
You can also set ignore_index=True when appending.
df_output = df_i.append(df_output, ignore_index=True)
Also you can concatenate the dataframes:
df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)
As #jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.
I strongly recommend you do not concatenate dataframes in a loop.
It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:
lst = []
for fn in input_fn:
lst.append(pd.read_csv(fn))
df_output = pd.concat(lst, ignore_index=True)
I have 100 dataframes (formatted exactly the same) saved on my disk as 100 pickle files. These dataframes are each roughly 250,000 rows long. I want to save all 100 dataframes in 1 dataframe which I want to save on my disk as 1 pickle file.
This is what I am doing so far:
path = '/Users/srayan/Desktop/MyData/Pickle'
df = pd.DataFrame()
for filename in glob.glob(os.path.join(path, '*.pkl')):
newDF = pd.read_pickle(filename)
df = df.append(newDF)
df.to_pickle("/Users/srayan/Desktop/MyData/Pickle/MergedPickle.pkl")
I understand that pickle serializes the data frame but is it necessary for me to take my pickle file, unserialize it, append the data frame, and then serialize it again? Or is there a faster way to do this? With all the data I have, I am getting slowed down
You can use list comprehension with appending each df to list and only once concat:
files = glob.glob('files/*.pkl')
df = pd.concat([pd.read_pickle(fp) for fp in files], ignore_index=True)
what is same as:
dfs = []
for filename in glob.glob('files/*.pkl'):
newDF = pd.read_pickle(filename)
dfs.append(newDF)
df = pd.concat(dfs, ignore_index=True)
A more compact version in one line:
df = pd.concat(map(pd.read_pickle, glob.glob(os.path.join(path, '*.pkl'))))