Trying to merge different files csv and to label the columns - python

I'm trying to get a single dataset by merging several cvs files within one folder. So I would like to merge the different file, which each have 4 columns. I would also like to label the four columns using names=[] in pd.concatenate.
I'm using this code:
path = r'C:\Users\chiar\Desktop\folder' # defining the path
all_files = glob.glob(path + "/*.csv")
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True, names=['quat_1', 'quat_2', 'quat_3', 'quat_4'])
The problem is that instead of getting 4 columns I get 25, and I don't get labeling.
Could someone tell me what I'm doing wrong? Thank you very much!

Use parameter names in read_csv if no header in files:
name = ['quat_1', 'quat_2', 'quat_3', 'quat_4']
df = pd.concat((pd.read_csv(f, names=names) for f in all_files), ignore_index=True)

Related

How to read multiple csv files with specific name from a folder and merge them?

I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.
Kindly help me!
glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
axis=0 means that the concatenation should happen along the rows (vertically)
ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.
import glob
import pandas as pd
path = r'<path to folder containing csv files>'
all_files = glob.glob(path + "/*.csv")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
lst.append(df)
merged_df = pd.concat(lst, axis=0, ignore_index=True)
This can be easily done with a CSV tool like miller:
mlr --csv cat --filename bla1.csv *.car.csv
This will concatenate the files (without repeating the header) and prepend the filename as the first column.
You can use the pandas library this way:
import pandas as pd
import os
# path to folder where the csv files are stored
path = '/path/to/folder'
result = pd.DataFrame()
for i in range(1, n+1):
filename = "{}.car.csv".format(i)
file_path = os.path.join(path, filename)
df = pd.read_csv(file_path)
df['new_label'] = i
result = pd.concat([result, df], ignore_index=True)
result.to_csv('final_result.csv', index=False)
The n in the code above should be replaced with the number of csv files you have in the folder.
If you need any explanation of the code (in case you're new to python or dataframes) just comment below.
Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.
from pathlib import Path
import pandas as pd
input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"
pd.concat(
(pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)

Data from one dataframe going to another dataframe Pandas

I am running a loop to open and modify a set of files in a directory using pandas. I am testing on a subset of 10 files and one of them is somehow transposing onto the other and I have no idea why. I have a column for filename and it is the correct file, but using data from the other. It's only this file and I can't figure out why. In the end I get a concatinated dataset where a subset are identical minus the "filename". It seems to be happening before line 8 because that output file has the incorrect info as well. The source files are indeed different and the names of the files are not the same.
Thank you for any help!
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
df = pd.read_excel(filename, header = None)
for i, row in df.iterrows():
if row.notnull().all():
df2 = df.iloc[(i+1):].reset_index(drop=True)
df2.columns = list(df.iloc[i])
df2.to_excel(filename+"test.xlsx", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*test2.xlsx'))
CAT = pd.concat([pd.read_excel(f) for f in all_filenames ], ignore_index=True, sort=False)
CAT.pop("Unnamed: 0")
CAT.to_excel("All_DF.xlsx", index=filename)
CAT.to_csv("All_DF.csv", index=filename)

Concatenate files into one Dataframe while adding identifier for each file

The first part of this question has been asked many times and the best answer I found was here: Import multiple csv files into pandas and concatenate into one DataFrame.
But what I essentially want to do is be able to add another variable to each dataframe that has participant number, such that when the files are all concatenated, I will be able to have participant identifiers.
The files are named like this:
So perhaps I could just add a column with the ucsd1, etc. to identify each participant?
Here's code that I've gotten to work for Excel files:
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
If I understand you correctly, it's simple:
import re # <-------------- Add this line
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.

How to combine a large number of dataframes?

I have many .txt files in a folder.
For example, each .txt file is like below.
FileA = pd.DataFrame({'Id':["a","b","c"],'Id2':["a","b","z"],'Amount':[10, 30,50]})
FileB= pd.DataFrame({'Id':["d","e","f","z"],'Id2':["g","h","i","j"],'Amount':[10, 30,50,100]})
FileC= pd.DataFrame({'Id':["r","e"],'Id2':["o","i"],'Amount':[6,33]})
FileD...
I want to extract the first row of each dataframe in the folder, and then combine all of them.
So what I did is below.
To make a list of the txt files, I did the following.
txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)
To extract first row and combine all of them, I did below.
pd.read_table(txtfiles[0])[:1].append([pd.read_table(txtfiles[1])[:1],pd.read_table(txtfiles[2])[:1]],pd.read_table.......)
If the number of txt. files is small, I can do in this way, but in case there are many .txt files, I need an automation method.
Does anyone know how to automate this?
Thanks for your help!
Based on Sid's answer to this post:
input_path = r"insert/your/path" # use the patk where you stored the txt files
all_files = glob.glob(os.path.join(input_path, "*.txt"))
df_from_each_file = (pd.read_csv(f, nrows=1) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
Update Using pd.read_csv was not properly ingesting the file. Replacing read_csv with read_table should give the expected results
input_path = r"insert/your/path" # use the patk where you stored the txt files
all_files = glob.glob(os.path.join(input_path, "*.txt"))
df_from_each_file = (pd.read_table(f, nrows=1) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)

How to pass labels to concatenated data frame to specify which data came from which CSV file while reading multiple CSV files using Glob

I have 27 CSV files which contain data about the various country GDPs. I am reading those CSV files using Glob and then concatenating them into a single data frame. Now the problem is I want to specify labels so that in concatenated data frame I can identify which dataset is for which state.
I have already tried to pass the list of states as key parameter available for pd.concat() method which does the required labeling but in my case, it is not working.
path = 'C:\folder A' # use your path
all_files = glob.glob(path + "/*.csv")
df_from_each_file = (pd.read_csv(f, encoding = "ISO-8859-1", index_col=None, header=0, sep=",") for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True,keys=['Andhra_Pradesh','Arunachal_Pradesh','Assam','Bihar','Chhattisgarh','Goa','Gujarat','Haryana','Himachal_Pradesh','Jharkhand','Karnataka','Kerala','Madhya_Pradesh','Maharashtra','Manipur','Meghalaya','Mizoram','Nagaland','Odisha','Punjab','Rajasthan','Sikkim','Tamil_Nadu','Telangana','Tripura','Uttar_Pradesh','Uttarakhand'], sort=True)
I think you need to manually modify the states first:
keys=['Andhra_Pradesh','Arunachal_Pradesh','Assam','Bihar','Chhattisgarh','Goa','Gujarat','Haryana','Himachal_Pradesh','Jharkhand','Karnataka','Kerala','Madhya_Pradesh','Maharashtra','Manipur','Meghalaya','Mizoram','Nagaland','Odisha','Punjab','Rajasthan','Sikkim','Tamil_Nadu','Telangana','Tripura','Uttar_Pradesh','Uttarakhand']
for df, state in zip(df_from_each_file, keys):
df['state'] = state
concatenated_df = pd.concate(df_from_each_file, sort=True, ignore_index=True)

Categories