Concatenate files into one Dataframe while adding identifier for each file - python

The first part of this question has been asked many times and the best answer I found was here: Import multiple csv files into pandas and concatenate into one DataFrame.
But what I essentially want to do is be able to add another variable to each dataframe that has participant number, such that when the files are all concatenated, I will be able to have participant identifiers.
The files are named like this:
So perhaps I could just add a column with the ucsd1, etc. to identify each participant?
Here's code that I've gotten to work for Excel files:
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

If I understand you correctly, it's simple:
import re # <-------------- Add this line
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.

Related

Combining CSV files into Dataframe python

I am trying to add data from several files in a folder to a data frame. Each .csv file has varying lengths but has the same number of columns. I am trying to add all of them to one data frame with ignoring the index so that the new data frame is just vertically combined. For some reason every time I try to concatenate the data I am left with ~ 363 columns when there should only be 9. Each csv file has the same number of columns so I am confused.
import os
import pandas as pd
import glob
cwd = os.getcwd()
folder = cwd +'\\downloads\\prepared_csv_files\\prepared_csv_files\\'
all_files = glob.glob(folder + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I have also tried
final_df = pd.DataFrame(li, columns = ['tool','pressure'])
# and I name all columns not doing it now
here final is the name of the final dataset.
I am assuming tool and pressure are the columns name in your all .csv files
final = pd.DataFrame(columns = ['tool','pressure'])
for filename in all_files:
df = pd.read_csv(filename)
df = pd.DataFrame(df)
final = pd.concat([final,df],ignore_index= True,join="inner")

Read single column from csv file and rename with the name of the text file

I'm using a for loop to cycle through numerous text files, select a single column from the text files (named ppm), and append these columns to a new data frame. I'd like the columns in the new data frame to have the name of the text file but I'm not sure how to do this..
My code is:
all_files=glob.glob(os.path.join(path,"*.txt"))
df1=pd.DataFrame()
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
df1 = pd.concat([df,df1],axis=1)
At the moment every column in the new dataframe is called 'ppm'.
I used to have this code
df1=pd.DataFrame()
for file in all_files:
file_name = file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0)
df1[file_name] = df['ppm']
But I ran into the warning 'PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() df1[file_name] = df['ppm'].copy()' when I tried to run the code for a large number of files (~ 100s).
Assuming index is equal, add all your data into a dictionairy:
all_files=glob.glob(os.path.join(path,"*.txt"))
data_dict = {}
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
data_dict[file_name] = df['ppm']
df1 = pd.DataFrame(data_dict)
Use concat outside loops with append DataFrames to list with rename column ppm:
all_files=glob.glob(os.path.join(path,"*.txt"))
dfs = []
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
dfs.append(df.rename(columns={'ppm':file_name}))
df_big = pd.concat(dfs, axis=1)
Use df.rename() to rename the column name of the dataframe.
for file in all_files:
file_name = os.path.basename(file)
print(file_name)
df = pandas.read_csv(file, index_col=None, sep=',', header = 0, usecols = ['ppm'])
df.rename(columns={'ppm': file_name}, inplace=True)
df1 = pandas.concat([df,df1],axis=1)
Output:
two.txt one.txt
0 9 3
1 0 6
Rather than concatenating and appending dataframes as you iterate over your list of files, you could consider building a dictionary of the relevant data then construct your dataframe just once. Like this:
import csv
import pandas as pd
import glob
import os
PATH = ''
COL = 'ppm'
FILENAME = 'filename'
D = {COL: [], FILENAME: []}
for file in glob.glob(os.path.join(PATH, '*.csv')):
with open(file, newline='') as infile:
for row in csv.DictReader(infile):
if COL in row:
D[COL].append(row[COL])
D[FILENAME].append(file)
df = pd.DataFrame(D)
print(df)

Read multiple CSV files then rename files based on the filenames

Currently the below code reads all the csv files in the path, then saved in a list.
I want to save each dataframe with the name of the filename e.g. echo.csv
path = r'M:\Work\Experimental_datasets\device_ID\IoT_device_captures\packet_header_features' # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, skiprows=15, sep='[|]',
skipfooter=2, engine='python', header=None,
names=["sum_frame_len","avg_frame_len","max_frame_len","sum_ip_len"],
usecols=[2,3,4,5]
)
li.append(df)
The output I get is a list of dataframes - but I want each of these dataframes with the name of the filename e.g. echo
How do I access each dataframe from the dictionary
As you mentioned a dictionary would be useful for this task. For example:
import os
all_files = glob.glob(os.path.join(path, "*.csv"))
df_dict = {}
for filename in all_files:
df = pd.read_csv(filename, skiprows=15, sep='[|]',
skipfooter=2, engine='python', header=None,
names=["sum_frame_len","avg_frame_len","max_frame_len","sum_ip_len"],
usecols=[2,3,4,5]
)
name = os.path.basename(filename).split('.')[0]
df_dict[name] = df
What you will be left with is the dictionary df_dict where the keys correspond to the name of the file and the value corresponds to the data within a given file.
You can view all the keys in the dictionary with df_dict.keys() and select a given DataFrame with df_dict[key].

Import multiple excel files into pandas and create a column based on name of file

I have multiple excel files in one folder which I want to read and concat together,but while concating together I want to add column based on name of the file
'D:\\156667_Report.xls',
'D:\\192059_Report.xls',
'D:\\254787_Report.xls',
'D:\\263421_Report.xls',
'D:\\273554_Report.xls',
'D:\\280163_Report.xls',
'D:\\307928_Report.xls'
I can read these files in pandas with following script
path =r'D:\' # use your path
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
I want to add column as Code in all the files which I read.Code will be numbers from filename e.g. 156667,192059
why not just match
foo = re.match('\.*_Report', file_)
num = foo[:6]`
df['Code']= num
Inside your loop?
One you could do this, is by using join, isdigit, inside a list comprehension.
The isdigit will get only the numbers from the file name (in a list), and the join function will join them back into 1.
To be clear, you could change your for loop to this:
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
df['Code'] = ''.join(str(i) for i in file_ if i.isdigit())
list_.append(df)
which will add a column called Code in each df.

Adding file name in a Column while merging multible csv files to pandas- Python

I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)

Categories