Currently the below code reads all the csv files in the path, then saved in a list.
I want to save each dataframe with the name of the filename e.g. echo.csv
path = r'M:\Work\Experimental_datasets\device_ID\IoT_device_captures\packet_header_features' # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, skiprows=15, sep='[|]',
skipfooter=2, engine='python', header=None,
names=["sum_frame_len","avg_frame_len","max_frame_len","sum_ip_len"],
usecols=[2,3,4,5]
)
li.append(df)
The output I get is a list of dataframes - but I want each of these dataframes with the name of the filename e.g. echo
How do I access each dataframe from the dictionary
As you mentioned a dictionary would be useful for this task. For example:
import os
all_files = glob.glob(os.path.join(path, "*.csv"))
df_dict = {}
for filename in all_files:
df = pd.read_csv(filename, skiprows=15, sep='[|]',
skipfooter=2, engine='python', header=None,
names=["sum_frame_len","avg_frame_len","max_frame_len","sum_ip_len"],
usecols=[2,3,4,5]
)
name = os.path.basename(filename).split('.')[0]
df_dict[name] = df
What you will be left with is the dictionary df_dict where the keys correspond to the name of the file and the value corresponds to the data within a given file.
You can view all the keys in the dictionary with df_dict.keys() and select a given DataFrame with df_dict[key].
Related
The first part of this question has been asked many times and the best answer I found was here: Import multiple csv files into pandas and concatenate into one DataFrame.
But what I essentially want to do is be able to add another variable to each dataframe that has participant number, such that when the files are all concatenated, I will be able to have participant identifiers.
The files are named like this:
So perhaps I could just add a column with the ucsd1, etc. to identify each participant?
Here's code that I've gotten to work for Excel files:
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
If I understand you correctly, it's simple:
import re # <-------------- Add this line
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.
I have a folder inside have over 100 CSV files. They all have the same prefix name.
eg:
shcool.Math001.csv
School.Math002.csv.
School.Physics001.csv. etc... They all contain the same number of columns.
How can I merge all the CSV files in one data frame in Python and add a new column with those files names but the prefix name "School." needs to be removed?
I found some code example online but did not sovle my problem:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)
Try this, haven't tested:
import os
import pandas as pd
path ='<folder path to CSVs>'
dfs = []
for filename in os.listdir(path):
sample_df = pd.read_csv(filename)
sample_df['filename'] = ''.join(filename[7:])
dfs.append(sample_df)
df = pd.concat(dfs, axis=0, ignore_index=True)
Add DataFrame.assign in generator comprehension for add new column:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t').assign(New=+os.path.basename(f[7:]).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)
to sum up i need that each colum as the name that the csv file has.
This is what I have done so far :
path = r'C:\Users\dfgdfsgsfg\Untitled Folder\tickers' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f , parse_dates=True, index_col="date") for f in all_files)
concat = pd.concat(df_from_each_file, axis=1)
df = concat['PriceUSD']
df.columns = [ ??????? ] #what do I put in here?
This what I get when I dont name the columns
i also tryied this , but not quiet the results i wished
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f , parse_dates=True, index_col="date").assign(filename = f) for f in all_files)
concat = pd.concat(df_from_each_file, axis=1)
df = concat['PriceUSD']
df.columns = all_files[:-2]
df
RESULT
If you are really interested in only the single column from all those CSV files, then while parsing the csv just trim it to just the column you want:
def getPriceUSD(filename):
"""reads csv file then returns dataframe with just the column 'PriceUSD'
with the filename as the column title"""
data = pd.read_csv(f , parse_dates=True, index_col="date")
data = data["PriceUSD"]
data.columns = [filename]
return data
Then concat all the already parsed and formatted columns together:
df = pd.concat(map(getPriceUSD, all_files), axis=1)
And before you ask, if you don't want the full path then use os.path.basename(filename) for the column instead of just filename
I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)
I'm trying to read a list of files into a list of Pandas DataFrames in Python. However, the code below doesn't work.
files = [file1, file2, file3]
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
dfs = [df1, df2, df3]
# Read in data files
for file,df in zip(files, dfs):
if file_exists(file):
with open(file, 'rb') as in_file:
df = pd.read_csv(in_file, low_memory=False)
print df #the file is getting read properly
print df1 #empty
print df2 #empty
print df3 #empty
How to I get the original DataFrames to update if I pass them into a for-loop as a list of DataFrames?
Try this:
dfs = [pd.read_csv(f, low_memory=False) for f in files]
if you want to check whether file exists:
import os
dfs = [pd.read_csv(f, low_memory=False) for f in files if os.path.isfile(f)]
and if you want to concatenate all of them into one data frame:
df = pd.concat([pd.read_csv(f, low_memory=False)
for f in files if os.path.isfile(f)],
ignore_index=True)
You are not working on the list elements themselves when iterating over them but you are not operating on the list.
You need to insert the elements (or append them) to the list. One possibility could be:
files = [file1, file2, file3]
dfs = [None] * 3 # Just a placeholder
# Read in data files
for i, file in enumerate(files): # Enumeration instead of zip
if file_exists(file):
with open(file, 'rb') as in_file:
dfs[i] = pd.read_csv(in_file, low_memory=False) # Setting the list element
print dfs[i] #the file is getting read properly
This updates the list elements and should work.
Your code seems over complicated you can just do:
files = [file1, file2, file3]
dfs = []
# Read in data files
for file in files:
if file_exists(file):
dfs.append(pd.read_csv(file, low_memory=False))
You will end up with a list of dfs as desired
You can try list comprehension:
files = [file1, file2, file3]
dfs = [pd.read_csv(x, low_memory=False) for x in files if file_exists(x)]
Custom-written Python function that appropriately handles both CSV & JSON files.
def generate_list_of_dfs(incoming_files):
"""
Accepts a list of csv and json file/path names.
Returns a list of DataFrames.
"""
outgoing_files = []
for filename in incoming_files:
file_extension = filename.split('.')[1]
if file_extension == 'json':
with open(filename, mode='r') as incoming_file:
outgoing_json = pd.DataFrame(json.load(incoming_file))
outgoing_files.append(outgoing_json)
if file_extension == 'csv':
outgoing_csv = pd.read_csv(filename)
outgoing_files.append(outgoing_csv)
return outgoing_files
How to Call this Function
import pandas as pd
import json
files_to_be_read = ['filename1.json', 'filename2.csv', 'filename3.json', 'filename4.csv']
dataframes_list = generate_list_of_dfs(files_to_be_read)
Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list.
import fnmatch
# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files
Output which is now a list of the names:
['Feedback Form Submissions 1.21-1.25.22.csv',
'Feedback Form Submissions 1.21.22.csv',
'Feedback Form Submissions 1.25-1.31.22.csv']
Now create a simple list of new names to make working with them easier:
# use a simple format
names = []
for i in range(0,len(files)):
names.append('data' + str(i))
names
['data0', 'data1', 'data2']
You can use any list of names that you want. The next step take the file names and the list of names and then assign them to the names.
# i is the incrementor for the list of names
i = 0
# iterate through the file names
for file in files:
# make an empty dataframe
df = pd.DataFrame()
# load the first file in
df = pd.read_csv(file, low_memory=False)
# get the first name from the list, this will be a string
new_name = names[i]
# assign the string to the variable and assign it to the dataframe
locals()[new_name] = df.copy()
# increment the list of names
i = i + 1
You now have 3 separate dataframes named data0, data1, data2, and do commands like
data2.info()