Read multiple CSV an each column has its CSV name - python

to sum up i need that each colum as the name that the csv file has.
This is what I have done so far :
path = r'C:\Users\dfgdfsgsfg\Untitled Folder\tickers' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f , parse_dates=True, index_col="date") for f in all_files)
concat = pd.concat(df_from_each_file, axis=1)
df = concat['PriceUSD']
df.columns = [ ??????? ] #what do I put in here?
This what I get when I dont name the columns

i also tryied this , but not quiet the results i wished
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f , parse_dates=True, index_col="date").assign(filename = f) for f in all_files)
concat = pd.concat(df_from_each_file, axis=1)
df = concat['PriceUSD']
df.columns = all_files[:-2]
df
RESULT

If you are really interested in only the single column from all those CSV files, then while parsing the csv just trim it to just the column you want:
def getPriceUSD(filename):
"""reads csv file then returns dataframe with just the column 'PriceUSD'
with the filename as the column title"""
data = pd.read_csv(f , parse_dates=True, index_col="date")
data = data["PriceUSD"]
data.columns = [filename]
return data
Then concat all the already parsed and formatted columns together:
df = pd.concat(map(getPriceUSD, all_files), axis=1)
And before you ask, if you don't want the full path then use os.path.basename(filename) for the column instead of just filename

Related

Combining CSV files into Dataframe python

I am trying to add data from several files in a folder to a data frame. Each .csv file has varying lengths but has the same number of columns. I am trying to add all of them to one data frame with ignoring the index so that the new data frame is just vertically combined. For some reason every time I try to concatenate the data I am left with ~ 363 columns when there should only be 9. Each csv file has the same number of columns so I am confused.
import os
import pandas as pd
import glob
cwd = os.getcwd()
folder = cwd +'\\downloads\\prepared_csv_files\\prepared_csv_files\\'
all_files = glob.glob(folder + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I have also tried
final_df = pd.DataFrame(li, columns = ['tool','pressure'])
# and I name all columns not doing it now
here final is the name of the final dataset.
I am assuming tool and pressure are the columns name in your all .csv files
final = pd.DataFrame(columns = ['tool','pressure'])
for filename in all_files:
df = pd.read_csv(filename)
df = pd.DataFrame(df)
final = pd.concat([final,df],ignore_index= True,join="inner")

Read single column from csv file and rename with the name of the text file

I'm using a for loop to cycle through numerous text files, select a single column from the text files (named ppm), and append these columns to a new data frame. I'd like the columns in the new data frame to have the name of the text file but I'm not sure how to do this..
My code is:
all_files=glob.glob(os.path.join(path,"*.txt"))
df1=pd.DataFrame()
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
df1 = pd.concat([df,df1],axis=1)
At the moment every column in the new dataframe is called 'ppm'.
I used to have this code
df1=pd.DataFrame()
for file in all_files:
file_name = file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0)
df1[file_name] = df['ppm']
But I ran into the warning 'PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() df1[file_name] = df['ppm'].copy()' when I tried to run the code for a large number of files (~ 100s).
Assuming index is equal, add all your data into a dictionairy:
all_files=glob.glob(os.path.join(path,"*.txt"))
data_dict = {}
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
data_dict[file_name] = df['ppm']
df1 = pd.DataFrame(data_dict)
Use concat outside loops with append DataFrames to list with rename column ppm:
all_files=glob.glob(os.path.join(path,"*.txt"))
dfs = []
for file in all_files:
file_name = os.path.basename(file)
df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
dfs.append(df.rename(columns={'ppm':file_name}))
df_big = pd.concat(dfs, axis=1)
Use df.rename() to rename the column name of the dataframe.
for file in all_files:
file_name = os.path.basename(file)
print(file_name)
df = pandas.read_csv(file, index_col=None, sep=',', header = 0, usecols = ['ppm'])
df.rename(columns={'ppm': file_name}, inplace=True)
df1 = pandas.concat([df,df1],axis=1)
Output:
two.txt one.txt
0 9 3
1 0 6
Rather than concatenating and appending dataframes as you iterate over your list of files, you could consider building a dictionary of the relevant data then construct your dataframe just once. Like this:
import csv
import pandas as pd
import glob
import os
PATH = ''
COL = 'ppm'
FILENAME = 'filename'
D = {COL: [], FILENAME: []}
for file in glob.glob(os.path.join(PATH, '*.csv')):
with open(file, newline='') as infile:
for row in csv.DictReader(infile):
if COL in row:
D[COL].append(row[COL])
D[FILENAME].append(file)
df = pd.DataFrame(D)
print(df)

Concatenate files into one Dataframe while adding identifier for each file

The first part of this question has been asked many times and the best answer I found was here: Import multiple csv files into pandas and concatenate into one DataFrame.
But what I essentially want to do is be able to add another variable to each dataframe that has participant number, such that when the files are all concatenated, I will be able to have participant identifiers.
The files are named like this:
So perhaps I could just add a column with the ucsd1, etc. to identify each participant?
Here's code that I've gotten to work for Excel files:
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
If I understand you correctly, it's simple:
import re # <-------------- Add this line
path = r"/Users/jamesades/desktop/Watch_data_1/Re__Personalized_MH_data_call"
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_excel(filename, index_col=None, header=0)
participant_number = int(re.search(r'(\d+)', filename).group(1)) # <-------------- Add this line
df['participant_number'] = participant_number # <-------------- Add this line
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
That way, each dataframe loaded from an Excel file will have a column called participant_number, and the value of that column each row in each dataframe will be the number found in the filename that the dataframe was loaded from.

Merge all the csv file in one folder and add a new column to pandas dataframe with partial file name in Python

I have a folder inside have over 100 CSV files. They all have the same prefix name.
eg:
shcool.Math001.csv
School.Math002.csv.
School.Physics001.csv. etc... They all contain the same number of columns.
How can I merge all the CSV files in one data frame in Python and add a new column with those files names but the prefix name "School." needs to be removed?
I found some code example online but did not sovle my problem:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)
Try this, haven't tested:
import os
import pandas as pd
path ='<folder path to CSVs>'
dfs = []
for filename in os.listdir(path):
sample_df = pd.read_csv(filename)
sample_df['filename'] = ''.join(filename[7:])
dfs.append(sample_df)
df = pd.concat(dfs, axis=0, ignore_index=True)
Add DataFrame.assign in generator comprehension for add new column:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t').assign(New=+os.path.basename(f[7:]).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

Read multiple CSV files then rename files based on the filenames

Currently the below code reads all the csv files in the path, then saved in a list.
I want to save each dataframe with the name of the filename e.g. echo.csv
path = r'M:\Work\Experimental_datasets\device_ID\IoT_device_captures\packet_header_features' # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, skiprows=15, sep='[|]',
skipfooter=2, engine='python', header=None,
names=["sum_frame_len","avg_frame_len","max_frame_len","sum_ip_len"],
usecols=[2,3,4,5]
)
li.append(df)
The output I get is a list of dataframes - but I want each of these dataframes with the name of the filename e.g. echo
How do I access each dataframe from the dictionary
As you mentioned a dictionary would be useful for this task. For example:
import os
all_files = glob.glob(os.path.join(path, "*.csv"))
df_dict = {}
for filename in all_files:
df = pd.read_csv(filename, skiprows=15, sep='[|]',
skipfooter=2, engine='python', header=None,
names=["sum_frame_len","avg_frame_len","max_frame_len","sum_ip_len"],
usecols=[2,3,4,5]
)
name = os.path.basename(filename).split('.')[0]
df_dict[name] = df
What you will be left with is the dictionary df_dict where the keys correspond to the name of the file and the value corresponds to the data within a given file.
You can view all the keys in the dictionary with df_dict.keys() and select a given DataFrame with df_dict[key].

Categories