Pandas Reading CSV With Common Path but Different Names - python

I am trying to write a faster way to read in a group of CSV files. The format of the files is I have a common partial path which leads to a group of subfolders, which are identified by some identifier, and then a file name that starts with the identifier, and then ends with a common end phrase.
For example, lets say I have group names A, B, C. The file paths would be:
C:\Users\Name\Documents\A\A-beginninggroup.csv
C:\Users\Name\Documents\A\A-middlegroup.csv
C:\Users\Name\Documents\A\A-endinggroup.csv
C:\Users\Name\Documents\B\B-beginninggroup.csv
C:\Users\Name\Documents\B\B-middlegroup.csv
C:\Users\Name\Documents\B\B-endinggroup.csv
C:\Users\Name\Documents\C\C-beginninggroup.csv
C:\Users\Name\Documents\C\C-middlegroup.csv
C:\Users\Name\Documents\C\C-endinggroup.csv
I am trying to write a code where I can just change the name of the subgroup without having to change it in each read_csv line. The following code shows the logic, but not sure how to make it work/if its possible.
intro='C:\Users\Name\Documents\'
subgroup='C'
ending1='-endinggroup.csv'
ending2='-middlegroup.csv'
ending3='-beginninggroup.csv'
filename_1=intro+subgroup+'\'+subgroup+ending1
filename_2=intro+subgroup+'\'+subgroup+ending2
filename_3=intro+subgroup+'\'+subgroup+ending3
file1=pd.read_csv(filename_1)
file2=pd.read_csv(filename2)
file3=pd.read_csv(filename3)

I am not sure exactly where you are after, but you can use an F-string in this case.
You first define your variable (names in your case):
location = 'somewhere\anywhere'
group = 'A'
csv = 'A-beginninggroup.csv'
Now you combine these variables in an F-string:
file_location = f"{location}\{group}\{csv}"
And pass the file_location to your pandas csv reader. You can freely change the group variable and the csv variable.

I can just change the name of the subgroup without having to change it in each read_csv line.
You can define a function to handle the logic of joining path:
import os
intro='C:\\Users\\Name\\Documents\\'
subgroup='C'
ending1='-endinggroup.csv'
ending2='-middlegroup.csv'
ending3='-beginninggroup.csv'
def read_file(subgroup, ending):
csv_path = os.join(intro, subgroup, subgroup+ending)
df = pd.read_csv(csv_path)
return df
file1 = read_file('A', ending1)
file2 = read_file('A', ending2)
file3 = read_file('B', ending1)

Related

How to extract data from a filename in python? - convert file name to string?

I am trying to extract the meta data for some experiments I'm helping conduct at school. We are naming our data files something like this:
name_date_sample_environment_run#.csv
What I need to do is write a function that separates each piece to a list that'll be output like this:
['name', 'date', 'sample', 'environment', 'run#']
Though I haven't quite figured it out. I think I need to figure out how to load the file, convert the name to a string, then use a delimiter for each underscore to separate each into the given list. I don't know how to load the file so that I can convert it to a string. Any help will be appreciated!
P.S - I will eventually need to figure out a way to save this data into a spreadsheet so we can see how many experiments we do with certain conditions, who performed them, etc. but I can figure that out later. Thanks!
If you're just asking how to break down the string into all the components separated by an underscore, then the easiest way would be using the split function.
x = 'name_date_sample_environment_run#.csv'
y = x.split('_')
# y = ['name', 'date', 'sample', 'environment', 'run#.csv']
The split function simply breaks down the string every time it sees the underscore. If you want to remove the .csv part from 'run#.csv' then you can process the original string to remove the last 4 characters.
x = 'name_date_sample_environment_run#.csv'
x = x[:-4]
y = x.split('_')
# y = ['name', 'date', 'sample', 'environment', 'run#]
If all your files are structured, and in the same folder you can do this way:
import os
files = os.listdir('.') #insert folder path
structured_files = {}
for file in files:
name, date, sample, environment = file.split('_')
structured_files.append({'name':name, 'date':date, 'sample':sample, 'env':env})
Then you'll have a structure dict with your file info.
If you want to, you can import into pandas, and save to an excel sheet:
import os
import pandas as pd
files = os.listdir('.') #insert folder path
structured_files = {}
for file in files:
name, date, sample, environment = file.split('_')
structured_files.append({'name':name, 'date':date, 'sample':sample, 'env':env})
pd.from_dict(structured_files).to_excel('files.xlsx')

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Use for loop to create dataframes from a list

Python/Pandas beginner here. I have a list with names which each represent a csv file on my computer. I would like to create a separate pandas dataframe for each of these csv files and use the same names for the dataframes. I can do this in a very inefficient way by creating a separate line of code for each name in the list and adding/removing these lines of code manually as the list changes over time, something like this when I have 3 names Mark, Frank and Peter:
path = 'C:\\Users\\Me\\Desktop\\Names'
Mark = pd.read_csv(path+"Mark.csv")
Frank = pd.read_csv(path+"Frank.csv")
Peter = pd.read_csv(path+"Peter.csv")
Problem is that I will usually have a dozen or so names and they change frequently, so this is not very efficient. Instead I figured I would keep a list of the names to update when needed and use a for loop to do the rest:
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
for name in names:
name = pd.read_csv(path+name+'.csv')
This does not produce an error, but instead of creating 3 different dataframes Mark, Frank and Peter, it creates a single dataframe 'name' using only the data from the first entry in the list. How do make this work so that it creates a separate dataframe for each name in the list and give each dataframe the same name as the csv file that was read?
it creates a single dataframe 'name' using only the data from the first entry in the list.
It uses the last entry, because each time through the loop, name is replaced with the result of the next read_csv call. (Actually, it's being replaced with one of the value from the list, and then with the read_csv result; to avoid confusion, you should use a separate name for your loop variables as your outputs. Especially since name doesn't make any sense as the thing to call your result :) )
How do make this work
You had a list of input values, and thus you want a list of output values as well. The simplest approach is to use a list comprehension, describing the list you want in terms of the list you start with:
csvs = [
pd.read_csv(f'{path}{name}.csv')
for name in names
]
It works the same way as the explicit loop, except it builds a list automatically from the value that's computed each time through. It means what it says, in order: "csvs is a list of these pd.read_csv results, computed once for each of the name values that is in names".
name here is the variable used to iterate over the list. Modifying it won't make any noticable changes.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path + name + '.csv'))
# OR
dfs = [
pd.read_csv(path + name + '.csv')
for name in names
]
Or, you can use a dict to map the name with the file.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = {}
for name in names:
dfs[name] = pd.read_csv(path + name + '.csv')
# OR
dfs = {
name : pd.read(path + name + '.csv')
for name in names
}
Two options:
If you know the names of all your csv files you can edit you code and only add a list to hold all your files.
Example
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path+name+'.csv')
Otherwise, you can look for all the files with csv extension and open all of them using listdir()
import os
import pandas as pd
path = 'C:\\Users\\Me\\Desktop\\Names'
files = os.listdir(path)
dfs = []
for file in files:
if file[-3:] == "csv":
dfs.append(pf.read_csv(path + file))
for name in names:
globals()[name] = pd.read_csv(path+name+'.csv')

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

How to append multiple CSV files and add an additional column indicating file name in Python?

I have over 20 CSV files in a single folder. All files have the same structure, they just represent different days.
Example:
Day01.csv
Day02.csv
Day03.csv
Day04.csv (and so on...)
The files contain just two numeric columns: x and y. I would like to append all of these csv files together into one large file and add a column for the file name (day). I have explored similar examples to generate the following code but this code adds each y to a separate column (Y1, Y2, Y3, Y4...and so on). I would like to simply have this appended file as three columns: x, y, file name. How can I modify the code to do the proper append?
I have tried the code from this example: Read multiple csv files and Add filename as new column in pandas
import pandas as pd
import os
os.chdir('C:....path to my folder')
files = os.listdir()
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
However, this code does not append all Y values under one column. (all other aspects seem to work, however). Can someone help with the code so that all Y values are under a single column?
The following should work by creating the filename column before appending the dataframe to your list.
import os
import pandas as pd
file_list = []
for file in os.listdir():
if file.endswith('.csv'):
df = pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, ignore_index=True)
all_days.to_csv("all.txt")
python is great at these simple task, almost too good to be true…
fake_files = lambda n: '\n'.join(('%d\t%d'%(i, i+1) for i in range(n, n+3)))
file_name = 'fake_me%s.csv'
with open('my_new.csv', 'wt') as new:
for number in range(3): # os.listdir()
# with open(number) as to_add:
# rows = to_add.readlines()
rows_fake = fake_files(number*2).split('\n')
adjusted_rows = [file_name%number + '\t' + row for row in rows_fake]
new.write('\n'.join(adjusted_rows) + '\n')
with adjustments to your specific io and naming, this is all you need.
you can just copy the code and run it and study how it works.

Categories