Use for loop to create dataframes from a list - python

Python/Pandas beginner here. I have a list with names which each represent a csv file on my computer. I would like to create a separate pandas dataframe for each of these csv files and use the same names for the dataframes. I can do this in a very inefficient way by creating a separate line of code for each name in the list and adding/removing these lines of code manually as the list changes over time, something like this when I have 3 names Mark, Frank and Peter:
path = 'C:\\Users\\Me\\Desktop\\Names'
Mark = pd.read_csv(path+"Mark.csv")
Frank = pd.read_csv(path+"Frank.csv")
Peter = pd.read_csv(path+"Peter.csv")
Problem is that I will usually have a dozen or so names and they change frequently, so this is not very efficient. Instead I figured I would keep a list of the names to update when needed and use a for loop to do the rest:
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
for name in names:
name = pd.read_csv(path+name+'.csv')
This does not produce an error, but instead of creating 3 different dataframes Mark, Frank and Peter, it creates a single dataframe 'name' using only the data from the first entry in the list. How do make this work so that it creates a separate dataframe for each name in the list and give each dataframe the same name as the csv file that was read?

it creates a single dataframe 'name' using only the data from the first entry in the list.
It uses the last entry, because each time through the loop, name is replaced with the result of the next read_csv call. (Actually, it's being replaced with one of the value from the list, and then with the read_csv result; to avoid confusion, you should use a separate name for your loop variables as your outputs. Especially since name doesn't make any sense as the thing to call your result :) )
How do make this work
You had a list of input values, and thus you want a list of output values as well. The simplest approach is to use a list comprehension, describing the list you want in terms of the list you start with:
csvs = [
pd.read_csv(f'{path}{name}.csv')
for name in names
]
It works the same way as the explicit loop, except it builds a list automatically from the value that's computed each time through. It means what it says, in order: "csvs is a list of these pd.read_csv results, computed once for each of the name values that is in names".

name here is the variable used to iterate over the list. Modifying it won't make any noticable changes.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path + name + '.csv'))
# OR
dfs = [
pd.read_csv(path + name + '.csv')
for name in names
]
Or, you can use a dict to map the name with the file.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = {}
for name in names:
dfs[name] = pd.read_csv(path + name + '.csv')
# OR
dfs = {
name : pd.read(path + name + '.csv')
for name in names
}

Two options:
If you know the names of all your csv files you can edit you code and only add a list to hold all your files.
Example
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path+name+'.csv')
Otherwise, you can look for all the files with csv extension and open all of them using listdir()
import os
import pandas as pd
path = 'C:\\Users\\Me\\Desktop\\Names'
files = os.listdir(path)
dfs = []
for file in files:
if file[-3:] == "csv":
dfs.append(pf.read_csv(path + file))

for name in names:
globals()[name] = pd.read_csv(path+name+'.csv')

Related

Pandas Reading CSV With Common Path but Different Names

I am trying to write a faster way to read in a group of CSV files. The format of the files is I have a common partial path which leads to a group of subfolders, which are identified by some identifier, and then a file name that starts with the identifier, and then ends with a common end phrase.
For example, lets say I have group names A, B, C. The file paths would be:
C:\Users\Name\Documents\A\A-beginninggroup.csv
C:\Users\Name\Documents\A\A-middlegroup.csv
C:\Users\Name\Documents\A\A-endinggroup.csv
C:\Users\Name\Documents\B\B-beginninggroup.csv
C:\Users\Name\Documents\B\B-middlegroup.csv
C:\Users\Name\Documents\B\B-endinggroup.csv
C:\Users\Name\Documents\C\C-beginninggroup.csv
C:\Users\Name\Documents\C\C-middlegroup.csv
C:\Users\Name\Documents\C\C-endinggroup.csv
I am trying to write a code where I can just change the name of the subgroup without having to change it in each read_csv line. The following code shows the logic, but not sure how to make it work/if its possible.
intro='C:\Users\Name\Documents\'
subgroup='C'
ending1='-endinggroup.csv'
ending2='-middlegroup.csv'
ending3='-beginninggroup.csv'
filename_1=intro+subgroup+'\'+subgroup+ending1
filename_2=intro+subgroup+'\'+subgroup+ending2
filename_3=intro+subgroup+'\'+subgroup+ending3
file1=pd.read_csv(filename_1)
file2=pd.read_csv(filename2)
file3=pd.read_csv(filename3)
I am not sure exactly where you are after, but you can use an F-string in this case.
You first define your variable (names in your case):
location = 'somewhere\anywhere'
group = 'A'
csv = 'A-beginninggroup.csv'
Now you combine these variables in an F-string:
file_location = f"{location}\{group}\{csv}"
And pass the file_location to your pandas csv reader. You can freely change the group variable and the csv variable.
I can just change the name of the subgroup without having to change it in each read_csv line.
You can define a function to handle the logic of joining path:
import os
intro='C:\\Users\\Name\\Documents\\'
subgroup='C'
ending1='-endinggroup.csv'
ending2='-middlegroup.csv'
ending3='-beginninggroup.csv'
def read_file(subgroup, ending):
csv_path = os.join(intro, subgroup, subgroup+ending)
df = pd.read_csv(csv_path)
return df
file1 = read_file('A', ending1)
file2 = read_file('A', ending2)
file3 = read_file('B', ending1)

Saving extracted column to a txt file in ascending order

I need some help writing the values from a column to a text file in ascending order.
The code I currently have creates a directory called values and saves the values extracted from the column to .txt file but it is not in ascending order as I would like.
values_dir=os.path.join(cwd, 'values')
if not os.path.exists(values_dir):
os.mkdir(values_dir)
with open(os.path.join(values_dir, 'values.txt'), "w") as txt_file:
for name, group in split_location:
txt_file.write(str(name) + '\n')
The code saves my values as
data23
data17
data88
I would like it to save as
data17
data23
data88
If someone could point me in the right direction would be much appreciated, thank you.
Edit
I split 2 large dataframes by unique values in fields Data and Data_Unit
datafile = pd.read_csv('location.csv')
datafile_large = pd.read_csv('large.csv')
split_location = datafile.groupby('Data')
split_large = datafile_large.groupby('Data_Unit')
I then loop through the groups and save the split dataframes to sub-directories based on their unique values, whilst maintaining the parent file name.
for name, group in split_location:
sub_dir = os.path.join(cwd, name)
if not os.path.exists(sub_dir):
os.mkdir(sub_dir)
group = group.drop(['Data'], axis=1)
group.to_csv(sub_dir + "/location.csv", index=0)
for name, group in split_large:
sub_dir = os.path.join(cwd, name)
if not os.path.exists(sub_dir):
os.mkdir(sub_dir)
group = group.drop(['Data_Unit'], axis=1)
group.to_csv(sub_dir + "/large.csv", index=0)
Lastly I create the values.txt file as mentioned in the beginning. But would like the values saved in the .txt file in ascending order.
values_dir=os.path.join(cwd, 'values')
if not os.path.exists(values_dir):
os.mkdir(values_dir)
with open(os.path.join(values_dir, 'values.txt'), "w") as txt_file:
for name, group in split_location:
txt_file.write(str(name) + '\n')
Try this:
names, groups = map(list, zip(*split_location))
names.sort()
for name in names:
txt_file.write(str(name) + '\n')
Instead of:
for name, group in split_location:
txt_file.write(str(name) + '\n')
You can use python's built-in sorted function or the sort method of a list. Another answer shows the sort method so I'm using sorted here.
Also use pathlib for python 3.
from pathlib import Path
values_dir = Path.home() / 'values'
values_dir.mkdir(exist_ok=True)
# step one: get a list of names
# from your example, split_location looks like
# an iterable of two-item tuple or list
names = sorted([str(item[0]) for item in split_location])
# step two: write the list of sorted names
# you can write just one string by joining your
# list of names with newline characters
newf = values_dir / 'values.txt'
newf.write_text('\n'.join(names))

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Pandas name dataframe from a string in csv name

I have several csv with a string in their name (e.g city name) and want to read them in dataframe with the names derived from that city name.
example of csv names: data_paris.csv , data_berlin.csv
How can I read them in a loop to get df_paris and df_berlin?
What I tried so far:
all_files = glob.glob(./*.csv")
for filename in all_files:
city_name=re.split("[_.]", filename)[1] #to extract city name from filename
dfname= {'df' + str(city_name)}
print(dfname)
dfname= pd.read_csv(filename)
I expect to have df_rome and df_paris, but I get just dfname. Why?
A related question: Name a dataframe based on csv file name?
Thank you!
I would recommend against automatic dynamic naming like df_paris, df_berlin. Instead, you should do:
all_files = glob.glob("./*.csv")
# dictionary of dataframes
dfs = dict()
for filename in all_files:
city_name=re.split("[_.]", filename)[1] # to extract city name from filename
dfs[city_name] = pd.read_csv(filename) # assign to the dataframe dictionary
You are mixing your concepts. If you want to reference dynamically data frames that have been loaded use a dict
all_files = glob.glob("./*.csv")
dfname={}
for filename in all_files:
city_name=re.split("[_.]", filename)[1] #to extract city name from filename
dfname['df' + str(city_name)] = pd.read_csv(filename)
print(list(dfname.keys())
the only dataframe you're creating is "dfname." You just keep overwriting that each time you loop through. I guess you could do this using globals(), though honestly I'd probably just create a list or a dict of dataframes (as it seems others have suggested while I was typing this), or else create a named column for 'city' in a master dataframe that I just keep appending to. But, keeping with what you're specifically asking, you could probably do it like so:
all_files = glob.glob("./*.csv")
for filename in all_files:
globals()[filename[5:-4]]= pd.read_csv(filename)

Why is my for loop overwriting instead of appending?

I have multiple (25k) .csv files that I'm trying to append into a HDFStore file. They all share identical headers. I am using the below code, but for some reason whenever I run it the dataframe isn't appended with all of the files, but rather is only the last file in the list.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
store = pd.HDFStore('store.h5')
store.put('df', pd.read_csv(filenames[0],dtype=dtypes,parse_dates=
["date"])) #store one data frame
for f in filenames:
try:
temp_csv = pd.DataFrame()
temp_csv = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
store.append('df', temp_csv)
except:
pass
I've tried using a subset of the filenames list, but always get the last entry. For some reason, the loop is not appending my file, but rather overwriting it every single time. Any advice would be appreciated as this is driving me bonkers. (python 3, windows)
I think the problem is related to:
store.append('df', temp_csv)
If I correctly understand what you're trying to do, 'df' should change every iteration, you're just overwriting it now.
You're creating/storing a new DataFrame with each iteration, like #SeaMonkey said. Your consolidated dataframe should be instantiated outside your loop, something like this.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
df = pd.DataFrame()
for f in filenames:
df_tmp = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
df = df.append(df_tmp)
store = pd.HDFStore('store.h5')
store.put('df', df)

Categories