I have several csv with a string in their name (e.g city name) and want to read them in dataframe with the names derived from that city name.
example of csv names: data_paris.csv , data_berlin.csv
How can I read them in a loop to get df_paris and df_berlin?
What I tried so far:
all_files = glob.glob(./*.csv")
for filename in all_files:
city_name=re.split("[_.]", filename)[1] #to extract city name from filename
dfname= {'df' + str(city_name)}
print(dfname)
dfname= pd.read_csv(filename)
I expect to have df_rome and df_paris, but I get just dfname. Why?
A related question: Name a dataframe based on csv file name?
Thank you!
I would recommend against automatic dynamic naming like df_paris, df_berlin. Instead, you should do:
all_files = glob.glob("./*.csv")
# dictionary of dataframes
dfs = dict()
for filename in all_files:
city_name=re.split("[_.]", filename)[1] # to extract city name from filename
dfs[city_name] = pd.read_csv(filename) # assign to the dataframe dictionary
You are mixing your concepts. If you want to reference dynamically data frames that have been loaded use a dict
all_files = glob.glob("./*.csv")
dfname={}
for filename in all_files:
city_name=re.split("[_.]", filename)[1] #to extract city name from filename
dfname['df' + str(city_name)] = pd.read_csv(filename)
print(list(dfname.keys())
the only dataframe you're creating is "dfname." You just keep overwriting that each time you loop through. I guess you could do this using globals(), though honestly I'd probably just create a list or a dict of dataframes (as it seems others have suggested while I was typing this), or else create a named column for 'city' in a master dataframe that I just keep appending to. But, keeping with what you're specifically asking, you could probably do it like so:
all_files = glob.glob("./*.csv")
for filename in all_files:
globals()[filename[5:-4]]= pd.read_csv(filename)
Related
I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()
I am running a data analysis where I am running many CSV files.
I used the code below
filelist = ["C:\Users\jan.csv", "C:\Users\feb.csv", "C:\Users\mar.csv"]
for location in filelist:
df = pd.read_csv(location)
out_put, productivity = timeresult.input_data.outbuild(df, year, days)
filelist.append(productivity)
Is there a way to have the index be the CSV name and not have the filename in the filelist anymore?
The filelist than created a list of my data analysis but I want the index name to be the csvfile name used in the filelist.
I did not understand this part
Is there a way to have the index be the CSV name
For the last part, instead of doing filelist.append(productivity), append it to an empty list like:
filelist=["C:\Users\jan.csv", "C:\Users\feb.csv", "C:\Users\mar.csv"]
filelist_new = []
for location in filelist:
df = pd.read_csv(location)
out_put, productivity= timeresult.input_data.outbuild(df, year, days)
filelist_new.append(productivity)
Added for your followup question:
I don't know what timeresult.input_data.outbuild this does. But you can append a list like:
filelist=["C:\Users\jan.csv", "C:\Users\feb.csv", "C:\Users\mar.csv"]
filelist_new = []
for location in filelist:
df = pd.read_csv(location)
out_put, productivity= timeresult.input_data.outbuild(df, year, days)
filelist_new.append([location.split('\')[-1], productivity])
Python/Pandas beginner here. I have a list with names which each represent a csv file on my computer. I would like to create a separate pandas dataframe for each of these csv files and use the same names for the dataframes. I can do this in a very inefficient way by creating a separate line of code for each name in the list and adding/removing these lines of code manually as the list changes over time, something like this when I have 3 names Mark, Frank and Peter:
path = 'C:\\Users\\Me\\Desktop\\Names'
Mark = pd.read_csv(path+"Mark.csv")
Frank = pd.read_csv(path+"Frank.csv")
Peter = pd.read_csv(path+"Peter.csv")
Problem is that I will usually have a dozen or so names and they change frequently, so this is not very efficient. Instead I figured I would keep a list of the names to update when needed and use a for loop to do the rest:
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
for name in names:
name = pd.read_csv(path+name+'.csv')
This does not produce an error, but instead of creating 3 different dataframes Mark, Frank and Peter, it creates a single dataframe 'name' using only the data from the first entry in the list. How do make this work so that it creates a separate dataframe for each name in the list and give each dataframe the same name as the csv file that was read?
it creates a single dataframe 'name' using only the data from the first entry in the list.
It uses the last entry, because each time through the loop, name is replaced with the result of the next read_csv call. (Actually, it's being replaced with one of the value from the list, and then with the read_csv result; to avoid confusion, you should use a separate name for your loop variables as your outputs. Especially since name doesn't make any sense as the thing to call your result :) )
How do make this work
You had a list of input values, and thus you want a list of output values as well. The simplest approach is to use a list comprehension, describing the list you want in terms of the list you start with:
csvs = [
pd.read_csv(f'{path}{name}.csv')
for name in names
]
It works the same way as the explicit loop, except it builds a list automatically from the value that's computed each time through. It means what it says, in order: "csvs is a list of these pd.read_csv results, computed once for each of the name values that is in names".
name here is the variable used to iterate over the list. Modifying it won't make any noticable changes.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path + name + '.csv'))
# OR
dfs = [
pd.read_csv(path + name + '.csv')
for name in names
]
Or, you can use a dict to map the name with the file.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = {}
for name in names:
dfs[name] = pd.read_csv(path + name + '.csv')
# OR
dfs = {
name : pd.read(path + name + '.csv')
for name in names
}
Two options:
If you know the names of all your csv files you can edit you code and only add a list to hold all your files.
Example
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path+name+'.csv')
Otherwise, you can look for all the files with csv extension and open all of them using listdir()
import os
import pandas as pd
path = 'C:\\Users\\Me\\Desktop\\Names'
files = os.listdir(path)
dfs = []
for file in files:
if file[-3:] == "csv":
dfs.append(pf.read_csv(path + file))
for name in names:
globals()[name] = pd.read_csv(path+name+'.csv')
I am trying to read multiple csv files from a list of file paths and save them all as separate pandas dataframes.
I feel like there should be a way to do this, however I cannot find a succinct explanation.
import pandas as pd
data_list = [['df_1','filepath1.csv'],
['df_2','filepath2.csv'],
['df_3','filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I have also tried:
data_list = [[df_1,'filepath1.csv'],[df_2,'filepath2.csv'],
[df_3,'filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I would like to be able to call each dataframe by its assigned name.
Ex):
df_1.head()
df_dct = {name:pd.read_csv(filepath) for name, filepath in data_list}
would create a dictionary of DataFrames. This may help you organize your data.
You may also want to look into glob.glob to create your list of files. For example, to get all CSV files in a directory:
file_paths = glob.glob(my_file_dir+"/*.csv")
I recommend you numpy. Read the csv files with numpy.
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
You will get nd-array's. After that you can include them into pandas.
You can make sure of dictionary for this...
import pandas as pd
data_list = ['filepath1.csv', 'filepath2.csv', 'filepath3.csv']
d = {}
for _, i in enumerate(data_list):
file_name = "df" + str(_)
d[file_name] = pd.read_csv(filepath)
Here d is the dictionary which contains all your dataframes.
I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)