Pandas read_csv multiple files - python

What's the best way to loop through a bunch of files and create separate data frames for each file? I've looked through other questions, but it seems the point in each of those is to concatenate files into one data frame.
For example, if I have mylist = ['a.csv','b.csv','c.csv'], and I want each of my data frames to take the name of the file (a, b,c), I can't do this because the left side of the assignment statement is treated as a string. How do I correct this so that it is interpreted as a dataframe assignment?
mylist = ['a.csv','b.csv','c.csv']
import pandas as pd
for file in mylist:
file.rsplit('.csv',1)[0] = pd.read_csv(file)

Use a dictionary comprehension:
dfs = {f.rsplit('.csv',1)[0]: pd.read_csv(file)
for f in mylist}

It is generally considered bad practice to name a variable using a formula. A better solution would be to use a dictionary:
mylist = ['a.csv','b.csv','c.csv']
mydict = {}
import pandas as pd
for file in mylist:
mydict[file.rsplit('.csv',1)[0]] = pd.read_csv(file)
Once you do this, you can access each dataframe by saying:
mydict['a']
mydict['b']
etc...

I think you can create dictionary of DataFrames:
import pandas as pd
mylist = ['a.csv','b.csv','c.csv']
dfs = {}
for f in mylist:
dfs.update({f.rsplit('.csv',1)[0]: pd.read_csv(f)})
print dfs['a']

Related

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!
You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Converting a string representation of dicts to an actual dict

I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.

Read and save multiple csv files from a for-loop

I am trying to read multiple csv files from a list of file paths and save them all as separate pandas dataframes.
I feel like there should be a way to do this, however I cannot find a succinct explanation.
import pandas as pd
data_list = [['df_1','filepath1.csv'],
['df_2','filepath2.csv'],
['df_3','filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I have also tried:
data_list = [[df_1,'filepath1.csv'],[df_2,'filepath2.csv'],
[df_3,'filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I would like to be able to call each dataframe by its assigned name.
Ex):
df_1.head()
df_dct = {name:pd.read_csv(filepath) for name, filepath in data_list}
would create a dictionary of DataFrames. This may help you organize your data.
You may also want to look into glob.glob to create your list of files. For example, to get all CSV files in a directory:
file_paths = glob.glob(my_file_dir+"/*.csv")
I recommend you numpy. Read the csv files with numpy.
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
You will get nd-array's. After that you can include them into pandas.
You can make sure of dictionary for this...
import pandas as pd
data_list = ['filepath1.csv', 'filepath2.csv', 'filepath3.csv']
d = {}
for _, i in enumerate(data_list):
file_name = "df" + str(_)
d[file_name] = pd.read_csv(filepath)
Here d is the dictionary which contains all your dataframes.

Is it possible to generate an object complete with keys using generator/ comprehension

import pandas as pd
import os
data = {}
for f in os.listdir('schools/'):
data[f.replace('.csv','')] = pd.read_csv('schools/'+f)
I run into this often where I want to use a list comprehension of some kind, which i believe is possible using the format of...
{pd.read_csv('schools/'+f) for f in os.listdir('schools/')}
However, not sure how to get the keys in there? Is it possible to generate an object this way?
you want a dictionary comprehension
for f in os.listdir('schools/'):
data[f.replace('.csv','')] = pd.read_csv('schools/'+f)
becomes:
data = {f.replace('.csv',''):pd.read_csv(os.path.join('schools',f)) for f in os.listdir('schools')}
maybe made safer and more readable using glob.glob so you filter out non-csv files and you don't have to join:
data = {f.replace('.csv',''):pd.read_csv(f) for f in glob.glob(os.path.join('schools',"*.csv"))}
A slight extension, to convert this to a pandas dataframe (assuming each csv only has one column):
pd.DataFrame({f.replace('.csv','') : pd.read_csv(os.path.join('schools',f)).values.reshape(-1, ) for f in os.listdir('schools')})

Categories