is it possible to create a new df after every iteration? with [i] being the iteration, it should generate df0, df1, df2, etc.. to the MAX NUMBER range as presented in the example:
for i in range(MAX_NUMBER + 1):
df[i] = pd.read_csv(f"C:/Users/Desktop/{i}.csv")
the original codes are functions that loop multiple times. however, for simplicity, i've use read.csv for the example.
kindly advise. Many thanks
Try creating array and append df as you progress through for loop. like this,
df = []
for i in range(MAX_NUMBER + 1):
df.append(pd.read_csv(f"C:/Users/Desktop/{i}.csv"))
and when you need to access, you can use index like df[0], df[1].
Read file and assign it to a dictionary key with key being the name of dataframe as follows:
dfs = {}
for i in range(MAX_NUMBER + 1):
dfs[f'df{i}'] = pd.read_csv(f"C:/Users/Desktop/{i}.csv")
Then you can access each df by its name:
dfs['df0']
Related
I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop
I would like to yield multiple empty dataframes by a function in Python.
import pandas as pd
df_list = []
def create_multiple_df(num):
for i in range(num):
df = pd.DataFrame()
df_name = "df_" + str(num)
exec(df_name + " = df ")
df_list.append(eval(df_name))
for i in df_list:
yield i
e.g. when I create_multiple_df(3), I would like to have df_1, df_2 and df_3 returned.
However, it didn't work.
I have two questions,
How to store multiple dataframes in a list (i.e. without evaluating the contents of the dataframes)?
How to yield multiple variable elements from a list?
Thanks!
It's very likely that you do not want to have df_1, df_2, df_3 ... etc. This is often a design pursued by beginners for some reason, but trust me that a dictionary or simply a list will do the trick without the need to hold different variables.
Here, it sounds like you simply want a list comprehension:
dfs = [pd.DataFrame() for _ in range(n)]
This will create n empty dataframes and store them in a list. To retrieve or modify them, you can simply access their position. This means instead of having a dataframe saved in a variable df_1, you can have that in the list dfs and use dfs[1] to get/edit it.
Another option is a dictionary comprehension:
dfs = {i: pd.DataFrame() for i in range(n)}
It works in a similar fashion, you can access it by dfs[0] or dfs[1] (or even have real names, e.g. {f'{genre}': pd.DataFrame() for genre in ['romance', 'action', 'thriller']}. Here, you could do dfs['romance'] or dfs['thriller'] to retrieve the corresponding df).
I am slicing a dataframe by identifier column and creating subset dataframes using for loop and globals(). And finally I combine all the sliced dataframes into a tuple. As you can see, the tuple creating part is manual, but I need to expand my code to a much larger dataset, and can't do it manually, and wanted to add this step to my for loop to have tup in one step without me needing to type "tup = (TT_a,TT_b,TT_c,TT_d,TT_e)". I just need the output, so please suggest any way to achieve it, do not need to use globals()
#creates dataframe
import pandas as pd
loc = [100,200,300,400,500,600,700,800,900,1000]
identifier = ['a','a','a','a','b','b','c','d','e','f']
d = {'loc':loc,'identifier':identifier}
df = pd.DataFrame(d)
#create sliced dataframe by identifier, 6 unique
for i in df['identifier'].unique():
globals()['TT_%s' % i] = df[df['identifier'] == i].reset_index()[['loc','identifier']]
%who
TT_a TT_b TT_c TT_d TT_e TT_f d df i
identifier loc pd
#Final Output needed
tup = (TT_a,TT_b,TT_c,TT_d,TT_e)
First of all, please do not use globals like that...
Use dictionary:
d={}
for i in df['identifier'].unique():
if len(df.loc[df['identifier'] == i,'identifier']) > 1:
d['TT_%s' % i] = df.loc[df['identifier'] == i, ['loc','identifier']].reset_index()
I have three different Pandas dataframes
df_1
df_2
df_3
I would like to loop over the dataframes, do some computations, and store the output using the name of the dataframe. In other words, something like this
for my_df in [df_1, df_2, df_3]:
my_df.reset_index(inplace=True)
my_df.to_csv('mypath/' + my_df +'.csv')
Output files expected:
'mypath/df_1.csv', 'mypath/df_2.csv' and 'mypath/df_3.csv'
I am struggling doing so because df_1 is an object, and not a string.
Any ideas how to do that?
Thanks!
Another more general solution is create dict with column names and then loop by items():
d = {'df_1':df_1,'df_2':df_3, 'df_3':df_3}
for k, my_df in d.items():
my_df.reset_index(inplace=True)
my_df.to_csv('mypath/' + k +'.csv')
Another possible solution is use another list with dataframes names, then enumerate and get value of name by position:
names = ['a','b','c']
print ({names[i]: df for i, df in enumerate([df_1, df_2, df_3])})
To store df as csv using to_csv method we need a string path.
So we enumerate over the list. This gives us 2 variable in the for loop.
1st variable is the index of the loop iteration. It's like a counter.
So basically enumerate gives us a counter object on the loop.
We are using the counter value to create a string of the index.
Which we use to create a unique file name to store.
for idx, my_df in enumerate([df_1, df_2, df_3]):
my_df.reset_index(inplace=True)
my_df.to_csv('mypath/df_' + str(idx + 1) +'.csv')
I have several data frames that contain all of the same column names. I want to append them into a master data frame. I also want to create a column that denotes the original field and then flood it with the original data frames name. I have some code that works.
df_combine = df_breakfast.copy()
df_combine['X_ORIG_DF'] = 'Breakfast'
df_combine = df_combine.append(df_lunch, ignore_index=True)
df_combine['X_ORIG_DF'] = df_combine['X_ORIG_DF'].fillna('Lunch')
# Rinse and repeat
However, it seems inelegant. I was hoping someone could point me to a more elegant solution. Thank you in advance for your time!
Note: Edited to reflect comment!
I would definitely consider restructuring you data in a way the names can be accessed neatly rather than as variable names (if they must be separate to begin with).
For example a dictionary:
d = {'breakfast': df_breakfast, 'lunch': df_lunch}
Create a function to give each DataFrame a new column:
def add_col(df, col_name, col_entry):
df = df.copy() # so as not to change df_lunch etc.
df[col_name] = col_entry
return df
and combine the list of DataFrame each with the appended column ('X_ORIG_DF'):
In [3]: df_combine = pd.DataFrame().append(list(add_col(v, 'X_ORIG_DF', k)
for k, v in d.items()))
Out[3]:
0 1 X_ORIG_DF
0 1 2 lunch
1 3 4 lunch
0 1 2 breakfast
1 3 4 breakfast
In this example: df_lunch = df_breakfast = pd.DataFrame([[1, 2], [3, 4]]).
I've encountered a similar problem as you when trying to combine multiple files together for the purpose of analysis in a master dataframe. Here is one method for creating that master dataframe by loading each dataframe independently, giving them each an identifier in a column called 'ID' and combining them. If your data is a list of files in a directory called datadir I would do the following:
import os
import pandas as pd
data_list = os.listdir(datadir)
df_dict = {}
for data_file in data_list:
df = read_table(data_file)
#add an ID column based on the file name.
#you could use some other naming scheme of course
df['ID'] = data_file
df_dict[data_file] = df
#the concat function is great for combining lots of dfs.
#it takes a list of dfs as an argument.
combined_df_with_named_column = pd.concat(df_dict.values())