I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop
Related
I have a dataframe with a ton of columns. I would like to change a list of a sub set of the column names to all uppercase.
The code below doesn't change the column names and the other code I've tried produces errors:
df[cols_to_cap].columns = df[cols_to_cap].columns.str.upper()
What am I missing?
Try the below code, this uses the rename function.
rename_dict = {}
for each_column in list_of_cols_in_lower_case:
rename_dict[each_column] = each_column.upper()
df.rename(columns = rename_dict , inplace = True ) #inplace to True if you want the change to be applied to the dataframe
I have a list of names. for each name, I start with my dataframe df, and use the elements in the list to define new columns for the df. after my data manipulation is complete, I eventually create a new data frame whose name is partially derived from the list element.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
new_df_x = df
new_df_x.to_parquet('new_df_x.parquet')
del new_df_x
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merege(new_df_foo ,new_df_bar , ...)
The reason I am using this approach is that, if I don't use a loop and just add the foo and bar columns one after another to the original df, my data gets really big and highly fragmented before I go from wide to long and I encounter insufficient memory error. The workaround for me is to create a loop and store the data frame for each element and then at the very end join the long-format data frames together. Therefore, I cannot use the approach suggested in other answers such as creating dictionaries etc.
I am stuck at the line
new_df_x = df
where within the loop, I am using the list element in the name of the data frame.
I'd appreciate any help.
IIUC, you only want the filenames, i.e. the stored parquet files to have the foo and bar markers, and you can reuse the variable name itself.
list = ['foo','bar']
for x in list :
df = prior_df
(long code for manipulating df)
df.to_parquet(f'new_df_{x}.parquet')
del df
new_df_foo = pd.read_parquet(new_df_foo.parquet)
new_df_bar = pd.read_parquet(new_df_bar.parquet)
new_df = pd.merge(new_df_foo ,new_df_bar , ...)
Here is an example, if you are looking to define a variables names dataframe using a list element.
import pandas as pd
data = {"A": [42, 38, 39],"B": [13, 25, 45]}
prior_df=pd.DataFrame(data)
list= ['foo','bar']
variables = locals()
for x in list :
df = prior_df.copy() # assign a dataframe copy to the variable df.
# (smple code for manipulating df)
#-----------------------------------
if x=='foo':
df['B']=df['A']+df['B'] #
if x=='bar':
df['B']=df['A']-df['B'] #
#-----------------------------------
new_df_x="new_df_{0}".format(x)
variables[new_df_x]=df
#del variables[new_df_x]
print(new_df_foo) # print the 1st df variable.
print(new_df_bar) # print the 2nd df variable.
I've got a list of dataframes that I want filtered depending on the values in one column that all three of them have. I want to split all three dataframes into three each; one sub-dataframe for each value in that one column. So I want to make 9 dataframes out of 3.
I've tried:
df_list=[df_a,df_b,df_c]
for df_tmp in df_list:
for i, g in df_tmp.groupby('COLUMN'):
globals()[str(df_tmp) + str(i)] = g
But I get super weird results. Can someone help me fix that code?
Thanks!
This should give you a list with dictionaries: One dictionary for each of the original dataframes, each one containing one dataframe referenced with the unique name from 'COLUMN'.
tables = [{'df_' + name: df[df['COLUMN'] == name].copy() for name in df['COLUMN'].unique()} for df in df_list]
So, for example, you can call tables[0] to get the three dataframes derivated from df_a. Or tables[0]['df_foo'] to get the table from df_a with all the rows with the value 'foo' in the column 'COLUMNS'.
Or, if you want to use a dictionary to have all the df associated with keys instead of indexes in a list:
tables = {'df_' + str(i): {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for i in range(len(df_list))}
and then you can all them as tables['df_0']['df_foo'].
You can of course create a list of names and use it to assing the keys:
df_names = ['df_a', 'df_b', 'df_c']
tables = {'df_' + name: {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for item in df_names}
And now you do tables['df_a']['df_foo'].
Let's say you choose to use one of the dictionaries and want to apply a single operation to all the dataframes, for example, let's say that each dataframe has a column called 'price' and you want to apply a function called get_discount(), then
for key1 in tables: # top level corresponding to [df_a,df_b,df_c]
for key2 in tables[key]: # bottom level corresponding to each filtered df
tables[key1][key2]['prices'] = tables[key1][key2]['prices'].apply(get_discount)
I'm checking if a dataframe is empty and then assigning a value if it is. Dataframe has columns "NAME" and "ROLE"
df = pd.DataFrame(columns = ['NAME', 'ROLE'])
if df.empty:
df["NAME"] = "Jake"
After assigning "Jake" to "NAME". The dataframe is still empty like so:
NAME
ROLE
but I want the dataframe to look like this:
NAME
ROLE
Jake
Assigning a scalar to a pandas dataframe sets each value in that column to the scalar. Since you have zero rows, df["NAME"] = "Jake" doesn't assign anything. If you assign a list however, the dataframe is extended for that list. To get a single row in the dataframe
df["NAME"] = ["Jake"]
You could create more rows by adding additional values to the list being assigned.
As people are saying in the comments, there are no rows in your empty dataframe to assign the value "Jake" to the "Name" column. Showing that in the first example:
df = pd.DataFrame(columns=['Name','Role'])
df['Name'] = 'Jake'
print(df)
I'm guessing instead you want to add a row:
df = pd.DataFrame(columns=['Name','Role'])
df = df.append({'Name':'Jake','Role':None},ignore_index=True)
print(df)
If you want to change more than one row value you can use .loc method in Pandas module:
change_index = data_["sample_column"].sample(150).index # index of rows whose value will change
data_["sample_column"].loc[sample_index] = 1 # value at specified index and specified column is changed or assigned to 1
Note : If you write as data_.loc[sample_index]["sample_column"]. = 1 it is not working! Because ["sample_column"] condition write as before .loc methods.
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff