I've got a list of dataframes that I want filtered depending on the values in one column that all three of them have. I want to split all three dataframes into three each; one sub-dataframe for each value in that one column. So I want to make 9 dataframes out of 3.
I've tried:
df_list=[df_a,df_b,df_c]
for df_tmp in df_list:
for i, g in df_tmp.groupby('COLUMN'):
globals()[str(df_tmp) + str(i)] = g
But I get super weird results. Can someone help me fix that code?
Thanks!
This should give you a list with dictionaries: One dictionary for each of the original dataframes, each one containing one dataframe referenced with the unique name from 'COLUMN'.
tables = [{'df_' + name: df[df['COLUMN'] == name].copy() for name in df['COLUMN'].unique()} for df in df_list]
So, for example, you can call tables[0] to get the three dataframes derivated from df_a. Or tables[0]['df_foo'] to get the table from df_a with all the rows with the value 'foo' in the column 'COLUMNS'.
Or, if you want to use a dictionary to have all the df associated with keys instead of indexes in a list:
tables = {'df_' + str(i): {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for i in range(len(df_list))}
and then you can all them as tables['df_0']['df_foo'].
You can of course create a list of names and use it to assing the keys:
df_names = ['df_a', 'df_b', 'df_c']
tables = {'df_' + name: {'df_' + name: df_list[i][df_list[i]['COLUMN'] == name].copy() for name in df_list[i]['COLUMN'].unique()} for item in df_names}
And now you do tables['df_a']['df_foo'].
Let's say you choose to use one of the dictionaries and want to apply a single operation to all the dataframes, for example, let's say that each dataframe has a column called 'price' and you want to apply a function called get_discount(), then
for key1 in tables: # top level corresponding to [df_a,df_b,df_c]
for key2 in tables[key]: # bottom level corresponding to each filtered df
tables[key1][key2]['prices'] = tables[key1][key2]['prices'].apply(get_discount)
Related
I am not sure how to build a data frame here but I am looking for a way to take the data from multiple columns and combine them into 1 column. Not as a sum but as a joined value.
Ex. MB|Val|34567|W123 -> MB|Val|34567|W123|MB_Val_34567_W123.
What I have tried so far is creating a conditions variable that calls a particular column identical to the value in it
conditions = [(Groupings_df['GroupingCriteria1'] == 'MB')]
then a values variable that would include what I want in the new column
values = ['MB_Val_34567_W123']
and lastly grouping it
Groupings_df['GroupingColumn'] = np.select(conditions,values)
This works for 1 row but it would be inefficient to keep manually changing the number in the values variable (34567) over a df with thousands of rows
IIUC, you want to create a new column as a concatenation of each row:
df = pd.DataFrame({'GC1': ['MB'], 'GC2': ['Val'], 'GC3': [34567], 'GC4': ['W123'],
'Dummy': [10], 'Other': ['Hello']})
df['GC'] = df.filter(like='GC').astype(str).apply(lambda x: '_'.join(x), axis=1)
print(df)
# Output
GC1 GC2 GC3 GC4 Dummy Other GC
0 MB Val 34567 W123 10 Hello MB_Val_34567_W123
I have a use case where i have an unknown list of dfs that are generated from a groupby. The groupby is contained in a list. Once the groupby gets done, a unique df is created for each iteration.
I can dynamically create a list and dictionary of the dataframe names, however, I cannot figure out how to use the position in the list/dictionary as the name of the dataframe. data and Code below:
Code so far:
data_list = [['A','F','B','existing'],['B','F','W','new'],['C','M','H','new'],['D','M','B','existing'],['E','F','A','existing']];
# Create the pandas DataFrame
data_long = pd.DataFrame(data_list, columns = ['PAT_ID', 'sex', 'race_ethnicity','existing_new_client'])
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
def sex_race_summary(groupbyColumns):
grouplist = groupbyColumns[grouping].copy();
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index();
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
return df;
for grouping in range(len(groupbyColumns)):
exec(f'df_{grouping} = sex_race_summary(groupbyColumns)');
print(df_0)
# create a dictionary
dict_of_dfs = dict();
# a list of the dataframes
df_names = [];
for i in range(len(groupbyColumns)):
df_names.append('df_'+ str(i))
print (df_names)
What I'd like to do next is:
# loop through every dataframe and transpose the dataframe
for i in df_names:
df_transposed = i.pivot_table(index='sex', columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index();
print(i)
The index of the list matches the suffix of the dataframe.
But i is being passed as a string and thus throwing an error. The reason I need to build it this way and not hard code the dataframes is because I will not know how many df_x will be created
Thanks for your help!
Update based on R Y A N comment:
Thank you so much for this! you actually gave me another idea, so i made some tweeks: this is my final code that results in what I want: a summary and transposed table for each iteration. However, I am learning that globals() are bad practice and I should use a dictionary instead. How could I convert this to a dictionary based process? #create summary tables for each pair of group by columns
groupbyColumns = [['sex','existing_new_client'], ['race_ethnicity','existing_new_client']];
for grouping in range(len(groupbyColumns)):
grouplist = groupbyColumns[grouping].copy();
prefix = grouplist[0];
# create new column to capture the variable names for aggregation and lowercase the fields
df = data_long.groupby((groupbyColumns[grouping])).agg(total_unique_patients=('PAT_ID', 'nunique')).reset_index().fillna(0);
# create new column to capture the variable names for aggregation and lowercase the fields
df['variables'] = df[grouplist[0]].str.lower() + '_' + df['existing_new_client'].str.lower();
#transpose the dataframe from long to wide
df_transposed = df.pivot_table(index= df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum').reset_index().fillna(0);
# create new df with the index as suffix
globals()['%s_summary'%prefix] = df;
globals()['%s_summary_transposed' %prefix] = df_transposed;
You can use the globals() method to index the global variable (here a DataFrame) based on the string value i in the loop. See below:
# loop through every dataframe and transpose the dataframe
for i in df_names:
# call globals() to return a dictionary of global vars and index on i
i_df = globals()[i]
# changing the index= argument to indirectly reference the first column since it's 'sex' in df_0 but 'race_ethnicity' in df_1
df_transposed = i_df.pivot_table(index=i_df.columns[0], columns='existing_new_client', values='total_unique_patients', aggfunc = 'sum')
print(df_transposed)
However with just this line added, the .pivot_table() function gives an index error since the 'sex' column exists only in df_0 and not df_1 in df_names. To fix this, you can indirectly reference the first column (i.e. index=i_df.columns[0]) in the .pivot_table() method to handle the different column names of the dfs coming through the loop
I have a list of unique names (4,300 to be exact). unique_names = ['James', 'Erika', 'Akshay', 'Neil', etc..].
I have a column in a dataframe, where every row has it's own list of names.
I have to find out which rows in this column contain a name from my unique_names list.
I have tried masking, but every time it only gives back 2 rows rather than all the rows that contain a name from my list unique_names.
for name in unique_names:
if name in unique_names:
mask = df['names'].apply(lambda x: name in x)
df1 = df[mask]
My expected result is for every single row that contains a unique name from my list unique_names, instead I only get a return of two rows that contain the name 'Akshay' in the list of names, although I see other rows contain names like 'Neil' and 'Erika' those are not returned.
I would expect that the following would suffice.
mask = df['names'].apply(lambda x: any(name in x for name in unique_names))
If unique_names is a set and the number of names per row is small:
mask = df['names'].apply(lambda x: any(name in unique_names for name in x))
Or:
mask = df['names'].apply(lambda x: not unique_names.isdisjoint(x)))
I would rethink how you are doing this problem. First things first your original code iterates over names from a container called unique_names, then subsequently checks for if it is in unique_names. Every single iteration will pass that test because you pull them from the same container you test for membership of.
My best advice would be to iterate over the rows rather than names. Pseudocode would be as follows:
rows_with_unique = list()
for row in dataframe:
for name in unique_names:
if name in row:
rows_with_unique.append(row) (or whatever you are trying to extract)
I have a list that is comprised of DataFrames where I would like to iterate over the list of DataFrames and insert a column to each DataFrame based on an array.
Below is a small example that I have created for illustrative purposes. I would do this manually if it was only 4 DataFrames but my dataset is much larger:
#Create dataframes
df1 = pd.DataFrame(list(range(0,10)))
df2 = pd.DataFrame(list(range(10,20)))
df3 = pd.DataFrame(list(range(20,30)))
df4 = pd.DataFrame(list(range(30,40)))
#Create list of Dataframes
listed_dfs = [df1,df2,df3,df4]
#Create list of dates
Dates = ['2015-05-15','2015-02-17', '2014-11-14', '2014-08-14']
#Objective: Sequentially append each instance of "Dates" to a new column in each dataframe
#First, create list of locations for iterations
locations = [0,1,2,3]
#Second, create for loop to iterate over [Need help here]
#Example: for the 1st Dataframe in the list of dataframes, add a column 'Date' that
# has the the 1st instance of the 'Dates' list for every row,
# then for the 2nd DataFrame in the list of dataframes, add the 2nd instance of the 'Dates' list for every row
for i in Dates:
for a in locations:
listed_dfs[a]['Date'] = i
print(listed_dfs)
The problem with the above for loop is that it applies the last date first, then it does not apply the 2nd date to the 2nd DataFrame, only the 1st date for each DataFrame.
Desired Output from for loop:
listed_dfs[0]['Date'] = Dates[0]
listed_dfs[1]['Date'] = Dates[1]
listed_dfs[2]['Date'] = Dates[2]
listed_dfs[3]['Date'] = Dates[3]
pd.concat(listed_dfs)
Change your for loop to
for i,j in zip(Dates,locations):
listed_dfs[j]['Date'] = i
Going with your desired output:
listed_dfs[0]['Date'] = Dates[0]
listed_dfs[1]['Date'] = Dates[1]
listed_dfs[2]['Date'] = Dates[2]
listed_dfs[3]['Date'] = Dates[3]
pd.concat(listed_dfs)
Notice that the index values are the same for a row, so, 0 and 0, 1 and 1, and so on.. That's essentially what you need.
for i in range(len(Dates)):
listed_dfs[i]['Date'] = Dates[i]
pd.concat(listed_dfs)
If I have undestood it well, the problem is that you are overwriting the column 'Date' in all four dataframes on each iteration on Dates. A solution may be only one 'for' loop like this:
for a in locations:
listed_dfs[a]['Date'] = Dates[a]
If, as in your example, you loop through your dataframes sequentially, you can zip dataframes and dates as below.
for df, date in zip(listed_dfs, Dates):
df['Date'] = date
This removes the need for the locations list.
I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name.
For example:
column name is testing user. I want `testing user`
Is there a method to do this in pyspark/python. when we apply the code it should return a data frame.
Use list comprehension in python.
from pyspark.sql import functions as F
df = ...
df_new = df.select([F.col(c).alias("`"+c+"`") for c in df.columns])
This method also gives you the option to add custom python logic within the alias() function like: "prefix_"+c+"_suffix" if c in list_of_cols_to_change else c
To add prefix or suffix:
Refer df.columns for list of columns ([col_1, col_2...]). This is the dataframe, for which we want to suffix/prefix column.
df.columns
Iterate through above list and create another list of columns with alias that can used inside select expression.
from pyspark.sql.functions import col
select_list = [col(col_name).alias("prefix_" + col_name) for col_name in df.columns]
When using inside select, do not forget to unpack list with asterisk(*). We can assign it back to same or different df for use.
df.select(*select_list).show()
df = df.select(*select_list)
df.columns will now return list of new columns(aliased).
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and .withColumnRenamed().
As an example, you might like:
def add_prefix(sdf, prefix):
for c in sdf.columns:
sdf = sdf.withColumnRenamed(c, '{}{}'.format(prefix, c))
return sdf
You can amend sdf.columns as you see fit.
You can use withColumnRenamed method of dataframe in combination with na to create new dataframe
df.na.withColumnRenamed('testing user', '`testing user`')
edit : suppose you have list of columns, you can do like -
old = "First Last Age"
new = ["`"+field+"`" for field in old.split()]
df.rdd.toDF(new)
output :
DataFrame[`First`: string, `Last`: string, `Age`: string]
here is how one can solve the similar problems:
df.select([col(col_name).alias('prefix' + col_name + 'suffix') for col_name in df])
I had a dataframe that I duplicated twice then joined together. Since both had the same columns names I used :
df = reduce(lambda df, idx: df.withColumnRenamed(list(df.schema.names)[idx],
list(df.schema.names)[idx] + '_prec'),
range(len(list(df.schema.names))),
df)
Every columns in my dataframe then had the '_prec' suffix which allowed me to do sweet stuff