Creating a function to pass dataframes as parameters - python

I have a hypothetical dataframe 'country_sales_df' :
country_sales_list_of_lists =
[
['Australia',21421324,342343,'Pacific','Y'],
['England',124233431,43543464,'Europe','Y'],
['Japan',12431241341,34267545,'Asia','N'],
['India',214343421,342343,'Asia','Y']
]
country_sales_df = pd.DataFrame(country_sales_list_of_lists,columns = ['Country','Sales',Profit,Region,Otac_Group])
I then define a series of dataframes from the original country sales dataframe:
otac_df= country_sales_df.query('Otac_Group == "Y"')
asia_df= country_sales_df.query('Region == "Asia"')
europe_df= country_sales_df.query('Region == "Europe"')
pacific_df= country_sales_df.query('Region == "Pacific"')
For each of the dataframes I want to aggregate all the numeric fields and create an additional dataframe with the aggregated information. I don't want to have to repeat the agg code for each dataframe as the actual project I'm working on will have significantly more lines of code, this is just a smaller example.
How would I create a function to do this? I tried the below but it will return an error 'TypeError: 'DataFrameGroupBy' object is not callable'
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')(['Sales','Profit']).agg([np.sum])
country_report_func('pacific_df_agg',pacific_df)
country_report_func('europe_df_agg',europe_df)
country_report_func('asia_df_agg',asia_df)
country_report_func('otac_df_agg',otac_df)
I'm basically just trying to get a piece of code to run for each of the dataframes I have defined and produce an additional dataframe for each. Does anyone have any recommendations on the best way to do this? I.e loop through a list of dataframes etc ?
Update:
I have now updated the function so it is applying the agg function to a datframe object and look to return the dataframe within the function. This is now returning the pacific_df_agg, however I'm unable to print this. The Europe, Asia and otac dataframes are also not created.
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')[['Sales','Profit']].agg([np.sum])
return df_name
country_report_func('pacific_df_agg',pacific_df)
country_report_func('europe_df_agg',europe_df)
country_report_func('asia_df_agg',asia_df)
country_report_func('otac_df_agg',otac_df)
Update 2:
I think I have solved it as I am now returning multiple dataframes for the function using the below code. Unsure if this is the easiest way to do this if anyone has any further suggestions:
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')[['Sales','Profit']].agg([np.sum])
return df_name
pacific_df_agg = country_report_func('pacific_df_agg',pacific_df)
europe_df_agg = country_report_func('europe_df_agg',europe_df)
asia_df_agg = country_report_func('asia_df_agg',asia_df)
otac_df_agg = country_report_func('otac_df_agg',otac_df)
print(pacific_df_agg)
print(europe_df_agg)
print(asia_df_agg)
print(otac_df_agg)

Related

How to rename a column while merging in pandas

I am using a for loop to merge many different dataframes. Each dataframe contains values from a specific time period. As such the column in each df is named "balance". In order to avoid creating multiple balance_x, balance_y... I want to name the columns using the name of the df.
so far, I have the following
top = topaccount_2021_12
top = top.rename(columns={"balance": "topaccount_2021_12"})
for i in [topaccount_2021_09, topaccount_2021_06, topaccount_2021_03,
topaccount_2020_12, topaccount_2020_09, topaccount_2020_06, topaccount_2020_03,
topaccount_2019_12, topaccount_2019_09, topaccount_2019_06, topaccount_2019_03,
topaccount_2018_12, topaccount_2018_09, topaccount_2018_06, topaccount_2018_03,
topaccount_2017_12, topaccount_2017_09, topaccount_2017_06, topaccount_2017_03,
topaccount_2016_12, topaccount_2016_09, topaccount_2016_06, topaccount_2016_03,
topaccount_2015_12, topaccount_2015_09]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})
But i get the error msg:
TypeError: Cannot convert bool to numpy.ndarray
Any idea how to solve this? Thanks!
I assume topaccount_* is a dataframe. I'm a bit confused in top = top.rename(columns={'balance': i}) because what do you want to achieve here? rename function used to rename column given key as original column name and value as the renamed column name. but instead of giving a string, you give dataframe to column
Edit
# store in dictionary
dictOfDf = {
'topaccount_2021_09':topaccount_2021_09,
'topaccount_2021_06':topaccount_2021_06,
...
'topaccount_2015_09':topaccount_2015_09,
}
# pick the first dict to declare dataframe
top = dictOfDf[dictOfDf.keys()[0]]
top = top.rename(columns={"balance": dictOfDf.keys()[0]})
# iterate through all the keys
for i in dictOfDf.keys()[1:]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})

How to get dataframe from groupby

I am doing a groupby practice. But it returning dict not dataframe. I fallowed some of the solutions from Stack Overflow even no luck.
My code:
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation).reset_index()
and I tried:
result[comNewColName] = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
and
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation)
all three cases, I am getting dict only. But I should get dataframe
here:
comNewColName = "totalAmount"
context =['clientCode']
aggColumn = 'amount'
aggOperation = 'sum'
If need new column created by aggregeted values use GroupBy.transform, but assign to sourceDF:
sourceDF[comNewColName] = sourceDF.groupby(context)[aggColumn].transform(aggOperation)
Your solution return DataFrame:
df = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
print (type(df))

why is function not applying to dataframe? Series object has no attribute called query (Pandas)

I have a dataframe user and calls where common column is user_id. I need to drop values in user dataframe where churn is not null and remove those user_id rows in calls.
users = user_id,first_name,last_name,age,city,reg_date,plan,churn_date
1000,Anamaria,Bauer,45,"Atlanta-Sandy Springs-Roswell, GA MSA",2018-12-24,ultimate,
1001,Mickey,Wilkerson,28,"Seattle-Tacoma-Bellevue, WA MSA",2018-08-13,surf,
1002,Carlee,Hoffman,36,"Las Vegas-Henderson-Paradise, NV MSA",2018-10-21,surf,
1003,Reynaldo,Jenkins,52,"Tulsa, OK MSA",2018-01-28,surf,
1004,Leonila,Thompson,40,"Seattle-Tacoma-Bellevue, WA MSA",2018-05-23,surf,
1005,Livia,Shields,31,"Dallas-Fort Worth-Arlington, TX MSA",2018-11-29,surf,
1007,Eusebio,Welch,42,"Grand Rapids-Kentwood, MI MSA",2018-07-11,surf,
1008,Emely,Hoffman,53,"Orlando-Kissimmee-Sanford, FL MSA",2018-08-03,ultimate,
1009,Gerry,Little,19,"San Jose-Sunnyvale-Santa Clara, CA MSA",2018-04-22,surf,
1010,Wilber,Blair,52,"Dallas-Fort Worth-Arlington, TX MSA",2018-03-09,surf,
calls = id,user_id,call_date,duration
1000_93,1000,2018-12-27,8.52
1000_145,1000,2018-12-27,13.66
1000_247,1000,2018-12-27,14.48
1000_309,1000,2018-12-28,5.76
1000_380,1000,2018-12-30,4.22
1000_388,1000,2018-12-31,2.2
1000_510,1000,2018-12-27,5.75
1000_521,1000,2018-12-28,14.18
1000_530,1000,2018-12-28,5.77
1000_544,1000,2018-12-26,4.4
filter_user = users[users['churn_date'].notnull()]["user_id"].tolist()
I am creating a function to use list of user_id's from filter_user
def new(df):
df = df.query('user_id != #filter_user')
return df
I want to apply other dataframe and remove rows containing user_ids from filter_user and that is why applying the above function to other dataframe
calls.apply(new,axis=1)
AttributeError: 'Series' object has no attribute 'query'
Why is this error coming?
When you running calls.apply(some_action, axis=1), it would call function some_action to all rows of your dataframe calls.
So you should either change your new function to work with pd.Series of rows, either filter users using anohter techniques. The easiest way to do this, use df.isin() method:
df = df[ df.user_id.isin(filter_user ) ]
df.isin used to check whether each element in the DataFrame is contained in values.
Try this:
users_to_remove = users.loc[users.churn_date.notnull(), 'user_id']
filtered_calls = calls[~calls.user_id.isin(users_to_remove)]

How to assign objects to a columns using the .between fuction

I am trying to label a data frame by on-peak, mid-peak, off-peak etc. I managed to get the values I want to assign in this 'Mid-Peak', df['Peak'][df['func'] == 'Winter_Weekend']. However, when I include the .between_time I get the error: SyntaxError: can't assign to function call. I am not sure how to fix this. My goal is for the code code to work like this. Do I need another function or a do I need to change the syntax? Thank you for the help.
df['Peak'][df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False) = 'Mid-Peak'
In general, you can't assign a result to a function call, so need a different syntax. You could try
selection = df[df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False)
selection["Peak"] = "Mid-Peak"
But this doesn't update your original df, only the rows copied into selection.
To update the original dataframe, one way is to use loc to select both rows and a column, and .index to apply the between_time selection to the original dataframe:
ww = df["func"] == "Winter_Weekend"
df.loc[df[ww].between_time('16:00', '21:00', include_end=False).index, "Peak"] = "Mid-Peak"
I would recommend leveraging np.where() here, as follows:
df['Peak'] = np.where(df[df['func'] == 'Winter_Weekend'].between_time('16:00','21:00', include_end=False), 'Mid-Peak', df['Peak'])

For Looping error in pyspark

I am facing the following problem:
I have a list which I need to compare with the elements of a column in a dataframe(acc_name). I am using the following looping function but it only returns me 1 record when it should provide me 30.
Using Pyspark
bs_list =
['AC_E11','AC_E12','AC_E13','AC_E135','AC_E14','AC_E15','AC_E155','AC_E157',
'AC_E16','AC_E163','AC_E165','AC_E17','AC_E175','AC_E180','AC_E185', 'AC_E215','AC_E22','AC_E225','AC_E23','AC_E23112','AC_E235','AC_E245','AC_E258','AC_E25','AC_E26','AC_E265','AC_E27','AC_E275','AC_E31','AC_E39','AC_E29']
for i in bs_list:
bs_acc1 = (acc\
.filter(i == acc.acc_name)
.select(acc.acc_name,acc.acc_description)
)
the bs_list elements are subset of acc_name column. I am trying to create a new DF which will have the following 2 Columns acc_name, acc_description. It will only contain details of the value of elements present in list bs_list
Please let me know where I am going wrong?
Thats because , in loop everytime you filter on i, you are creating a new dataframe bs_acc1. So it must be showing you only 1 row belonging to last value in bs_list i.e. row for 'AC_E29'
one way to do it is repeat union with itself, so previous results also remain in the dataframe like -
# create a empty dataframe, give schema which is appropriate to your data below
bs_acc1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for i in bs_list:
bs_acc1 = bs_acc1.union(
acc\
.filter(i == acc_fil.acc_name)
.select(acc.acc_name,acc.acc_description)
)
better way is not doing loop at all -
from pyspark.sql.functions import *
bs_acc1 = acc.where(acc.acc_name.isin(bs_list))
You can also transform bs_list to dataframe with column acc_name and then just do join to acc dataframe.
bs_rdd = spark.sparkContext.parallelize(bs_list)
bs_df = bs_rdd.map(lambda x: Row(**{'acc_name':x})).toDF()
bs_join_df = bs_df.join(acc, on='acc_name')
bs_join_df.show()

Categories