How to get dataframe from groupby - python

I am doing a groupby practice. But it returning dict not dataframe. I fallowed some of the solutions from Stack Overflow even no luck.
My code:
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation).reset_index()
and I tried:
result[comNewColName] = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
and
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation)
all three cases, I am getting dict only. But I should get dataframe
here:
comNewColName = "totalAmount"
context =['clientCode']
aggColumn = 'amount'
aggOperation = 'sum'

If need new column created by aggregeted values use GroupBy.transform, but assign to sourceDF:
sourceDF[comNewColName] = sourceDF.groupby(context)[aggColumn].transform(aggOperation)
Your solution return DataFrame:
df = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
print (type(df))

Related

Creating a function to pass dataframes as parameters

I have a hypothetical dataframe 'country_sales_df' :
country_sales_list_of_lists =
[
['Australia',21421324,342343,'Pacific','Y'],
['England',124233431,43543464,'Europe','Y'],
['Japan',12431241341,34267545,'Asia','N'],
['India',214343421,342343,'Asia','Y']
]
country_sales_df = pd.DataFrame(country_sales_list_of_lists,columns = ['Country','Sales',Profit,Region,Otac_Group])
I then define a series of dataframes from the original country sales dataframe:
otac_df= country_sales_df.query('Otac_Group == "Y"')
asia_df= country_sales_df.query('Region == "Asia"')
europe_df= country_sales_df.query('Region == "Europe"')
pacific_df= country_sales_df.query('Region == "Pacific"')
For each of the dataframes I want to aggregate all the numeric fields and create an additional dataframe with the aggregated information. I don't want to have to repeat the agg code for each dataframe as the actual project I'm working on will have significantly more lines of code, this is just a smaller example.
How would I create a function to do this? I tried the below but it will return an error 'TypeError: 'DataFrameGroupBy' object is not callable'
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')(['Sales','Profit']).agg([np.sum])
country_report_func('pacific_df_agg',pacific_df)
country_report_func('europe_df_agg',europe_df)
country_report_func('asia_df_agg',asia_df)
country_report_func('otac_df_agg',otac_df)
I'm basically just trying to get a piece of code to run for each of the dataframes I have defined and produce an additional dataframe for each. Does anyone have any recommendations on the best way to do this? I.e loop through a list of dataframes etc ?
Update:
I have now updated the function so it is applying the agg function to a datframe object and look to return the dataframe within the function. This is now returning the pacific_df_agg, however I'm unable to print this. The Europe, Asia and otac dataframes are also not created.
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')[['Sales','Profit']].agg([np.sum])
return df_name
country_report_func('pacific_df_agg',pacific_df)
country_report_func('europe_df_agg',europe_df)
country_report_func('asia_df_agg',asia_df)
country_report_func('otac_df_agg',otac_df)
Update 2:
I think I have solved it as I am now returning multiple dataframes for the function using the below code. Unsure if this is the easiest way to do this if anyone has any further suggestions:
def country_report_func(df_name,region_df):
df_name = region_df.groupby('Country')[['Sales','Profit']].agg([np.sum])
return df_name
pacific_df_agg = country_report_func('pacific_df_agg',pacific_df)
europe_df_agg = country_report_func('europe_df_agg',europe_df)
asia_df_agg = country_report_func('asia_df_agg',asia_df)
otac_df_agg = country_report_func('otac_df_agg',otac_df)
print(pacific_df_agg)
print(europe_df_agg)
print(asia_df_agg)
print(otac_df_agg)

How to edit/ sort a non-column column in Python?

I wrote the script below, and I'm 98% content with the output. However, the unorganized manner/ disorder of the 'Approved' field bugs me. As you can see, I tried to sort the values using .sort_values() but was unsuccessful. The output for the written script is below, as is the list of fields in the data frame.
df = df.replace({'Citizen': {1: 'Yes',
0: 'No'}})
grp_by_citizen = pd.DataFrame(df.groupby(['Citizen']).agg({i: 'value_counts' for i in ['Approved']}).fillna(0))
grp_by_citizen.rename(columns = {'Approved': 'Count'}, inplace = True)
grp_by_citizen.sort_values(by = 'Approved')
grp_by_citizen
Do let me know if you need further clarification or insight as to my objective.
You need to reassign the result of sort_values or use inplace=True. From documentation:
Returns: DataFrame or None
      DataFrame with sorted values or None if inplace=True.
grp_by_citizen = grp_by_citizen.sort_values(by = 'Approved')
First go with:
f = A.columns.values.tolist()
To see what is the actual names of your columns are. Then you can try:
A.sort_values(by=f[:2])
And if you sort by column name keep in mind that 2L is a long int, so just go:
A.sort_values(by=[2L])

How to rename a column while merging in pandas

I am using a for loop to merge many different dataframes. Each dataframe contains values from a specific time period. As such the column in each df is named "balance". In order to avoid creating multiple balance_x, balance_y... I want to name the columns using the name of the df.
so far, I have the following
top = topaccount_2021_12
top = top.rename(columns={"balance": "topaccount_2021_12"})
for i in [topaccount_2021_09, topaccount_2021_06, topaccount_2021_03,
topaccount_2020_12, topaccount_2020_09, topaccount_2020_06, topaccount_2020_03,
topaccount_2019_12, topaccount_2019_09, topaccount_2019_06, topaccount_2019_03,
topaccount_2018_12, topaccount_2018_09, topaccount_2018_06, topaccount_2018_03,
topaccount_2017_12, topaccount_2017_09, topaccount_2017_06, topaccount_2017_03,
topaccount_2016_12, topaccount_2016_09, topaccount_2016_06, topaccount_2016_03,
topaccount_2015_12, topaccount_2015_09]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})
But i get the error msg:
TypeError: Cannot convert bool to numpy.ndarray
Any idea how to solve this? Thanks!
I assume topaccount_* is a dataframe. I'm a bit confused in top = top.rename(columns={'balance': i}) because what do you want to achieve here? rename function used to rename column given key as original column name and value as the renamed column name. but instead of giving a string, you give dataframe to column
Edit
# store in dictionary
dictOfDf = {
'topaccount_2021_09':topaccount_2021_09,
'topaccount_2021_06':topaccount_2021_06,
...
'topaccount_2015_09':topaccount_2015_09,
}
# pick the first dict to declare dataframe
top = dictOfDf[dictOfDf.keys()[0]]
top = top.rename(columns={"balance": dictOfDf.keys()[0]})
# iterate through all the keys
for i in dictOfDf.keys()[1:]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})

Adding new column from aggregated via .loc returns NaN

I'm having a difficulty to understand why my my code does not work as expected.
I have a datafame structured:
Screenshot to dataframe
(Sorry I don't have high enough reputation to post images)
And I aggregate as follows to get sum of testBytes:
aggregation = {'testBytes' : ['sum']}
tests_DL_groupped = tests_DL_short.groupby(['measDay','_p_live','_p_compositeId','Latitude','Longitude','testType']).agg(aggregation).reset_index()
And now the actual question is why this code does not work as expected producing NaN:
tests_DL_groupped.loc[:,'testMBytes'] = tests_DL_groupped['testBytes']/1000/1000
a not working
while this works fine:
tests_DL_groupped['testMBytes'] = tests_DL_groupped['testBytes']/1000/1000
a working
and which should be the preferred pandas way to do it...
Thank You very much!
There is problem MultiIndex in columns.
Solution is change:
aggregation = {'testBytes' : ['sum']}
to:
aggregation = {'testBytes' : 'sum'}
for avoid it.
Or use GroupBy.sum:
cols = ['measDay','_p_live','_p_compositeId','Latitude','Longitude','testType']
tests_DL_groupped = tests_DL_short.groupby(cols)['testBytes'].sum().reset_index()
tests_DL_groupped = tests_DL_short.groupby(cols, as_index=False)['testBytes'].sum()

Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')

I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).

Categories