Summary of categorical variables pandas - python

As stated in the title, I want to conduct some summary analysis about categorical variables in pandas, but have not come across a satisfying solution after searching for a while. So I developed the following code as kind of a self answering-question with the hope that someone out there on SO can help to improve.
test_df = pd.DataFrame({'x':['a', 'b','b','c'],
'y':[1, 0, 0, np.nan],
'z':['Jay', 'Jade', 'Jia', ''],
'u':[1, 2, 3, 3]})
def cat_var_describe(input_df, var_list):
df = input_df.copy()
# dataframe to store result
res = pd.DataFrame({'var_name', 'values', 'count'})
for var in var_list:
temp_res = df[var].value_counts(dropna=False).rename_axis('unique_values').reset_index(name='counts')
temp_res['var_name'] = var
if var==var_list[0]:
res = temp_res.copy()
else:
res = pd.concat([res, temp_res], axis=0)
res = res[['var_name', 'unique_values', 'counts']]
return res
cat_des_test = cat_var_describe(test_df, ['x','y','z','u'])
cat_des_test
Any helpful suggestions will be deeply appreciated.

You can use the pandas DataFrame describe() method.
describe() includes only numerical data by default.
to include categorical variables you must use the include argument.
using 'object' returns only the non-numerical data
test_df.describe(include='object')
using 'all' returns a summary of all columns with NaN where the statistic is inappropriate for the datatype
test_df.describe(include='all')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

You can use the unique() method to get a list of individual values for a column, for example:
test_df['x'].unique()
For getting the number of occurrences of values in a column, you can use value_counts():
test_df['x'].value_counts()
A simplified loop over all columns of the DataFrame could look like this:
for col in list(test_df):
print('variable:', col)
print(test_df[col].value_counts(dropna=False).to_string())

You can use describe Function
test_df.describe()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Related

How to edit/ sort a non-column column in Python?

I wrote the script below, and I'm 98% content with the output. However, the unorganized manner/ disorder of the 'Approved' field bugs me. As you can see, I tried to sort the values using .sort_values() but was unsuccessful. The output for the written script is below, as is the list of fields in the data frame.
df = df.replace({'Citizen': {1: 'Yes',
0: 'No'}})
grp_by_citizen = pd.DataFrame(df.groupby(['Citizen']).agg({i: 'value_counts' for i in ['Approved']}).fillna(0))
grp_by_citizen.rename(columns = {'Approved': 'Count'}, inplace = True)
grp_by_citizen.sort_values(by = 'Approved')
grp_by_citizen
Do let me know if you need further clarification or insight as to my objective.
You need to reassign the result of sort_values or use inplace=True. From documentation:
Returns: DataFrame or None
      DataFrame with sorted values or None if inplace=True.
grp_by_citizen = grp_by_citizen.sort_values(by = 'Approved')
First go with:
f = A.columns.values.tolist()
To see what is the actual names of your columns are. Then you can try:
A.sort_values(by=f[:2])
And if you sort by column name keep in mind that 2L is a long int, so just go:
A.sort_values(by=[2L])

How to get dataframe from groupby

I am doing a groupby practice. But it returning dict not dataframe. I fallowed some of the solutions from Stack Overflow even no luck.
My code:
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation).reset_index()
and I tried:
result[comNewColName] = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
and
result[comNewColName] = sourceDF.groupby(context, as_index=False)[aggColumn].agg(aggOperation)
all three cases, I am getting dict only. But I should get dataframe
here:
comNewColName = "totalAmount"
context =['clientCode']
aggColumn = 'amount'
aggOperation = 'sum'
If need new column created by aggregeted values use GroupBy.transform, but assign to sourceDF:
sourceDF[comNewColName] = sourceDF.groupby(context)[aggColumn].transform(aggOperation)
Your solution return DataFrame:
df = sourceDF.groupby(context)[aggColumn].agg(aggOperation).reset_index()
print (type(df))

Dask DataFrame Groupby: Most frequent value of column in aggregate

A custom dask GroupBy Aggregation is very handy, but I am having trouble to define one working for the most often value in a column.
What do I have:
So from the example here, we can define custom aggregate functions like this:
custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
my_aggregate = {
'A': custom_sum,
'B': custom_most_often_value, ### <<< This is the goal.
'C': ['max','min','mean'],
'D': ['max','min','mean']
}
col_name = 'Z'
ddf_agg = ddf.groupby(col_name).agg(my_aggregate).compute()
While this works for custom_sum (as on the example page), the adaption to most often value could be like this (from the example here):
custom_most_often_value = dd.Aggregation('custom_most_often_value', lambda x:x.value_counts().index[0], lambda x0:x0.value_counts().index[0])
but it yields
ValueError: Metadata inference failed in `_agg_finalize`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
Then I tried to find the meta keyword in the dd.Aggregation implementation to define it, but could not find it.. And the fact that it is not needed in the example of custom_sum makes me think that the error is somewhere else..
So my question would be, how to get that mostly occuring value of a column in a df.groupby(..).agg(..). Thanks!
A quick clarification rather than an answer: the meta parameter is used in the .agg() method, to specify the column data types you expect, best expressed as a zero-length pandas dataframe. Dask will supply dummy data to your function otherwise, to try to guess those types, but this doesn't always work.
The issue that you're running into, is that the separate stages of the aggregation can't be the same function applied recursively, as in the custom_sum example that you're looking at.
I've modified code from this answer, leaving comments from #
user8570642, because they are very helpful. Note that this method will solve for a list of groupby keys:
https://stackoverflow.com/a/46082075/3968619
def chunk(s):
# for the comments, assume only a single grouping column, the
# implementation can handle multiple group columns.
#
# s is a grouped series. value_counts creates a multi-series like
# (group, value): count
return s.value_counts()
def agg(s):
# print('agg',s.apply(lambda s: s.groupby(level=-1).sum()))
# s is a grouped multi-index series. In .apply the full sub-df will passed
# multi-index and all. Group on the value level and sum the counts. The
# result of the lambda function is a series. Therefore, the result of the
# apply is a multi-index series like (group, value): count
return s.apply(lambda s: s.groupby(level=-1).sum())
# faster version using pandas internals
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
# s is a multi-index series of the form (group, value): count. First
# manually group on the group part of the index. The lambda will receive a
# sub-series with multi index. Next, drop the group part from the index.
# Finally, determine the index with the maximum value, i.e., the mode.
level = list(range(s.index.nlevels - 1))
return (
s.groupby(level=level)
.apply(lambda s: s.reset_index(level=level, drop=True).idxmax())
)
max_occurence = dd.Aggregation('mode', chunk, agg, finalize)
chunk will count the values for the groupby object in each partition. agg will take the results from chunk and groupy the original groupby command and sum the value counts, so that we have the value counts for every group. finalize will take the multi-index series provided by agg and return the most frequently occurring value of B for each group from Z.
Here's a test case:
df = dd.from_pandas(
pd.DataFrame({"A":[1,1,1,1,2,2,3]*10,"B":[5,5,5,5,1,1,1]*10,
'Z':['mike','amy','amy','amy','chris','chris','sandra']*10}), npartitions=10)
res = df.groupby(['Z']).agg({'B': mode}).compute()
print(res)

Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')

I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).

Pandas: get the most count label

My dataframe has a column contains various type values, I want to get the most counted one:
In this case, I want to get the label FM-15, so later on I can query data only labled by this.
How can I do that?
Now I can get away with:
most_count = df['type'].value_counts().max()
s = df['type'].value_counts()
s[s == most_count].index
This returns
Index([u'FM-15'], dtype='object')
But I feel this is to ugly, and I don't know how to use this Index() object to query df. I only know something like df = df[(df['type'] == 'FM-15')].
Use argmax:
lbl = df['type'].value_counts().argmax()
To query,
df.query("type==#lbl")

Categories