Using Pandas data frame group by feature and I want to group by column c_b and (1) calculate unique count for column c_a and column c_c, (2) and get the max value of column c_d. Wondering if there is any solution to write one line of group by code to achieve both goals? I tried the following line of code, but it seems not correct.
sampleGroup = sample.groupby('c_b')(['c_a', 'c_d'].agg(pd.Series.nunique), ['c_d'].agg(pd.Series.max))
My expected results are,
Expected results,
c_b,c_a_unique_count,c_c_unique_count,c_d_max
python,2,2,1.0
c++,2,2,0.0
Thanks.
Input file,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,0.0
go,c++,std,0.0
Source code,
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sampleGroup = sample.groupby('c_b')(['c_a', 'c_d'].agg(pd.Series.nunique), ['c_d'].agg(pd.Series.max))
results.to_csv(sampleGroup, index= False)
You can pass a dict to agg():
df.groupby('c_b').agg({'c_a':'nunique', 'c_c':'nunique', 'c_d':'max'})
If you don't want c_b as index, you can pass as_index=False to groupby:
df.groupby('c_b', as_index=False).agg({'c_a':'nunique', 'c_c':'nunique', 'c_d':'max'})
Related
I have dataframe where the column names have the same format: data_sensor, where the date is in the format of yymmdd. Here is a subset of it:
Considering the last data (180722), I would like to keep the column according to sensor pre-defined priority. For example, I would like to define that SN1 is more important than SK3. So the desired result would be the same dataframe, only without column 180722_SK3. The number of columns with the same last date can be more than two.
This is the solution I implemented:
sensorsImportance = ['SN1', 'SK3'] #list of importence, first item is the most important
sensorsOrdering = {word: i for i, word in enumerate(sensorsImportance)}
def remove_duplicate_last_date(df,sensorsOrdering):
s = []
lastDate = df.columns.tolist()[-1].split('_')[0]
for i in df.columns.tolist():
if lastDate in i:
s.append(i.split('_')[1])
if len(s)>1:
keepCol = lastDate +'_'+sorted(s, key=sensorsOrdering.get)[0]
dropCols = [lastDate +'_'+i for i in sorted(s, key=sensorsOrdering.get)[1:]]
df.drop(dropCols,axis=1,inplace=True)
return df
It works fine, however, I feel that this is too cumbersome, is there a better way?
It can be done, with split the column then apply the argsort with the list , then reorder your dataframe , and join back the columns after groupby get the first value by date
df.columns=df.columns.str.split('_').map(tuple)
sensorsImportance = ['SN1', 'SK3']
idx=df.columns.get_level_values(1).map(dict(zip(sensorsImportance,range(len(sensorsImportance))))).argsort()
df=df.iloc[:,idx].T.groupby(level=0).head(1).T
df.columns=df.columns.map('_'.join)
I've consulted a bunch of previous related SO posts, but I could not adapt them to solve my question.
Here is an example dataframe.
# Using pandas 0.24.2
data = {'customer_id': [1, 2, 3],
'prev_due_date':['Jun-2010', 'Apr-2019', 'Dec-1999'],
'current_due_date':['Aug-2019', 'Dec-2045', 'Jan-2000'],
'next_due_date':['Feb-2025', 'Nov-2065', 'Sep-2001']
}
df = pd.DataFrame(data)
Here is what the dataframe looks like, and there are many more such columns to parse in actual dataframe, hence my question.
customer_id prev_due_date current_due_date next_due_date
0 1 Jun-2010 Aug-2019 Feb-2025
1 2 Apr-2019 Dec-2045 Nov-2065
2 3 Dec-1999 Jan-2000 Sep-2001
I have created a function to parse one column (ie, this adds two parsed columns --- month and year columns --- to the supplied df)
def parse_column(df, col_parse):
col_parse_mmm = col_parse + '_mmm'
col_parse_yyyy = col_parse + '_yyyy'
df[[col_parse_mmm, col_parse_yyyy]] = df[col_parse].str.split('-', expand=True)
return df
Calling this function below does the job for the supplied column:
parse_column(df, 'prev_due_date')
Now, my question is:
How can I do this for an arbitrary number columns of my choosing (eg, list of of tens or hundreds columns that I want to parse), using apply?
Is it possible to avoid using apply?
for c in df.columns:
if c.endswith('_date'):
parse_column(df, c)
(you don't need return the df in your parse_column function)
If you already have the list with the column names you're interested in:
for c in my_columns_list:
parse_column(df, c)
You don't need any apply.
I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I have written some code to essentially do a excel style vlookup on two pandas dataframes and want to speed it up.
The structure of the data frames is as follows:
dbase1_df.columns:
'VALUE', 'COUNT', 'GRID', 'SGO10GEO'
merged_df.columns:
'GRID', 'ST0, 'ST1', 'ST2', 'ST3', 'ST4', 'ST5', 'ST6', 'ST7', 'ST8', 'ST9', 'ST10'
sgo_df.columns:
'mkey', 'type'
To combine them, I do the following:
1. For each row in dbase1_df, find the row where its 'SGO10GEO' value matches the 'mkey' value of sgo_df. Obtain the 'type' from that row in sgo_df.
'type' contains an integer ranging from 0 to 10. Create a column name by appending 'ST' to type.
Find the value in merged_df, where its 'GRID' value matches the 'GRID' value in dbase1_df and the column name is the one we obtained in step 2. Output this value into a csv file.
// Read in dbase1 dbf into data frame
dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col=False)
merged_df = pandas.DataFrame.from_csv('merged.csv',index_col=False)
lup_out.writerow(["VALUE","TYPE",EXTRACT_VAR.upper()])
// For each unique value in dbase1 data frame:
for index, row in dbase1_df.iterrows():
# 1. Find the soil type corresponding to the mukey
tmp = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
if tmp.size > 0:
s_type = 'ST'+tmp[0]
val = int(row['VALUE'])
# 2. Obtain hmu value
tmp_val = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])]
if tmp_val.size > 0:
hmu_val = tmp_val[0]
# 4. Output into data frame: VALUE, hmu value
lup_out.writerow([val,s_type,hmu_val])
else:
err_out.writerow([merged_df['GRID'], type, row['GRID']])
Is there anything here that might be a speed bottleneck? Currently it takes me around 20 minutes for around ~500,000 rows in dbase1_df; ~1,000 rows in merged_df and ~500,000 rows in sgo_df.
thanks!
You need to use the merge operation in Pandas to get a better performance. I'm not able to test the below code since I don't have the data but at minimum it should help you to get the idea:
import pandas as pd
dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)
#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey
dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']
#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)
#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))
# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)
# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']
result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])