How to count unique occurences of particular value per row? - python

I have dataframe as follows:
df =
DATA TYPE_1 TYPE_2 TYPE_3 EVALUATED
WW A234 456 456 0
AA 456 123A 567 1
BB 456 123A 456 1
df = pd.DataFrame({"DATA":["WW","AA","BB"],"TYPE_1":["A234","456","456"],"TYPE_2":["456","123A","123A"],"TYPE_3":["456","567","456"],"EVALUATED":[0,1,1]})
I am calculating the number of times TYPE* apeared with EVALUATED equal to 0 and 1. This is the code that I am using:
grouped = (df.set_index('EVALUATED')
.filter(like='TYPE_')
.stack()
.to_frame('TYPE')
.reset_index()
.pivot_table(index='TYPE', columns='EVALUATED', aggfunc='size', fill_value=0)
).reset_index()
I need to slightly improve this code by counting only unique occurences of TYPE* per row.
For example, 456 appeared two times in row 1, but it must be counted only once.
The result should be this one:
grouped =
TYPE 0 1
------------
A234 1 0
456 1 2
123A 0 2
567 0 1

My solution is similar as your, only added .apply(lambda x: pd.Series(x.unique()), axis=1) for remove duplicates in rows and another possible solution for fast size with groupby and unstack:
grouped = df.set_index('EVALUATED')
.filter(like='TYPE_')
.apply(lambda x: pd.Series(x.unique()), axis=1)
.stack()
.to_frame('TYPE')
.reset_index()
.groupby(['TYPE', 'EVALUATED'])
.size()
.unstack(fill_value=0)
print (grouped)
EVALUATED 0 1
TYPE
123A 0 2
456 1 2
567 0 1
A234 1 0
Your solution:
grouped = (df.set_index('EVALUATED')
.filter(like='TYPE_')
.apply(lambda x: pd.Series(x.unique()), axis=1) #added row for unique rows
.stack()
.to_frame('TYPE')
.reset_index()
.pivot_table(index='TYPE', columns='EVALUATED', aggfunc='size', fill_value=0)
).reset_index()
print (grouped)
EVALUATED TYPE 0 1
0 123A 0 2
1 456 1 2
2 567 0 1
3 A234 1 0

Steps:
1) Subset columns starting with the char "TYPE_X".
2) Construct a new DF by taking thier unique values across columns. Let it's new index be the contents present in EVALUATED column.
3) Perform Groupby w.r.t the index axis by specifying level=0. Using apply, stack the frame so that a series gets created. Take it's distinct counts using value_counts. Unstack the obtained DF with providing fill_value=0. Finally, transpose the resulting dataframe.
TYPE_cols = df.filter(like="TYPE_")
d = pd.DataFrame([pd.unique(x) for x in TYPE_cols.values], df['EVALUATED'].values)
result = d.groupby(level=0).apply(lambda x: pd.value_counts(x.values.ravel()))
result.unstack(fill_value=0).T.reset_index().rename(columns={"index":"TYPE"})

Related

Defining an aggregation function with groupby in pandas

I would like to collapse my dataset using groupby and agg, however after collapsing, I want the new column to show a string value only for the grouped rows.
For example, the initial data is:
df = pd.DataFrame([["a",1],["a",2],["b",2]], columns=['category','value'])
category value
0 a 1
1 a 3
2 b 2
Desired output:
category value
0 a grouped
1 b 2
How should I modify my code (to show "grouped" instead of 3):
df=df.groupby(['category'], as_index=False).agg({'value':'max'})
You can use a lambda with a ternary:
df.groupby("category", as_index=False)
.agg({"value": lambda x: "grouped" if len(x) > 1 else x})
This outputs:
category value
0 a grouped
1 b 2
Another possible solution:
(df.assign(value = np.where(
df.duplicated(subset=['category'], keep=False), 'grouped', df['value']))
.drop_duplicates())
Output:
category value
0 a grouped
2 b 2

Creating a new map from existing maps in python

This question might be common but I am new to python and would like to learn more from the community. I have 2 map files which have data mapping like this:
map1 : A --> B
map2 : B --> C,D,E
I want to create a new map file which will be A --> C
What is the most efficient way to achieve this in python? A generic approach would be very helpful as I need to apply the same logic on different files and different columns
Example:
Map1:
1,100
2,453
3,200
Map2:
100,25,30,
200,300,,
250,190,20,1
My map3 should be:
1,25
2,0
3,300
As 453 is not present in map2, our map3 contains value 0 for key 2.
First create DataFrames:
df1 = pd.read_csv(Map1, header=None)
df2 = pd.read_csv(Map2, header=None)
And then use Series.map by second column with by Series created by df2 with set index by first column, last replace missing values to 0 for not matched values:
df1[1] = df1[1].map(df2.set_index(0)[1]).fillna(0, downcast='int')
print (df1)
0 1
0 1 25
1 2 0
2 3 300
EDIT: for mapping multiple columns use left join with remove only missing columns by DataFrame.dropna and columns b,c used for join, last replace missing values:
df1.columns=['a','b']
df2.columns=['c','d','e','f']
df = (df1.merge(df2, how='left', left_on='b', right_on='c')
.dropna(how='all', axis=1)
.drop(['b','c'], axis=1)
.fillna(0)
.convert_dtypes())
print (df)
a d e
0 1 25 30
1 2 0 0
2 3 300 0

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

how to get distinct value while using apply.join in Pandas DataFrame

I am joining different values from a dataset into one column in Pandas Dataframe, however there are lots of duplication, how can I get rid of them without deleting any row?:
example:
newCol
------
123,456,129,123,123
237,438,365,432,438
using df.newCol.drop_duplicates() removes the entire rows but I want the result to be:
newCol
------
123,456,129
237,438,365,432
...
thank you
You need first split, apply set and then join:
df.newCol = df.newCol.apply(lambda x: ','.join(set(str(x).split(','))))
print (df)
newCol
0 129,123,456
1 432,365,438,237
But you can apply set in join before:
print (df)
0 1 2 3 4
0 123 456 129 123 123
1 237 438 365 432 438
df = df.apply(lambda x: ','.join(set(x.astype(str))), axis=1)
print (df)
0 129,123,456
1 432,365,438,237
dtype: object
Or unique:
df = df.apply(lambda x: ','.join((x.astype(str)).unique()), axis=1)
print (df)
0 123,456,129
1 237,438,365,432
dtype: object

In Pandas, after groupby the grouped column is gone

I have the following dataframe named ttm:
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12 1 60 3 1728
1 11 1 240 3 1331
3 5 1 5 3 125
4 6 1 16 2 216
2 10 3 270 3 1000
5 8 3 18 2 512
When i do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
I get what I expected (though I would've wanted the results to be under a new label named 'ratio'):
clienthostid LoginDaysSum
0 1 4
1 3 2
But when I do
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
I get:
0 1.0
1 1.5
Why did the labels go? I still also need the grouped need the 'clienthostid' and I need also the results of the apply to be under a label too
Sometimes when I do groupby some of the other columns still appear, why is that that sometimes columns disappear and sometime stays? is there a flag I'm missing that do those stuff?
In the example that I gave, when I did count the results showed on label 'LoginDaysSum', is there a why to add a new label for the results instead?
Thank you,
For return DataFrame after groupby are 2 possible solutions:
parameter as_index=False what works nice with count, sum, mean functions
reset_index for create new column from levels of index, more general solution
df = ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
df = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'].count().reset_index()
print (df)
clienthostid LoginDaysSum
0 1 4
1 3 2
For second need remove as_index=False and instead add reset_index:
#output is `Series`
a = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum'] \
.apply(lambda x: x.iloc[0] / x.iloc[1])
print (a)
clienthostid
1 1.0
3 1.5
Name: LoginDaysSum, dtype: float64
print (type(a))
<class 'pandas.core.series.Series'>
print (a.index)
Int64Index([1, 3], dtype='int64', name='clienthostid')
df1 = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSum']
.apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
print (df1)
clienthostid ratio
0 1 1.0
1 3 1.5
Why some columns are gone?
I think there can be problem automatic exclusion of nuisance columns:
#convert column to str
ttm.usersidid = ttm.usersidid.astype(str) + 'aa'
print (ttm)
usersidid clienthostid eventSumTotal LoginDaysSum score
0 12aa 1 60 3 1728
1 11aa 1 240 3 1331
3 5aa 1 5 3 125
4 6aa 1 16 2 216
2 10aa 3 270 3 1000
5 8aa 3 18 2 512
#removed str column userid
a = ttm.groupby(['clienthostid'], sort=False).sum()
print (a)
eventSumTotal LoginDaysSum score
clienthostid
1 321 11 3400
3 288 5 1512
What is the difference between size and count in pandas?
count is a built in method for the groupby object and pandas knows what to do with it. There are two other things specified that goes into determining what the out put looks like.
# For a built in method, when
# you don't want the group column
# as the index, pandas keeps it in
# as a column.
# |----||||----|
ttm.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].count()
clienthostid LoginDaysSum
0 1 4
1 3 2
# For a built in method, when
# you do want the group column
# as the index, then...
# |----||||---|
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].count()
# |-----||||-----|
# the single brackets tells
# pandas to operate on a series
# in this case, count the series
clienthostid
1 4
3 2
Name: LoginDaysSum, dtype: int64
ttm.groupby(['clienthostid'], as_index=True, sort=False)[['LoginDaysSum']].count()
# |------||||------|
# the double brackets tells pandas
# to operate on the dataframe
# specified by these columns and will
# return a dataframe
LoginDaysSum
clienthostid
1 4
3 2
When you used apply pandas no longer knows what to do with the group column when you say as_index=False. It has to trust that if you use apply you want returned exactly what you say to return, so it will just throw it away. Also, you have single brackets around your column which says to operate on a series. Instead, use as_index=True to keep the grouping column information in the index. Then follow it up with a reset_index to transfer it from the index back into the dataframe. At this point, it will not have mattered that you used single brackets because after the reset_index you'll have a dataframe again.
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1])
0 1.0
1 1.5
dtype: float64
ttm.groupby(['clienthostid'], as_index=True, sort=False)['LoginDaysSum'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
clienthostid LoginDaysSum
0 1 1.0
1 3 1.5
Reading the groupy documentarion, a found out that automatic exclusion of columns after groupby usually caused by the presence of null values in that columns excluded.
Try fill the 'null' with some value.
Like this:
df.fillna('')
You simply need this instead:
ttm.groupby(['clienthostid'], as_index=False, sort=False)[['LoginDaysSum']].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index()
The double [[]] will turn the output into a pd.Dataframe instead of a pd.Series.

Categories