Finding unique combinations of columns from a dataframe - python

In my below data set, I need to find unique sequences and assign them a serial no ..
DataSet :
user age maritalstatus product
A Young married 111
B young married 222
C young Single 111
D old single 222
E old married 111
F teen married 222
G teen married 555
H adult single 444
I adult single 333
Expected output:
young married 0
young single 1
old single 2
old married 3
teen married 4
adult single 5
After finding the unique values like shown above, if I pass a new user like below,
user age maritalstatus
X young married
it should return me the products as a list .
X : [111, 222]
if there is no sequence, like below
user age maritalstatus
Y adult married
it should return me an empty list
Y : []

First select only columns for output and add drop_duplicates, last add new column by range:
df = df[['age','maritalstatus']].drop_duplicates()
df['no'] = range(len(df.index))
print (df)
age maritalstatus no
0 Young married 0
1 young married 1
2 young Single 2
3 old single 3
4 old married 4
5 teen married 5
7 adult single 6
If want convert all values to lowercase first:
df = df[['age','maritalstatus']].apply(lambda x: x.str.lower()).drop_duplicates()
df['no'] = range(len(df.index))
print (df)
age maritalstatus no
0 young married 0
2 young single 1
3 old single 2
4 old married 3
5 teen married 4
7 adult single 5
EDIT:
First convert to lowercase:
df[['age','maritalstatus']] = df[['age','maritalstatus']].apply(lambda x: x.str.lower())
print (df)
user age maritalstatus product
0 A young married 111
1 B young married 222
2 C young single 111
3 D old single 222
4 E old married 111
5 F teen married 222
6 G teen married 555
7 H adult single 444
8 I adult single 333
And then use merge for unique product converted to list:
df2 = pd.DataFrame([{'user':'X', 'age':'young', 'maritalstatus':'married'}])
print (df2)
age maritalstatus user
0 young married X
a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[111, 222]
df2 = pd.DataFrame([{'user':'X', 'age':'adult', 'maritalstatus':'married'}])
print (df2)
age maritalstatus user
0 adult married X
a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[]
But if need column use transform:
df['prod'] = df.groupby(['age', 'maritalstatus'])['product'].transform('unique')
print (df)
user age maritalstatus product prod
0 A young married 111 [111, 222]
1 B young married 222 [111, 222]
2 C young single 111 [111]
3 D old single 222 [222]
4 E old married 111 [111]
5 F teen married 222 [222, 555]
6 G teen married 555 [222, 555]
7 H adult single 444 [444, 333]
8 I adult single 333 [444, 333]
EDIT1:
a = (pd.merge(df, df2, on=['age','maritalstatus'])
.groupby('user_y')['product']
.apply(lambda x: x.unique().tolist())
.to_dict())
print (a)
{'X': [111, 222]}
Detail:
print (pd.merge(df, df2, on=['age','maritalstatus']))
user_x age maritalstatus product user_y
0 A young married 111 X
1 B young married 222 X

One way is pd.factorize. Note I convert columns to lower case first for results to make sense.
for col in ['user', 'age', 'maritalstatus']:
df[col] = df[col].str.lower()
df['category'] = list(zip(df.age, df.maritalstatus))
df['category'] = pd.factorize(df['category'])[0]
# user age maritalstatus product category
# 0 a young married 111 0
# 1 b young married 222 0
# 2 c young single 111 1
# 3 d old single 222 2
# 4 e old married 111 3
# 5 f teen married 222 4
# 6 g teen married 555 4
# 7 h adult single 444 5
# 8 i adult single 333 5
Finally, drop duplicates:
df_cats = df[['age', 'maritalstatus', 'category']].drop_duplicates()
# age maritalstatus category
# 0 young married 0
# 2 young single 1
# 3 old single 2
# 4 old married 3
# 5 teen married 4
# 7 adult single 5
To map a list of products, try this:
s = df.groupby(['age', 'maritalstatus'])['product'].apply(list)
df['prod_catwise'] = list(map(s.get, zip(df.age, df.maritalstatus)))
Another option is to use categorical data, which I highly recommend for workflows. You can easily extract codes from a categorical series via pd.Series.cat.codes.

Related

pandas groupby column to list and keep certain values

I have the following dataframe:
id occupations
111 teacher
111 student
222 analyst
333 cook
111 driver
444 lawyer
I create a new column with a list of the all the occupations:
new_df['occupation_list'] = df['id'].map(df.groupby('id')['occupations'].agg(list))
How do I only include teacher and student values in occupation_list?
You can filter before groupby:
to_map = (df[df['occupations'].isin(['teacher', 'student'])]
.groupby('id')['occupations'].agg(list)
)
df['occupation_list'] = df['id'].map(to_map)
Output:
id occupations occupation_list
0 111 teacher [teacher, student]
1 111 student [teacher, student]
2 222 analyst NaN
3 333 cook NaN
4 111 driver [teacher, student]
5 444 lawyer NaN
You can also do
df.groupby('id')['occupations'].transform(' '.join).str.split()
You would just do a groupby and agg the column to a list:
df.groupby('id',as_index=False).agg({'occupations':lambda x: x.tolist()})
out:
>>> df
id occupations
0 111 teacher
1 111 student
2 222 analyst
3 333 cook
4 111 driver
5 444 lawyer
>>> df.groupby('id',as_index=False).agg({'occupations':lambda x: x.tolist()})
id occupations
0 111 [teacher, student, driver]
1 222 [analyst]
2 333 [cook]
3 444 [lawyer]

Pandas: Group by two parameters and sort by third parameter

I want to group my dataframe by two columns (Name and Budget) and then sort the aggregated results by a third parameter (Prio).
Name Budget Prio Quantity
peter A 2 12
B 1 123
joe A 3 34
B 1 51
C 2 43
I already checked this post, which was very helpful and leads to the following output. However, I cannot manage sorting by the third parameter (Prio).
df_agg = df.groupby(['Name','Budget','Prio']).agg({'Quantity':sum})
g = df_agg['Quantity'].groupby(level=0, group_keys=False)
res = g.apply(lambda x: x.sort_values(ascending=True))
I would now like to sort the prio in ascending order within each of the groups. To get something like:
Name Budget Prio Quantity
peter B 1 123
A 2 12
joe B 1 51
C 2 34
A 3 43
IIUC,
df.groupby(['Name','Budget','Prio']).agg({'Quantity':sum}).sort_values(['Name','Prio'])
Output:
Quantity
Name Budget Prio
joe B 1 51
C 2 4
A 3 34
peter B 1 123
A 2 12
If you want only sort by Prio, you can use sort_index:
(df.groupby(['Name','Budget','Prio'])
.agg({'Quantity':'sum'})
.sort_index(level=['Name', 'Prio'],
ascending=[False, True])
)
Output:
Quantity
Name Budget Prio
peter B 1 123
A 2 12
joe B 1 51
C 2 43
A 3 34

Compare 2 dataframes Pandas, returns wrong values

There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999
Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999
Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444

pandas: transform based on count of row value in another dataframe

I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.
sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1
pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1

Counting elements in Pandas

Let's say I have a Panda DataFrame like this
import pandas as pd
a=pd.Series([{'Country'='Italy','Name'='Augustina','Gender'='Female','Number'=1}])
b=pd.Series([{'Country'='Italy','Name'='Piero','Gender'='Male','Number'=2}])
c=pd.Series([{'Country'='Italy','Name'='Carla','Gender'='Female','Number'=3}])
d=pd.Series([{'Country'='Italy','Name'='Roma','Gender'='Female','Number'=4}])
e=pd.Series([{'Country'='Greece','Name'='Sophia','Gender'='Female','Number'=5}])
f=pd.Series([{'Country'='Greece','Name'='Zeus','Gender'='Male','Number'=6}])
df=pd.DataFrame([a,b,c,d,e,f])
then, I sort with multiindex, like
df.set_index(['Country','Gender'],inplace=True)
Now, I wold like to know how to count how many people are from Italy, or how many Greek female I have in the dataframe.
I've tried
df['Italy'].count()
and
df['Greece']['Female'].count()
. None of them works,
Thanks
I think you need groupby with aggregatingsize:
What is the difference between size and count in pandas?
a=pd.DataFrame([{'Country':'Italy','Name':'Augustina','Gender':'Female','Number':1}])
b=pd.DataFrame([{'Country':'Italy','Name':'Piero','Gender':'Male','Number':2}])
c=pd.DataFrame([{'Country':'Italy','Name':'Carla','Gender':'Female','Number':3}])
d=pd.DataFrame([{'Country':'Italy','Name':'Roma','Gender':'Female','Number':4}])
e=pd.DataFrame([{'Country':'Greece','Name':'Sophia','Gender':'Female','Number':5}])
f=pd.DataFrame([{'Country':'Greece','Name':'Zeus','Gender':'Male','Number':6}])
df=pd.concat([a,b,c,d,e,f], ignore_index=True)
print (df)
Country Gender Name Number
0 Italy Female Augustina 1
1 Italy Male Piero 2
2 Italy Female Carla 3
3 Italy Female Roma 4
4 Greece Female Sophia 5
5 Greece Male Zeus 6
df = df.groupby('Country').size()
print (df)
Country
Greece 2
Italy 4
dtype: int64
df = df.groupby(['Country', 'Gender']).size()
print (df)
Country Gender
Greece Female 1
Male 1
Italy Female 3
Male 1
dtype: int64
If need only some sizes with select by MultiIndex by xs or slicers:
df.set_index(['Country','Gender'],inplace=True)
print (df)
Name Number
Country Gender
Italy Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
Greece Female Sophia 5
Male Zeus 6
print (df.xs('Italy', level='Country'))
Name Number
Gender
Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
print (len(df.xs('Italy', level='Country').index))
4
print (df.xs(('Greece', 'Female'), level=('Country', 'Gender')))
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.xs(('Greece', 'Female'), level=('Country', 'Gender')).index))
1
#KeyError: 'MultiIndex Slicing requires
#the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
idx = pd.IndexSlice
print (df.loc[idx['Italy', :],:])
Name Number
Country Gender
Italy Female Augustina 1
Female Carla 3
Female Roma 4
Male Piero 2
print (len(df.loc[idx['Italy', :],:].index))
4
print (df.loc[idx['Greece', 'Female'],:])
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.loc[idx['Greece', 'Female'],:].index))
1

Categories