pandas groupby column to list and keep certain values - python

I have the following dataframe:
id occupations
111 teacher
111 student
222 analyst
333 cook
111 driver
444 lawyer
I create a new column with a list of the all the occupations:
new_df['occupation_list'] = df['id'].map(df.groupby('id')['occupations'].agg(list))
How do I only include teacher and student values in occupation_list?

You can filter before groupby:
to_map = (df[df['occupations'].isin(['teacher', 'student'])]
.groupby('id')['occupations'].agg(list)
)
df['occupation_list'] = df['id'].map(to_map)
Output:
id occupations occupation_list
0 111 teacher [teacher, student]
1 111 student [teacher, student]
2 222 analyst NaN
3 333 cook NaN
4 111 driver [teacher, student]
5 444 lawyer NaN

You can also do
df.groupby('id')['occupations'].transform(' '.join).str.split()

You would just do a groupby and agg the column to a list:
df.groupby('id',as_index=False).agg({'occupations':lambda x: x.tolist()})
out:
>>> df
id occupations
0 111 teacher
1 111 student
2 222 analyst
3 333 cook
4 111 driver
5 444 lawyer
>>> df.groupby('id',as_index=False).agg({'occupations':lambda x: x.tolist()})
id occupations
0 111 [teacher, student, driver]
1 222 [analyst]
2 333 [cook]
3 444 [lawyer]

Related

Edit columns based on duplicate values found in Pandas

I have below dataframe:
No: Fee:
111 500
111 500
222 300
222 300
123 400
If data in No is duplicate, I want to keep only one fee and remove others.
Should look like below:
No: Fee:
111 500
111
222 300
222
123 400
I actually have no idea where to start, so please guide here.
Thanks.
Use DataFrame.duplicated with set empty string by DataFrame.loc:
#if need test duplicated by both columns
mask = df.duplicated(['No','Fee'])
df.loc[mask, 'Fee'] = ''
print (df)
No Fee
0 111 500
1 111
2 222 300
3 222
4 123 400
But then lost numeric column, because mixed numbers with strings:
print (df['Fee'].dtype)
object
Possible solution is use missing values if need numeric column:
df.loc[mask, 'Fee'] = np.nan
print (df)
No Fee
0 111 500.0
1 111 NaN
2 222 300.0
3 222 NaN
4 123 400.0
print (df['Fee'].dtype)
float64
df.loc[mask, 'Fee'] = np.nan
df['Fee'] = df['Fee'].astype('Int64')
print (df)
No Fee
0 111 500
1 111 <NA>
2 222 300
3 222 <NA>
4 123 400
print (df['Fee'].dtype)
Int64

Making a Unique Indicator Column in Pandas Based off of Two Columns

This has been a tricky one for me. So my data structure is roughly the following:
column_1
column_2
111
a
111
a
111
a
111
a
111
b
111
b
222
a
222
b
222
b
222
b
222
b
222
b
I want to get to this in the most efficient way possible from a processing perspective:
column_1
column_2
unique_id
111
a
1
111
a
2
111
a
3
111
a
4
111
b
1
111
b
2
222
a
1
222
b
1
222
b
2
222
b
3
222
b
4
222
b
5
In summary I want to create a column that will turn each row into a unique occurrence. The preference is that this unique_id column starts at 1 for each new combination of column_1 and column_2.

Transpose by grouping a Dataframe having both numeric and string variables

I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.
Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1

How to compare a list with dataframe column headers and substitute the header's name?

There is a df
df_example =
id city street house flat
0 NY street_ny 111 01
1 LA street_la 222 02
2 SF street_sf 333 03
3 Vegas street_vg 444 04
4 Boston street_bs 555 05
And in a database exists a table where every column name matches with column id (withoit id column)
sql_table (as df) =
column_name column_id
city 0
street 1
house 2
flat 3
I need to substitute in df_example column names with column ids from sql_table
Like this
id 0 1 2 3
0 NY street_ny 111 01
1 LA street_la 222 02
2 SF street_sf 333 03
3 Vegas street_vg 444 04
4 Boston street_bs 555 05
So far I got the list of column names without id column name
column_names_list = list(df_example)[1:]
column_names_list = ['city', 'street', 'house', 'flat']
But how to proceed I have no idea
.isin method doesn't really what I need
Appreciate any help
Use rename with dictionary created by zip:
df_example = df_example.rename(columns=dict(zip(df['column_name'], df['column_id'])))
print (df_example)
id 0 1 2 3
0 0 NY street_ny 111 1
1 1 LA street_la 222 2
2 2 SF street_sf 333 3
3 3 Vegas street_vg 444 4
4 4 Boston street_bs 555 5

Compare 2 dataframes Pandas, returns wrong values

There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999
Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999
Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444

Categories