Convert dataframe to dictionary of list of tuples - python

I have a dataframe that looks like the following
user item \
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e The Cove - Jack Johnson
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e Entre Dos Aguas - Paco De Lucia
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e Stronger - Kanye West
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e Constellations - Jack Johnson
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e Learn To Fly - Foo Fighters
rating
0 1
1 2
2 1
3 1
4 1
and would like to achieve the following structure:
dict-> list of tuples
user-> (item, rating)
b80344d063b5ccb3212f76538f3d9e43d87dca9e -> list((The Cove - Jack
Johnson, 1), ... , )
I can do:
item_set = dict((user, set(items)) for user, items in \
data.groupby('user')['item'])
But that only gets me halfways. How do I get the corresponding "rating" value from the groupby?

Set user as index, convert to tuple using df.apply, groupby index using df.groupby(level=0) and get a list using dfGroupBy.agg and convert to dictionary using df.to_dict:
In [1417]: df
Out[1417]:
user item \
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e The Cove - Jack Johnson
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e Entre Dos Aguas - Paco De Lucia
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e Stronger - Kanye West
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e Constellations - Jack Johnson
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e Learn To Fly - Foo Fighters
rating
0 1
1 2
2 2
3 2
4 2
In [1418]: df.set_index('user').apply(tuple, 1)\
.groupby(level=0).agg(lambda x: list(x.values))\
.to_dict()
Out[1418]:
{'b80344d063b5ccb3212f76538f3d9e43d87dca9e': [('The Cove - Jack Johnson', 1),
('Entre Dos Aguas - Paco De Lucia', 2),
('Stronger - Kanye West', 2),
('Constellations - Jack Johnson', 2),
('Learn To Fly - Foo Fighters', 2)]}

Related

Update DataFrame based on matching rows in another DataFrame

Say there is a group of people who can choose an English and / or a Spanish word. Let's say they chose like this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water',None,'hello','thanks',None,'green'],spanish=[None,'agua',None,None,'bienvenido','verde']))
person english spanish
0 mary water None
1 james None agua
2 patricia hello None
3 robert thanks None
4 jennifer None bienvenido
5 michael green verde
Say I also have an English-Spanish dictionary (assume no duplicates, i.e. one-to-one relationship):
>>> pandas.DataFrame(dict(english=['hello','bad','green','thanks','welcome','water'],spanish=['hola','malo','verde','gracias','bienvenido','agua']))
english spanish
0 hello hola
1 bad malo
2 green verde
3 thanks gracias
4 welcome bienvenido
5 water agua
How can I fill in any missing words, i.e. update the first DataFrame using the second DataFrame where either english or spanish is None, to arrive at this:
>>> pandas.DataFrame(dict(person=['mary','james','patricia','robert','jennifer','michael'],english=['water','water','hello','thanks','welcome','green'],spanish=['agua','agua','hola','gracias','bienvenido','verde']))
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde
You may check the map with fillna
df['english'] = df['english'].fillna(df['spanish'].map(df2.set_index('spanish')['english']))
df['spanish'] = df['spanish'].fillna(df['english'].map(df2.set_index('english')['spanish']))
df
Out[200]:
person english spanish
0 mary water agua
1 james water agua
2 patricia hello hola
3 robert thanks gracias
4 jennifer welcome bienvenido
5 michael green verde

How to groupby on several columns to know value on other column? [duplicate]

This question already has answers here:
Get rows based on distinct values from one column
(2 answers)
Closed 1 year ago.
I have a dataframe with thousands rows like this:
city zip_code name
paris 1 John
paris 1 Eric
paris 2 David
LA 3 David
LA 4 David
LA 4 NaN
How can I do a groupby city and zip code and know the name for each city and zip_code grouped ?
Expected output: a dataframe with rows with unique city and unique zip_code and corresponding names in other column (one row per name)
city zip_code name
paris 1 John
Eric
paris 2 David
LA 3 David
LA 4 David
IIUC, you want to know the existing combinations of city and zip_code?
[k for k,_ in df.groupby(['city', 'zip_code'])]
output: [('LA', 3), ('LA', 4), ('paris', 1), ('paris', 2)]
edit following your change in the question:
It looks like you want:
df.drop_duplicates().dropna()
output:
city zip_code name
0 paris 1 John
1 paris 1 Eric
2 paris 2 David
3 LA 3 David
4 LA 4 David

How to create a duplicate flag (column) that counts duplicate rows based on two columns?

I have the following dataframe and would like to create a column at the end called "dup" showing the number of times the row shows up based on the "Seasons" and "Actor" columns. Ideally the dup column would look like this:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 1
This should do what you need:
df['dup'] = df.groupby(['Seasons', 'Actor']).cumcount() + 1
Output:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 2
As Scott Boston mentioned, according to your criteria the last row should also be 2 in the dup column.
Here is a similar post that can provide you more information. SQL-like window functions in PANDAS

pandas - how to extract top three rows from the dataframe provided

My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z

Pandas: Concatenate two dataframes with different column names

I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?

Categories