Finding average in pandas from a value in a dictionary? - python

experimenting on a project with a large dataset of movies. I have a large data frame, with one row named "Genres" and one named "Vote Average". My goal is to find the 20 highest rated genres bases on "Vote Average".
I would use a group by but I can't seem to figure it out because the genre information looks like this in the column "Genres" :
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
How can I extract Comedy, Drama and Romance from the list above?
How can I group by individual genres while assigning the rows "Vote Average to each genre, so I can print the top 20 rated genres in the data frame?
Genres Vote Average
1 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 7.7
2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 6.9
3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 6.5
4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
5 [{'id': 35, 'name': 'Comedy'}] 5.7
... ... ...
32255 [{'id': 878, 'name': 'Science Fiction'}] 3.5
32256 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name... 5.7
32257 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam... 3.8
32258 [] 0.0
32259 [] 0.0
EDIT: Example from Data frame is above. movies_metadata.csv from https://www.kaggle.com/rounakbanik/the-movies-dataset

EDIT:
Now when I see all information on kaggle then I think it may need totally different method because these genres are assigned to titles and they can't be in separated rows.
OLD:
Now you have to convert it to correct DataFrame with genres in separated rows insitead of
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, ...]
Here is my example data
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print(df)
result:
Genre Vote Average
0 [{'id': 16, 'name': 'Animation'}] 7.7
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
2 [{'id': 10749, 'name': 'Romance'}] 6.5
You can iterate every row and use pd.DataFrame(row['Genre']) to create correct dataframe which you will add to new global dataframe
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
result:
id name Vote Average
0 16 Animation 7.7
0 35 Comedy 6.1
1 18 Drama 6.1
2 10749 Romance 6.1
0 10749 Romance 6.5
and now you can do whatever you like.
Other method to correct data:
First convert list of dictionares to separated rows with dictionares
new_df = df.explode('Genre')
print(new_df)
result:
Genre Vote Average
0 {'id': 16, 'name': 'Animation'} 7.7
1 {'id': 35, 'name': 'Comedy'} 6.1
1 {'id': 18, 'name': 'Drama'} 6.1
1 {'id': 10749, 'name': 'Romance'} 6.1
2 {'id': 10749, 'name': 'Romance'} 6.5
and later convert every dictionary to columns
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
print(new_df)
result:
Genre Vote Average id name
0 {'id': 16, 'name': 'Animation'} 7.7 16 Animation
1 {'id': 35, 'name': 'Comedy'} 6.1 35 Comedy
1 {'id': 18, 'name': 'Drama'} 6.1 18 Drama
1 {'id': 10749, 'name': 'Romance'} 6.1 10749 Romance
2 {'id': 10749, 'name': 'Romance'} 6.5 10749 Romance
or using
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
Full example
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print('--- df ---')
print(df)
print('--- iterrows ---')
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
print('--- explode #1 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #1 ---')
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)
print('--- explode #2 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #2 ---')
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)

Related

Change column format of DF, where some columns are dicts

I'm new to pandas and I need help. Below I described my DF, which I need to change.
id title \
0 121852 {'en': 'Hard Fork'}
1 123209 {'en': 'Quarterly Public Meeting'}
2 122436 {'en': 'Luxy NFT Marketplace'}
3 122995 {'en': 'Poloniex Listing'}
4 123391 {'en': 'Staking 3.0 Release'}
5 123355 {'en': 'BitMart Listing'}
6 122819 {'en': 'Amazy IGO'}
7 123470 {'en': 'YouTube Live AMA'}
8 123392 {'en': 'AMA'}
9 123319 {'en': 'LBank Listing'}
10 123306 {'en': 'Community Call'}
11 123465 {'en': 'Digifinex Listing'}
12 123469 {'en': 'MEXC Global Listing'}
13 123512 {'en': 'Metarun & Fabwelt AMA'}
14 123460 {'en': 'Digifinex Listing'}
15 123489 {'en': 'BitMart Listing'}
coins \
0 [{'id': 'gxchain', 'coingecko_id': '', 'name': 'GXChain', 'rank': 442, 'symbol': 'GXC', 'fullname': 'GXChain (GXC)'}, {'id': 'rei-network', 'coingecko_id': '', 'name': 'REI Network', 'rank': 376, 'symbol': 'REI', 'fullname': 'REI Network (REI)'}]
1 [{'id': 'filecoin', 'coingecko_id': '', 'name': 'Filecoin', 'rank': 45, 'symbol': 'FIL', 'fullname': 'Filecoin (FIL)'}]
2 [{'id': 'luxy', 'coingecko_id': '', 'name': 'Luxy', 'rank': 0, 'symbol': 'LUXY', 'fullname': 'Luxy (LUXY)'}, {'id': 'syscoin', 'coingecko_id': '', 'name': 'Syscoin', 'rank': 240, 'symbol': 'SYS', 'fullname': 'Syscoin (SYS)'}]
3 [{'id': 'bitkub-coin', 'coingecko_id': '', 'name': 'Bitkub Coin', 'rank': 125, 'symbol': 'KUB', 'fullname': 'Bitkub Coin (KUB)'}]
4 [{'id': 'sidus', 'coingecko_id': '', 'name': 'Sidus', 'rank': 1231, 'symbol': 'SIDUS', 'fullname': 'Sidus (SIDUS)'}]
5 [{'id': 'solve', 'coingecko_id': '', 'name': 'SOLVE', 'rank': 693, 'symbol': 'SOLVE', 'fullname': 'SOLVE (SOLVE)'}]
6 [{'id': 'seedify-fund', 'coingecko_id': '', 'name': 'Seedify.fund', 'rank': 389, 'symbol': 'SFUND', 'fullname': 'Seedify.fund (SFUND)'}]
7 [{'id': 'oasis-network', 'coingecko_id': '', 'name': 'Oasis Network', 'rank': 134, 'symbol': 'ROSE', 'fullname': 'Oasis Network (ROSE)'}]
8 [{'id': 'dydx', 'coingecko_id': '', 'name': 'dYdX', 'rank': 157, 'symbol': 'DYDX', 'fullname': 'dYdX (DYDX)'}]
9 [{'id': 'grove', 'coingecko_id': '', 'name': 'Grove', 'rank': 0, 'symbol': 'GVR', 'fullname': 'Grove (GVR)'}]
10 [{'id': 'perpetual-protocol', 'coingecko_id': '', 'name': 'Perpetual Protocol', 'rank': 373, 'symbol': 'PERP', 'fullname': 'Perpetual Protocol (PERP)'}]
11 [{'id': 'new-paradigm-assets-solution', 'coingecko_id': '', 'name': 'New Paradigm Assets Solution', 'rank': 0, 'symbol': 'NPAS', 'fullname': 'New Paradigm Assets Solution (NPAS)'}]
12 [{'id': 'handy', 'coingecko_id': '', 'name': 'Handy', 'rank': 0, 'symbol': 'HANDY', 'fullname': 'Handy (HANDY)'}]
13 [{'id': 'fabwelt', 'coingecko_id': '', 'name': 'Fabwelt', 'rank': 2626, 'symbol': 'WELT', 'fullname': 'Fabwelt (WELT)'}, {'id': 'metarun', 'coingecko_id': '', 'name': 'Metarun', 'rank': 3092, 'symbol': 'MRUN', 'fullname': 'Metarun (MRUN)'}]
14 [{'id': 'dungeon', 'coingecko_id': '', 'name': 'Dungeon', 'rank': 0, 'symbol': 'DGN', 'fullname': 'Dungeon (DGN)'}]
15 [{'id': 'monetha', 'coingecko_id': '', 'name': 'Monetha', 'rank': 1967, 'symbol': 'MTH', 'fullname': 'Monetha (MTH)'}]
date_event can_occur_before created_date \
0 2022-07-13T00:00:00Z False 2022-06-27T14:39:15Z
1 2022-07-13T00:00:00Z False 2022-07-09T13:27:25Z
2 2022-07-13T00:00:00Z False 2022-07-02T06:10:09Z
3 2022-07-13T00:00:00Z False 2022-07-07T13:55:34Z
4 2022-07-13T00:00:00Z False 2022-07-11T18:42:01Z
5 2022-07-13T00:00:00Z False 2022-07-11T18:16:08Z
6 2022-07-13T00:00:00Z False 2022-07-06T06:55:16Z
7 2022-07-13T00:00:00Z False 2022-07-12T13:59:23Z
8 2022-07-13T00:00:00Z False 2022-07-11T18:43:02Z
9 2022-07-13T00:00:00Z False 2022-07-11T14:12:23Z
10 2022-07-13T00:00:00Z False 2022-07-11T14:11:47Z
11 2022-07-13T00:00:00Z False 2022-07-12T13:49:28Z
12 2022-07-13T00:00:00Z False 2022-07-12T14:05:15Z
13 2022-07-13T00:00:00Z False 2022-07-12T18:46:28Z
14 2022-07-13T00:00:00Z False 2022-07-12T13:48:55Z
15 2022-07-13T00:00:00Z False 2022-07-12T23:33:03Z
categories \
0 [{'id': 14, 'name': 'Fork/Swap'}]
1 [{'id': 16, 'name': 'Team Update'}]
2 [{'id': 4, 'name': 'Exchange'}]
3 [{'id': 4, 'name': 'Exchange'}]
4 [{'id': 17, 'name': 'Staking/Farming'}]
5 [{'id': 4, 'name': 'Exchange'}]
6 [{'id': 7, 'name': 'Other'}]
7 [{'id': 9, 'name': 'AMA'}]
8 [{'id': 9, 'name': 'AMA'}]
9 [{'id': 4, 'name': 'Exchange'}]
10 [{'id': 16, 'name': 'Team Update'}]
11 [{'id': 4, 'name': 'Exchange'}]
12 [{'id': 4, 'name': 'Exchange'}]
13 [{'id': 9, 'name': 'AMA'}]
14 [{'id': 4, 'name': 'Exchange'}]
15 [{'id': 4, 'name': 'Exchange'}]
I need to change the column "title": delete the key 'en' and stay only values.
I need to change the column "coins": extract keys as separate columns and put there their values.
I need to change the column "categories": delete the key "id" and values from "id", delete the key "name", but stay values from "name"
For columns with list in rows i would use pandas.explode
For columns with dict rows, use .apply(pandas.Series)
and then rename the columns with same name if u want use it (like 'id') or reformat the dicts when you get the parsed json
should look like this
import pandas
df = pandas.DataFrame({
'1': [[{"id": 1, "a": 1}], [{"id": 2, "a": 1}]],
'2': [1, 2],
'3': [[{"id": 1, "name": "a"}], [{"id": 2, "name": "b"}]],
'4': [{"en": "a"}, {"en": "b"}]
})
df = df.explode(["1", "3"])
pandas.concat([
df.drop(["1", "3", "4"], axis=1),
df['1'].apply(pandas.Series),
df['3'].apply(pandas.Series),
df['4'].apply(pandas.Series)
], axis=1)

How can I spilt element to different columns in python?

I am working on a python program to output a dataset of ranking of some items.
This is my code:
import pandas as pd
list=[{'ranking': 1, 'sku': 'WD-0215', 'name': 'Sofa', 'price': '$1,299.00', 'detail': 'Red'},
{'ranking': 1, 'sku': 'WD-0215', 'name': 'Sofa', 'price': '$1,299.00', 'detail': 'Cottom'},
{'ranking': 1, 'sku': 'WD-0215', 'name': 'Sofa', 'price': '$1,299.00', 'detail': 'Wood Lab'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'W1360×D750×H710'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'LED'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'Made in Japan'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'Nordic'}
]
df = pd.DataFrame(list)
print(df)
df.to_csv('item.csv',encoding='utf_8_sig')
However my expected output should be like this:
ranking
sku
name
price
detail1
detail2
detail3
detail4
1
WD-0215
Sofa
$1299.00
Red
Cottom
Wood Lab
none
1
sfr20
TV
$1861.00-$3699.00
W1360×D750×H710
LED
Made in Japan
Nordic
How can change the code to ouput this result?
Use GroupBy.cumcount for counter and reshape by Series.unstack:
g = df.groupby(['ranking', 'sku', 'name', 'price']).cumcount().add(1)
df = df.set_index(['ranking', 'sku', 'name', 'price', g])['detail'].unstack().add_prefix('detail')
print (df)
detail1 detail2 \
ranking sku name price
1 WD-0215 Sofa $1,299.00 Red Cottom
2 sfr20 TV $1,861.00 – $3,699.00 W1360×D750×H710 LED
detail3 detail4
ranking sku name price
1 WD-0215 Sofa $1,299.00 Wood Lab NaN
2 sfr20 TV $1,861.00 – $3,699.00 Made in Japan Nordic

How to extract the first item in a set of dictionaries?

Please see the screenshot...
I'm trying to create another column pulling the first element of the column 'genre' (i.e. Animation for the first one, Adventure for the second one, Romance for the third one and so on...)
Could anyone please help?
You can split the data from the dictionary or from the dataframe. In this code, I convert the column to a list before splitting.
import pandas as pd
### from dictionary
dd = { 'genres':[
[{'id':16,'name':'Animation'},{'id':26,'name':'ChicFlick'}],
[{'id':12,'name':'Adventure'},{'id':22,'name':'Horror'}],
[{'id':18,'name':'Romance'},{'id':28,'name':'Crime'}],
]}
dd['genres2'] = [x[0]['name'] for x in dd['genres']]
print(dd)
### from dataframe
dd = { 'genres':[
[{'id':16,'name':'Animation'},{'id':26,'name':'ChicFlick'}],
[{'id':12,'name':'Adventure'},{'id':22,'name':'Horror'}],
[{'id':18,'name':'Romance'},{'id':28,'name':'Crime'}],
]}
df = pd.DataFrame(dd)
df['genres2'] = [x[0]['name'] for x in df['genres'].to_list()]
print(df.to_string(index=False))
Output
{'genres':
[[{'id': 16, 'name': 'Animation'}, {'id': 26, 'name': 'ChicFlick'}],
[{'id': 12, 'name': 'Adventure'}, {'id': 22, 'name': 'Horror'}],
[{'id': 18, 'name': 'Romance'}, {'id': 28, 'name': 'Crime'}]],
'genres2': ['Animation', 'Adventure', 'Romance']}
genres genres2
[{'id': 16, 'name': 'Animation'}, {'id': 26, 'name': 'ChicFlick'}] Animation
[{'id': 12, 'name': 'Adventure'}, {'id': 22, 'name': 'Horror'}] Adventure
[{'id': 18, 'name': 'Romance'}, {'id': 28, 'name': 'Crime'}] Romance

DataFrame column containing list within double quotes

I have a dataframe about movies and one of the columns is genre.
The entries of this column are in the form of list like -
[{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
My aim is to extract the genre from the list and store them as a list such as ['Comedy', 'Drama', 'Family', 'Romance'].
When I print the entries of the column for example -
data['genres'][1] it returns the list within the quotes (datatype : string)
"[{'id': 35, 'name': 'Comedy'}]"
Can someone help to get the list without the quotes? like [{'id': 35, 'name': 'Comedy'}] I should be able to take it from there.
When I create my custom example, it works as expected and returns a list without quotes. For example -
ref = pd.DataFrame({'col':[[1,2,3],[4,3,2]]})
ref['col'][0]
This returns a list (without quotes).
Problem is there are string representation of lists, so is necessary first convert it to list of dicts and then extract by get:
a = [{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
df = pd.DataFrame({'col':a}).astype(str)
import ast
df['genres'] = df['col'].apply(lambda x: ast.literal_eval(x).get('name'))
print (df)
col genres
0 {'id': 35, 'name': 'Comedy'} Comedy
1 {'id': 18, 'name': 'Drama'} Drama
2 {'id': 10751, 'name': 'Family'} Family
3 {'id': 10749, 'name': 'Romance'} Romance
If is necessary get all values:
df = pd.DataFrame({'a':list('abcd'),'col':a}).astype(str)
df = df.join(pd.DataFrame([ast.literal_eval(x) for x in df.pop('col')], index=df.index))
print (df)
a id name
0 a 35 Comedy
1 b 18 Drama
2 c 10751 Family
3 d 10749 Romance

python DataFrame split dict column to multiple columns

The column looks this way:
0 [{'id': 18, 'name': 'Drama'}, {'id': 10769, 'n...
1 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...
2 [{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...
3 [{'id': 18, 'name': 'Drama'}]
4 [{'id': 99, 'name': 'Documentary'}]
5 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...
6 [{'id': 10749, 'name': 'Romance'}, {'id': 18, ...
I wish to see ID columns with bool value for each genre:
index id=18 id=10769 id=35 id=27 ...
0 1 1 0 0 ...
1 1 0 0 0 ...
2 0 0 1 1 ...
3 1 0 0 0 ...
...
Use list comprehension with flattening and then DataFrame contructor:
df = pd.DataFrame({'col':[[{'id': 18, 'name': 'Drama'}, {'id': 10769}],
[{'id': 99, 'name': 'Documentary'}]]})
print (type(df.loc[0, 'col']))
<class 'list'>
df = pd.DataFrame([y for x in df['col'] for y in x])
print (df)
id name
0 18 Drama
1 10769 NaN
2 99 Documentary
#alternative
#df = pd.concat([pd.DataFrame(x) for x in df['col']], ignore_index=True)

Categories