python DataFrame split dict column to multiple columns - python

The column looks this way:
0 [{'id': 18, 'name': 'Drama'}, {'id': 10769, 'n...
1 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...
2 [{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...
3 [{'id': 18, 'name': 'Drama'}]
4 [{'id': 99, 'name': 'Documentary'}]
5 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...
6 [{'id': 10749, 'name': 'Romance'}, {'id': 18, ...
I wish to see ID columns with bool value for each genre:
index id=18 id=10769 id=35 id=27 ...
0 1 1 0 0 ...
1 1 0 0 0 ...
2 0 0 1 1 ...
3 1 0 0 0 ...
...

Use list comprehension with flattening and then DataFrame contructor:
df = pd.DataFrame({'col':[[{'id': 18, 'name': 'Drama'}, {'id': 10769}],
[{'id': 99, 'name': 'Documentary'}]]})
print (type(df.loc[0, 'col']))
<class 'list'>
df = pd.DataFrame([y for x in df['col'] for y in x])
print (df)
id name
0 18 Drama
1 10769 NaN
2 99 Documentary
#alternative
#df = pd.concat([pd.DataFrame(x) for x in df['col']], ignore_index=True)

Related

Change column format of DF, where some columns are dicts

I'm new to pandas and I need help. Below I described my DF, which I need to change.
id title \
0 121852 {'en': 'Hard Fork'}
1 123209 {'en': 'Quarterly Public Meeting'}
2 122436 {'en': 'Luxy NFT Marketplace'}
3 122995 {'en': 'Poloniex Listing'}
4 123391 {'en': 'Staking 3.0 Release'}
5 123355 {'en': 'BitMart Listing'}
6 122819 {'en': 'Amazy IGO'}
7 123470 {'en': 'YouTube Live AMA'}
8 123392 {'en': 'AMA'}
9 123319 {'en': 'LBank Listing'}
10 123306 {'en': 'Community Call'}
11 123465 {'en': 'Digifinex Listing'}
12 123469 {'en': 'MEXC Global Listing'}
13 123512 {'en': 'Metarun & Fabwelt AMA'}
14 123460 {'en': 'Digifinex Listing'}
15 123489 {'en': 'BitMart Listing'}
coins \
0 [{'id': 'gxchain', 'coingecko_id': '', 'name': 'GXChain', 'rank': 442, 'symbol': 'GXC', 'fullname': 'GXChain (GXC)'}, {'id': 'rei-network', 'coingecko_id': '', 'name': 'REI Network', 'rank': 376, 'symbol': 'REI', 'fullname': 'REI Network (REI)'}]
1 [{'id': 'filecoin', 'coingecko_id': '', 'name': 'Filecoin', 'rank': 45, 'symbol': 'FIL', 'fullname': 'Filecoin (FIL)'}]
2 [{'id': 'luxy', 'coingecko_id': '', 'name': 'Luxy', 'rank': 0, 'symbol': 'LUXY', 'fullname': 'Luxy (LUXY)'}, {'id': 'syscoin', 'coingecko_id': '', 'name': 'Syscoin', 'rank': 240, 'symbol': 'SYS', 'fullname': 'Syscoin (SYS)'}]
3 [{'id': 'bitkub-coin', 'coingecko_id': '', 'name': 'Bitkub Coin', 'rank': 125, 'symbol': 'KUB', 'fullname': 'Bitkub Coin (KUB)'}]
4 [{'id': 'sidus', 'coingecko_id': '', 'name': 'Sidus', 'rank': 1231, 'symbol': 'SIDUS', 'fullname': 'Sidus (SIDUS)'}]
5 [{'id': 'solve', 'coingecko_id': '', 'name': 'SOLVE', 'rank': 693, 'symbol': 'SOLVE', 'fullname': 'SOLVE (SOLVE)'}]
6 [{'id': 'seedify-fund', 'coingecko_id': '', 'name': 'Seedify.fund', 'rank': 389, 'symbol': 'SFUND', 'fullname': 'Seedify.fund (SFUND)'}]
7 [{'id': 'oasis-network', 'coingecko_id': '', 'name': 'Oasis Network', 'rank': 134, 'symbol': 'ROSE', 'fullname': 'Oasis Network (ROSE)'}]
8 [{'id': 'dydx', 'coingecko_id': '', 'name': 'dYdX', 'rank': 157, 'symbol': 'DYDX', 'fullname': 'dYdX (DYDX)'}]
9 [{'id': 'grove', 'coingecko_id': '', 'name': 'Grove', 'rank': 0, 'symbol': 'GVR', 'fullname': 'Grove (GVR)'}]
10 [{'id': 'perpetual-protocol', 'coingecko_id': '', 'name': 'Perpetual Protocol', 'rank': 373, 'symbol': 'PERP', 'fullname': 'Perpetual Protocol (PERP)'}]
11 [{'id': 'new-paradigm-assets-solution', 'coingecko_id': '', 'name': 'New Paradigm Assets Solution', 'rank': 0, 'symbol': 'NPAS', 'fullname': 'New Paradigm Assets Solution (NPAS)'}]
12 [{'id': 'handy', 'coingecko_id': '', 'name': 'Handy', 'rank': 0, 'symbol': 'HANDY', 'fullname': 'Handy (HANDY)'}]
13 [{'id': 'fabwelt', 'coingecko_id': '', 'name': 'Fabwelt', 'rank': 2626, 'symbol': 'WELT', 'fullname': 'Fabwelt (WELT)'}, {'id': 'metarun', 'coingecko_id': '', 'name': 'Metarun', 'rank': 3092, 'symbol': 'MRUN', 'fullname': 'Metarun (MRUN)'}]
14 [{'id': 'dungeon', 'coingecko_id': '', 'name': 'Dungeon', 'rank': 0, 'symbol': 'DGN', 'fullname': 'Dungeon (DGN)'}]
15 [{'id': 'monetha', 'coingecko_id': '', 'name': 'Monetha', 'rank': 1967, 'symbol': 'MTH', 'fullname': 'Monetha (MTH)'}]
date_event can_occur_before created_date \
0 2022-07-13T00:00:00Z False 2022-06-27T14:39:15Z
1 2022-07-13T00:00:00Z False 2022-07-09T13:27:25Z
2 2022-07-13T00:00:00Z False 2022-07-02T06:10:09Z
3 2022-07-13T00:00:00Z False 2022-07-07T13:55:34Z
4 2022-07-13T00:00:00Z False 2022-07-11T18:42:01Z
5 2022-07-13T00:00:00Z False 2022-07-11T18:16:08Z
6 2022-07-13T00:00:00Z False 2022-07-06T06:55:16Z
7 2022-07-13T00:00:00Z False 2022-07-12T13:59:23Z
8 2022-07-13T00:00:00Z False 2022-07-11T18:43:02Z
9 2022-07-13T00:00:00Z False 2022-07-11T14:12:23Z
10 2022-07-13T00:00:00Z False 2022-07-11T14:11:47Z
11 2022-07-13T00:00:00Z False 2022-07-12T13:49:28Z
12 2022-07-13T00:00:00Z False 2022-07-12T14:05:15Z
13 2022-07-13T00:00:00Z False 2022-07-12T18:46:28Z
14 2022-07-13T00:00:00Z False 2022-07-12T13:48:55Z
15 2022-07-13T00:00:00Z False 2022-07-12T23:33:03Z
categories \
0 [{'id': 14, 'name': 'Fork/Swap'}]
1 [{'id': 16, 'name': 'Team Update'}]
2 [{'id': 4, 'name': 'Exchange'}]
3 [{'id': 4, 'name': 'Exchange'}]
4 [{'id': 17, 'name': 'Staking/Farming'}]
5 [{'id': 4, 'name': 'Exchange'}]
6 [{'id': 7, 'name': 'Other'}]
7 [{'id': 9, 'name': 'AMA'}]
8 [{'id': 9, 'name': 'AMA'}]
9 [{'id': 4, 'name': 'Exchange'}]
10 [{'id': 16, 'name': 'Team Update'}]
11 [{'id': 4, 'name': 'Exchange'}]
12 [{'id': 4, 'name': 'Exchange'}]
13 [{'id': 9, 'name': 'AMA'}]
14 [{'id': 4, 'name': 'Exchange'}]
15 [{'id': 4, 'name': 'Exchange'}]
I need to change the column "title": delete the key 'en' and stay only values.
I need to change the column "coins": extract keys as separate columns and put there their values.
I need to change the column "categories": delete the key "id" and values from "id", delete the key "name", but stay values from "name"
For columns with list in rows i would use pandas.explode
For columns with dict rows, use .apply(pandas.Series)
and then rename the columns with same name if u want use it (like 'id') or reformat the dicts when you get the parsed json
should look like this
import pandas
df = pandas.DataFrame({
'1': [[{"id": 1, "a": 1}], [{"id": 2, "a": 1}]],
'2': [1, 2],
'3': [[{"id": 1, "name": "a"}], [{"id": 2, "name": "b"}]],
'4': [{"en": "a"}, {"en": "b"}]
})
df = df.explode(["1", "3"])
pandas.concat([
df.drop(["1", "3", "4"], axis=1),
df['1'].apply(pandas.Series),
df['3'].apply(pandas.Series),
df['4'].apply(pandas.Series)
], axis=1)

Finding average in pandas from a value in a dictionary?

experimenting on a project with a large dataset of movies. I have a large data frame, with one row named "Genres" and one named "Vote Average". My goal is to find the 20 highest rated genres bases on "Vote Average".
I would use a group by but I can't seem to figure it out because the genre information looks like this in the column "Genres" :
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
How can I extract Comedy, Drama and Romance from the list above?
How can I group by individual genres while assigning the rows "Vote Average to each genre, so I can print the top 20 rated genres in the data frame?
Genres Vote Average
1 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 7.7
2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 6.9
3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 6.5
4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
5 [{'id': 35, 'name': 'Comedy'}] 5.7
... ... ...
32255 [{'id': 878, 'name': 'Science Fiction'}] 3.5
32256 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name... 5.7
32257 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam... 3.8
32258 [] 0.0
32259 [] 0.0
EDIT: Example from Data frame is above. movies_metadata.csv from https://www.kaggle.com/rounakbanik/the-movies-dataset
EDIT:
Now when I see all information on kaggle then I think it may need totally different method because these genres are assigned to titles and they can't be in separated rows.
OLD:
Now you have to convert it to correct DataFrame with genres in separated rows insitead of
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, ...]
Here is my example data
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print(df)
result:
Genre Vote Average
0 [{'id': 16, 'name': 'Animation'}] 7.7
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
2 [{'id': 10749, 'name': 'Romance'}] 6.5
You can iterate every row and use pd.DataFrame(row['Genre']) to create correct dataframe which you will add to new global dataframe
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
result:
id name Vote Average
0 16 Animation 7.7
0 35 Comedy 6.1
1 18 Drama 6.1
2 10749 Romance 6.1
0 10749 Romance 6.5
and now you can do whatever you like.
Other method to correct data:
First convert list of dictionares to separated rows with dictionares
new_df = df.explode('Genre')
print(new_df)
result:
Genre Vote Average
0 {'id': 16, 'name': 'Animation'} 7.7
1 {'id': 35, 'name': 'Comedy'} 6.1
1 {'id': 18, 'name': 'Drama'} 6.1
1 {'id': 10749, 'name': 'Romance'} 6.1
2 {'id': 10749, 'name': 'Romance'} 6.5
and later convert every dictionary to columns
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
print(new_df)
result:
Genre Vote Average id name
0 {'id': 16, 'name': 'Animation'} 7.7 16 Animation
1 {'id': 35, 'name': 'Comedy'} 6.1 35 Comedy
1 {'id': 18, 'name': 'Drama'} 6.1 18 Drama
1 {'id': 10749, 'name': 'Romance'} 6.1 10749 Romance
2 {'id': 10749, 'name': 'Romance'} 6.5 10749 Romance
or using
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
Full example
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print('--- df ---')
print(df)
print('--- iterrows ---')
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
print('--- explode #1 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #1 ---')
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)
print('--- explode #2 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #2 ---')
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)

How to replace multiple items in a list and make a dataframe of it?

I have a quiet complex list where I try to change 2 things:
id has to become ID
Field1 has to become Value1
After that I try to make a neat DataFrame of it. This is my expected outcome:
ID - Value1
0 1 - 1235
1 2 - 5631
2 3 - 9875
3 4 - 2683
4 5 - 97525
5 6 - 6614
my_list looks like this:
my_list = [('www.url1.com'), 1000, [{'id': 1, 'Field1': 1235}, {'id': 2, 'Field1': 5631}, {'id': 3, 'Field1': 9875}, 'www.google.com)'],
('www.url1.com'), 1000, [{'id': 4, 'Field1': 2683}, {'id': 5, 'Field1': 97525}, {'id': 6, 'Field1': 6614}, 'www.google.com)']]
This is the code I tried to use. I don't get any errors, but neither do I get the expected result.
import pandas as pd
my_list = [('www.url1.com'), 1000, [{'id': 1, 'Field1': 1235}, {'id': 2, 'Field1': 5631}, {'id': 3, 'Field1': 9875}, 'www.google.com)'],
('www.url1.com'), 1000, [{'id': 4, 'Field1': 2683}, {'id': 5, 'Field1': 97525}, {'id': 6, 'Field1': 6614}, 'www.google.com)']]
for n, i in enumerate(my_list):
if i == 'id':
my_list[n] = 'ID'
# print(my_list)
df = pd.DataFrame(my_list)
#print(df)
The data you need inside the list. you can use isinstance to filter the list and then inside the list select all element whose are type dict then create the new dictionary using the new keys {'id': 'ID', 'Field1': 'Value1'}.
keys = {'id': 'ID', 'Field1': 'Value1'}
res = []
for x in my_list:
if isinstance(x, list):
res += [{keys[k]: y[k] for k in keys} for y in x if isinstance(y, dict)]
df = pd.DataFrame(res)
print(df)
Using list comprehension
keys = {'id': 'ID', 'Field1': 'Value1'}
res = [y for x in my_list if isinstance(x, list) for y in x if isinstance(y, dict)]
df = pd.DataFrame(res).rename(columns=keys)
Output:
ID Value1
0 1 1235
1 2 5631
2 3 9875
3 4 2683
4 5 97525
5 6 6614
You can select just the lists with the ids and put them together:
from functools import reduce
my_list = [('www.url1.com'), 1000, [{'id': 1, 'Field1': 1235}, {'id': 2, 'Field1': 5631}, {'id': 3, 'Field1': 9875}, 'www.google.com)'],
('www.url1.com'), 1000, [{'id': 4, 'Field1': 2683}, {'id': 5, 'Field1': 97525}, {'id': 6, 'Field1': 6614}, 'www.google.com)']]
a = reduce(lambda x,y: x+y,[my_list[2::3][i][-2::-1] for i in range(len(my_list[2::3]))])
pd.DataFrame(a).rename(columns = {"id":"ID","Field1":"Value1"})
Output:
ID Value1
0 3 9875
1 2 5631
2 1 1235
3 6 6614
4 5 97525
5 4 2683
Just sort by ID if you need it.

Aggregate each columns in a dataframe based on custom functions in python

here is my dataframe:
df = [{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 10, 'rate':0},
{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 20, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 5, 'grape': 30, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 10, 'grape': 40, 'rate':0}]
i would like to:
where apple= apple.sum() and grape=grape.sum() and rate = grape/apple*100.
id name apple grape rate
0 1 bob 90 30 300
1 2 smith 15 70 21.4
I have attempted this with the following:
df = pd.DataFrame(df)
def cal_rate(rate):
return df['apple'] / df['grape']*100
agg_funcs = {'apple':'sum',
'grape':'sum',
'rate' : cal_rate}
df=df.groupby(['id','name').agg(agg_funcs).reset_index()
But got this result:
id name apple grape rate
0 1 bob 90 30 105
1 2 smith 15 70 105
Can you help me out?thanks in advance.
Here you go:
import pandas as pd
df = [{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 10, 'rate':0},
{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 20, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 5, 'grape': 30, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 10, 'grape': 40, 'rate':0}]
df = pd.DataFrame(df)
def cal_rate(group):
frame = df.loc[group.index]
return frame['apple'].sum() / frame['grape'].sum() * 100
agg_funcs = {'apple':'sum',
'grape':'sum',
'rate' : cal_rate}
df=df.groupby(['id','name']).agg(agg_funcs).reset_index()
print(df)
Output
id name apple grape rate
0 1 bob 90 30 300.0
1 2 smith 15 70 21.4
You can also do it this way
df = df.groupby(['id', 'name']).agg({'apple':'sum', 'grape':'sum'}).reset_index()
df['rate'] = (df['apple'] / df['grape']) *100
just another way to do this
import pandas as pd
df = [{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 10, 'rate':0},
{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 20, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 5, 'grape': 30, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 10, 'grape': 40, 'rate':0}]
df = pd.DataFrame(df)
df=df.groupby(['id','name']).sum().reset_index()
df['rate']=round((df['apple'] / df['grape'])*100,1)
print(df)
output
id name apple grape rate
0 1 bob 90 30 300.0
1 2 smith 15 70 21.4

DataFrame column containing list within double quotes

I have a dataframe about movies and one of the columns is genre.
The entries of this column are in the form of list like -
[{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
My aim is to extract the genre from the list and store them as a list such as ['Comedy', 'Drama', 'Family', 'Romance'].
When I print the entries of the column for example -
data['genres'][1] it returns the list within the quotes (datatype : string)
"[{'id': 35, 'name': 'Comedy'}]"
Can someone help to get the list without the quotes? like [{'id': 35, 'name': 'Comedy'}] I should be able to take it from there.
When I create my custom example, it works as expected and returns a list without quotes. For example -
ref = pd.DataFrame({'col':[[1,2,3],[4,3,2]]})
ref['col'][0]
This returns a list (without quotes).
Problem is there are string representation of lists, so is necessary first convert it to list of dicts and then extract by get:
a = [{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
df = pd.DataFrame({'col':a}).astype(str)
import ast
df['genres'] = df['col'].apply(lambda x: ast.literal_eval(x).get('name'))
print (df)
col genres
0 {'id': 35, 'name': 'Comedy'} Comedy
1 {'id': 18, 'name': 'Drama'} Drama
2 {'id': 10751, 'name': 'Family'} Family
3 {'id': 10749, 'name': 'Romance'} Romance
If is necessary get all values:
df = pd.DataFrame({'a':list('abcd'),'col':a}).astype(str)
df = df.join(pd.DataFrame([ast.literal_eval(x) for x in df.pop('col')], index=df.index))
print (df)
a id name
0 a 35 Comedy
1 b 18 Drama
2 c 10751 Family
3 d 10749 Romance

Categories