DataFrame column containing list within double quotes - python

I have a dataframe about movies and one of the columns is genre.
The entries of this column are in the form of list like -
[{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
My aim is to extract the genre from the list and store them as a list such as ['Comedy', 'Drama', 'Family', 'Romance'].
When I print the entries of the column for example -
data['genres'][1] it returns the list within the quotes (datatype : string)
"[{'id': 35, 'name': 'Comedy'}]"
Can someone help to get the list without the quotes? like [{'id': 35, 'name': 'Comedy'}] I should be able to take it from there.
When I create my custom example, it works as expected and returns a list without quotes. For example -
ref = pd.DataFrame({'col':[[1,2,3],[4,3,2]]})
ref['col'][0]
This returns a list (without quotes).

Problem is there are string representation of lists, so is necessary first convert it to list of dicts and then extract by get:
a = [{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
df = pd.DataFrame({'col':a}).astype(str)
import ast
df['genres'] = df['col'].apply(lambda x: ast.literal_eval(x).get('name'))
print (df)
col genres
0 {'id': 35, 'name': 'Comedy'} Comedy
1 {'id': 18, 'name': 'Drama'} Drama
2 {'id': 10751, 'name': 'Family'} Family
3 {'id': 10749, 'name': 'Romance'} Romance
If is necessary get all values:
df = pd.DataFrame({'a':list('abcd'),'col':a}).astype(str)
df = df.join(pd.DataFrame([ast.literal_eval(x) for x in df.pop('col')], index=df.index))
print (df)
a id name
0 a 35 Comedy
1 b 18 Drama
2 c 10751 Family
3 d 10749 Romance

Related

How to get specific values from a nested list in Python

Could you help me getting the specific values I wanted in the below list
list=[['Russia',[{'id': 250282,'d_id': 19553,'p_id': 1796,'value': 'silver'},
{'id': 250212,'d_id': 19553,'p_id': 1896,'value': 'gold'},
{'id': 250242,'d_id': 19553,'p_id': 1396,'value': 'iron'},
{'id': 250082,'d_id': 19553,'p_id': 1496,'value': 'cobalt'}]],
['China',[{'id': 210282,'d_id': 193,'p_id': 1196,'value': 'silver'},
{'id': 220212,'d_id': 193,'p_id': 1396,'value': 'iron'},
{'id': 240242,'d_id': 193,'p_id': 1586,'value': 'iron'},
{'id': 250082,'d_id': 193,'p_id': 1492,'value': 'gold'}]],
['Africa',[]],
['USA',[{'id': 200282,'d_id': 5531,'p_id': 1093,'value': 'iron'},
{'id': 253212,'d_id': 5531,'p_id': 1843,'value': 'gold'},
{'id': 255242,'d_id': 5531,'p_id': 1323,'value': 'iron'},
{'id': 257082,'d_id': 5531,'p_id': 1409,'value': 'cobalt'}]],
['UK',[]]]
output should be:
'Russia', 19553
'China', 193
'Africa', 0
'USA', 5531
'UK', 0
I am trying to get countries and unique values of d_id because it will be the same for all records and impute missing values with 0
I tried for loops and slicing of lists but nothing worked out
If anyone of you have a solution for this that would be much appreciated.
output should be:
'Russia', 19553
'China', 193
'Africa', 0
'USA', 5531
'UK', 0
In the above output Africa and UK d_id values are imputed with 0
for item in list:
country = item[0]
metal = item[1]
d_id = []
for details in metal:
for ids in details:
if ids == 'd_id':
d_id.append(details[ids])
d_id = set(d_id)
if len(d_id):
print(f"{country},{d_id}")
else:
print(f"{country},0")
Your inner lists have 2 values - a country name and a list of records. You could iterate the list, using tuple expansion to get those two values. If the list of records is not empty, grab the first value, otherwise use zero.
list=[['Russia',[{'id': 250282,'d_id': 19553,'p_id': 1796,'value': 'silver'},
{'id': 250212,'d_id': 19553,'p_id': 1896,'value': 'gold'},
{'id': 250242,'d_id': 19553,'p_id': 1396,'value': 'iron'},
{'id': 250082,'d_id': 19553,'p_id': 1496,'value': 'cobalt'}]],
['China',[{'id': 210282,'d_id': 193,'p_id': 1196,'value': 'silver'},
{'id': 220212,'d_id': 193,'p_id': 1396,'value': 'iron'},
{'id': 240242,'d_id': 193,'p_id': 1586,'value': 'iron'},
{'id': 250082,'d_id': 193,'p_id': 1492,'value': 'gold'}]],
['Africa',[]],
['USA',[{'id': 200282,'d_id': 5531,'p_id': 1093,'value': 'iron'},
{'id': 253212,'d_id': 5531,'p_id': 1843,'value': 'gold'},
{'id': 255242,'d_id': 5531,'p_id': 1323,'value': 'iron'},
{'id': 257082,'d_id': 5531,'p_id': 1409,'value': 'cobalt'}]],
['UK',[]]]
output = []
for name, records in list:
if records:
d_id = records[0]['d_id']
else:
d_id = 0
output.append((name, d_id))
for name, d_id in output:
print(f" '{name}': {d_id}")
Next time you should include the code that you've tried with your question.
{l[0]: set([_d['d_id'] for _d in l[1]]) if len(l[1]) > 0 else set([0]) for l in your_list}

Finding average in pandas from a value in a dictionary?

experimenting on a project with a large dataset of movies. I have a large data frame, with one row named "Genres" and one named "Vote Average". My goal is to find the 20 highest rated genres bases on "Vote Average".
I would use a group by but I can't seem to figure it out because the genre information looks like this in the column "Genres" :
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
How can I extract Comedy, Drama and Romance from the list above?
How can I group by individual genres while assigning the rows "Vote Average to each genre, so I can print the top 20 rated genres in the data frame?
Genres Vote Average
1 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 7.7
2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 6.9
3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 6.5
4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
5 [{'id': 35, 'name': 'Comedy'}] 5.7
... ... ...
32255 [{'id': 878, 'name': 'Science Fiction'}] 3.5
32256 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name... 5.7
32257 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam... 3.8
32258 [] 0.0
32259 [] 0.0
EDIT: Example from Data frame is above. movies_metadata.csv from https://www.kaggle.com/rounakbanik/the-movies-dataset
EDIT:
Now when I see all information on kaggle then I think it may need totally different method because these genres are assigned to titles and they can't be in separated rows.
OLD:
Now you have to convert it to correct DataFrame with genres in separated rows insitead of
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, ...]
Here is my example data
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print(df)
result:
Genre Vote Average
0 [{'id': 16, 'name': 'Animation'}] 7.7
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
2 [{'id': 10749, 'name': 'Romance'}] 6.5
You can iterate every row and use pd.DataFrame(row['Genre']) to create correct dataframe which you will add to new global dataframe
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
result:
id name Vote Average
0 16 Animation 7.7
0 35 Comedy 6.1
1 18 Drama 6.1
2 10749 Romance 6.1
0 10749 Romance 6.5
and now you can do whatever you like.
Other method to correct data:
First convert list of dictionares to separated rows with dictionares
new_df = df.explode('Genre')
print(new_df)
result:
Genre Vote Average
0 {'id': 16, 'name': 'Animation'} 7.7
1 {'id': 35, 'name': 'Comedy'} 6.1
1 {'id': 18, 'name': 'Drama'} 6.1
1 {'id': 10749, 'name': 'Romance'} 6.1
2 {'id': 10749, 'name': 'Romance'} 6.5
and later convert every dictionary to columns
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
print(new_df)
result:
Genre Vote Average id name
0 {'id': 16, 'name': 'Animation'} 7.7 16 Animation
1 {'id': 35, 'name': 'Comedy'} 6.1 35 Comedy
1 {'id': 18, 'name': 'Drama'} 6.1 18 Drama
1 {'id': 10749, 'name': 'Romance'} 6.1 10749 Romance
2 {'id': 10749, 'name': 'Romance'} 6.5 10749 Romance
or using
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
Full example
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print('--- df ---')
print(df)
print('--- iterrows ---')
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
print('--- explode #1 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #1 ---')
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)
print('--- explode #2 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #2 ---')
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)

Extract values from dicts inside lists

I'm trying to extract the values from this JSON file, but I having some trouble to extract the data inside from lists in the dict values. For example, in the city and state, I would like to get only the name values and create a Pandas Dataframe and select only some keys like this.
I tried using some for with get methods techniques, but without success.
{'birthday': ['1987-07-13T00:00:00.000Z'],
'cpf': ['9999999999999'],
'rg': [],
'gender': ['Feminino'],
'email': ['my_user#bol.com.br'],
'phone_numbers': ['51999999999'],
'photo': [],
'id': 11111111,
'duplicate_id': -1,
'name': 'My User',
'cnpj': [],
'company_name': '[]',
'city': [{'id': 0001, 'name': 'Porto Alegre'}],
'state': [{'id': 100, 'name': 'Rio Grande do Sul', 'fs': 'RS'}],
'type': 'Private Person',
'tags': [],
'pending_tickets_count': 0}
In [123]: data
Out[123]:
{'birthday': ['1987-07-13T00:00:00.000Z'],
'cpf': ['9999999999999'],
'rg': [],
'gender': ['Feminino'],
'email': ['my_user#bol.com.br'],
'phone_numbers': ['51999999999'],
'photo': [],
'id': 11111111,
'duplicate_id': -1,
'name': 'My User',
'cnpj': [],
'company_name': '[]',
'city': [{'id': '0001', 'name': 'Porto Alegre'}],
'state': [{'id': 100, 'name': 'Rio Grande do Sul', 'fs': 'RS'}],
'type': 'Private Person',
'tags': [],
'pending_tickets_count': 0}
In [124]: data2 = {k:v for k,v in data.items() if k in required}
In [125]: data2
Out[125]:
{'birthday': ['1987-07-13T00:00:00.000Z'],
'gender': ['Feminino'],
'id': 11111111,
'name': 'My User',
'city': [{'id': '0001', 'name': 'Porto Alegre'}],
'state': [{'id': 100, 'name': 'Rio Grande do Sul', 'fs': 'RS'}]}
In [126]: pd.DataFrame(data2).assign(
...: city_name=lambda x: x['city'].str.get('name'),
...: state_name=lambda x: x['state'].str.get('name'),
...: state_fs=lambda x: x['state'].str.get('fs')
...: ).drop(['state', 'city'], axis=1)
Out[126]:
birthday gender id name city_name state_name state_fs
0 1987-07-13T00:00:00.000Z Feminino 11111111 My User Porto Alegre Rio Grande do Sul RS
reason why data2 is required is that you can't have columns that differ in length. So in this case, pd.DataFrame(data) won't work as rg has 0 items but birthday has 1 item.
Also something to look at if you are directly dealing with json files is pd.json_normalize

How to extract the first item in a set of dictionaries?

Please see the screenshot...
I'm trying to create another column pulling the first element of the column 'genre' (i.e. Animation for the first one, Adventure for the second one, Romance for the third one and so on...)
Could anyone please help?
You can split the data from the dictionary or from the dataframe. In this code, I convert the column to a list before splitting.
import pandas as pd
### from dictionary
dd = { 'genres':[
[{'id':16,'name':'Animation'},{'id':26,'name':'ChicFlick'}],
[{'id':12,'name':'Adventure'},{'id':22,'name':'Horror'}],
[{'id':18,'name':'Romance'},{'id':28,'name':'Crime'}],
]}
dd['genres2'] = [x[0]['name'] for x in dd['genres']]
print(dd)
### from dataframe
dd = { 'genres':[
[{'id':16,'name':'Animation'},{'id':26,'name':'ChicFlick'}],
[{'id':12,'name':'Adventure'},{'id':22,'name':'Horror'}],
[{'id':18,'name':'Romance'},{'id':28,'name':'Crime'}],
]}
df = pd.DataFrame(dd)
df['genres2'] = [x[0]['name'] for x in df['genres'].to_list()]
print(df.to_string(index=False))
Output
{'genres':
[[{'id': 16, 'name': 'Animation'}, {'id': 26, 'name': 'ChicFlick'}],
[{'id': 12, 'name': 'Adventure'}, {'id': 22, 'name': 'Horror'}],
[{'id': 18, 'name': 'Romance'}, {'id': 28, 'name': 'Crime'}]],
'genres2': ['Animation', 'Adventure', 'Romance']}
genres genres2
[{'id': 16, 'name': 'Animation'}, {'id': 26, 'name': 'ChicFlick'}] Animation
[{'id': 12, 'name': 'Adventure'}, {'id': 22, 'name': 'Horror'}] Adventure
[{'id': 18, 'name': 'Romance'}, {'id': 28, 'name': 'Crime'}] Romance

python DataFrame split dict column to multiple columns

The column looks this way:
0 [{'id': 18, 'name': 'Drama'}, {'id': 10769, 'n...
1 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...
2 [{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...
3 [{'id': 18, 'name': 'Drama'}]
4 [{'id': 99, 'name': 'Documentary'}]
5 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...
6 [{'id': 10749, 'name': 'Romance'}, {'id': 18, ...
I wish to see ID columns with bool value for each genre:
index id=18 id=10769 id=35 id=27 ...
0 1 1 0 0 ...
1 1 0 0 0 ...
2 0 0 1 1 ...
3 1 0 0 0 ...
...
Use list comprehension with flattening and then DataFrame contructor:
df = pd.DataFrame({'col':[[{'id': 18, 'name': 'Drama'}, {'id': 10769}],
[{'id': 99, 'name': 'Documentary'}]]})
print (type(df.loc[0, 'col']))
<class 'list'>
df = pd.DataFrame([y for x in df['col'] for y in x])
print (df)
id name
0 18 Drama
1 10769 NaN
2 99 Documentary
#alternative
#df = pd.concat([pd.DataFrame(x) for x in df['col']], ignore_index=True)

Categories