I am working on a python program to output a dataset of ranking of some items.
This is my code:
import pandas as pd
list=[{'ranking': 1, 'sku': 'WD-0215', 'name': 'Sofa', 'price': '$1,299.00', 'detail': 'Red'},
{'ranking': 1, 'sku': 'WD-0215', 'name': 'Sofa', 'price': '$1,299.00', 'detail': 'Cottom'},
{'ranking': 1, 'sku': 'WD-0215', 'name': 'Sofa', 'price': '$1,299.00', 'detail': 'Wood Lab'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'W1360×D750×H710'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'LED'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'Made in Japan'},
{'ranking': 2, 'sku': 'sfr20', 'name': 'TV', 'price': '$1,861.00 – $3,699.00', 'detail': 'Nordic'}
]
df = pd.DataFrame(list)
print(df)
df.to_csv('item.csv',encoding='utf_8_sig')
However my expected output should be like this:
ranking
sku
name
price
detail1
detail2
detail3
detail4
1
WD-0215
Sofa
$1299.00
Red
Cottom
Wood Lab
none
1
sfr20
TV
$1861.00-$3699.00
W1360×D750×H710
LED
Made in Japan
Nordic
How can change the code to ouput this result?
Use GroupBy.cumcount for counter and reshape by Series.unstack:
g = df.groupby(['ranking', 'sku', 'name', 'price']).cumcount().add(1)
df = df.set_index(['ranking', 'sku', 'name', 'price', g])['detail'].unstack().add_prefix('detail')
print (df)
detail1 detail2 \
ranking sku name price
1 WD-0215 Sofa $1,299.00 Red Cottom
2 sfr20 TV $1,861.00 – $3,699.00 W1360×D750×H710 LED
detail3 detail4
ranking sku name price
1 WD-0215 Sofa $1,299.00 Wood Lab NaN
2 sfr20 TV $1,861.00 – $3,699.00 Made in Japan Nordic
Related
experimenting on a project with a large dataset of movies. I have a large data frame, with one row named "Genres" and one named "Vote Average". My goal is to find the 20 highest rated genres bases on "Vote Average".
I would use a group by but I can't seem to figure it out because the genre information looks like this in the column "Genres" :
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
How can I extract Comedy, Drama and Romance from the list above?
How can I group by individual genres while assigning the rows "Vote Average to each genre, so I can print the top 20 rated genres in the data frame?
Genres Vote Average
1 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 7.7
2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 6.9
3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 6.5
4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
5 [{'id': 35, 'name': 'Comedy'}] 5.7
... ... ...
32255 [{'id': 878, 'name': 'Science Fiction'}] 3.5
32256 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name... 5.7
32257 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam... 3.8
32258 [] 0.0
32259 [] 0.0
EDIT: Example from Data frame is above. movies_metadata.csv from https://www.kaggle.com/rounakbanik/the-movies-dataset
EDIT:
Now when I see all information on kaggle then I think it may need totally different method because these genres are assigned to titles and they can't be in separated rows.
OLD:
Now you have to convert it to correct DataFrame with genres in separated rows insitead of
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, ...]
Here is my example data
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print(df)
result:
Genre Vote Average
0 [{'id': 16, 'name': 'Animation'}] 7.7
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
2 [{'id': 10749, 'name': 'Romance'}] 6.5
You can iterate every row and use pd.DataFrame(row['Genre']) to create correct dataframe which you will add to new global dataframe
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
result:
id name Vote Average
0 16 Animation 7.7
0 35 Comedy 6.1
1 18 Drama 6.1
2 10749 Romance 6.1
0 10749 Romance 6.5
and now you can do whatever you like.
Other method to correct data:
First convert list of dictionares to separated rows with dictionares
new_df = df.explode('Genre')
print(new_df)
result:
Genre Vote Average
0 {'id': 16, 'name': 'Animation'} 7.7
1 {'id': 35, 'name': 'Comedy'} 6.1
1 {'id': 18, 'name': 'Drama'} 6.1
1 {'id': 10749, 'name': 'Romance'} 6.1
2 {'id': 10749, 'name': 'Romance'} 6.5
and later convert every dictionary to columns
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
print(new_df)
result:
Genre Vote Average id name
0 {'id': 16, 'name': 'Animation'} 7.7 16 Animation
1 {'id': 35, 'name': 'Comedy'} 6.1 35 Comedy
1 {'id': 18, 'name': 'Drama'} 6.1 18 Drama
1 {'id': 10749, 'name': 'Romance'} 6.1 10749 Romance
2 {'id': 10749, 'name': 'Romance'} 6.5 10749 Romance
or using
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
Full example
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print('--- df ---')
print(df)
print('--- iterrows ---')
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
print('--- explode #1 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #1 ---')
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)
print('--- explode #2 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #2 ---')
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)
[{'id': 523535,
'type': 'array',
'name': 'Index',
'value': '- rea - das - faA -\n'},
{'id': 425322,
'type': 'array',
'name': 'status',
'value': '321 - 323 - - B332\n'},
{'id': 425322, 'type': 'array', 'name': 'Index', 'value': 'I'},
{'id': 527942, 'type': 'array', 'name': 'status', 'value': 'BF'}]
I want to data-frame which only name and value.
where column names are Freigabestatus and Index,
and their values are BF and I
as you can see below.
_____________________
|Freigabestatus |Index|
_______________________
| BF |I |
_______________________
import pandas as pd
lst = [{'id': 1050881,
'type': 'array',
'name': 'Index',
'value': '- AF - H04 - SCA -\n'},
{'id': 1050882,
'type': 'array',
'name': 'Freigabestatus',
'value': 'U1 - 000 - I - BF\n'},
{'id': 1050909, 'type': 'array', 'name': 'Index', 'value': 'I'},
{'id': 1050949, 'type': 'array', 'name': 'Freigabestatus', 'value': 'BF'}]
df = pd.DataFrame({'Freigabestatus': [d['value'] for d in lst if d['name'] =='Freigabestatus']})
df['Index'] = [d['value'] for d in lst if d['name'] == 'Index']
df
I'm trying to extract the values from this JSON file, but I having some trouble to extract the data inside from lists in the dict values. For example, in the city and state, I would like to get only the name values and create a Pandas Dataframe and select only some keys like this.
I tried using some for with get methods techniques, but without success.
{'birthday': ['1987-07-13T00:00:00.000Z'],
'cpf': ['9999999999999'],
'rg': [],
'gender': ['Feminino'],
'email': ['my_user#bol.com.br'],
'phone_numbers': ['51999999999'],
'photo': [],
'id': 11111111,
'duplicate_id': -1,
'name': 'My User',
'cnpj': [],
'company_name': '[]',
'city': [{'id': 0001, 'name': 'Porto Alegre'}],
'state': [{'id': 100, 'name': 'Rio Grande do Sul', 'fs': 'RS'}],
'type': 'Private Person',
'tags': [],
'pending_tickets_count': 0}
In [123]: data
Out[123]:
{'birthday': ['1987-07-13T00:00:00.000Z'],
'cpf': ['9999999999999'],
'rg': [],
'gender': ['Feminino'],
'email': ['my_user#bol.com.br'],
'phone_numbers': ['51999999999'],
'photo': [],
'id': 11111111,
'duplicate_id': -1,
'name': 'My User',
'cnpj': [],
'company_name': '[]',
'city': [{'id': '0001', 'name': 'Porto Alegre'}],
'state': [{'id': 100, 'name': 'Rio Grande do Sul', 'fs': 'RS'}],
'type': 'Private Person',
'tags': [],
'pending_tickets_count': 0}
In [124]: data2 = {k:v for k,v in data.items() if k in required}
In [125]: data2
Out[125]:
{'birthday': ['1987-07-13T00:00:00.000Z'],
'gender': ['Feminino'],
'id': 11111111,
'name': 'My User',
'city': [{'id': '0001', 'name': 'Porto Alegre'}],
'state': [{'id': 100, 'name': 'Rio Grande do Sul', 'fs': 'RS'}]}
In [126]: pd.DataFrame(data2).assign(
...: city_name=lambda x: x['city'].str.get('name'),
...: state_name=lambda x: x['state'].str.get('name'),
...: state_fs=lambda x: x['state'].str.get('fs')
...: ).drop(['state', 'city'], axis=1)
Out[126]:
birthday gender id name city_name state_name state_fs
0 1987-07-13T00:00:00.000Z Feminino 11111111 My User Porto Alegre Rio Grande do Sul RS
reason why data2 is required is that you can't have columns that differ in length. So in this case, pd.DataFrame(data) won't work as rg has 0 items but birthday has 1 item.
Also something to look at if you are directly dealing with json files is pd.json_normalize
I have a dataframe about movies and one of the columns is genre.
The entries of this column are in the form of list like -
[{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
My aim is to extract the genre from the list and store them as a list such as ['Comedy', 'Drama', 'Family', 'Romance'].
When I print the entries of the column for example -
data['genres'][1] it returns the list within the quotes (datatype : string)
"[{'id': 35, 'name': 'Comedy'}]"
Can someone help to get the list without the quotes? like [{'id': 35, 'name': 'Comedy'}] I should be able to take it from there.
When I create my custom example, it works as expected and returns a list without quotes. For example -
ref = pd.DataFrame({'col':[[1,2,3],[4,3,2]]})
ref['col'][0]
This returns a list (without quotes).
Problem is there are string representation of lists, so is necessary first convert it to list of dicts and then extract by get:
a = [{'id': 35, 'name': 'Comedy'},
{'id': 18, 'name': 'Drama'},
{'id': 10751, 'name': 'Family'},
{'id': 10749, 'name': 'Romance'}]
df = pd.DataFrame({'col':a}).astype(str)
import ast
df['genres'] = df['col'].apply(lambda x: ast.literal_eval(x).get('name'))
print (df)
col genres
0 {'id': 35, 'name': 'Comedy'} Comedy
1 {'id': 18, 'name': 'Drama'} Drama
2 {'id': 10751, 'name': 'Family'} Family
3 {'id': 10749, 'name': 'Romance'} Romance
If is necessary get all values:
df = pd.DataFrame({'a':list('abcd'),'col':a}).astype(str)
df = df.join(pd.DataFrame([ast.literal_eval(x) for x in df.pop('col')], index=df.index))
print (df)
a id name
0 a 35 Comedy
1 b 18 Drama
2 c 10751 Family
3 d 10749 Romance
I have a complex JSON data structure and have to convert it to a data frame. The JSON structure is as follows:
{'fields': [{'id': 'a', 'label': 'Particulars', 'type': 'string'},
{'id': 'b', 'label': 'States', 'type': 'string'},
{'id': 'c', 'label': 'Gender', 'type': 'string'},
{'id': 'd', 'label': ' 11-2013', 'type': 'string'},
{'id': 'e', 'label': ' 12-2013', 'type': 'string'},
{'id': 'f', 'label': ' 1-2014', 'type': 'string'},
{'id': 'g', 'label': ' 2-2014', 'type': 'string'}],
'data': [['Animal Husbandry- incl Poultry, Dairy and Herdsman',
'Andhra Pradesh',
'Men',
'156.12',
'153.18',
'163.56',
'163.56'],
['Animal Husbandry- incl Poultry, Dairy and Herdsman',
'Bihar',
'Men',
'159.39',
'149.38',
'147.24',
'155.89'],
['Animal Husbandry- incl Poultry, Dairy and Herdsman',
'Gujarat',
'Men',
'157.08',
'145',
'145',
'145']]}
I want to make a dataframe from it in the following format:
I tried directly using the read_json function which gives me error. Then I tried using json.normalize which does not give me the desired output as I don't know its proper working. Can anyone let me know how should I use json.normalize() to get the output in my required format?
Use json_normalize and set columns names by list comprehension:
from pandas.io.json import json_normalize
df = json_normalize(d, 'data')
df.columns = [x.get('label') for x in d['fields']]
print (df)
Particulars States Gender \
0 Animal Husbandry- incl Poultry, Dairy and Herd... Andhra Pradesh Men
1 Animal Husbandry- incl Poultry, Dairy and Herd... Bihar Men
2 Animal Husbandry- incl Poultry, Dairy and Herd... Gujarat Men
11-2013 12-2013 1-2014 2-2014
0 156.12 153.18 163.56 163.56
1 159.39 149.38 147.24 155.89
2 157.08 145 145 145