Converting to dataframe, beginner question

Converting to dataframe, beginner question - python

I have a piece of data that looks like this
my_data[:5]
returns:
[{'key': ['Aaliyah', '2', '2016'], 'values': ['10']},
{'key': ['Aaliyah', '2', '2017'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2018'], 'values': ['21']},
{'key': ['Aaliyah', '2', '2019'], 'values': ['26']},
{'key': ['Aaliyah', '2', '2020'], 'values': ['15']}]
The key represents Name, Gender, and Year. The value is number.
I do not manage to generate a data frame with columns name, gender, year, and number.
Can you help me?

Here is one way, using a generator:
from itertools import chain
pd.DataFrame.from_records((dict(zip(['name', 'gender', 'year', 'number'],
chain(*e.values())))
for e in my_data))
Without itertools:
pd.DataFrame(((E:=list(e.values()))[0]+E[1] for e in my_data),
columns=['name', 'gender', 'year', 'number'])
output:
name gender year number
0 Aaliyah 2 2016 10
1 Aaliyah 2 2017 26
2 Aaliyah 2 2018 21
3 Aaliyah 2 2019 26
4 Aaliyah 2 2020 15

Related

Match values of different columns and dataframes

Check if value from one dataframe exists in another dataframe
df_threads = pd.DataFrame({'conv_id': ['12', '15', '14', '23'], 'tweets': ['Nice', 'Test', 'Hi', 'Yes']})
df_orig = pd.DataFrame({'id': ['17', '12', '15', '6','67'], 'tweets': ['hola', 'mundo', 'yes', 'look', 'Beautiful']})
I would like to know if the value of the column "id" of the dataframe df_orig is in the column "conv_id" of df_threads and if it is true, add a column "File" in df_threads with the value "Spanish" and if it False set Portuguese.
Expected Output
df_threads
conv_id tweet File
---------------------------------------------
12 "Nice" "Spanish"
15 "Test" "Spanish"
14 "Hi" "Spanish"
23 "Yes" "Spanish"
56 "NotHappy" "Portuguese"

Use np.where and df.where:
df_threads['File'] = np.where(df_threads['conv_id'].isin(df_orig['id']),
'Spanish', 'Portuguese')
>>> df_threads
conv_id tweets File
0 12 Nice Spanish
1 15 Test Spanish
2 14 Hi Portuguese
3 23 Yes Portuguese
Note: check your input data and your outcome, they not match.

Several dictionary with duplicate key but different value and no limit in column

Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.
ID Name Detail
1 Sara [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2 Frank [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]
My expected output is
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24

So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.
import pandas as pd
import json
from pandas import json_normalize
#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
#Read the json, make it a DF, then append the information to the empty DF
detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))
# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.
# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])
#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)
If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.

First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :
def processdicts(x):
personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
personal=personal.rename(columns={"ID": "Personal_ID"})
personal['Personal_Name']=personal['Name']
orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
orders=orders.rename(columns={"ID": "Order_ID"})
personDf=orders.merge(personal, left_index=True, right_index=True)
return personDf
Create an empty dataframe that will contain the compiled data
outcome=pd.DataFrame(columns=[],index=[])
Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:
for details in yourdataframe['Detail']:
outcome=pd.concat([outcome,processdicts(details)])
Finally reset index:
outcome=outcome.reset_index(drop=True)
You may rename columns according to your requirement in the final dataframe. For example:
outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})
Order (or skip) the columns according to your requirement using:
outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]
Assign a name to the index of the dataframe:
outcome.index.name='ID'
This should help.

You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,
import io
import pandas as pd
#Creating dataframe:
s_e='''
ID Name
1 Sara
2 Frank
'''
df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]
#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)
#retrieved from #Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')
order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')
result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)
#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)
Output:
#Using explode:
index ID Name Detail
0 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2 0 1 Sara [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3 1 2 Frank [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]
#result:
ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price
0 1 Sara 001 Sara 01 Book 0001 20200222 C0123 ABC 4
1 2 Sara 001 Sara 02 Food 0004 20200222 D0123 Small beef 15
2 3 Sara 001 Sara 02 Food 0005 20200222 D0200 Shrimp 28
3 4 Frank 002 Frank 02 Food 0008 20200228 D0288 Salmon 24

Python Pandas - Handling columns with nested dictionary(json) values

I have a column "data" which has json object as values. I would like to split them up.
source = {'_id':['SE-DATA-BB3A','SE-DATA-BB3E','SE-DATA-BB3F'], 'pi':['BB3A_CAP_BMLS','BB3E_CAP_BMLS','BB3F_CAP_PAS'], 'Datetime':['190725-122500', '190725-122500', '190725-122500'], 'data': [ {'bb3a_bmls':[{'name': 'WAG 01', 'id': '105F', 'state': 'available', 'nodes': 3,'volumes-': [{'state': 'available', 'id': '330172', 'name': 'q_-4144d4e'}, {'state': 'available', 'id': '275192', 'name': 'p_3089d821ae', }]}]}
, {'bb3b_bmls':[{'name': 'FEC 01', 'id': '382E', 'state': 'available', 'nodes': 4,'volumes': [{'state': 'unavailable', 'id': '830172', 'name': 'w_-4144d4e'}, {'state': 'unavailable', 'id': '223192', 'name': 'g_3089d821ae', }]}]}
, {'bb3c_bmls':[{'name': 'ASD 01', 'id': '303F', 'state': 'available', 'nodes': 6,'volumes': [{'state': 'unavailable', 'id': '930172', 'name': 'e_-4144d4e'}, {'state': 'unavailable', 'id': '245192', 'name': 'h_3089d821ae', }]}]}
] }
input_df = pd.DataFrame(source)
My input_df is as below:
I'm expecting the output_df as below:
I could manage to get the columns volume_id volume_name volume_state
name id state nodes using the below method.
input_df['data'] = input_df['data'].apply(pd.Series)
which will result as below
Test_df=pd.concat([json_normalize(input_df['bb3a_bmls'][key], 'volumes', ['name','id','state','nodes'], record_prefix='volume_') for key in input_df.index if isinstance(input_df['bb3a_bmls'][key],list)]).reset_index(drop=True)
Which will result for one "SERVER" - bb3a_bmls
Now, I don't have an idea how to get the parent columns "_id", "pi", "Datetime" back.

Idea is loop by each nested lists or by dicts and create list of dictionary for pass to DataFrame constructor:
out = []
zipped = zip(source['_id'], source['pi'], source['Datetime'], source['data'])
for a,b,c,d in zipped:
for k1, v1 in d.items():
for e in v1:
#get all values of dict with exlude volumes
di = {k2:v2 for k2, v2 in e.items() if k2 != 'volumes'}
#for each dict in volumes add volume_ to keys
for f in e['volumes']:
di1 = {f'volume_{k3}':v3 for k3, v3 in f.items()}
#create dict from previous values
di2 = {'_id':a, 'pi':b,'Datetime':c, 'SERVER':k1}
#add to list merged dictionaries
out.append({**di2, ** di1, **di})
df = pd.DataFrame(out)
print (df)
_id pi Datetime SERVER volume_state \
0 SE-DATA-BB3A BB3A_CAP_BMLS 190725-122500 bb3a_bmls available
1 SE-DATA-BB3A BB3A_CAP_BMLS 190725-122500 bb3a_bmls available
2 SE-DATA-BB3E BB3E_CAP_BMLS 190725-122500 bb3b_bmls unavailable
3 SE-DATA-BB3E BB3E_CAP_BMLS 190725-122500 bb3b_bmls unavailable
4 SE-DATA-BB3F BB3F_CAP_PAS 190725-122500 bb3c_bmls unavailable
5 SE-DATA-BB3F BB3F_CAP_PAS 190725-122500 bb3c_bmls unavailable
volume_id volume_name name id state nodes
0 330172 q_-4144d4e WAG 01 105F available 3
1 275192 p_3089d821ae WAG 01 105F available 3
2 830172 w_-4144d4e FEC 01 382E available 4
3 223192 g_3089d821ae FEC 01 382E available 4
4 930172 e_-4144d4e ASD 01 303F available 6
5 245192 h_3089d821ae ASD 01 303F available 6

Python: Json unpacking into a dataframe

I'm trying to parse json I've recieved from an api into a pandas DataFrame. That json is ierarchical, in this example I have city code, line name and list of stations for this line. Unfortunately I can't "unpack" it. Would be gratefull for help and explanation.
Json:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская', <------Line name
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино', <------Station 1
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево', <------Station 2
'order': 1},
etc.
I'm trying to recieve evrything from lowest level and the add all higher level information (starting from linename):
c = r.content
j = simplejson.loads(c)
tmp=[]
i=0
data1=pd.DataFrame(tmp)
data2=pd.DataFrame(tmp)
pd.concat
station['name']
for station in j['lines']:
data2 = data2.append(pd.DataFrame(station['stations'], station['name']),ignore_index=True)
data2
Once more - the questions are:
How to make it work?
Is this solution an optimal one, or there are some functions I should know about?
Update:
The Json parses normally:
json_normalize(j)
id lines name
1 [{'hex_color': 'FFCD1C', 'stations': [{'lat': ... Москва
Current DataFrame I can get:
data2 = data2.append(pd.DataFrame(station['stations']),ignore_index=True)
id lat lng name order
0 8.189 55.745113 37.864052 Новокосино 0
1 8.88 55.752237 37.814587 Новогиреево 1
Desired dataframe can be represented as:
id lat lng name order Line_Name Id_Top Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 1 Москва
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 1 Москва

In addition to MaxU's answer, I think you still need the highest level id, this should work:
json_normalize(data, ['lines','stations'], ['id',['lines','name']],record_prefix='station_')

Assuming you have the following dictionary:
In [70]: data
Out[70]:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская',
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино',
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево',
'order': 1}]}]}
Solution: use pandas.io.json.json_normalize:
In [71]: pd.io.json.json_normalize(data['lines'],
['stations'],
['name', 'id'],
meta_prefix='parent_')
Out[71]:
id lat lng name order parent_name parent_id
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 8
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 8
UPDATE: reflects updated question
res = (pd.io.json.json_normalize(data,
['lines', 'stations'],
['id', ['lines', 'name']],
meta_prefix='Line_')
.assign(Name_Top='Москва'))
Result:
In [94]: res
Out[94]:
id lat lng name order Line_id Line_lines.name Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 1 Калининская Москва
1 8.88 55.752237 37.814587 Новогиреево 1 1 Калининская Москва

Converting Data frame into a dict with columns as key inside key

I have a pandas data frame.
mac_address no. of co_visit no. of random_visit
0 00:02:1a:11:b0:b9 1 2
1 00:02:71:d6:04:84 1 1
2 00:05:33:34:2f:f2 1 3
3 00:08:22:04:c4:fb 1 4
4 00:08:22:06:7b:41 1 1
5 00:08:22:07:48:15 1 1
6 00:08:22:08:a8:54 1 3
7 00:08:22:0e:0a:fc 1 1
I want to convert it into a dictionary with mac_address as key and 'no. of co_visit' and 'no. of random_visit' as subkeys inside key and value across that column as value inside subkey. So, my output for first 2 row will be like.
00:02:1a:11:b0:b9:{no. of co_visit:1, no. of random_visit: 2}
00:02:71:d6:04:84:{no. of co_visit:1, no. of random_visit: 1}
I am using python2.7. Thank you.
I was able to set mac_address as key but the values were being added as list inside key, not key inside key.

You can use pandas.DataFrame.T and to_dict().
df.set_index('mac_address').T.to_dict()
Output:
{'00:02:1a:11:b0:b9': {'no. of co_visit': '1', 'no. of random_visit': '2'},
'00:02:71:d6:04:84': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:05:33:34:2f:f2': {'no. of co_visit': '1', 'no. of random_visit': '3'},
'00:08:22:04:c4:fb': {'no. of co_visit': '1', 'no. of random_visit': '4'},
'00:08:22:06:7b:41': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:08:22:07:48:15': {'no. of co_visit': '1', 'no. of random_visit': '1'},
'00:08:22:08:a8:54': {'no. of co_visit': '1', 'no. of random_visit': '3'},
'00:08:22:0e:0a:fc': {'no. of co_visit': '1', 'no. of random_visit': '1'}}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting to dataframe, beginner question - python

Related

Match values of different columns and dataframes

Several dictionary with duplicate key but different value and no limit in column

Python Pandas - Handling columns with nested dictionary(json) values

Python: Json unpacking into a dataframe

Converting Data frame into a dict with columns as key inside key

Categories

Resources