I have a dataframe as follows:
lat
long
city
nameDisease
numberCases
0
2
rio
Dengue
1
0
2
rio
Chicungunha
2
1
3
sp
Dengue
3
1
3
sp
COVID
4
I want to aggregate the rows with same (lat,long,city) and generate a json as follows:
[{lat:0,long:2,city:"rio",diseases:[{nameDisease:"Dengue",numberCases:1},{nameDisease:"Chicungunha",numberCases:2}],{lat:1,long:3,city:"sp",diseases:[{nameDisease:"Dengue",numberCases:3},{nameDisease:"COVID",numberCases:4}]]
How can I do this kind of transformation with pandas?
A few to_dict + groupby calls:
json = df.groupby(cols).apply(lambda g: g.drop(cols, axis=1).to_dict('records')).reset_index().rename({0:'diseases'}, axis=1).to_dict('records')
Output:
>>> json
[{'lat': 0,
'long': 2,
'city': 'rio',
'diseases': [{'nameDisease': 'Dengue', 'numberCases': 1},
{'nameDisease': 'Chicungunha', 'numberCases': 2}]},
{'lat': 1,
'long': 3,
'city': 'sp',
'diseases': [{'nameDisease': 'Dengue', 'numberCases': 3},
{'nameDisease': 'COVID', 'numberCases': 4}]}]
>>> json == expected_output
True
Related
I am working on a dataframe that displays information on property rentals in Brazil. This is a sample of the dataset:
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)'])
print(df)
This is how the sample looks:
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800
I want to filter the dataset in order to select only the apartments that have at maximum 2 rooms and 2 bathrooms.
Do you know how I can do this?
Try with
out = df.loc[(df.Rooms>=2) & (df.Bathrooms>=2)]
You can use query() method:
out=test_gdata.query('Bathrooms<=2 and Rooms<=2')
You can filter the values on the dataframe
import pandas as pd
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)'])
df_filtered = df[(df['Rooms'] <= 2) & (df['Bathrooms'] <= 2)]
print(df_filtered)
Returns
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800
I have a list of dictionaries (sorry It's a bit complex but I'm trying to display the real data) :
[{'alerts': [{'city': ' city name1',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'milis': 1582337463000},
{'city': ' city name2',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'pubMillis': 1582337573000,
'type': 'TYPE2'}],
'end': '11:02:00:000',
'start': '11:01:00:000'},
{'alerts': [{'city': ' city name3',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'milis': 1582337463000}],
'end': '11:02:00:000',
'start': '11:01:00:000'}]
In general the list structure is like this :
[
{ [
{ {},
},
{ {},
}
],
},
{ [
{ {},
},
{ {},
}
],
}
]
If I want to access city name1, I can access using this line of code : alerts[0]['alerts'][0]['city'].
If I want to access city name2, I can access using this code : alerts[0]['alerts'][1]['city'].
How can I access this in a loop?
Use nested loops:
Where alerts equals the list of dicts
for x in alerts:
for alert in x['alerts']:
print(alert['city'])
Use pandas
data equals your sample list of dicts
import pandas as pd
# create the dataframe and explode the list of dicts
df = pd.DataFrame(data).explode('alerts').reset_index(drop=True)
# json_normalize the dicts and join back to df
df = df.join(pd.json_normalize(df.alerts))
# drop the alerts column as it's no longer needed
df.drop(columns=['alerts'], inplace=True)
# output
start end country city milis location.x location.y type pubMillis
0 11:01:00:000 11:02:00:000 ZZ city name1 1.582337e+12 1 3 NaN NaN
1 11:01:00:000 11:02:00:000 ZZ city name2 NaN 1 3 TYPE2 1.582338e+12
2 11:01:00:000 11:02:00:000 ZZ city name3 1.582337e+12 1 3 NaN NaN
What is the goal? To get all city names?
>>> for top_level_alert in alerts:
for nested_alert in top_level_alert['alerts']:
print(nested_alert['city'])
city name1
city name2
city name3
I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN
I'm trying to parse json I've recieved from an api into a pandas DataFrame. That json is ierarchical, in this example I have city code, line name and list of stations for this line. Unfortunately I can't "unpack" it. Would be gratefull for help and explanation.
Json:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская', <------Line name
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино', <------Station 1
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево', <------Station 2
'order': 1},
etc.
I'm trying to recieve evrything from lowest level and the add all higher level information (starting from linename):
c = r.content
j = simplejson.loads(c)
tmp=[]
i=0
data1=pd.DataFrame(tmp)
data2=pd.DataFrame(tmp)
pd.concat
station['name']
for station in j['lines']:
data2 = data2.append(pd.DataFrame(station['stations'], station['name']),ignore_index=True)
data2
Once more - the questions are:
How to make it work?
Is this solution an optimal one, or there are some functions I should know about?
Update:
The Json parses normally:
json_normalize(j)
id lines name
1 [{'hex_color': 'FFCD1C', 'stations': [{'lat': ... Москва
Current DataFrame I can get:
data2 = data2.append(pd.DataFrame(station['stations']),ignore_index=True)
id lat lng name order
0 8.189 55.745113 37.864052 Новокосино 0
1 8.88 55.752237 37.814587 Новогиреево 1
Desired dataframe can be represented as:
id lat lng name order Line_Name Id_Top Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 1 Москва
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 1 Москва
In addition to MaxU's answer, I think you still need the highest level id, this should work:
json_normalize(data, ['lines','stations'], ['id',['lines','name']],record_prefix='station_')
Assuming you have the following dictionary:
In [70]: data
Out[70]:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская',
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино',
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево',
'order': 1}]}]}
Solution: use pandas.io.json.json_normalize:
In [71]: pd.io.json.json_normalize(data['lines'],
['stations'],
['name', 'id'],
meta_prefix='parent_')
Out[71]:
id lat lng name order parent_name parent_id
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 8
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 8
UPDATE: reflects updated question
res = (pd.io.json.json_normalize(data,
['lines', 'stations'],
['id', ['lines', 'name']],
meta_prefix='Line_')
.assign(Name_Top='Москва'))
Result:
In [94]: res
Out[94]:
id lat lng name order Line_id Line_lines.name Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 1 Калининская Москва
1 8.88 55.752237 37.814587 Новогиреево 1 1 Калининская Москва
I have a list of dictionaries, where each dictionary represents a record. It is formatted as follows:
>>> ListOfData=[
... {'Name':'Andrew',
... 'number':4,
... 'contactinfo':{'Phone':'555-5555', 'Address':'123 Main St'}},
... {'Name':'Ben',
... 'number':6,
... 'contactinfo':{'Phone':'555-5554', 'Address':'124 2nd St'}},
... {'Name':'Cathy',
... 'number':1,
... 'contactinfo':{'Phone':'555-5556', 'Address':'126 3rd St'}}]
>>>
>>> import pprint
>>> pprint.pprint(ListOfData)
[{'Name': 'Andrew',
'contactinfo': {'Address': '123 Main St', 'Phone': '555-5555'},
'number': 4},
{'Name': 'Ben',
'contactinfo': {'Address': '124 2nd St', 'Phone': '555-5554'},
'number': 6},
{'Name': 'Cathy',
'contactinfo': {'Address': '126 3rd St', 'Phone': '555-5556'},
'number': 1}]
>>>
What is the best way to read this into a Pandas dataframe with multiindex columns for those attributes in the sub dictionaries?
For example, I'd ideally have 'Phone' and 'Address' columns nested under the 'contactinfo' columns.
I can read in the data as follows, but would like the contact info column to be broken into sub columns.
>>> pd.DataFrame.from_dict(ListOfData)
Name contactinfo number
0 Andrew {u'Phone': u'555-5555', u'Address': u'123 Main... 4
1 Ben {u'Phone': u'555-5554', u'Address': u'124 2nd ... 6
2 Cathy {u'Phone': u'555-5556', u'Address': u'126 3rd ... 1
>>>
how about this
declare empty data frame
df = DataFrame(columns=('Name', 'conntactinfo', 'number'))
then iterate over List and add rows
for row in ListOfData:
df.loc[len(df)] = row
complete code
import pandas as pd
ListOfData=[
{'Name':'Andrew',
'number':4,
'contactinfo':{'Phone':'555-5555', 'Address':'123 Main St'}},
{'Name':'Ben',
'number':6,
'contactinfo':{'Phone':'555-5554', 'Address':'124 2nd St'}}]
df = pd.DataFrame(columns=('Name', 'contactinfo', 'number'))
for row in ListOfData:
df.loc[len(df)] = row
print(df)
this prints
Name contactinfo number
0 Andrew {'Phone': '555-5555', 'Address': '123 Main St'} 4
1 Ben {'Phone': '555-5554', 'Address': '124 2nd St'} 6
Here is a pretty clunky workaround that I was able to get what I need. I loop through the columns, find those that are made of dicts and then divide it into multiple columns and merge it to the dataframe. I'd appreciate hearing any ways to improve this code. I'd imagine that ideally the dataframe would be constructed from the get-go without having dictionaries as values.
>>> df=pd.DataFrame.from_dict(ListOfData)
>>>
>>> for name,col in df.iteritems():
... if any(isinstance(x, dict) for x in col.tolist()):
... DividedDict=col.apply(pd.Series)
... DividedDict.columns=pd.MultiIndex.from_tuples([(name,x) for x in DividedDict.columns.tolist()])
... df=df.join(DividedDict)
... df.drop(name,1, inplace=True)
...
>>> print df
Name number (contactinfo, Address) (contactinfo, Phone)
0 Andrew 4 123 Main St 555-5555
1 Ben 6 124 2nd St 555-5554
2 Cathy 1 126 3rd St 555-5556
>>>
Don't know about best or not, but you could do it in two steps:
>>> df = pd.DataFrame(ListOfData)
>>> df = df.join(pd.DataFrame.from_records(df.pop("contactinfo")))
>>> df
Name number Address Phone
0 Andrew 4 123 Main St 555-5555
1 Ben 6 124 2nd St 555-5554
2 Cathy 1 126 3rd St 555-5556