I have a list of dictionaries (sorry It's a bit complex but I'm trying to display the real data) :
[{'alerts': [{'city': ' city name1',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'milis': 1582337463000},
{'city': ' city name2',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'pubMillis': 1582337573000,
'type': 'TYPE2'}],
'end': '11:02:00:000',
'start': '11:01:00:000'},
{'alerts': [{'city': ' city name3',
'country': 'ZZ',
'location': {'x': 1, 'y': 3},
'milis': 1582337463000}],
'end': '11:02:00:000',
'start': '11:01:00:000'}]
In general the list structure is like this :
[
{ [
{ {},
},
{ {},
}
],
},
{ [
{ {},
},
{ {},
}
],
}
]
If I want to access city name1, I can access using this line of code : alerts[0]['alerts'][0]['city'].
If I want to access city name2, I can access using this code : alerts[0]['alerts'][1]['city'].
How can I access this in a loop?
Use nested loops:
Where alerts equals the list of dicts
for x in alerts:
for alert in x['alerts']:
print(alert['city'])
Use pandas
data equals your sample list of dicts
import pandas as pd
# create the dataframe and explode the list of dicts
df = pd.DataFrame(data).explode('alerts').reset_index(drop=True)
# json_normalize the dicts and join back to df
df = df.join(pd.json_normalize(df.alerts))
# drop the alerts column as it's no longer needed
df.drop(columns=['alerts'], inplace=True)
# output
start end country city milis location.x location.y type pubMillis
0 11:01:00:000 11:02:00:000 ZZ city name1 1.582337e+12 1 3 NaN NaN
1 11:01:00:000 11:02:00:000 ZZ city name2 NaN 1 3 TYPE2 1.582338e+12
2 11:01:00:000 11:02:00:000 ZZ city name3 1.582337e+12 1 3 NaN NaN
What is the goal? To get all city names?
>>> for top_level_alert in alerts:
for nested_alert in top_level_alert['alerts']:
print(nested_alert['city'])
city name1
city name2
city name3
Related
So I have data in a .csv file and have turned all the columns in the .csv into data dictionaries containing the relevant information e.g. data_dict['Date'] would give me all the records of dates. there's about 170k records.
What I am trying to do is identify all Countries with a certain score above, lets say 100, and print them. So countries is one column and score is another, but there are about 50 columns total.
My thought process was to find the numbers above 100 and then print the corresponding countries.
my data dictionaries look like this, kinda these are just examples.
['Countries'] = AAA, AAB, AAC, AAD......
['Score'] = 20, 30, 40, 50.....
note: the country AAA's score is 20, they are within the same record
So the output I want should be like -
the countries with scores higher than 100 are x, y, z.......
I dont even know where to start so I cant really provide code.
Bonus points if you can divide every 'Score' record by 10 before printing the countries.
I know this is a huge long shot but any assistance would be appreciated :)
list_of_dics is a list of dics loaded from csv.
countries_with_score_higher_than_100 is you answer.
list_of_dicts = [
{'Country': 'Germany', 'Score': 50, 'Some_other_data': 3},
{'Country': 'Poland', 'Score': 90, 'Some_other_data': 7},
{'Country': 'Hungary', 'Score': 90, 'Some_other_data': 3},
{'Country': 'America', 'Score': 110, 'Some_other_data': 3},
{'Country': 'Spain', 'Score': 120, 'Some_other_data': 4},
]
countries_with_score_higher_than_100 = []
for dic in list_of_dicts:
if dic['Score'] > 100:
dic['Score'] = dic['Score'] / 10
countries_with_score_higher_than_100.append(dic)
print(countries_with_score_higher_than_100)
If I understand correctly:
data_dict = {
"Countries":
[
'AAA',
'AAB',
'AAC',
'AAD'
],
"Score":
[
20,
30,
400,
500
]
}
# first create an index:
i = 0
# Now loop all the countries:
while i < len(data_dict['Countries']):
# get the score for this country:
country = data_dict['Countries'][i]
score = data_dict['Score'][i]
# do the check
if score > 100:
# do the print. divide by 10
print(f"{country} has a score of: {score / 10}")
# increment the index to process the next record
i = i + 1
Fair warning. Your data should really not be structured like this. It makes it messy when you have to deal with a country that doesn't have a score or any scenario where the data isn't 100% perfect and ordered in the same way.
A cleaner data format would be:
data_dict = {
"Countries":
{
'AAA': {
"Score": 20,
"Date": '2022-04-20'
},
'AAB': {
"Score": 30,
"Date": '2022-04-20'
},
'AAC': {
"Score": 400,
"Date": '2022-04-20'
},
'AAD': {
"Score": 500,
"Date": '2022-04-20'
}
}
}
that way you can keep track of each countries data points by that country:
for country_name, data_points in data_dict["Countries"].items():
print(f"County name: {country_name}")
print(f"Score: {data_points['Score']}")
print(f"Date: {data_points['Date']}")
print("----")
output:
County name: AAA
Score: 20
Date: 2022-04-20
----
County name: AAB
Score: 30
Date: 2022-04-20
----
County name: AAC
Score: 400
Date: 2022-04-20
----
County name: AAD
Score: 500
Date: 2022-04-20
----
For example I have multiple locations under the location column, then I want to add group numbers within each location. But number of groups is different in locations.
e.g. df1
Location
Chicago
Minneapolis
Dallas
.
.
.
and df2
Location times
Chicago 2
Minneapolis 5
Dallas 1
. .
. .
. .
What I want to get is:
Location Group
Chicago 1
Chicago 2
Minneapolis 1
Minneapolis 2
Minneapolis 3
Minneapolis 4
Minneapolis 5
Dallas 1
.
.
.
What I have now is... repeating same number of groups among the locations: 17 groups within each location. But I just realized there will be different groups among locations... then I don't know what to do next.
filled_results['location'] = results['location'].unique()
filled_results['times'] = 17
filled_results = filled_results.loc[filled_results.index.repeat(filled_results.times)]
v = pd.Series(range(1, 18))
filled_results['group'] = np.tile(v, len(filled_results) // len(v) + 1)[:len(filled_results)]
filled_results = filled_results.drop(columns=['times'])
I was thinking about a for loop, but don't know how to achieve that. for each unique location within df1, giving them 0 to x of groups based on #ofgroups in df2.
I think I found a solution myself. It's very easy if considering this as add index within each group. Here's the solution:
df = pd.DataFrame()
df['location'] = df1['location'].unique()
df = pd.merge(df,
df2,
on = 'location',
how = 'left' )
df = df.loc[df.index.repeat(df.times)]
df["Group"] = df.groupby("location")["times"].rank(method="first", ascending=True)
df["Group"] = df["Group"].astype(int)
df = df.drop(columns=['times'])
You can check out this code.
data = [
{ 'name': 'Chicago', 'c': 2 },
{ 'name': 'Minneapolis', 'c': 5 },
{ 'name': 'Dallas', 'c': 1 }
]
result = []
for location in data:
for i in range(0, location['c']):
result.append({ 'name': location['name'], 'group': i+1 })
result will be:
[{'group': 1, 'name': 'Chicago'}, {'group': 2, 'name': 'Chicago'}, {'group': 1, 'name': 'Minneapolis'}, {'group': 2, 'name': 'Minneapolis'}, {'group': 3, 'name': 'Minneapolis'}, {'group': 4, 'name': 'Minneapolis'}, {'group': 5, 'name': 'Minneapolis'}, {'group': 1, 'name': 'Dallas'}]
I have pandas dataframe where one of the columns is in JSON format. It contains lists of movie production companies for a given title. Below the sample structure:
ID | production_companies
---------------
1 | "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]"
2 | "[{'name': 'Walt Disney Pictures', 'id': 2}]"
3 | "[{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]"
4 | nan
5 | nan
6 | nan
7 | "[{'name': 'Ghost House Pictures', 'id': 768}, {'name': 'North Box Productions', 'id': 22637}]"
As you see one movie (row) can have multiple production companies. I want to create for each movie separate columns containing names of the producers. Columns should look like: name_1, name_2, name_3,... etc. If there is no second or third producer it should be NaN.
I don't have much experience working with JSON formats and I've tried a few methods (iterators with lambda functions) but they are not even close to what I need.
Therefore I hope for your help guys!
EDIT:
The following code ("movies" is the main database):
from pandas.io.json import json_normalize
companies = list(movies['production_companies'])
json_normalize(companies)
gives me the following error:
AttributeError: 'str' object has no attribute 'values'
Adding on to #Andy's answer above to answer OP's question.
This part was by #Andy:
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
My additions to answer OP's requirements:
tmp_lst = []
for idx, item in df.groupby(by='ID'):
# Crediting this part to #Andy above
tmp_df = pd.DataFrame(list(itertools.chain(*item["production_companies"].values.tolist()))).drop(columns='id')
# Transpose dataframe
tmp_df = tmp_df.T
# Add back movie id to tmp_df
tmp_df['ID'] = item['ID'].values
# Accumulate tmp_df from all unique movie ids
tmp_lst.append(tmp_df)
pd.concat(tmp_lst, sort=False)
Result:
0 1 2 ID
name Paramount Pictures United Artists Metro-Goldwyn-Mayer (MGM) 1
name Walt Disney Pictures NaN NaN 3
This should do it
import pandas as pd
import numpy as np
import ast
import itertools
# dummy data
df = pd.DataFrame({
"ID": [1,2,3],
"production_companies": ["[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]", np.nan, "[{'name': 'Walt Disney Pictures', 'id': 2}]"]
})
# remove the nans
df.dropna(inplace=True)
# convert the strings into lists
df["production_companies"] = df["production_companies"].apply(lambda x: ast.literal_eval(x))
# flatten the column of lists into a single list, and convert to DataFrame
pd.DataFrame(list(itertools.chain(*df["production_companies"].values.tolist())))
Which yields:
id name
0 4 Paramount Pictures
1 60 United Artists
2 8411 Metro-Goldwyn-Mayer (MGM)
3 2 Walt Disney Pictures
I'm trying to parse json I've recieved from an api into a pandas DataFrame. That json is ierarchical, in this example I have city code, line name and list of stations for this line. Unfortunately I can't "unpack" it. Would be gratefull for help and explanation.
Json:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская', <------Line name
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино', <------Station 1
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево', <------Station 2
'order': 1},
etc.
I'm trying to recieve evrything from lowest level and the add all higher level information (starting from linename):
c = r.content
j = simplejson.loads(c)
tmp=[]
i=0
data1=pd.DataFrame(tmp)
data2=pd.DataFrame(tmp)
pd.concat
station['name']
for station in j['lines']:
data2 = data2.append(pd.DataFrame(station['stations'], station['name']),ignore_index=True)
data2
Once more - the questions are:
How to make it work?
Is this solution an optimal one, or there are some functions I should know about?
Update:
The Json parses normally:
json_normalize(j)
id lines name
1 [{'hex_color': 'FFCD1C', 'stations': [{'lat': ... Москва
Current DataFrame I can get:
data2 = data2.append(pd.DataFrame(station['stations']),ignore_index=True)
id lat lng name order
0 8.189 55.745113 37.864052 Новокосино 0
1 8.88 55.752237 37.814587 Новогиреево 1
Desired dataframe can be represented as:
id lat lng name order Line_Name Id_Top Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 1 Москва
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 1 Москва
In addition to MaxU's answer, I think you still need the highest level id, this should work:
json_normalize(data, ['lines','stations'], ['id',['lines','name']],record_prefix='station_')
Assuming you have the following dictionary:
In [70]: data
Out[70]:
{'id': '1',
'lines': [{'hex_color': 'FFCD1C',
'id': '8',
'name': 'Калининская',
'stations': [{'id': '8.189',
'lat': 55.745113,
'lng': 37.864052,
'name': 'Новокосино',
'order': 0},
{'id': '8.88',
'lat': 55.752237,
'lng': 37.814587,
'name': 'Новогиреево',
'order': 1}]}]}
Solution: use pandas.io.json.json_normalize:
In [71]: pd.io.json.json_normalize(data['lines'],
['stations'],
['name', 'id'],
meta_prefix='parent_')
Out[71]:
id lat lng name order parent_name parent_id
0 8.189 55.745113 37.864052 Новокосино 0 Калининская 8
1 8.88 55.752237 37.814587 Новогиреево 1 Калининская 8
UPDATE: reflects updated question
res = (pd.io.json.json_normalize(data,
['lines', 'stations'],
['id', ['lines', 'name']],
meta_prefix='Line_')
.assign(Name_Top='Москва'))
Result:
In [94]: res
Out[94]:
id lat lng name order Line_id Line_lines.name Name_Top
0 8.189 55.745113 37.864052 Новокосино 0 1 Калининская Москва
1 8.88 55.752237 37.814587 Новогиреево 1 1 Калининская Москва
So, I have a pandas dataframe with large no. of rows. whose one row might look like this and more.
data_store.iloc[0]
Out[5]:
mac_address 00:03:7f:05:c0:06
Visit 1/4/2016
storeid Ritika - Bhubaneswar
Mall or not High Street
Address 794, Sahid Nagar, Janpath, Bhubaneswar-751007
Unnamed: 4 OR
Locality JanPath
Affluence Index Locality 4
Lifestyle Index 5
Tourist Attraction In the locality? 0
City Bhubaneswar
Pop Density City 2131
Population Density of City NaN
City Affluence Index Medium
Mall / Shopping Complex High Street
Mall Premiumness Index NaN
Multiplex NaN
Offices Nearby NaN
Food Court NaN
Average Footfall NaN
Average Rental NaN
Size of the mall NaN
Area NaN
Upscale Street NaN
Place of Worship in vicinity NaN
High Street / Mall High Street
Brand Premiumness Index 4
Restaurant Nearby? 0
Store Size Large
Area.1 2600
There may be some more value in place of Nan just take it as a example.Now the unique key here is mac_address so I want to start with a empty JSON document. now for each row of data i will update the JSON file. like
{
mac_address: "00:03:7f:05:c0:06"
{
"cities" : [
{
"City Name1" : "Wittenbergplatz",
"City count" : "12"
},
{
"City Name2" : "Spichernstrasse",
"City Count" : "19"
},
{
"City Name3" : "Weberwiese",
"City count" : "30"
}
]
}
}
city count is no. of times a mac_address visited to a city. By reading this particular row I would like to update a city named Bhubneswar and Count 1. Now for each new row i would like to check if it is already there in JSON for that probably i would have to import the JSON in python in dictionary or something(suggest).So, if a mac_address is already there i would like to update the info of that row in existing JSON across that mac_address and if it is not there i would like to add that mac_address as new field and update the info of that row across that mac_address. I have to do it in python and pandas dataframe as i have a bit idea about pandas dataframe. Any help on this?
your desired JSON file is not a valid JSON file, here is the error message thrown by the online JSON validator:
Input:
{
"mac_address": "00:03:7f:05:c0:06" {
"cities": [{
"City Name1": "Wittenbergplatz",
"City count": "12"
}, {
"City Name2": "Spichernstrasse",
"City Count": "19"
}, {
"City Name3": "Weberwiese",
"City count": "30"
}]
}
}
Error
Error: Parse error on line 2:
..."00:03:7f:05:c0:06" { "cities": [{
-----------------------^
Expecting 'EOF', '}', ':', ',', ']', got '{'
this solution might help you to start:
In [440]: (df.groupby(['mac_address','City'])
.....: .size()
.....: .reset_index()
.....: .rename(columns={0:'count'})
.....: .groupby('mac_address')
.....: .apply(lambda x: x[['City','count']].to_dict('r'))
.....: .to_dict()
.....: )
Out[440]:
{'00:03:7f:05:c0:01': [{'City': 'aaa', 'count': 1}],
'00:03:7f:05:c0:02': [{'City': 'bbb', 'count': 1}],
'00:03:7f:05:c0:03': [{'City': 'ccc', 'count': 2}],
'00:03:7f:05:c0:05': [{'City': 'xxx', 'count': 1},
{'City': 'zzz', 'count': 1}],
'00:03:7f:05:c0:06': [{'City': 'aaa', 'count': 1},
{'City': 'bbb', 'count': 1}],
'00:03:7f:05:c0:07': [{'City': 'aaa', 'count': 3},
{'City': 'bbb', 'count': 1}]}
data:
In [441]: df
Out[441]:
mac_address City
0 00:03:7f:05:c0:06 aaa
1 00:03:7f:05:c0:06 bbb
2 00:03:7f:05:c0:07 aaa
3 00:03:7f:05:c0:07 bbb
4 00:03:7f:05:c0:07 aaa
5 00:03:7f:05:c0:01 aaa
6 00:03:7f:05:c0:02 bbb
7 00:03:7f:05:c0:03 ccc
8 00:03:7f:05:c0:03 ccc
9 00:03:7f:05:c0:07 aaa
10 00:03:7f:05:c0:05 xxx
11 00:03:7f:05:c0:05 zzz