I have one json file. I opened it with pd.read_json and then whem parsing to a geodtaframe just some fields are considered. Nut some are not. When I open it on QGIS for instance there are multiple columns that I cannot convert to geodataframe.
So my file is called PT:
PT = pd.read_json('PT.json')
PT
type features
0 FeatureCollection {'id': 'osm-w96717521', 'type': 'Feature', 'pr...
1 FeatureCollection {'id': 'osm-w96850552', 'type': 'Feature', 'pr...
2 FeatureCollection {'id': 'osm-r1394361', 'type': 'Feature', 'pro...
and for different PT lines I have different fields:
So for instance for:
PT['features'][0]
{'id': 'osm-w96717521',
'type': 'Feature',
'properties': {'height': 24,
'heightSrc': 'manual',
'levels': 8,
'date': 201804},
'geometry': {'type': 'Polygon',
'coordinates': [[[-9.151539, 38.725054],
[-9.15148, 38.724906],
[-9.151281, 38.724918],
[-9.151254, 38.724867],
[-9.151142, 38.724699],
[-9.150984, 38.724783],
[-9.151081, 38.724918],
[-9.151152, 38.725076],
[-9.151539, 38.725054]]]}}
and for:
PT['features'][100000]
{'id': 'osm-w556092901',
'type': 'Feature',
'properties': {'date': 201801, 'orient': 95, 'height': 3, 'heightSrc': 'ai'},
'geometry': {'type': 'Polygon',
'coordinates': [[[-9.402381, 38.742663],
[-9.402342, 38.74261],
[-9.402215, 38.742667],
[-9.402281, 38.742706],
[-9.402381, 38.742663]]]}}
it has also the field 'orient'.
When I convert the features dict to each column on a df, for some columns result:
df["coordinates"] = nta["features"].apply(lambda row: row["geometry"]["coordinates"])
df
but those that do not appear on every line I cannot consider. So for 'levels' or 'orient':
df["floors"] = nta["features"].apply(lambda row: row["properties"]["levels"])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In [46], line 1
----> 1 df["floors"] = nta["features"].apply(lambda row: row["properties"]["levels"])
(...)
KeyError: 'levels'
How can I get all columns contained in feature even if for some values they should be null?
You can use an if/else construct like return nan if the key does not exist.
import numpy as np
df["floors"] = nta["features"].apply(lambda row: row["properties"]["levels"] if 'levels' in list(row["properties"].keys()) else np.nan)
df["coordinates"] = nta["features"].apply(lambda row: row["geometry"]["coordinates"] if 'coordinates' in list(row["geometry"].keys()) else np.nan)
I have a json file built like this:
{"type":"FeatureCollection","features":[
{"type":"Feature","id":"010020000A0225","geometry":{"type":"Polygon","coordinates":[[[5.430767,46.0214267],[5.4310805,46.0220116],[5.4311205,46.0220864],[5.4312362,46.0223019],[5.4308994,46.0224141],[5.43087,46.0224242],[5.430774,46.0222401],[5.4304506,46.0223202],[5.4302885,46.021982],[5.4300391,46.0216054],[5.4299637,46.0216342],[5.4300862,46.0218401],[5.4299565,46.021902],[5.4298847,46.0218195],[5.4298545,46.0217829],[5.4297689,46.0216672],[5.4297523,46.0216506],[5.4297379,46.0216389],[5.4296432,46.0215854],[5.429517,46.0214509],[5.4294188,46.0213458],[5.4293757,46.0213128],[5.4291918,46.0211768],[5.4291488,46.0211448],[5.4291083,46.0211214],[5.429024,46.0210828],[5.4292965,46.0208202],[5.4294241,46.0208894],[5.4295183,46.0209623],[5.4295455,46.0209865],[5.429613,46.0210554],[5.4296428,46.0210813],[5.4298751,46.0212862],[5.429988,46.0213782],[5.430014,46.0213973],[5.4300746,46.0214318],[5.430124,46.0214542],[5.4302569,46.0215069],[5.4303111,46.0215192],[5.4303632,46.0215166],[5.4306127,46.0214642],[5.430767,46.0214267]]]},"properties":{"id":"010020000A0225","commune":"01002","prefixe":"000","section":"A","numero":"225","contenance":9440,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}},
{"type":"Feature","id":"010020000A0346","geometry":{"type":"Polygon","coordinates":[[[5.4241952,46.0255535],[5.4233594,46.0262031],[5.4232624,46.0262774],[5.4226259,46.0267733],[5.4227608,46.0268718],[5.4227712,46.0268789],[5.4226123,46.0269855],[5.422565,46.0270182],[5.4223546,46.027145],[5.4222957,46.0271794],[5.4221794,46.0272376],[5.4221383,46.0272585],[5.4221028,46.027152],[5.4220695,46.0270523],[5.4220378,46.026962],[5.4220467,46.0269265],[5.4220524,46.0268709],[5.4220563,46.0268474],[5.4222945,46.0268985],[5.4224161,46.0267746],[5.4224581,46.0267904],[5.4226286,46.02666],[5.4226811,46.02662],[5.4227313,46.0265803],[5.4227813,46.0265406],[5.4228535,46.0264868],[5.4229063,46.0264482],[5.4229741,46.0264001],[5.4234903,46.0260331],[5.4235492,46.0259893],[5.4235787,46.0259663],[5.423645,46.0259126],[5.4237552,46.0258198],[5.4237839,46.0257951],[5.4238321,46.0257547],[5.4239258,46.0256723],[5.4239632,46.0256394],[5.4241164,46.0255075],[5.4241952,46.0255535]]]},"properties":{"id":"010020000A0346","commune":"01002","prefixe":"000","section":"A","numero":"346","contenance":2800,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}},
I would like to get for each feature: properties and geometry but I think I loop badly on my json file. here is my code
data = pd.read_json(json_file_path)
for key, v in data.items():
print(f"{key['features']['geometry']} : {v}",
f"{key['features']['properties']} : {v}")
The values you are interested in are located in a list that is itself a value of your main dictionary.
If you want to be able to process these values with pandas, it would be better to build your dataframe directly from them:
import json
import pandas as pd
data = json.loads("""{"type":"FeatureCollection","features":[
{"type":"Feature","id":"010020000A0225","geometry":{"type":"Polygon","coordinates":[[[5.430767,46.0214267],[5.4310805,46.0220116],[5.4311205,46.0220864],[5.4312362,46.0223019],[5.4308994,46.0224141],[5.43087,46.0224242],[5.430774,46.0222401],[5.4304506,46.0223202],[5.4302885,46.021982],[5.4300391,46.0216054],[5.4299637,46.0216342],[5.4300862,46.0218401],[5.4299565,46.021902],[5.4298847,46.0218195],[5.4298545,46.0217829],[5.4297689,46.0216672],[5.4297523,46.0216506],[5.4297379,46.0216389],[5.4296432,46.0215854],[5.429517,46.0214509],[5.4294188,46.0213458],[5.4293757,46.0213128],[5.4291918,46.0211768],[5.4291488,46.0211448],[5.4291083,46.0211214],[5.429024,46.0210828],[5.4292965,46.0208202],[5.4294241,46.0208894],[5.4295183,46.0209623],[5.4295455,46.0209865],[5.429613,46.0210554],[5.4296428,46.0210813],[5.4298751,46.0212862],[5.429988,46.0213782],[5.430014,46.0213973],[5.4300746,46.0214318],[5.430124,46.0214542],[5.4302569,46.0215069],[5.4303111,46.0215192],[5.4303632,46.0215166],[5.4306127,46.0214642],[5.430767,46.0214267]]]},"properties":{"id":"010020000A0225","commune":"01002","prefixe":"000","section":"A","numero":"225","contenance":9440,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}},
{"type":"Feature","id":"010020000A0346","geometry":{"type":"Polygon","coordinates":[[[5.4241952,46.0255535],[5.4233594,46.0262031],[5.4232624,46.0262774],[5.4226259,46.0267733],[5.4227608,46.0268718],[5.4227712,46.0268789],[5.4226123,46.0269855],[5.422565,46.0270182],[5.4223546,46.027145],[5.4222957,46.0271794],[5.4221794,46.0272376],[5.4221383,46.0272585],[5.4221028,46.027152],[5.4220695,46.0270523],[5.4220378,46.026962],[5.4220467,46.0269265],[5.4220524,46.0268709],[5.4220563,46.0268474],[5.4222945,46.0268985],[5.4224161,46.0267746],[5.4224581,46.0267904],[5.4226286,46.02666],[5.4226811,46.02662],[5.4227313,46.0265803],[5.4227813,46.0265406],[5.4228535,46.0264868],[5.4229063,46.0264482],[5.4229741,46.0264001],[5.4234903,46.0260331],[5.4235492,46.0259893],[5.4235787,46.0259663],[5.423645,46.0259126],[5.4237552,46.0258198],[5.4237839,46.0257951],[5.4238321,46.0257547],[5.4239258,46.0256723],[5.4239632,46.0256394],[5.4241164,46.0255075],[5.4241952,46.0255535]]]},"properties":{"id":"010020000A0346","commune":"01002","prefixe":"000","section":"A","numero":"346","contenance":2800,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}}
]
}
""")
df = pd.DataFrame(data['features'])
print(df)
It'll give you the following DataFrame:
type id geometry properties
0 Feature 010020000A0225 {'type': 'Polygon', 'coordinates': [[[5.430767... {'id': '010020000A0225', 'commune': '01002', '...
1 Feature 010020000A0346 {'type': 'Polygon', 'coordinates': [[[5.424195... {'id': '010020000A0346', 'commune': '01002', '...
From there you can easily access the geometry and properties columns.
Furthermore, if you want geometric and other properties in their own columns, you can use json_normalize:
df = pd.json_normalize(data['features'])
print(df)
Output:
type id geometry.type geometry.coordinates ... properties.contenance properties.arpente properties.created properties.updated
0 Feature 010020000A0225 Polygon [[[5.430767, 46.0214267], [5.4310805, 46.02201... ... 9440 False 2005-06-03 2018-09-25
1 Feature 010020000A0346 Polygon [[[5.4241952, 46.0255535], [5.4233594, 46.0262... ... 2800 False 2005-06-03 2018-09-25
I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)
I have a data like this and I want the data to be written in a dataframe so that I can convert it directly into a csv file.
Data =
[ {'event': 'User Clicked', 'properties': {'user_id': '123', 'page_visited': 'contact_us', etc},
{'event': 'User Clicked', 'properties': {'user_id': '456', 'page_visited': 'homepage', etc} , ......
{'event': 'User Clicked', 'properties': {'user_id': '789', 'page_visited': 'restaurant', etc}} ]
This is How I am able to access its values:
for item in list_of_dict_responses:
print item['event']
for key, value in item.items():
if type(value) is dict:
for k, v in value.items():
print k,v
I want it in a dataframe where event is a column with value of User Clicked and properties is a another column with sub column of user_id, page_visited, contact_us and then respective values of sub column.
flatten the nested dictionaries & then just use the data frame constructor to create a data frame.
data = [
{'event': 'User Clicked', 'properties': {'user_id': '123', 'page_visited': 'contact_us'}},
{'event': 'User Clicked', 'properties': {'user_id': '456', 'page_visited': 'homepage'}},
{'event': 'User Clicked', 'properties': {'user_id': '789', 'page_visited': 'restaurant'}}
]
The flattened dictionary may be constructed in several ways. Here's 1 method using a generator that is generic & will work with arbitrary-depth nested dictionaries (or at least until it hits the max recursion depth)
def flatten(kv, prefix=[]):
for k, v in kv.items():
if isinstance(v, dict):
yield from flatten(v, prefix+[str(k)])
else:
if prefix:
yield '_'.join(prefix+[str(k)]), v
else:
yield str(k), v
Then using list comprehension to flatten all the records in data, construct the data frame
pd.DataFrame({k:v for k, v in flatten(kv)} for kv in data)
#Out
event properties_page_visited properties_user_id
0 User Clicked contact_us 123
1 User Clicked homepage 456
2 User Clicked restaurant 789
You have 2 options: either use a MultiIndex for columns, or add a prefix for data in properties. The former, in my opinion, is not appropriate here, since you don't have a "true" hierarchical columnar structure. The second level, for example, would be empty for event.
Implementing the second idea, you can restructure your list of dictionaries before feeding to pd.DataFrame. The syntax {**d1, **d2} is used to combine two dictionaries.
data_transformed = [{**{'event': d['event']},
**{f'properties_{k}': v for k, v in d['properties'].items()}} \
for d in Data]
res = pd.DataFrame(data_transformed)
print(res)
event properties_page_visited properties_user_id
0 User Clicked contact_us 123
1 User Clicked homepage 456
2 User Clicked restaurant 789
This also aids writing to and reading from CSV files, where a MultiIndex can be ambiguous.