I have a json file built like this:
{"type":"FeatureCollection","features":[
{"type":"Feature","id":"010020000A0225","geometry":{"type":"Polygon","coordinates":[[[5.430767,46.0214267],[5.4310805,46.0220116],[5.4311205,46.0220864],[5.4312362,46.0223019],[5.4308994,46.0224141],[5.43087,46.0224242],[5.430774,46.0222401],[5.4304506,46.0223202],[5.4302885,46.021982],[5.4300391,46.0216054],[5.4299637,46.0216342],[5.4300862,46.0218401],[5.4299565,46.021902],[5.4298847,46.0218195],[5.4298545,46.0217829],[5.4297689,46.0216672],[5.4297523,46.0216506],[5.4297379,46.0216389],[5.4296432,46.0215854],[5.429517,46.0214509],[5.4294188,46.0213458],[5.4293757,46.0213128],[5.4291918,46.0211768],[5.4291488,46.0211448],[5.4291083,46.0211214],[5.429024,46.0210828],[5.4292965,46.0208202],[5.4294241,46.0208894],[5.4295183,46.0209623],[5.4295455,46.0209865],[5.429613,46.0210554],[5.4296428,46.0210813],[5.4298751,46.0212862],[5.429988,46.0213782],[5.430014,46.0213973],[5.4300746,46.0214318],[5.430124,46.0214542],[5.4302569,46.0215069],[5.4303111,46.0215192],[5.4303632,46.0215166],[5.4306127,46.0214642],[5.430767,46.0214267]]]},"properties":{"id":"010020000A0225","commune":"01002","prefixe":"000","section":"A","numero":"225","contenance":9440,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}},
{"type":"Feature","id":"010020000A0346","geometry":{"type":"Polygon","coordinates":[[[5.4241952,46.0255535],[5.4233594,46.0262031],[5.4232624,46.0262774],[5.4226259,46.0267733],[5.4227608,46.0268718],[5.4227712,46.0268789],[5.4226123,46.0269855],[5.422565,46.0270182],[5.4223546,46.027145],[5.4222957,46.0271794],[5.4221794,46.0272376],[5.4221383,46.0272585],[5.4221028,46.027152],[5.4220695,46.0270523],[5.4220378,46.026962],[5.4220467,46.0269265],[5.4220524,46.0268709],[5.4220563,46.0268474],[5.4222945,46.0268985],[5.4224161,46.0267746],[5.4224581,46.0267904],[5.4226286,46.02666],[5.4226811,46.02662],[5.4227313,46.0265803],[5.4227813,46.0265406],[5.4228535,46.0264868],[5.4229063,46.0264482],[5.4229741,46.0264001],[5.4234903,46.0260331],[5.4235492,46.0259893],[5.4235787,46.0259663],[5.423645,46.0259126],[5.4237552,46.0258198],[5.4237839,46.0257951],[5.4238321,46.0257547],[5.4239258,46.0256723],[5.4239632,46.0256394],[5.4241164,46.0255075],[5.4241952,46.0255535]]]},"properties":{"id":"010020000A0346","commune":"01002","prefixe":"000","section":"A","numero":"346","contenance":2800,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}},
I would like to get for each feature: properties and geometry but I think I loop badly on my json file. here is my code
data = pd.read_json(json_file_path)
for key, v in data.items():
print(f"{key['features']['geometry']} : {v}",
f"{key['features']['properties']} : {v}")
The values you are interested in are located in a list that is itself a value of your main dictionary.
If you want to be able to process these values with pandas, it would be better to build your dataframe directly from them:
import json
import pandas as pd
data = json.loads("""{"type":"FeatureCollection","features":[
{"type":"Feature","id":"010020000A0225","geometry":{"type":"Polygon","coordinates":[[[5.430767,46.0214267],[5.4310805,46.0220116],[5.4311205,46.0220864],[5.4312362,46.0223019],[5.4308994,46.0224141],[5.43087,46.0224242],[5.430774,46.0222401],[5.4304506,46.0223202],[5.4302885,46.021982],[5.4300391,46.0216054],[5.4299637,46.0216342],[5.4300862,46.0218401],[5.4299565,46.021902],[5.4298847,46.0218195],[5.4298545,46.0217829],[5.4297689,46.0216672],[5.4297523,46.0216506],[5.4297379,46.0216389],[5.4296432,46.0215854],[5.429517,46.0214509],[5.4294188,46.0213458],[5.4293757,46.0213128],[5.4291918,46.0211768],[5.4291488,46.0211448],[5.4291083,46.0211214],[5.429024,46.0210828],[5.4292965,46.0208202],[5.4294241,46.0208894],[5.4295183,46.0209623],[5.4295455,46.0209865],[5.429613,46.0210554],[5.4296428,46.0210813],[5.4298751,46.0212862],[5.429988,46.0213782],[5.430014,46.0213973],[5.4300746,46.0214318],[5.430124,46.0214542],[5.4302569,46.0215069],[5.4303111,46.0215192],[5.4303632,46.0215166],[5.4306127,46.0214642],[5.430767,46.0214267]]]},"properties":{"id":"010020000A0225","commune":"01002","prefixe":"000","section":"A","numero":"225","contenance":9440,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}},
{"type":"Feature","id":"010020000A0346","geometry":{"type":"Polygon","coordinates":[[[5.4241952,46.0255535],[5.4233594,46.0262031],[5.4232624,46.0262774],[5.4226259,46.0267733],[5.4227608,46.0268718],[5.4227712,46.0268789],[5.4226123,46.0269855],[5.422565,46.0270182],[5.4223546,46.027145],[5.4222957,46.0271794],[5.4221794,46.0272376],[5.4221383,46.0272585],[5.4221028,46.027152],[5.4220695,46.0270523],[5.4220378,46.026962],[5.4220467,46.0269265],[5.4220524,46.0268709],[5.4220563,46.0268474],[5.4222945,46.0268985],[5.4224161,46.0267746],[5.4224581,46.0267904],[5.4226286,46.02666],[5.4226811,46.02662],[5.4227313,46.0265803],[5.4227813,46.0265406],[5.4228535,46.0264868],[5.4229063,46.0264482],[5.4229741,46.0264001],[5.4234903,46.0260331],[5.4235492,46.0259893],[5.4235787,46.0259663],[5.423645,46.0259126],[5.4237552,46.0258198],[5.4237839,46.0257951],[5.4238321,46.0257547],[5.4239258,46.0256723],[5.4239632,46.0256394],[5.4241164,46.0255075],[5.4241952,46.0255535]]]},"properties":{"id":"010020000A0346","commune":"01002","prefixe":"000","section":"A","numero":"346","contenance":2800,"arpente":false,"created":"2005-06-03","updated":"2018-09-25"}}
]
}
""")
df = pd.DataFrame(data['features'])
print(df)
It'll give you the following DataFrame:
type id geometry properties
0 Feature 010020000A0225 {'type': 'Polygon', 'coordinates': [[[5.430767... {'id': '010020000A0225', 'commune': '01002', '...
1 Feature 010020000A0346 {'type': 'Polygon', 'coordinates': [[[5.424195... {'id': '010020000A0346', 'commune': '01002', '...
From there you can easily access the geometry and properties columns.
Furthermore, if you want geometric and other properties in their own columns, you can use json_normalize:
df = pd.json_normalize(data['features'])
print(df)
Output:
type id geometry.type geometry.coordinates ... properties.contenance properties.arpente properties.created properties.updated
0 Feature 010020000A0225 Polygon [[[5.430767, 46.0214267], [5.4310805, 46.02201... ... 9440 False 2005-06-03 2018-09-25
1 Feature 010020000A0346 Polygon [[[5.4241952, 46.0255535], [5.4233594, 46.0262... ... 2800 False 2005-06-03 2018-09-25
Related
My database has a column where all the cells have a string of data. There are around 15-20 variables, where the information is assigned to the variables with an "=" and separated by a space. The number and names of the variables can differ in the individual cells... The issue I face is that the data is separated by spaces and so are some of the variables. The variable name is in every cell, so I can't just make the headers and add the values to the data frame like a csv. The solution also needs to be able to do this process automatically for all the new data in the database.
Example:
Cell 1: TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520"... RELEASED="1880".
Cell 2: TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655"... MAIN CHARACTER="Ishmael".
I want to convert these strings of data into a structured dataframe like.
TITLE
AUTHOR
PAGES
RELEASED
MAIN
Brothers Karamazov
Fyodor Dostoevsky
520
1880
NaN
Moby Dick
Herman Meville
655
NaN
Ishmael
Any tips on how to move forwards? I have though about converting it into a JSON format by using the replace() function, before turning it into a dataframe, but have not yet succeeded. Any tips or ideas are much appreciated.
Thanks,
I guess this sample is what you need.
import pandas as pd
# Helper function
def str_to_dict(cell) -> dict:
normalized_cell = cell.replace('" ', '\n').replace('"', '').split('\n')
temp = {}
for x in normalized_cell:
key, value = x.split('=')
temp[key] = value
return temp
list_of_cell = [
'TITLE="Brothers Karamazov" AUTHOR="Fyodor Dostoevsky" PAGES="520" RELEASED="1880"',
'TITLE="Moby Dick" AUTHOR="Herman Melville" PAGES="655" MAIN CHARACTER="Ishmael"'
]
dataset = [str_to_dict(i) for i in list_of_cell]
print(dataset)
"""
[{'TITLE': 'Brothers Karamazov', 'AUTHOR': 'Fyodor Dostoevsky', 'PAGES': '520', 'RELEASED': '1880'}, {'TITLE': 'Moby Dick', 'AUTHOR': 'Herman Melville', 'PAGES': '655', 'MAIN CHARACTER': 'Ishmael'}]
"""
df = pd.DataFrame(dataset)
df.head()
"""
TITLE AUTHOR PAGES RELEASED MAIN CHARACTER
0 Brothers Karamazov Fyodor Dostoevsky 520 1880 NaN
1 Moby Dick Herman Melville 655 NaN Ishmael
"""
Pandas lib can read them from a .csv file and make a data frame - try this:
import pandas as pd
file = 'xx.csv'
data = pd.read_csv(file)
print(data)
Create a Python dictionary from your database rows.
Then create Pandas dataframe using the function: pandas.DataFrame.from_dict
Something like this:
import pandas as pd
# Assumed data from DB, structure it like this
data = [
{
'TITLE': 'Brothers Karamazov',
'AUTHOR': 'Fyodor Dostoevsky'
}, {
'TITLE': 'Moby Dick',
'AUTHOR': 'Herman Melville'
}
]
# Dataframe as per your requirements
dt = pd.DataFrame.from_dict(data)
I am working with Amazon Rekognition to do some image analysis.
With a symple Python script, I get - at every iteration - a response of this type:
(example for the image of a cat)
{'Labels':
[{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [],
'Parents': [{'Name': 'Animal'}]}, {'Name': 'Mammal', 'Confidence': 96.146484375,
'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375.....
I got all the attributes I need in a list, that looks like this:
[Pet, Mammal, Cat, Animal, Manx, Abyssinian, Furniture, Kitten, Couch]
Now, I would like to create a dataframe where the elements in the list above appear as columns and the rows take values 0 or 1.
I created a dictionary in which I add the elements in the list, so I get {'Cat': 1}, then I go to add it to the dataframe and I get the following error:
TypeError: Index(...) must be called with a collection of some kind, 'Cat' was passed.
Not only that, but I don't even seem able to add to the same dataframe the information from different images. For example, if I only insert the data in the dataframe (as rows, not columns), I get a series with n rows with the n elements (identified by Amazon Rekognition) of only the last image, i.e. I start from an empty dataframe at each iteration.
The result I would like to get is something like:
Image Human Animal Flowers etc...
Pic1 1 0 0
Pic2 0 0 1
Pic3 1 1 0
For reference, this is the code I am using now (I should add that I am working on a software called KNIME, but this is just Python):
from pandas import DataFrame
import pandas as pd
import boto3
fileName=flow_variables['Path_Arr[1]'] #This is just to tell Amazon the name of the image
bucket= 'mybucket'
client=boto3.client('rekognition', region_name = 'us-east-2')
response = client.detect_labels(Image={'S3Object':
{'Bucket':bucket,'Name':fileName}})
data = [str(response)] # This is what I inserted in the first cell of this question
d= {}
for key, value in response.items():
for el in value:
if isinstance(el,dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
print(d)
df = pd.DataFrame(d, ignore_index=True)
print(df)
output_table = df
I am definitely getting it all wrong both in the for loop and when adding things to my dataframe, but nothing really seems to work!
Sorry for the super long question, hope it was clear! Any ideas?
I do not know if this answers your question completely, because i do not know, what you data can look like, but it's a good step that should help you, i think. I added the same data multiple time, but the way should be clear.
import pandas as pd
response = {'Labels': [{'Name': 'Pet', 'Confidence': 96.146484375, 'Instances': [], 'Parents': [{'Name': 'Animal'}]},
{'Name': 'Cat', 'Confidence': 96.146484375, 'Instances': [{'BoundingBox':
{'Width': 0.6686800122261047,
'Height': 0.9005332589149475,
'Left': 0.27255237102508545,
'Top': 0.03728689253330231},
'Confidence': 96.146484375}],
'Parents': [{'Name': 'Pet'}]
}]}
def handle_new_data(repsonse_data: dict, image_name: str) -> pd.DataFrame:
d = {"Image": image_name}
result = pd.DataFrame()
for key, value in repsonse_data.items():
for el in value:
if isinstance(el, dict):
for k, v in el.items():
if k == "Name":
d[v] = 1
result = result.append(d, ignore_index=True)
return result
df_all = pd.DataFrame()
df_all = df_all.append(handle_new_data(response, "image1"))
df_all = df_all.append(handle_new_data(response, "image2"))
df_all = df_all.append(handle_new_data(response, "image3"))
df_all = df_all.append(handle_new_data(response, "image4"))
df_all.reset_index(inplace=True)
print(df_all)
I am reading a file with one JSON object per line (ndjson)
dfjson = pd.read_json(path_or_buf=JsonFicMain,orient='records',lines=True)
Here is an example of 2 lines of the content of the dataframe (after dropping columns)
nomCommune codeCommune numeroComplet nomVoie codePostal meilleurePosition codesParcelles
0 Ablon-sur-Seine 94001 21 Rue Robert Schumann 94480 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.411247955172414, 48.726054248275865]}} [94001000AG0013]
1 Ablon-sur-Seine 94001 13 Rue Robert Schumann 94480 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.412065866666666, 48.72614911111111]}} [94001000AG0020]
It contents million of rows, I want to extract one geo coordinate, between square brackets, in a specific colum (named meilleurePosition). The expected output is
[2.411247955172414, 48.726054248275865]
I tried to either extract the coordinate or replace all other unwanted characters
Using extractall, or extract does not match
test=dfjson['meilleurePosition'].str.extract(pat='(\d+\.\d+)')
test2=dfjson['meilleurePosition'].str.extractall(pat='(\d+\.\d+)')
Empty DataFrame
Columns: [0]
Index: []
Using replace, or str.replace does not work
test3=dfjson["meilleurePosition"].replace(to_replace=r'[^0-9.,:]',value='',regex=True)
0 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.411247955172414, 48.726054248275865]}}
1 {'type': 'parcelle', 'geometry': {'type': 'Point', 'coordinates': [2.412065866666666, 48.72614911111111]}}
Even none regex type does not work
test4=dfjson['meilleurePosition'].str.replace('type','whatever')
0 NaN
1 NaN
print(test)
I have tried to find why this does not work at all.
Column type is 'object' (which is apparently good as this is a
string)
Using inplace=True without copying the dataframe leads to
similar results
Why can't I manipulate this column, is it because of the special characters in it?
How can get these coordinate in the good format?
OK, after more investigation, the column contains a nested dict, that's why it is not working
This answer helped me a lot
python pandas use map with regular expressions
I did then use the following code to create a new column with the expected coordinates
def extract_coord(meilleurepositiondict):
if isinstance(meilleurepositiondict,dict) :
return meilleurepositiondict['geometry']['coordinates']
else :
return None
dfjson['meilleurePositionclean']=dfjson['meilleurePosition'].apply(lambda x: extract_coord(x))
I found the solution using the code below
dfjson['meilleurePosition']=dfjson['meilleurePosition'].apply(lambda x: extract_coord(x) if x == x else defaultmeilleurepositionvalue)
this was required because of empty rows leading to error (not trapped in function definition).
However ,i am still convinced there is much easy way to assign a dict value of a column to the column itself , still trying...
How can I turn a nested list with dict inside into extra columns in a dataframe in Python?
I received information within a dict from an API,
{'orders':
[
{ 'orderId': '2838168630',
'dateTimeOrderPlaced': '2020-01-22T18:37:29+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361764421',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 1}
]},
{ 'orderId': '2708182540',
'dateTimeOrderPlaced': '2020-01-22T17:45:36+01:00',
'orderItems': [{ 'orderItemId': 'BFC0000361749496',
'ean': '234234234234234',
'cancelRequest': False,
'quantity': 3}
]},
{ 'orderId': '2490844970',
'dateTimeOrderPlaced': '2019-08-17T14:21:46+02:00',
'orderItems': [{ 'orderItemId': 'BFC0000287505870',
'ean': '234234234234234',
'cancelRequest': True,
'quantity': 1}
]}
which I managed to turn into a simple dataframe by doing this:
pd.DataFrame(recieved_data.get('orders'))
output:
orderId date oderItems
1 1-12 [{orderItemId: 'dfs13', 'ean': '34234'}]
2 etc.
...
I would like to have something like this
orderId date oderItemId ean
1 1-12 dfs13 34234
2 etc.
...
I already tried to single out the orderItems column with Iloc and than turn it into a list so I can then try to extract the values again. However I than still end up with a list which I need to extract another list from, which has the dict in it.
# Load the dataframe as you have already done.
temp_df = df['orderItems'].apply(pd.Series)
# concat the temp_df and original df
final_df = pd.concat([df, temp_df])
# drop columns if required
Hope it works for you.
Cheers
By combining the answers on this question I reached my end goal. I dit the following:
#unlist the orderItems column
temp_df = df['orderItems'].apply(pd.Series)
#Put items in orderItems into seperate columns
temp_df_json = json_normalize(temp_df[0])
#Join the tables
final_df = df.join(temp_df_json)
#Drop the old orderItems coloumn for a clean table
final_df = final_df.drop(["orderItems"], axis=1)
Also, instead of .concat() I applied .join() to join both tables based on the existing index.
Just to make it clear, you are receiving a json from the API, so you can try to use the function json_normalize.
Try this:
import pandas as pd
from pandas.io.json import json_normalize
# DataFrame initialization
df = pd.DataFrame({"orderId": [1], "date": ["1-12"], "oderItems": [{ 'orderItemId': 'dfs13', 'ean': '34234'}]})
# Serializing inner dict
sub_df = json_normalize(df["oderItems"])
# Dropping the unserialized column
df = df.drop(["oderItems"], axis=1)
# joining both dataframes.
df.join(sub_df)
So the output is:
orderId date ean orderItemId
0 1 1-12 34234 dfs13
I have some json data that I want to put into a pandas dataframe. The json looks like this:
{'date': [20170629,
20170630,
20170703,
20170705,
20170706,
20170707],
'errorMessage': None,
'seriesarr': [{'chartOnlyFlag': 'false',
'dqMaxValidStr': None,
'expression': 'DB(FXO,V1,EUR,USD,7D,VOL)',
'freq': None,
'frequency': None,
'iDailyDates': None,
'label': '',
'message': None,
'plotPoints': [0.0481411225888,
0.0462401214563,
0.0587196848727,
0.0765737640932,
0.0678912611279,
0.0675766942022],
}
I am trying to create a pandas DataFrame with 'date' as the index and 'plotPoints' as a second column. I don't need any of the other infomation.
I've tried
df = pd.io.json.json_normalize(data, record_path = 'date', meta = ['seriesarr', ['plotPoints']])
When I do this I get the following error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('plotPoints',)
Any help with this is appreciated.
Thanks!
IIUC, json_normalize may not be able to help you here. It might instead just be easier to extract that data and then load it into a dataframe directly. If need be, convert to datetime using pd.to_datetime:
date = data.get('date')
plotPoints = data.get('seriesarr')[0].get('plotPoints')
df = pd.DataFrame({'date' : pd.to_datetime(date, format='%Y%m%d'),
'plotPoints' : plotPoints})
df
date plotPoints
0 2017-06-29 0.048141
1 2017-06-30 0.046240
2 2017-07-03 0.058720
3 2017-07-05 0.076574
4 2017-07-06 0.067891
5 2017-07-07 0.067577
This is under the assumption that your data is exactly as shown in the question.
As #COLDSPEED pointed out, getting data directly from dictionary columns will be suitable since 'plotPoints' is contained within a list of dict.
A list comprehension variation is as below that has date as index and plotpoints as column..
col1 = data['date']
adict = dict((k,v) for d in data['seriesarr'] for k,v in d.iteritems() )
col2 = adict['plotPoints']
pd.DataFrame(data= col2, index=col1)
>>> 0
20170629 0.048141
20170630 0.046240
20170703 0.058720
20170705 0.076574
20170706 0.067891
20170707 0.067577