Python: String replacement with JSON dictionary - python

I need to create a script in Python, for replacement of strings in a json file, based on a json dictionary. The file has information about patents and it looks like this:
{
"US-8163793-B2": {
"publication_date": "20120424",
"priority_date": "20090420",
"family_id": "42261969",
"country_code": "US",
"ipc_code": "C07D417/14",
"cpc_code": "C07D471/04",
"assignee_name": "Hoffman-La Roche Inc.",
"title": "Proline derivatives",
"abstract": "The invention relates to a compound of formula (I) wherein A, R 1 -R 6 are as defined in the description and in the claims. The compound of formula (I) can be used as a medicament."
}
}
Initially I used a software that identifies, based on entities (ex. COMPANY), all the words that are written differently, but are the same. For example, the company "BMW" can be called "BMW Ag" as well as "BMW Group". And this dictionary has a structure like this (is only partially represented, otherwise it would be very long):
{
"RESP_META" : {
,"RESP_WARNINGS" : null
,"RESP_PAYLOAD":
{
"BIOCHEM": [
{
"hitID": "D011392",
"name": "L-Proline",
"frag_vector_array": [
"16#{!Proline!} derivatives"
],
...,
"sectionMeta": {
"8": "$.US-8163793-B2.title|"
}
},
{
(next hit...)
},
...
]
}
Taking into consideration that the "sectionMeta" key gives me the patent ID and, for ex., abstract, title or assignee_name, I would like to use this information to find out in which patent will the replacement take place, and then based on the "frag_vector_array" key, find the word to be replaced, which always is between {!!}, for example {! Proline!}, and that word should be replaced by "name", for ex. L-Proline.
I've tried something to replace the companies name, but I think I'm going the wrong way. Here is the code I started:
import json
patents = json.load(open("testset_patents.json"))
companies = json.load(open("termite_output.json"))
print(companies)
companies = companies['RESP_PAYLOAD']
# loop through companies data
for company in companies.values():
company_list = company["COMPANY"]
for comp in company_list:
comp_name = comp["name"]
# update patents "name" in "assignee_name"
for patent in patents.values():
patent['assignee_name'] = comp_name
print(patents)
# save output in new file
with open('company_replacement.json', 'w') as fp:
json.dump(patents, fp)
Well any and all help is welcome.

Related

How to write a Python script to automate API calls and retrieve a specific part of the result

I have a csv file of schools that contains one school per row for a total of 32091 schools. The name of the school is indicated in the 6th column, and the city code is indicated in the 7th column.
I would like to retrieve the latitude and longitude of the schools by using the geocoding API of the IGN (Institut Géographique National de France) whose documentation in French is here: https://geoservices.ign.fr/documentation/services/api-et-services-ogc/geocodage-beta-20/documentation-technique-de-lapi-de
This API allows me to indicate a string of characters as search terms, and to restrict the search with a filter on the city code. I have tested several queries and the results seem to be satisfactory. For example, for the school "ecole primaire privee st joseph de bonabry" located in Fougères (city code 35115), the following query:
https://wxs.ign.fr/essentiels/geoportail/geocodage/rest/0.1/search?q=ecole%20primaire%20privee%20st%20joseph%20de%20bonabry&index=poi&limit=1&returntruegeometry=false&postcode=35300
returns the following json:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"postcode": [
"35300"
],
"citycode": [
"35115",
"35"
],
"city": [
"Fougères"
],
"toponym": "École Primaire Saint-Joseph de Bonabry",
"category": [
"area of activity or interest",
"primary education"
],
"extrafields": {
"cleabs": "SURFACTI0000000215529805",
"names": [
"saint joseph de bonabry elementary school"
]
},
"_score": 0.703030303030303,
"_type": "poi"
},
"geometry": {
"type": "Point",
"coordinates": [
-1.19610139955834,
48.3550652629677
]
}
}
]
}
So the coordinates to extract are located here: {"features":[{ "geometry":{"coordinates":[lon, lat]}}]}
I would like to go through a Python script to automate the task. From what I understand, the steps could be as follows:
Open the CSV
Read the value contained in the sixth column
Perform an http get request for each row, changing the URL based on the value in the sixth column
Extract longitude and latitude from the results
Update the longitude and latitude columns (already existing) with the previously extracted values.
Panda allows me to read the CSV while Requests allows me to formulate the query. Being a beginner in programming I don't really know how to write the script. I guess it can start this way:
import panda as pd
import requests
df = pd.read_csv("myfile.csv")
...but I'm stuck on what to do next. I guess a loop would allow to repeat the request but how do you change the URL terms? In general, any help on the whole scrit will be greatly appreciated!
This is how I would do it.
Replace "name" and "post" with the actual column names from your CSV
import pandas as pd
import requests
# read the data CSV
# you have to replace "name" and "post" with the actual column names
df = pd.read_csv("data.csv", usecols=["name", "post"])
# define the request URL
url = "https://wxs.ign.fr/essentiels/geoportail/geocodage/rest/0.1/search"
#api call for each element
for i in range(len(df["name"])):
# prepare the name for URL
genName = df["name"][i].replace(" ", "%20")
print(genName)
# prepare request
request = url + "?q=" + genName + "&index=poi&limit=1&returntruegeometry=false&postcode=" + str(df["post"][i])
print(request)
# do the request
r = requests.get(request)
# response
result = r.text
print(result)

Export Json to CSV with missing key in a dict via get()

I am very new to Python, and need to using Python to get the below work done.
I am using Python to get value out for a key pairs (which is the JSON response I got from API call), however, some of them has the value while some of them might not have the value, example of JSON response as below:
"attributes": [
{
"key": "TK_GENIE_ACTUAL_TOTAL_HOURS_EXCLUDE_CORRECTIONS",
"alias": "Annual Leave"
},
{
"key": "TK_GENIE_ACTUAL_TOTAL_HOURS_EXCLUDE_CORRECTIONS",
"alias": "Other Non-Prod Hours"
},
{
"key": "TK_GENIE_ACTUAL_TOTAL_HOURS_EXCLUDE_CORRECTIONS",
"alias": "Non-Prod Hours"
},
{
"key": "EMP_COMMON_PRIMARY_JOB",
"alias": "Primary Job",
"rawValue": "RN",
"value": "RN"
},
{
"key": "TIMECARD_TRANS_APPLY_DATE",
"alias": "Apply Date",
"rawValue": "2022-05-19",
"value": "19/05/2022"
},
The above is one of the children under a nested others, as you can see, for the above one, there is no value for "Annual Leave", however, other children might has a valid value for "Annual Leave"
I am exporting those infor into CSV, with "alias" is the column name, and "value" is the row value
like below csv sample:
enter image description here
So, I using the below python code to extract the value for each key and put them into csv as per column specified.
AL=item['attributes'][0]['value']
Date=item['attributes'][4]['value']
spamwriter.writerow([AL,'2','3',date,'5'])
However, it raised an error code
File "jsoncsv.py", line 47, in <module>
al=item['attributes'][0]['value']
KeyError: 'value'
I think I understand the error, where there is no value for this particular "Annual Leave" key.
But how do I say,like, if there is no value for this key, then value = 0, and put 0 in the CSV under "Annual Leave" column, then, move to next (which is "Other Non-Prod Hours", which also has no value in this case, but might have value for some other children)?
I found get(), but not sure how should I code it, I was trying below code:
value=it.get('value')
if len(value)>0:
AL=item['attributes'][0]value
Date=item['attributes'][4]value
spamwriter.writerow([AL,'2','3',date,'5'])
But result is syntax error.
Could please any Python expert provide help.
Much Appreciated.
WB
As user #richarddodson pointed out, you can do this:
al = item['attributes'][0].get('value', 0)
Instead of 0, you may want to consider using None, which is a clear indication that there is no value, avoiding confusion with the case where value actually is 0, i.e.:
al = item['attributes'][0].get('value', None)
Which is the same as:
al = item['attributes'][0].get('value')
As .get() returns None if there is no value to get.
A more explicit way to do the same would be:
al = None if 'value' in item['attributes'][0] else item['attributes'][0]['value']
But the solution using .get() is simpler and faster and thus probably preferable.

Real data elements from JSON

I am trying to extract data elements from json url link using python.Below is the code. It is working partially when I try to extract elements
response = urllib.request.urlopen(url)
data = json.loads(response.read())
print("planId",data[0]["planId"]) #Gives result as planId PWR93173MBE1
print("postcode",data[0]["postcode"]) # Gives result as postcode 2000
print("tariffType", data[0]["tariffType"]) This gives me error.
Also, if I want to extract other elements such as PlanType and other fields in Fees, how can I do it?
[
{
"planData":{
"planType":"M",
"tariffType":"SR",
"contract":[
{
"pricingModel":"SR",
"benefitPeriod":"Ongoing",
"coolingOffDays":10,
"additionalFeeInformation":"This offer provides access to wholesale prices, utilises your Powerbank to smooth wholesale market volatility and Powerwatch to warn of higher prices. For more information on this and any other standard fees, visit our website www.powerclub.com.au",
"fee":[
{
"description":"Annual Membership payable each year for each of your business premises taking supply.",
"amount":79,
"feeType":"MBSF",
"percent":0,
"feeTerm":"A"
},
{
"description":"Cost for providing a paper bill",
"amount":2.5,
"feeType":"PBF",
"percent":0,
"feeTerm":"F"
},
{
"description":"Disconnection fee",
"amount":59.08,
"feeType":"DiscoF",
"percent":0,
"feeTerm":"F"
},
{
"description":"Reconnection Fee",
"amount":59.08,
"feeType":"RecoF",
"percent":0,
"feeTerm":"F"
},
{
"description":"Meter Read - Requested by Customer",
"amount":12.55,
"feeType":"OF",
"percent":0,
"feeTerm":"F"
}
],
"planId":"PWR93173MBE1",
"planType":"E#B#PWR93173MBE1",
"postcode":2000
}
]
The tariffType property sits inside the planData property, so you need to do something like
print("tariffType", data[0]["planData"]["tariffType"])
You forgot to nest, correct should be:
print("tariffType", data[0]["planData"]["tariffType"])

Importing JSON data from a file as class instances

class Company:
def __init__(self, company_id, name):
self.company_id = company_id
self.name = name
I have a class named Company (class shown above), and I am trying to add companies into this program using this JSON file:
[
{
"id": "companyA",
"name": "Apple Farm"
},
{
"id": "companyB",
"name": "Buzzing Bees"
},
{
"id": "companyC",
"name": "Cat Corporate"
},
{
"id": "companyD",
"name": "Dog Dentists Limited"
}
]
Right now I, know how to add each company individually like this:
c = Company(company[0]['id'], company[0]['name'])
However, I want to create all of them with a for loop which creates these companies, and so that I can access them by their company_id.
How can I do this?
I dint understand the last part of the question where you say you want to "call them by company IDs"..
If you dont want to handle each objects independently and reference it by their name then storing them in a dict is a useful way. A dictionary comprehension will be the easiest way to do it.
company_objects = {company['id']: Company(company['id'], company['name']) for company in companies}
Now you should be able to access them as company_objects['companyA']
If you change the JSON definition slightly as below:
{
"company_id": "companyA",
"name": "Apple Farm"
}
You will be able to even optimize the dictionary comprehension as below:
{company['company_id']: Company(**company) for company in companies}
You can first retrieve the JSON code with the json library:
import json
with open("path/to/file.json") as jsonFile: # Replace "path/to/file.json" with the path to your JSON file
companies = json.load(jsonFile)
Then, you can create a dictionary to store the companies:
companyDict = {}
You then iterate through all the companies in the JSON file. Then, for every dictionary inside the JSON file, create a Company:
for company in companies:
companyDict[company["id"]] = Company(company["id"], company["name"])
You can then access your company like this:
companyA = companyDict["companyA"] # You can replace the key with the company ID
From the looks of your class, it looks as though it is not meant to contain information pertaining to multiple companies. Rather, you should create a separate class instance for each company.
Using the object you provided as such:
companies = [
{
"id": "companyA",
"name": "Apple Farm"
},
{
"id": "companyB",
"name": "Buzzing Bees"
},
{
"id": "companyC",
"name": "Cat Corporate"
},
{
"id": "companyD",
"name": "Dog Dentists Limited"
}
]
you can create a class for each company as such:
company_classes = [Company(company.id, company.name) for company in companies]
this will create a list of company objects.

Json organization

I use JSON for one of my project. For example, I have the JSON structure.
{
"address":{
"streetAddress": {
"aptnumber" : "21",
"building_number" : "2nd",
"street" : "Wall Street",
},
"city":"New York"
},
"phoneNumber":
[
{
"type":"home",
"number":"212 555-1234"
}
]
}
Now I have a bunch of modules using this structure, and it expects to see certain fields in the received json. For the example above, I have two files: address_manager and phone_number_manager. Each will be passed the relevant information. So address_manager will expect a dict that has keys 'streetAddress' and 'city'.
My question is: Is it possible to set up a constant structure so that every time I change the name of a field in my JSON structure (e.g. I want to change 'streetAddress' to 'address'), I don't have to make change in several places?
My naive approach is to have a bunch of constants (e.g.
ADDRESS = "address"
ADDRESS_STREET_ADDRESS = "streetAddress"
..etc..
) and so if I want to change the name of one of my fields in JSON structure, I just have to make change in one place. However, this seems to be very inefficient because my constant naming would be terribly long once I reach the third or fourth layer of the JSON structure (e.g. ADDRESS_STREETADDRESS_APTNUMBER, ADDRESS_STREETADDRESS_BUILDINGNUMBER)
I am doing this in python, but any generic answer would be OK.
Thanks.
Like Cameron Sparr suggested in a comment, don't have your constant names include all levels of your JSON structure. If you have the same data in multiple places, it will actually be better if you reuse the same constant. For example, suppose your JSON has a phone number included in the address:
{
"address": {
"streetAddress": {
"aptnumber" : "21",
"building_number" : "2nd",
"street" : "Wall Street"
},
"city":"New York",
"phoneNumber":
[
{
"type":"home",
"number":"212 555-1234"
}
]
},
"phoneNumber":
[
{
"type":"home",
"number":"212 555-1234"
}
]
}
Why not have a single constant PHONES = 'phoneNumber' that you use in both places? Your constants will have shorter names, and it is more logically coherent. You would end up using it like this (assuming JSON is stored in person):
person[ADDRESS][PHONES][x] # Phone numbers associated with that address
person[PHONES][x] # Phone numbers associated with the person
Instead of
person[ADDRESS][ADDRESS_PHONES][x]
person[PHONE_NUMBERS][x]
You can write a script than when you change the constant, change the structure in all json files.
Example:
import json
CHANGE = ('steet', 'streetAddress')
json_data = None
with open('file.json') as jfile:
json_data = jfile.load(jfile)
json_data[CHANGE[1]], json_data[CHANGE[0]] = json_data[CHANGE[0]], None

Categories