Extract from dynamic JSON response with Scrapy

Extract from dynamic JSON response with Scrapy - python

I want to extract the 'avail' value from the JSON output that look like this.
{
"result": {
"code": 100,
"message": "Command Successful"
},
"domains": {
"yolotaxpayers.com": {
"avail": false,
"tld": "com",
"price": "49.95",
"premium": false,
"backorder": true
}
}
}
The problem is that the ['avail'] value is under ["domains"]["domain_name"] and I can't figure out how to get the domain name.
You have my spider below. The first part works fine, but not the second one.
import scrapy
import json
from whois.items import WhoisItem
class whoislistSpider(scrapy.Spider):
name = "whois_list"
start_urls = []
f = open('test.txt', 'r')
global lines
lines = f.read().splitlines()
f.close()
def __init__(self):
for line in lines:
self.start_urls.append('http://www.example.com/api/domain/check/%s/com' % line)
def parse(self, response):
for line in lines:
jsonresponse = json.loads(response.body_as_unicode())
item = WhoisItem()
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
item["domain"] = domain_name
yield item
Thank you in advance for your replies.

Currently, it tries to get the value by the "('%s.com' % line)" key.
You need to do the string formatting correctly:
domain_name = "%s.com" % line.strip()
item["avail"] = jsonresponse["domains"][domain_name]["avail"]

Assuming you are only expecting one result per response:
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
This will work even if there is a mismatch between the domain in the file "test.txt" and the domain in the result.

To get the domain name from above json response you can use list comprehension , e.g:
domain_name = [x for x in jsonresponse.values()[0].keys()]
To get the "avail" value use same method, e.g:
avail = [x["avail"] for x in jsonresponse.values()[0].values() if "avail" in x]
to get the values in string format you should call it by index 0 e.g:
domain_name[0] and avail[0] because list comprehension results stored in list type variable.
More info on list comprehension

Related

Reading JSON data in Python using Pagination, max records 100

I am trying to extract data from a REST API using python and put it into one neat JSON file, and having difficulty. The date is rather lengthy, with a total of nearly 4,000 records, but the max record allowed by the API is 100.
I've tried using some other examples to get through the code, and so far this is what I'm using (censoring the API URL and auth key, for the sake of confidentiality):
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while new_results:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("results", [])
vendors.extend(centiblock)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
For the life of me, I cannot figure out why it isn't working. The output keeps coming out as just:
[
"records"
]
If I play around with the print statement at the end, I can get it to print centiblock (so named for being a block of 100 records at a time) just fine - it gives me 100 records in un-formated text. However, if I try printing vendors at the end, the output is:
['records']
...which leads me to guess that somehow, the vendors array is not getting filled with the data. I suspect that I need to modify the get request where I define new_results, but I'm not sure how.
For reference, this is a censored look at how the json data begins, when I format and print out one centiblock:
{
"records": [
{
"id": "XXX",
"createdTime": "2018-10-15T19:23:59.000Z",
"fields": {
"Vendor Name": "XXX",
"Main Phone": "XXX",
"Street": "XXX",
Can anyone see where I'm going wrong?
Thanks in advance!

When you are extending vendors with centiblock, your are giving a dict to the extend function. extend is expecting an Iterable, so that works, but when you iterate over a python dict, you only iterate over the keys of the dict. In this case, ['records'].
Note as well, that your loop condition becomes False after the first iteration, because centiblock.get("results", []) returns [], since "results" is not a key of the output of the API. and [] has a truthiness value of False.
Hence to correct those errors you need to get the correct field from the API into new_results, and extend vendors with new_results, which is itself an array. Note that on the last iteration, new_results will be the empty list, which means vendors won't be extended with any null value, and will contain exactly what you need:
This should look like:
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while len(new_results) > 0:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("records", [])
vendors.extend(new_results)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
Note that I replaced the while new_results with a while len(new_results)>0 which is equivalent in this case, but more readable, and better practice in general.

Error when defining a dictionary path as a variable: TypeError: string indices must be integers

I get this error "TypeError: string indices must be integers" when defining a variable.
def updateJson(fileName, pathToValue, updatedValue):
# Opening JSON file
f = open(fileName)
# returns JSON object as a dictionary
data = json.load(f)
# Changes the ID value in JSON
data[pathToValue] = updatedValue
f.close()
with open("template3.json", "w") as outfile:
json.dump(data, outfile)
x = ['Something 1'][0]['ID']
updateJson("Temp\\random.json", x, 9)
JSON:
{
"Something 1": [
{
"ID": "placeholder",
"Music": "placeholder"
}
]
}
But if I don't pass it as variable and just use it in code like this: data['Something 1'][0]['ID'] = updatedValue it works as expected.
What I have tried:
Wrapping the variable in "", (), {} and some other minor things, in which case it kinda works, but the path gets interpreted wrong, and I can't successfully target the ID value in JSON.

The problem has nothing to do with your JSON.
Consider the following code:
y = "Some string"["ID"]
This wouldn't work, right? something like y = "Some string"[1] would set y equal to "o", but the example above is nonsensical.
When you are defining x, this is what's happening. Let's break it down:
x = ["Something 1"]
# x is a list, containing a single string
x = ["Something 1"][0]
# x is the first element of the list ["Something 1"], so x = "Something 1" - see for yourself!
x = ["Something 1"][0]["ID"]
# TypeError! This is equivalent to:
x = "Something 1"["ID"]
To get the functionality you're looking for, we need another way to pass this pathToValue. One way to do this is to pass the different parts as different parameters:
def updateJson(fileName, pathMain, pathIndex, pathMinor, updatedValue):
...
data[pathMain][pathIndex][pathMinor] = updatedValue
...
updateJson("Temp \\random.json", "Something 1", 0, "ID", 9) # Would work
However, this would only work if your JSON file has a very consistent structure.
A slightly more concise way to do this would be:
def updateJson(fileName, pathToValue, updatedValue):
...
pathMain, pathIndex, pathMinor = pathToValue # Extract the different components of pathToValue from the list
data[pathMain][pathIndex][pathMinor] = updatedValue
...
x = ["Something 1", 0, "ID"]
updateJson("Temp \\random.json", x, 9) # Would work

The command bellow does the following:
Creates a list with one item of type str and value "Something 1"
Takes the first element of the list ("Something 1")
Tries to get the element "ID" from "Something 1" and thus the error
x = ['Something 1'][0]['ID']
you will need to get these from another object, that holds the JSON data you expect.
Try instead to define a function that applies the path to the right variable. Like this:
def updateJson(fileName, update, updatedValue):
# Opening JSON file
f = open(fileName)
# returns JSON object as a dictionary
data = json.load(f)
# Changes the ID value in JSON
update(data, updatedValue)
f.close()
with open("template3.json", "w") as outfile:
json.dump(data, outfile)
x = lambda data, value: data["Something 1"][0].setdefault("ID", value)
updateJson("Temp\\random.json", x, 9)

CSV to json convert

I have this data in .csv format:
I want to convert it into .json format like this :
{
"title": "view3",
"sharedWithOrganization": false,
"sharedWithUsers": [
"81241",
"81242",
"81245",
"81265"
],
"filters": [{"field":"Account ID","comparator":"==","value":"prod"}]
},
{
"title": "view3",
"sharedWithOrganization": true,
"sharedWithUsers": [],
"filters": [{"field":"Environment_AG","comparator":"=#","value":"Development"}]
}
Below these are the conversion for comparator
'equals' means '=='
'not equal' means '!='
'contains' means '=#'
'does not contain' means '!=#'
Can you please help me convert .csv to .json I am unable to convert using python .

What I would do, without giving you the proper answer (doing it yourself is better for learning).
First : Create an Object containing your informations
class View():
def __init__(self, title, field, comparator, value, sharedWithOrganization, user1, user2, user3, user4, user5, user6):
self.title = title
self.field = field
self.comparator = comparator
self.value = value
self.sharedWithOrganization = sharedWithOrganization
self.user1 = user1
...
self.user6 = user6
Then I would load the CSV and create an object for each line, and store them in a Dict with the following structure :
loadedCsv = { "Your line title (ex : view3)" : [List of all the object with the title view3] }
Yes, with this point of view, there is data redundancy of the title parameter, you can chose to remove it from the object.
When this is done, I would, for each title in my dictionary, get all the element I need and format them in JSON by using "import json" (c.f python documentation : https://docs.python.org/3/library/json.html)

Hehere I'm posting my work on your doubt.. hope u and others will find it helpful.
But I want you to try urself....
import csv
import json
def csv_to_json(csvFilePath, jsonFilePath):
jsonArray = []
jsonArray2 = []
with open(csvFilePath, encoding='utf-8') as csvf:
csvReader = csv.DictReader(csvf)
for row in csvReader:
if row["comparator"] == "equals":
row["comparator"]="=="
elif row["comparator"]=="not equal":
row["comparator"]="!#"
elif row["comparator"]=="contains":
row["comparator"]="=#"
elif row["comparator"]=="does not contain":
row["comparator"]="!=#"
final_data={
"title":row["title"],
"sharedWithOrganization":bool(row["sharedWithOrganization"]),
"sharedWithUsers": [
row["user1"],
row["user2"],
row["user3"],
row["user4"],
row["user5"],
row["user6"]
],
"filters":[ {"field":row['field'],"comparator":row["comparator"],"value":row["value"]} ]
}
jsonArray.append(final_data)
with open(jsonFilePath, 'w', encoding='utf-8') as jsonf:
jsonString = json.dumps(jsonArray, indent=4)
jsonf.write(jsonString)
csvFilePath = r'test.csv'
jsonFilePath = r'test11.json'
csv_to_json(csvFilePath, jsonFilePath)

Printing dictionary from inside a list puts one character on each line

Yes, yet another. I can't figure out what the issue is. I'm trying to iterate over a list that is a subsection of JSON output from an API call.
This is the section of JSON that I'm working with:
[
{
"created_at": "2017-02-22 17:20:29 UTC",
"description": "",
"id": 1,
"label": "FOO",
"name": "FOO",
"title": "FOO",
"updated_at": "2018-12-04 16:37:09 UTC"
}
]
The code that I'm running that retrieves this and displays it:
#!/usr/bin/python
import json
import sys
try:
import requests
except ImportError:
print "Please install the python-requests module."
sys.exit(-1)
SAT_API = 'https://satellite6.example.com/api/v2/'
USERNAME = "admin"
PASSWORD = "password"
SSL_VERIFY = False # Ignore SSL for now
def get_json(url):
# Performs a GET using the passed URL location
r = requests.get(url, auth=(USERNAME, PASSWORD), verify=SSL_VERIFY)
return r.json()
def get_results(url):
jsn = get_json(url)
if jsn.get('error'):
print "Error: " + jsn['error']['message']
else:
if jsn.get('results'):
return jsn['results']
elif 'results' not in jsn:
return jsn
else:
print "No results found"
return None
def display_all_results(url):
results = get_results(url)
if results:
return json.dumps(results, indent=4, sort_keys=True)
def main():
orgs = display_all_results(KATELLO_API + "organizations/")
for org in orgs:
print org
if __name__ == "__main__":
main()
I appear to be missing a concept because when I print org I get each character per line such as
[
{
"
c
r
e
a
t
e
d
_
a
t
"
It does this through to the final ]
I've also tried to print org['name'] which throws the TypeError: list indices must be integers, not str Python error. This makes me think that org is being seen as a list rather than a dictionary which I thought it would be due to the [{...}] format.
What concept am I missing?
EDIT: An explanation for why I'm not getting this: I'm working with a script in the Red Hat Satellite API Guide which I'm using to base another script on. I'm basically learning as I go.

display_all_results is returning a string since you are doing json.dumps in json.dumps(results, indent=4, sort_keys=True), which converts the dictionary to a string (you are getting that dictionary from r.json() in get_json function)
You then end up iterating over the characters of that string in main, and you see one character per line
Instead just return results from display_all_results and the code will work as intended
def display_all_results(url):
#results is already a dictionary, just return it
results = get_results(url)
if results:
return results

Orgs is a result of json.dump which produces a string. So instead of this code:
for org in orgs:
print(org)
replace it with simply:
#for org in orgs:
print(orgs)

python + json: parse to list

I'm somewhat new to parsing JSON data with python (using python 2.7). There is a service that I have to send API calls to, and the JSON response is something like what I have below. the amount of items in 'row' can vary. What I need to do is take only the 'content' from the second line IF there is a second line, and put it into a list. Essentially, it is a list of only the 'campaign confirmation numbers' and nothing else. the number will also always be only 9 numeric numbers if that helps anything. Any advice would be very much appreciated.
{"response":
{"result":
{"Potentials":
{"row":
[
{"no":"1","FL":
{"content":"523836000004148171","val":"POTENTIALID"}
},
{"no":"2","FL":
{"content":"523836000004924051","val":"POTENTIALID"}
},
{"no":"3","FL":
[
{"content":"523836000005318448","val":"POTENTIALID"},
{"content":"694275295","val":"Campaign Confirmation Number"}
]
},
{"no":"4","FL":
[
{"content":"523836000005318662","val":"POTENTIALID"},
{"content":"729545274","val":"Campaign Confirmation Number"}
]
},
{"no":"5","FL":
[
{"content":"523836000005318663","val":"POTENTIALID"},
{"content":"903187021","val":"Campaign Confirmation Number"}
]
},
{"no":"6","FL":
{"content":"523836000005322387","val":"POTENTIALID"}
},
{"no":"7","FL":
[
{"content":"523836000005332558","val":"POTENTIALID"},
{"content":"729416761","val":"Campaign Confirmation Number"}
]
}
]
}
},
"uri":"/crm/private/json/Potentials/getSearchRecords"}
}
EDIT: an example of the output for this example would be:
confs = [694275295, 729545274, 903187021, 729416761]
or
confs = ['694275295', '729545274', '903187021', '729416761']
it really doesn't matter if they're stored as strings or ints
EDIT 2: here's my code snip:
import urllib
import urllib2
import datetime
import json
key = '[removed]'
params = {
'[removed]'
}
final_URL = 'https://[removed]'
data = urllib.urlencode(params)
request = urllib2.Request(final_URL,data)
response = urllib2.urlopen(request)
content = response.read()
j = json.load(content)
confs = []
for no in j["response"]["result"]["Potentials"]["row"]:
data = no["FL"]
if isinstance(data, list) and len(data) > 1:
confs.append(int(data[1]["content"]))
print confs

Assuming j is your JSON object which the above structure has been parsed into:
>>> results = []
>>> for no in j["response"]["result"]["Potentials"]["row"]:
... data = no["FL"]
... if isinstance(data, list) and len(data) > 1:
... results.append(int(data[1]["content"]))
...
>>> results
[694275295, 729545274, 903187021, 729416761]

Assuming that 'response' holds the json string:
import json
data = json.loads(response)
rows = data['response']['result']['Potentials']['rows']
output = []
for row in rows:
contents = row['FL']
if len(contents) > 1:
output.append(contents[1]['content'])
That should do it.
EDIT:
I finally got some time to test this "one liner". It's fun to use Python's functional features:
import json
#initialize response to your string
data = json.loads(response)
rows = data['response']['result']['Potentials']['row']
output = [x['FL'][1]['content'] for x in rows if isinstance(x['FL'], list) and len(x['FL']) > 1]
print output
['694275295', '729545274', '903187021', '729416761']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract from dynamic JSON response with Scrapy - python

Currently, it tries to get the value by the "('%s.com' % line)" key. You need to do the string formatting correctly: domain_name = "%s.com" % line.strip() item["avail"] = jsonresponse["domains"][domain_name]["avail"]

Assuming you are only expecting one result per response: domain_name = list(jsonresponse['domains'].keys())[0] item["avail"] = jsonresponse["domains"][domain_name]["avail"] This will work even if there is a mismatch between the domain in the file "test.txt" and the domain in the result.

Related

Reading JSON data in Python using Pagination, max records 100

Error when defining a dictionary path as a variable: TypeError: string indices must be integers

CSV to json convert

Printing dictionary from inside a list puts one character on each line

python + json: parse to list

Categories

Resources