python + json: parse to list - python

I'm somewhat new to parsing JSON data with python (using python 2.7). There is a service that I have to send API calls to, and the JSON response is something like what I have below. the amount of items in 'row' can vary. What I need to do is take only the 'content' from the second line IF there is a second line, and put it into a list. Essentially, it is a list of only the 'campaign confirmation numbers' and nothing else. the number will also always be only 9 numeric numbers if that helps anything. Any advice would be very much appreciated.
{"response":
{"result":
{"Potentials":
{"row":
[
{"no":"1","FL":
{"content":"523836000004148171","val":"POTENTIALID"}
},
{"no":"2","FL":
{"content":"523836000004924051","val":"POTENTIALID"}
},
{"no":"3","FL":
[
{"content":"523836000005318448","val":"POTENTIALID"},
{"content":"694275295","val":"Campaign Confirmation Number"}
]
},
{"no":"4","FL":
[
{"content":"523836000005318662","val":"POTENTIALID"},
{"content":"729545274","val":"Campaign Confirmation Number"}
]
},
{"no":"5","FL":
[
{"content":"523836000005318663","val":"POTENTIALID"},
{"content":"903187021","val":"Campaign Confirmation Number"}
]
},
{"no":"6","FL":
{"content":"523836000005322387","val":"POTENTIALID"}
},
{"no":"7","FL":
[
{"content":"523836000005332558","val":"POTENTIALID"},
{"content":"729416761","val":"Campaign Confirmation Number"}
]
}
]
}
},
"uri":"/crm/private/json/Potentials/getSearchRecords"}
}
EDIT: an example of the output for this example would be:
confs = [694275295, 729545274, 903187021, 729416761]
or
confs = ['694275295', '729545274', '903187021', '729416761']
it really doesn't matter if they're stored as strings or ints
EDIT 2: here's my code snip:
import urllib
import urllib2
import datetime
import json
key = '[removed]'
params = {
'[removed]'
}
final_URL = 'https://[removed]'
data = urllib.urlencode(params)
request = urllib2.Request(final_URL,data)
response = urllib2.urlopen(request)
content = response.read()
j = json.load(content)
confs = []
for no in j["response"]["result"]["Potentials"]["row"]:
data = no["FL"]
if isinstance(data, list) and len(data) > 1:
confs.append(int(data[1]["content"]))
print confs

Assuming j is your JSON object which the above structure has been parsed into:
>>> results = []
>>> for no in j["response"]["result"]["Potentials"]["row"]:
... data = no["FL"]
... if isinstance(data, list) and len(data) > 1:
... results.append(int(data[1]["content"]))
...
>>> results
[694275295, 729545274, 903187021, 729416761]

Assuming that 'response' holds the json string:
import json
data = json.loads(response)
rows = data['response']['result']['Potentials']['rows']
output = []
for row in rows:
contents = row['FL']
if len(contents) > 1:
output.append(contents[1]['content'])
That should do it.
EDIT:
I finally got some time to test this "one liner". It's fun to use Python's functional features:
import json
#initialize response to your string
data = json.loads(response)
rows = data['response']['result']['Potentials']['row']
output = [x['FL'][1]['content'] for x in rows if isinstance(x['FL'], list) and len(x['FL']) > 1]
print output
['694275295', '729545274', '903187021', '729416761']

Related

Reading JSON data in Python using Pagination, max records 100

I am trying to extract data from a REST API using python and put it into one neat JSON file, and having difficulty. The date is rather lengthy, with a total of nearly 4,000 records, but the max record allowed by the API is 100.
I've tried using some other examples to get through the code, and so far this is what I'm using (censoring the API URL and auth key, for the sake of confidentiality):
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while new_results:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("results", [])
vendors.extend(centiblock)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
For the life of me, I cannot figure out why it isn't working. The output keeps coming out as just:
[
"records"
]
If I play around with the print statement at the end, I can get it to print centiblock (so named for being a block of 100 records at a time) just fine - it gives me 100 records in un-formated text. However, if I try printing vendors at the end, the output is:
['records']
...which leads me to guess that somehow, the vendors array is not getting filled with the data. I suspect that I need to modify the get request where I define new_results, but I'm not sure how.
For reference, this is a censored look at how the json data begins, when I format and print out one centiblock:
{
"records": [
{
"id": "XXX",
"createdTime": "2018-10-15T19:23:59.000Z",
"fields": {
"Vendor Name": "XXX",
"Main Phone": "XXX",
"Street": "XXX",
Can anyone see where I'm going wrong?
Thanks in advance!
When you are extending vendors with centiblock, your are giving a dict to the extend function. extend is expecting an Iterable, so that works, but when you iterate over a python dict, you only iterate over the keys of the dict. In this case, ['records'].
Note as well, that your loop condition becomes False after the first iteration, because centiblock.get("results", []) returns [], since "results" is not a key of the output of the API. and [] has a truthiness value of False.
Hence to correct those errors you need to get the correct field from the API into new_results, and extend vendors with new_results, which is itself an array. Note that on the last iteration, new_results will be the empty list, which means vendors won't be extended with any null value, and will contain exactly what you need:
This should look like:
import requests
import json
from requests.structures import CaseInsensitiveDict
url = "https://api.airtable.com/v0/CENSORED/Vendors?maxRecords=100"
headers = CaseInsensitiveDict()
headers["Authorization"] = "Bearer CENSORED"
resp = requests.get(url, headers=headers)
resp.content.decode("utf-8")
vendors = []
new_results = True
page = 1
while len(new_results) > 0:
centiblock = requests.get(url + f"&page={page}", headers=headers).json()
new_results = centiblock.get("records", [])
vendors.extend(new_results)
page += 1
full_directory = json.dumps(vendors, indent=4)
print(full_directory)
Note that I replaced the while new_results with a while len(new_results)>0 which is equivalent in this case, but more readable, and better practice in general.

Read a file and match lines above or below from the matching pattern

I'm reading an input json file, and capturing the array values into a dictionary, by matching tar.gz and printing a line above that (essentially the yaml file).
{"Windows": [
"/home/windows/work/input.yaml",
"/home/windows/work/windows.tar.gz"
],
"Mac": [
"/home/macos/required/utilities/input.yaml",
"/home/macos/required/utilities.tar.gz"
],
"Unix": [
"/home/unix/functional/plugins/input.yaml",
"/home/unix/functional/plugins/Plugin.tar.gz"
]
goes on..
}
Output of the dictionary:
{'/home/windows/work/windows.tar.gz': '/home/windows/work/input.yaml',
'/home/macos/required/utilities/utilities.tar.gz' : '/home/macos/required/input.yaml'
......
}
Problem being, if the entries of json changes, i.e. A) tar.gz entries can come as the 1st element in the list of values or B. or, its mix and match,
Irrespective of the entries, how can I get the output dictionary to be of above mentioned format only.
{ "Windows": [
"/home/windows/work/windows.tar.gz",
"/home/windows/work/input.yaml"
],
"Mac": [
"/home/macos/required/utilities/utilities.tar.gz",
"/home/macos/required/input.yaml"
],
"Unix": [
"/home/unix/functional/plugins/Plugin.tar.gz",
"/home/unix/functional/plugins/input.yaml"
]
goes on.. }
mix and match scenario.
{ "Windows": [
"/home/windows/work/windows.tar.gz",
"/home/windows/work/input.yaml"
],
"Mac": [
"/home/macos/required/utilities/input.yaml",
"/home/macos/required/utilities.tar.gz"
],
"Unix": [
"/home/unix/functional/plugins/Plugin.tar.gz",
"/home/unix/functional/plugins/input.yaml"
] }
My code snippet.
def read_input():
files_to_be_processed = {}
with open('input.json', 'r') as f:
lines = f.read().splitlines()
lines = [line.replace('"', '').replace(" ", '').replace(',', '') for line in lines]
for index, value in enumerate(lines):
match = re.match(r".*.tar.gz", line)
if match:
j = i-1 if i > 1 else 0
for k in range(j, i):
read_input[match.string] = lines[k]
print(read_input)
A method here is to have the following:
1- Using the JSON class in python makes your whole process much easier.
2- After taking the data in the JSON class, you can check each object (aka Windows/Max/Unix), for both the tar-gz and the yaml
3- Assign to new dictionary
Here is a quick code:
import json
def read_input():
files_to_be_processed = {}
with open('input.json','r') as f:
jsonObject = json.load(f)
for value in jsonObject.items():
tarGz = ""
Yaml = ""
for line in value[1]: #value[0] contains the key (e.g. Windows)
if line.endswith('.tar.gz'):
tarGz = line
elif line.endswith('.yaml'):
Yaml = line
files_to_be_processed[tarGz] = Yaml
print(files_to_be_processed)
read_input()
This code can be shortened and optimised using things like list comprehension and other methods, but it should be a good place to get started
One way could be for you to transform the list within your input json_dict into a dict that has a key for "yaml" and "gz"
json_dict_1 = dict.fromkeys(json_dict, dict())
for key in json_dict:
list_val = json_dict[key]
for entry in list_val:
entry_key = 'yaml' if 'yaml' in entry[-4:] else 'gz'
json_dict_1[key][entry_key] = entry
print(json_dict_1)
#{'Windows': {'yaml': '/home/unix/functional/plugins/input.yaml',
# 'gz': '/home/unix/functional/plugins/Plugin.tar.gz'},
# 'Mac': {'yaml': '/home/unix/functional/plugins/input.yaml',
# 'gz': '/home/unix/functional/plugins/Plugin.tar.gz'},
# 'Unix': {'yaml': '/home/unix/functional/plugins/input.yaml',
# 'gz': '/home/unix/functional/plugins/Plugin.tar.gz'}}

Reading JSon return values from an API

I'm getting below output from an API and I want to read all purchaseOrder data. Not really sure how to loop on this data. Also it comes with b' at the front.
b'[
{"purchaseOrder":
[
{
"id":"d01f0f6d-398f-4220-8a9a-44f47beedf04",
"installationNumber":null,
"peerId":"308866ba-90cb-47a7-8c73-589c0f355eb7",
"validFrom":"2019-06-07T12:51:15.000+0000",
"validTo":"2019-06-07T13:51:15.000+0000",
"originalQuantity":5,
"quantity":5,
"price":5,
"periodInitial":"2019-06-07T13:00:00.000+0000",
"periodFinal":"2019-06-07T14:00:00.000+0000"
}
],
"salesOrder":null,
"agreement":null,
"status":""
}
]'
Have tried things like loaded_json = json.load(r.content) and it didn't work.
This is code I use to get the response:
r = requests.post(url=api_endpoint, data=json.dumps(json_post), headers=headers)
To get the json of a response use data = response.json().
After that you can step through it like normal lists and dicts:
import json
data = r.json()
print(json.dumps(data , indent=2)) # If you want to see the data from the response
for dic in data :
if 'purchaseOrder' in dic:
for item in dic['purchaseOrder']:
# item here is the `dict` for each purchaseOrder (PO).
print(json.dumps(item, indent=2)) # This will print each item in PO.
Thanks all for support. The next code works for me:
data = r.json()
print(json.dumps(data, indent=2))
for dic in data:
if 'purchaseOrder' in dic:
for itemdata in dic['purchaseOrder']:
for key in itemdata:
if key == 'id':
print("Id:")
print(itemdata['id'])
print("Price:")
print(itemdata['price'])

Creating a json output in python

I am trying to return a response for a function in json form. The output is a list with each element being a dictionary. I don't see any mistake when I print the output. The problem arises when I iterate through the output. I get all the characters in the output one by one. See the sample code and sample output for proper understanding.
code:
import requests
import json
import sys
from bs4 import BeautifulSoup
from collections import OrderedDict
class Cricbuzz():
url = "http://synd.cricbuzz.com/j2me/1.0/livematches.xml"
def __init__(self):
pass
def getxml(self,url):
try:
r = requests.get(url)
except requests.exceptions.RequestException as e:
print e
sys.exit(1)
soup = BeautifulSoup(r.text,"html.parser")
return soup
def matchinfo(self,match):
d = OrderedDict()
d['id'] = match['id']
d['srs'] = match['srs']
d['mchdesc'] = match['mchdesc']
d['mnum'] = match['mnum']
d['type'] = match['type']
d['mchstate'] = match.state['mchstate']
d['status'] = match.state['status']
return d
def matches(self):
xml = self.getxml(self.url)
matches = xml.find_all('match')
info = []
for match in matches:
info.append(self.matchinfo(match))
data = json.dumps(info)
return data
c = Cricbuzz()
matches = c.matches()
print matches #print matches - output1
for match in matches:
print match #print match - output2
"print matches" i.e output1 in above code gives me following output:
[
{
"status": "Coming up on Dec 24 at 01:10 GMT",
"mchstate": "nextlive",
"mchdesc": "AKL vs WEL",
"srs": "McDonalds Super Smash, 2016-17",
"mnum": "18TH MATCH",
"type": "ODI",
"id": "0"
},
{
"status": "Ind U19 won by 34 runs",
"mchstate": "Result",
"mchdesc": "INDU19 vs SLU19",
"srs": "Under 19 Asia Cup, 2016",
"mnum": "Final",
"type": "ODI",
"id": "17727"
},
{
"status": "PRS won by 48 runs",
"mchstate": "Result",
"mchdesc": "PRS vs ADS",
"srs": "Big Bash League, 2016-17",
"mnum": "5th Match",
"type": "T20",
"id": "16729"
}
]
But "print match" i.e output2 in above code inside the for loop gives this output:
[
{
"
i
d
"
:
"
0
"
,
"
s
r
s
"
:
"
M
c
D
o
n
a
l
d
s
S
u
p
e
r
S
m
a
s
h
,
2
0
1
6
-
1
7
"
,
"
m
c
h
d
e
s
As you can see,a character gets printed on each line from matches. I would like to get the dictionary object when printing the match.
def matches(self):
xml = self.getxml(self.url)
matches = xml.find_all('match')
info = []
for match in matches:
info.append(self.matchinfo(match))
data = json.dumps(info) # This is a string
return data # This is a string
c = Cricbuzz()
matches = c.matches() # This is a string
print matches
for match in matches: # Looping over all characters of a string
print match
I think you just want return info, which is a list. You can json.dumps() outside of that function at a later point when you actually do need JSON.
Or if you do want that function to return a JSON string, then you have to parse it back into a list.
for match in json.loads(matches):
If you call json.dumps like you do on info before returning data, the value is converted to a json string. If you want to iterate over the iterable the json string represents, you have to load the data back out of the json.
Consider:
import json
info = [ { "a": 1}, { "b": 2} ]
data = json.dumps(info,indent=2)
print data
for i in data:
print i
for i in json.loads(data):
print i
$ python t.py
[
{
"a": 1
},
{
"b": 2
}
]
[
{
"
a
"
:
1
}
,
{
"
b
"
:
2
}
]
{u'a': 1}
{u'b': 2}
matches is a JSON string, not a dictionary, so for match in matches: iterates over the characters in the string.
If you want the dictionary, the function should return info rather than json.dumps(info). Or you could do:
for match in json.loads(matches):
to parse the JSON back into a dictionary.
Normally you should move data around in the program as structured types like dictionaries and lists, and only convert them to/from JSON when you're sending over a network or storing into a file.
Json.dumps returns a string.
If you expect to have each dict from list during iteration process you may wrap your response into:
matches = json.loads(matches)
Btw, it's nice to dumps it's previously as a simple JSON validation, because it makes a valid JSON from invalid: first of all replaces single quotes with double quotes, etc. That's why I suggest don't skip json.dumps as you're trying to do.

Extract from dynamic JSON response with Scrapy

I want to extract the 'avail' value from the JSON output that look like this.
{
"result": {
"code": 100,
"message": "Command Successful"
},
"domains": {
"yolotaxpayers.com": {
"avail": false,
"tld": "com",
"price": "49.95",
"premium": false,
"backorder": true
}
}
}
The problem is that the ['avail'] value is under ["domains"]["domain_name"] and I can't figure out how to get the domain name.
You have my spider below. The first part works fine, but not the second one.
import scrapy
import json
from whois.items import WhoisItem
class whoislistSpider(scrapy.Spider):
name = "whois_list"
start_urls = []
f = open('test.txt', 'r')
global lines
lines = f.read().splitlines()
f.close()
def __init__(self):
for line in lines:
self.start_urls.append('http://www.example.com/api/domain/check/%s/com' % line)
def parse(self, response):
for line in lines:
jsonresponse = json.loads(response.body_as_unicode())
item = WhoisItem()
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
item["domain"] = domain_name
yield item
Thank you in advance for your replies.
Currently, it tries to get the value by the "('%s.com' % line)" key.
You need to do the string formatting correctly:
domain_name = "%s.com" % line.strip()
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
Assuming you are only expecting one result per response:
domain_name = list(jsonresponse['domains'].keys())[0]
item["avail"] = jsonresponse["domains"][domain_name]["avail"]
This will work even if there is a mismatch between the domain in the file "test.txt" and the domain in the result.
To get the domain name from above json response you can use list comprehension , e.g:
domain_name = [x for x in jsonresponse.values()[0].keys()]
To get the "avail" value use same method, e.g:
avail = [x["avail"] for x in jsonresponse.values()[0].values() if "avail" in x]
to get the values in string format you should call it by index 0 e.g:
domain_name[0] and avail[0] because list comprehension results stored in list type variable.
More info on list comprehension

Categories