Parse JSON Object in python without the json library (Using only regex) - python

I'm currently building a small application using the Instagram API which replies with JSON "objects" for the GET operations. To get the response I'm currently using urllib2.
This is part of an assignment from one of the courses I'm currently attending to, and the biggest challenge is that we are not allowed to use the JSON library to quickly parse and retrieve the information from the instagram response. We are forced to use the regex library (and only that) to properly parse the information.
The instagram response format to obtain the feed page of an user, for example, follows the structure shown in this link.
I honestly have spent 3 hours trying to figure this out by myself and also tried to obtain information on the internet, but most answered questions always point out to use the JSON library.
Any tips or suggestion would come in handy.
Additionally, other than urllib2 (may be considered external), I am not allowed to use any other external library (more like, 3rd party library) than the ones provided with python 2.7.
Thanks in advance.

It's not that complicated really, when you do the get request, you will get a bunch of code, from which you only need little parts, like for example, if you want to parse the news feeds from an user, and get the images and its captions:
query = "https://api.instagram.com/v1/users/"+profile_id+"/media/recent?access_token="+token
response = urlopen(query)
the_page = response.read()
feed = {}
feed['images'] = []
feed['captions'] = []
matchImage = re.findall(r'"standard_resolution":{"url":"(.*?)"', the_page)
matchCaption = re.findall(r'"caption":(.*?),(.*?),', the_page)
if len(matchImage) > 0:
for x in xrange(0,len(matchImage)):
image = matchImage[x].replace('\\','')
if matchCaption[x][0] == 'null':
feed['images'].append(image)
feed['captions'].append('No Caption')
else:
caption = re.search(r'"text":"(.*?)"', matchCaption[x][1])
caption = caption.group(1).replace('\\','')
feed['images'].append(image)
feed['captions'].append(caption)

How about using a functional parser library and a bit of regex?
def parse(seq):
'Sequence(Token) -> object'
...
n = lambda s: a(Token('Name', s)) >> tokval
def make_array(n):
if n is None:
return []
else:
return [n[0]] + n[1]
...
null = n('null') >> const(None)
true = n('true') >> const(True)
false = n('false') >> const(False)
number = toktype('Number') >> make_number
string = toktype('String') >> make_string
value = forward_decl()
member = string + op_(':') + value >> tuple
object = (
op_('{') +
maybe(member + many(op_(',') + member)) +
op_('}')
>> make_object)
array = (
op_('[') +
maybe(value + many(op_(',') + value)) +
op_(']')
>> make_array)
value.define(
null
| true
| false
| object
| array
| number
| string)
json_text = object | array
json_file = json_text + skip(finished)
return json_file.parse(seq)
You will need the funcparserlib library for this.
Note: Doing this with just pure regex is just too hard. You need to write some kind of "parser" -- So you may as well use a parser library to help with some of the boring bits.

Related

How to Use Elsevier Article Retrieval API to get fulltext of paper

I want to use Elsevier Article Retrieval API (https://dev.elsevier.com/documentation/FullTextRetrievalAPI.wadl) to get fulltext of paper.
I use httpx to get the information of the paper,but it just contains some information.My code is below:
import httpx
import time
def scopus_paper_date(paper_doi,apikey):
apikey=apikey
headers={
"X-ELS-APIKey":apikey,
"Accept":'text/xml'
}
timeout = httpx.Timeout(10.0, connect=60.0)
client = httpx.Client(timeout=timeout,headers=headers)
query="&view=FULL"
url=f"https://api.elsevier.com/content/article/doi/" + paper_doi
r=client.get(url)
print(r)
return r.text
y = scopus_paper_date('10.1016/j.solmat.2021.111326',myapikey)
y
the result is below:
<full-text-retrieval-response xmlns="http://www.elsevier.com/xml/svapi/article/dtd" xmlns:bk="http://www.elsevier.com/xml/bk/dtd" xmlns:cals="http://www.elsevier.com/xml/common/cals/dtd" xmlns:ce="http://www.elsevier.com/xml/common/dtd" xmlns:ja="http://www.elsevier.com/xml/ja/dtd" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:sa="http://www.elsevier.com/xml/common/struct-aff/dtd" xmlns:sb="http://www.elsevier.com/xml/common/struct-bib/dtd" xmlns:tb="http://www.elsevier.com/xml/common/table/dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><coredata><prism:url>https://api.elsevier.com/content/article/pii/S0927024821003688</prism:url>....
how can i get the fulldata of the paper,many thanks!
That depends on the paper you want to download.
I modified a bit the function you posted. Now it gets the response as JSON and no XML (this is just my personal preference, you can use the format you prefer).
import httpx
import time
def scopus_paper_date(paper_doi,apikey):
apikey=apikey
headers={
"X-ELS-APIKey":apikey,
"Accept":'application/json'
}
timeout = httpx.Timeout(10.0, connect=60.0)
client = httpx.Client(timeout=timeout,headers=headers)
query="&view=FULL"
url=f"https://api.elsevier.com/content/article/doi/"+paper_doi
r=client.get(url)
print(r)
return r
Now you can retrieve the document you want, and then you will have to parse it:
# Get document
y = scopus_paper_date('10.1016/j.solmat.2021.111326',my_api_key)
# Parse document
import json
json_acceptable_string = y.text
d = json.loads(json_acceptable_string)
# Print document
print(d['full-text-retrieval-response']['coredata']['dc:description'])
The result will the the dc:description of the document, i.e. the Abstract:
The production of molecular hydrogen by photoelectrochemical
dissociation (PEC) of water is a promising technique, which allows ... The width of the forbidden
bands and the position of the valence and conduction bands of the
different materials were determined by Mott - Schottky type
measurements.
For this document that is all that you can get, there are no more options.
However, if you try to get a different document, for example:
# Get document
y = scopus_paper_date('10.1016/j.nicl.2021.102600',my_api_key)
# Parse document
import json
json_acceptable_string = y.text
d = json.loads(json_acceptable_string)
You can then print the originalText key of the full-text-retrieval-response
# Print document
print(d['full-text-retrieval-response']['originalText'])
You will notice that this is a very long string containing a lot of text, probably more that you want, for example it contains all the references as well.
As I said in the beginning, the information you can get depends on the single paper. However, the full data will always be contained in the y variable defined in the code.

String Manipulation for Json webscraping

I am trying to scrape a website and have all the data needed in very long matrices which were obtained through requests and json imports.
I am having issues getting any output.
Is it because of the merge of two strings in requests.get()?
Here is the part with the problem, all things used were declared at the start of the code.
balance=[]
for q in range(len(DepositMatrix)):
address= requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
print(balance)
Example of DepositMatrix - list of lists with 4 elements, [[string , float, int, int]]
[['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
I think the error is in this part:
requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
This change doesnt help either:
requests.get('https://ethplorer.io/service/service.php?data=' + DepositMatrix[q][0])
Like I said in my comment, I tried your code and it worked for me. But I wanted to highlight some things that could help your code be clearer:
import requests
import pprint
DepositMatrix = [['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
balance=[]
for deposit in DepositMatrix:
address = requests.get('https://ethplorer.io/service/service.php?data=' + deposit[0])
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
pprint.pprint(balance)
For your loop, instead of creating a range of the length of your list (q) and then using this q to get the information back from your list, it's simpler to get each element directly (for deposit in DepositMatrix:)
I've used the pprint module to ease the visualization of your data.

Python: Joining and writing (XML.etrees) trees stored in a list

I'm looping over some XML files and producing trees that I would like to store in a defaultdict(list) type. With each loop and the next child found will be stored in a separate part of the dictionary.
d = defaultdict(list)
counter = 0
for child in root.findall(something):
tree = ET.ElementTree(something)
d[int(x)].append(tree)
counter += 1
So then repeating this for several files would result in nicely indexed results; a set of trees that were in position 1 across different parsed files and so on. The question is, how do I then join all of d, and write the trees (as a cumulative tree) to a file?
I can loop through the dict to get each tree:
for x in d:
for y in d[x]:
print (y)
This gives a complete list of trees that were in my dict. Now, how do I produce one massive tree from this?
Sample input file 1
Sample input file 2
Required results from 1&2
Given the apparent difficulty in doing this, I'm happy to accept more general answers that show how I can otherwise get the result I am looking for from two or more files.
Use Spyne:
from spyne.model.primitive import *
from spyne.model.complex import *
class GpsInfo(ComplexModel):
UTC = DateTime
Latitude = Double
Longitude = Double
DopplerTime = Double
Quality = Unicode
HDOP = Unicode
Altitude = Double
Speed = Double
Heading = Double
Estimated = Boolean
class Header(ComplexModel):
Name = Unicode
Time = DateTime
SeqNo = Integer
class CTrailData(ComplexModel):
index = UnsignedInteger
gpsInfo = GpsInfo
Header = Header
class CTrail(ComplexModel):
LastError = AnyXml
MaxTrial = Integer
Trail = Array(CTrailData)
from lxml import etree
from spyne.util.xml import *
file_1 = get_xml_as_object(etree.fromstring(open('file1').read()), CTrail)
file_2 = get_xml_as_object(etree.fromstring(open('file2').read()), CTrail)
file_1.Trail.extend(file_2.Trail)
file_1.Trail.sort(key=lambda x: x.index)
elt = get_object_as_xml(file_1, no_namespace=True)
print etree.tostring(elt, pretty_print=True)
While doing this, Spyne also converts the data fields from string to their native Python formats as well, so it'll be much easier for you to work with the data from this xml document.
Also, if you don't mind using the latest version from git, you can do e.g.:
class GpsInfo(ComplexModel):
# (...)
doppler_time = Double(sub_name="DopplerTime")
# (...)
so that you can get data from the CamelCased tags without having to violate PEP8.
Use lxml.objectify:
from lxml import etree, objectify
obj_1 = objectify.fromstring(open('file1').read())
obj_2 = objectify.fromstring(open('file2').read())
obj_1.Trail.CTrailData.extend(obj_2.Trail.CTrailData)
# .sort() won't work as objectify's lists are not regular python lists.
obj_1.Trail.CTrailData = sorted(obj_1.Trail.CTrailData, key=lambda x: x.index)
print etree.tostring(obj_1, pretty_print=True)
It doesn't do the additional conversion work that the Spyne variant does, but for your use case, that might be enough.

Parsing Multiple json elements in python

I'm trying to build a small script that will go through the Etsy API and retrieve certain information. The API returns 25 different listing all in json and I would appreciate it if someone could help me learn how to handle one at a time.
Here is an example of the json I'm dealing with:
{"count":50100,"results":[{"listing_id":114179207,"state":"active"},{"listing_id":11344567,"state":"active"},
and so on.
Is there a simple way to handle only one of these listings at a time to minimize the amount of calls I must make to the API?
Here is some of the code of how I'm dealing with just one when I limit the results returned to 1:
r = requests.get('http://openapi.etsy.com/v2/listings/active?api_key=key&limit=1&offset='+str(offset_param)+'&category=Clothing')
raw_json = r.json()
encoded_json = json.dumps(raw_json)
dataObject = json.loads(encoded_json)
if dataObject["results"][0]["quantity"] > 1:
if dataObject["results"][0]["listing_id"] not in already_done:
already_done.append(dataObject["results"][0]["listing_id"])
s = requests.get('http://openapi.etsy.com/v2/users/'+str(dataObject["results"][0]["user_id"])+'/profile?api_key=key')
raw_json2 = s.json()
encoded_json2 = json.dumps(raw_json2)
dataObject2 = json.loads(encoded_json2)
t = requests.get('http://openapi.etsy.com/v2/users/'+str(dataObject["results"][0]["user_id"])+'?api_key=key')
raw_json3 = t.json()
encoded_json3 = json.dumps(raw_json3)
dataObject3 = json.loads(encoded_json3)
Seeing how the results field (or key) contains a list structure, you can simply iterate it through like the following
json_str = { ...other key-values, "results": [{"listing_id":114179207,"state":"active"},{"listing_id":11344567,"state":"active"}, ...and so on] }
results = json_str['results'] # this gives you a list of dicts
# iterate through this list
for result in results:
if result['state'] == 'active':
do_something_with( result['listing_id']
else:
do_someotherthing_with( result['listing_id'] # or none at all

Why do we add this code when using mmap in Google Maps API?

net, cid, lac = 404415, 40962, 128
import urllib
# net = MCC 404 & MNC 415
a = '000E00000000000000000000000000001B0000000000000000000000030000'
b =hex(cid)[2:].zfill(8) + hex(lac)[2:].zfill(8)
c = hex(divmod(net,100)[1])[2:].zfill(8) + hex(divmod(net,100)[0])[2:].zfill(8)
string = (a + b + c+ 'FFFFFFFF00000000').decode('hex')
try:
data = urllib.urlopen('http://www.google.com/glm/mmap',string)
r = data.read().encode('hex')
if len(r) > 14:
print float(int(r[14:22],16))/1000000, float(int(r[22:30],16))/1000000
else:
print 'no data in google'
except:
print 'connect error'
I need to understand why we need to send this specific format to mmap . Especially regarding the
a = '000E00000000000000000000000000001B0000000000000000000000030000'
and why add 'FFFFFFFF00000000' to the string. Can anyone please explain this?
This is just a guess.
But i'm assuming the a = '...' is the location string and the FFFFFF0000000 is just a padding at the end to tell the API that this is the end of your request?
And the reason to why the "a" string looks so odd is most likely because google maps supports a hughe bunch of things such as Streetview, Google Places, pictures etc. So in the core, that's just longitude and latitude but with a bunch of extra stuff on it.
So yea, that's just Longitude and Latitude.
And the rest is just padding.
Net, Cid and Lac are GPS specific values, i think it's Cid that's the ID of you (short for CallerID) and the Lac are the local area code so you're probably doing something for the android?
http://code.google.com/apis/maps/documentation/webservices/
http://eng.xakep.ru/link/50814/
http://code.google.com/apis/maps/documentation/javascript/
http://www.cellumap.com/map.html

Categories