Python reading responses and validating the result - python

currently I am stuck with being able to print out the result gotten from the API, but not being able to alter nor read them without parsing it into a text file.
Furthermore, I wouldn't need all of the information that the API provides and would be great if I can only have the match_id.
The response from the API:Result.
From the result I would only need the match_id and after I have gotten the match_id, I would compare it with a list of string e.g. 3238829394, 3238829395 and more, to check whether does any of the value are similar to mine, and if it's similar, the system would then alert me
I have found a way of doing it by passing the results into a text file, then comparing it with the list that I have.
The code for getting the response:
import dota2api
import json
import requests
api = dota2api.Initialise("[Value API][2]")
reponse = api.get_match_history_by_seq_num(start_at_match_seq_num=2829690055, matches_requested=1)
response = str(hist)
f = open('myfile.txt', 'w')
f.write(response)
f.close()
However I am hoping to find a faster and better way to do this process, as it is very time consuming and unstable. Thank you.

You are getting a JSON file back from that API. In python all data can be accessed directly without parsing it.
The response will be something like (sorry, but in that image I cannot copy paste to read the JSON properly):
for match in response['matches']:
if is_similar(match['match_id']):
do_something_cool_here
I think that should do what you need. If you give the answer as string I can help you building the code properly, but I guess you get the idea of what I am trying to say there :)
Hope it helps!
EDIT:
We talked by private and this works:
import dota2api
import requests
api = dota2api.Initialise("API_KEY")
response = api.get_match_history_by_seq_num(start_at_match_seq_num=SEQ_NUM, matches_requested=1)
match_id_check = MATCH_ID
for match in response['matches']:
if match_id_check == match['match_id']:
print(match)
with API_KEY, SEQ_NUM and MATCH_ID to configure

Related

How do you correctly parse web links to avoid a 403 error when using Wget?

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x, safe='://')
print (parsed_url)
wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
attachment_counter += 1
for x in url_list["pdf"]:
print (parsed_url + "\n")```
I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).
wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
import requests
r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")
with open("myfile.png", "wb") as file:
file.write(r.content)
I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?
Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).
import re
regex = re.compile(r"^http://")
if re.match(regex, mydocument):
<do something>
The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).
Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.
You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.
I would recommend doing something like this:
# get the file
req = requests.get(parsed_url)
# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]
filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
if "filename*=" in get_parameter:
# split it to get the name
filename = get_parameter.split('filename*=')[1]
# save the file
with open(<path> + filename, 'wb') as file:
file.write(req.content)
I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.

Python parsing JSON from url incomplete

I am trying to get a JSON from a URL I have succeeded in doing that but when I print the result it seems like I get only half of the content.
this is what I wrote:
with urllib.request.urlopen("https://financialmodelingprep.com/api/v3/company/stock/list") as url:
data = json.loads(url.read().decode())
print(data)
I won't print what I am getting because it's very long(if you look at the source and ctrl+f look up the sentence you will find that its in the middle of the page) but it starts with
ties Municipal Fund', 'price': 9.83,
thanks for the help.
I think that your output has been cut by your IDE.
When I run the same request and I write it into a file, you can see the that the data is fully written:
import requests
with requests.get("https://financialmodelingprep.com/api/v3/company/stock/list") as url:
data = url.content.decode()
with open("file.txt","w") as f:
f.write(data)
When the response is large, the first read may only return what is available at that time. If you want to be sure to have the full data, you should loop reading until you get nothing (an empty byte string here).
Here, it would probably be more robust to pass the response stream to json.load and let it read until the json string is completed:
with urllib.request.urlopen(
"https://financialmodelingprep.com/api/v3/company/stock/list") as url:
data = json.load(url)
print(data)
But I think that if the data was partial, json.loads should have choked on it. So the problem is likely elsewhere...

How to use wiktextract

I am trying to extract a Wiktionary xml file from their dumps using the wiktextract python module. However their website does not give me enough information. I could not use the command line program that comes with it since it isn't a Windows executable, so I tried the programmatic way. The following code takes a while to run so it seems to be doing something but then I'm not sure what to do with the ctx variable. Can anyone help me?
import wiktextract
def word_cb(data):
print(data)
ctx = wiktextract.parse_wiktionary(
r'myfile.xml', word_cb,
languages=["English", "Translingual"])
You are on the right track, but don't have to worry too much about the ctx object.
As the documentation says:
The parse_wiktionary call will call word_cb(data) for words and redirects found in the
Wiktionary dump. data is information about a single word and part-of-speech as a dictionary (multiple senses of the same part-of-speech are combined into the same dictionary). It may also be a redirect (indicated by presence of a redirect key in the dictionary).
The output ctx object mostly contains summary information (the number of sections processed, etc; you can use dir(ctx) to see some of its fields.
The useful results are not the ones in the returned ctx object, but the ones passed to word_cb on a word-by-word basis. So you might just try something like the following to get a JSON dump from a wiktionary XML dump. Because the full dumps are many gigabytes, I put a small one on a server for convenience in this example.
import json
import wiktextract
import requests
xml_fn = 'enwiktionary-20190220-pages-articles-sample.xml'
print("Downloading XML dump to " + xml_fn)
response = requests.get('http://45.61.148.79/' + xml_fn, stream=True)
# Throw an error for bad status codes
response.raise_for_status()
with open(xml_fn, 'wb') as handle:
for block in response.iter_content(4096):
handle.write(block)
print("Downloaded XML dump, beginning processing...")
fh = open("output.json", "wb")
def word_cb(data):
fh.write(json.dumps(data))
ctx = wiktextract.parse_wiktionary(
r'enwiktionary-20190220-pages-articles-sample.xml', word_cb,
languages=["English", "Translingual"])
print("{} English entries processed.".format(ctx.language_counts["English"]))
print("{} bytes written to output.json".format(fh.tell()))
fh.close()
For me this produces:
Downloading XML dump to enwiktionary-20190220-pages-articles-sample.xml
Downloaded XML dump, beginning processing...
684 English entries processed.
326478 bytes written to output.json
with the small dump extract I placed on a server for convenience. It will take much longer to run on the full dump.

Python unable to read my csv due to extra carriage returns

Sorry if this is redundant but I've tried very hard to look for the answer to this but I've been unable to find one. I'm very new to this so please bear with me:
My objective is for a piece of code to read through a csv full of urls, and return a http status code. I have Python 2.7.5. The outcome of each row would give me the url and the status code, something like this: www.stackoverflow.com: 200.
My csv is a single column csv full of hundreds of urls, one per row. The code I am using is below and when I run this code, it gives me a /r separating two urls similar to this:
{http://www.stackoverflow.com/test\rhttp://www.stackoverflow.com/questions/': 404}
What I would like to see is the two urls separated, and each with their own http status code:
{'http://www.stackoverflow.com': 200, 'http://www.stackoverflow.com/questions/': 404}
But there seems to be an extra \r when Python reads the csv, so it doesn't read the urls correctly. I know folks have said that strip() isn't an all inclusive wiper-outer so any advice on what to do to make this work, would be very much appreciated.
import requests
def get_url_status(url):
try:
r = requests.head(url)
return url, r.status_code
except requests.ConnectionError:
print "failed to connect"
return url, 'error'
results = {}
with open('url2.csv', 'rb') as infile:
for url in infile:
url = url.strip() # "http://datafox.co"
url_status = get_url_status(url)
results[url_status[0]] = url_status[1]
print results
You probably need to figure out how your csv file is formatted, before feeding it to Python.
First, make sure it has consistent line endings. If it has newlines sometimes, and crlf others, that's probably a problem that needs to be corrected.
If you're on a *ix system, tr might be useful.

Python not parsing JSON API return prettily

I reviewed a handful of the questions related to mine and found this slightly unique. I'm using Python 2.7.1 on OS X 10.7. One more note: I'm more of a hacker than developer.
I snagged the syntax below from Python documentation to try to do a "Pretty Print:"
date = {}
data = urllib2.urlopen(url)
s = json.dumps(data.read(), sort_keys=True, indent=4)
print '\n'.join([l.rstrip() for l in s.splitlines()])
I expected using the rstrip / splitlines commands would expand out the calls like in the example.
Also, not sure if it's relevant, but when tring to pipe the output to python -mjson.tool the reply is No JSON object could be decoded
Here's a snippet of the cURL output I'm trying to parse:
{"data":[{"name":"Site Member","created_at":"2012-07-24T11:22:04-07:00","activity_id":"500ee7cbbaf02xxx8e011e2e",
And so on.
The main objective is to make this mess of data more legible so I can learn from it and start structuring some automatic scraping of data based on arguments. Any guidance to get me from green to successful is a huge help.
Thanks,
mjb
The output of the urllib2.urlopen().read() is a string and needs to be converted to an object first before you can call json.dumps() on it.
Modified code:
date = {}
data = urllib2.urlopen(url)
data_obj = json.loads(data.read())
s = json.dumps(data_obj, sort_keys=True, indent=4)
print s

Categories