Alternative to innerhtml that includes header? - python

I'm trying to extract data from the following page:
http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#
Which, conveniently and inefficiently enough, includes all the data embedded as a csv file in the header, set as a variable called gs_csv.
How do I extract this? Document.body.innerhtml skips the header where the data is, what is the alternative that includes the header (or better yet, the value associated with gs_csv)?
(Sorry, new to all this, I've been searching through loads of documentation, and trying a lot of them, but nothing so far has worked).
Thanks to Sinan (this is mostly his solution transcribed into Python).
import win32com.client
import time
import os
import os.path
ie = Dispatch("InternetExplorer.Application")
ie.Visible=False
ie.Navigate("http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#")
time.sleep(20)
webpage=ie.document.body.innerHTML
s1=ie.document.scripts(1).text
s1=s1[s1.find("gs_csv")+8:-11]
scriptfilepath="c:\FO Share\bmreports\script.txt"
scriptfile = open(scriptfilepath, 'wb')
scriptfile.write(s1.replace('\n','\n'))
scriptfile.close()
ie.quit

Untested: Did you try looking at what Document.scripts contains?
UPDATE:
For some reason, I am having immense difficulty getting this to work using the Windows Scripting Host (but then, I don't use it very often, apologies). Anyway, here is the Perl source that works:
use strict;
use warnings;
use Win32::OLE;
$Win32::OLE::Warn = 3;
my $ie = get_ie();
$ie->{Visible} = 1;
$ie->Navigate(
'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?'
.'param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#'
);
sleep 1 until is_ready( $ie );
my $scripts = $ie->Document->{scripts};
for my $script (in $scripts ) {
print $script->text;
}
sub is_ready { $_[0]->{ReadyState} == 4 }
sub get_ie {
Win32::OLE->new('InternetExplorer.Application',
sub { $_[0] and $_[0]->Quit },
);
}
__END__
C:\Temp> ie > output
output now contains everything within the script tags.

fetch the source of that page using ajax, and parse the response text like XML using jquery. It should be simple enought to get the text of the first tag you encounter inside the
I'm out of touch with jquery, or I would have posted code examples.
EDIT: I assume you are talking about fetching the csv on the client side.

If this is just a one off script then exctracting this csv data is as simple as this:
import urllib2
response = urllib2.urlopen('http://www.bmreports.com/foo?bar?')
html = response.read()
csv = data.split('gs_csv=')[1].split('</SCRIPT>')[0]
#process csv data here

Thanks to Sinan (this is mostly his solution transcribed into Python).
import win32com.client
import time import os
import os.path
ie = Dispatch("InternetExplorer.Application") ie.Visible=False
ie.Navigate("http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#")
time.sleep(20)
webpage=ie.document.body.innerHTML
s1=ie.document.scripts(1).text s1=s1[s1.find("gs_csv")+8:-11]
scriptfilepath="c:\FO Share\bmreports\script.txt"
scriptfile = open(scriptfilepath, 'wb')
scriptfile.write(s1.replace('\n','\n'))
scriptfile.close()
ie.quit

Related

How do you correctly parse web links to avoid a 403 error when using Wget?

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x, safe='://')
print (parsed_url)
wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
attachment_counter += 1
for x in url_list["pdf"]:
print (parsed_url + "\n")```
I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).
wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
import requests
r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")
with open("myfile.png", "wb") as file:
file.write(r.content)
I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?
Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).
import re
regex = re.compile(r"^http://")
if re.match(regex, mydocument):
<do something>
The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).
Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.
You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.
I would recommend doing something like this:
# get the file
req = requests.get(parsed_url)
# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]
filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
if "filename*=" in get_parameter:
# split it to get the name
filename = get_parameter.split('filename*=')[1]
# save the file
with open(<path> + filename, 'wb') as file:
file.write(req.content)
I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.

Python reading responses and validating the result

currently I am stuck with being able to print out the result gotten from the API, but not being able to alter nor read them without parsing it into a text file.
Furthermore, I wouldn't need all of the information that the API provides and would be great if I can only have the match_id.
The response from the API:Result.
From the result I would only need the match_id and after I have gotten the match_id, I would compare it with a list of string e.g. 3238829394, 3238829395 and more, to check whether does any of the value are similar to mine, and if it's similar, the system would then alert me
I have found a way of doing it by passing the results into a text file, then comparing it with the list that I have.
The code for getting the response:
import dota2api
import json
import requests
api = dota2api.Initialise("[Value API][2]")
reponse = api.get_match_history_by_seq_num(start_at_match_seq_num=2829690055, matches_requested=1)
response = str(hist)
f = open('myfile.txt', 'w')
f.write(response)
f.close()
However I am hoping to find a faster and better way to do this process, as it is very time consuming and unstable. Thank you.
You are getting a JSON file back from that API. In python all data can be accessed directly without parsing it.
The response will be something like (sorry, but in that image I cannot copy paste to read the JSON properly):
for match in response['matches']:
if is_similar(match['match_id']):
do_something_cool_here
I think that should do what you need. If you give the answer as string I can help you building the code properly, but I guess you get the idea of what I am trying to say there :)
Hope it helps!
EDIT:
We talked by private and this works:
import dota2api
import requests
api = dota2api.Initialise("API_KEY")
response = api.get_match_history_by_seq_num(start_at_match_seq_num=SEQ_NUM, matches_requested=1)
match_id_check = MATCH_ID
for match in response['matches']:
if match_id_check == match['match_id']:
print(match)
with API_KEY, SEQ_NUM and MATCH_ID to configure

can I get the POST data from environment variables in python?

I am trying to understand http queries and was succesfully getting the data from GET requests through the environment variables by first looking through the keys of the environment vars and then accessing 'QUERY_STRING' to get the actual data.
like this:
#!/usr/bin/python3
import sys
import cgi
import os
inputVars = cgi.FieldStorage()
f = open('test','w')
f.write(str(os.environ['QUERY_STRING])+"\n")
f.close()
Is there a way to get the POST data (the equivalent of 'QUERY_STRING' for POST - so to say) as well or is it not accessible because the POST data is send in its own package? the keys of the environment variables did not give me any hint so far.
the possible duplicate link solved it, as syntonym pointed out in the comments and user Schien explains in one of the answers to the linked question:
the raw http post data (the stuff after the query) can be read through stdin.
so the sys.stdin.read() method can be used.
my code now works looking like this:
#!/usr/bin/python3
import sys
import os
f = open('test','w')
f.write(str(sys.stdin.read()))
f.close()

Access a JSON in Python

I'm new at Python as well as JQuery.
I have the next JSON in JS
var intervalos= {"variables":{
"nombreVariables":nombreVariables,
"extremoInferior":fromarray,
"extremoSuperior":toarray,
"step":steparray,
"random":randomarray
}};//intervalos
I did in Python
parsed_input = json.loads(self.intervalos)
How can I access to the structure? (I know that this is a list/dictionary)
like this
intervalos['variables']['nombreVariables'][i];
Presumably, the JS is remote. Assuming its accessible at http://example.com/json one would import it into python as follows:
import requests
parsed_input = requests.get('http://example.com/json').json()
for i in xrange(0, len(parsed_input['variables']['nombreVariables']):
print parsed_input['variables']['nombreVariables'][i]
You'll need requests to make this work in python2. With python 3.x, your syntax will be slightly different.

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.
If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}
First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.
If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

Categories