Understanding how to extract data from HTML File

Understanding how to extract data from HTML File - python

I am trying to access the "Yield Curve Data" available on this page. It has a radio button which upon clicking "Submit" results in a zip File, from which I am looking to get the data. I am looking to get the data from the "Retrieve all data" Option. My code is as follows, and from the statement print result.read() I realize that result is actually a HTML Document. My difficult is in understanding how to extract the data from result as I don't see any data in this. I am confused as to where to go from here.
import urllib, urllib2
import csv
from StringIO import StringIO
import pandas as pd
import os
from zipfile import ZipFile
my_url = 'http://www.bankofcanada.ca/rates/interest-rates/bond-yield-curves/'
data = urllib.urlencode({'lastchange': 'all'})
request = urllib2.Request(my_url, data)
result = urllib2.urlopen(request)
Thank You

Your going to need to generate a post request to the following endpoint:
http://www.bankofcanada.ca/stats/results/csv
With the following form data:
lookupPage: lookup_yield_curve.php
startRange: 1986-01-01
searchRange: all
This should give you the file.
You may also need to fake your useragent.

Related

download csv data through requests.get() retrieves me php text

I tried to download specific data as part of my work,
the data is located in link! .
The source indicates how to download through the get method, but when I make my requests:
import requests
import pandas as pd
url="https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/2015-01/2019-01"
r=pd.to_csv(url)
it doesnt read as it should be (open link in navigator).
When I try
s=requests.get(url,verify=False) # you can set verify=True
df=pd.DataFrame(s)
the data neither is good.
What else can I do? It suppose to download the data as csv avoiding me to clean the data.

to get the content as csv you can replace all HTML line breaks with newline chars.
please let me know if this works for you:
import requests
import pandas as pd
from io import StringIO
url = "https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/2015-01/2019-01"
content = requests.get(url,verify=False).text.replace("<br>","\n").strip()
csv = StringIO(content)
r = pd.read_csv(csv)
print(r)

PUT method using python3 and urbllib - headers

So I am trying to just receive the data from this json. I to use POST, GET on any link but the link I am currently trying to read. It needs [PUT]. So I wanted to know if I was calling this url correctly via urllib or am I missing something?
Request
{"DataType":"Word","Params":["1234"], "ID":"22"}
Response {
JSON DATA IN HERE
}
I feel like I am doing the PUT method call wrong since it is wrapped around Request{}.
import urllib.request, json
from pprint import pprint
header = {"DataType":"Word","Params":"[1234"]", "ID":"22"}
req = urllib.request.Request(url = "website/api/json.service", headers =
heaer, method = 'PUT')
with urllib.request.urlopen(req) as url:
data = json.loads(url.read(), decode())
pprint(data)
I am able to print json data as long as its anything but PUT. As soon as I get a site with put on it with the following JSON template I get an Internal Error 500. So I assumed it was was my header.
Thank you in advanced!

robobrowser to deal with response in json

I'm trying to programmably access a website
from robobrowser import RoboBrowser
import sys
browser = RoboBrowser(history=True)
browser.open('https://test.com/login')
loginForm = browser.get_form()
loginForm['UserName']='username'
loginForm['Password']='*'
browser.submit_form(loginForm)
if browser.response.ok:
if browser.response.content[2]=='false':
print browser.response.content[4]
sys.exit(1)
website returned json format ( at least i think it's json), but i can't seems to find robobrowser api for dealing with json.
{"RedirectUrl":null,"IsSuccess":false,"Message":null,"CustomMessage":null,"Errors":[{"Key":"CaptchaValue","Value":["Your response did not match. Please try again."]}],"Messages":{},"HasView":true.......}
As you can see I want to test if "isSuccess", and print error message, how can I proceed in this case?
thanks

found a solution using json
json.load(StringIO(browser.response.content))

and for python 3.x is functional
import io
import json
json.load(io.BytesIO(browser.response.content))

python urllib2.urlopen - html text is garbled - why?

The printed html returns garbled text... instead of what I expect to see as seen in "view source" in browser.
Why is that? How to fix it easily?
Thank you for your help.
Same behavior using mechanize, curl, etc.
import urllib
import urllib2
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
print html

I got the same garbled text using curl
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm
The result appears to be gzipped. So this shows the correct HTML for me.
curl http://www.ncert.nic.in/ncerts/textbook/textbook.htm | gunzip
Here's a solutions on doing this in Python: Convert gzipped data fetched by urllib2 to HTML
Edited by OP:
The revised answer after reading above is:
import urllib
import urllib2
import gzip
import StringIO
start_url = "http://www.ncert.nic.in/ncerts/textbook/textbook.htm"
response = urllib2.urlopen(start_url)
html = response.read()
data = StringIO.StringIO(html)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
html now holds the HTML (Print it to see)

Try requests. Python Requests.
import requests
response = requests.get("http://www.ncert.nic.in/ncerts/textbook/textbook.htm")
print response.text
The reason for this is because the site uses gzip encoding. To my knowledge urllib doesn't support deflating so you end up with compressed html responses for certain sites that use that encoding. You can confirm this by printing the content headers from the response like so.
print response.headers
There you will see that the "Content-Encoding" is gzip format. In order to get around this using the standard urllib library you'd need to use the gzip module. Mechanize also does this because it uses the same urllib library. Requests will handle this encoding and format it nicely for you.

Python urllib POST response

I am trying to write script that searches an inchikey (ex: OBSSCZVQJAGPOE-KMKNQKDISA-N) to get a chemical structure from this website:
http://www.chemspider.com/inchi-resolver/Resolver.aspx
From the documentation my code looks like it should work, but instead it just returns the original search page.
Thanks for the help,
import urllib
inchi = 'OBSSCZVQJAGPOE-KMKNQKDISA-N'
url = 'http://www.chemspider.com/inchi-resolver/Resolver.aspx'
data = urllib.urlencode({'"ctl00$ContentPlaceHolder1$TextBox1"':inchi})
response = urllib.urlopen(url, data)
print response.read()

Your code is performing a GET request and not a POST request. Apart from that: the form contains various hidden fields with some strange values which might be necessary for the processing as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding how to extract data from HTML File - python

Your going to need to generate a post request to the following endpoint: http://www.bankofcanada.ca/stats/results/csv With the following form data: lookupPage: lookup_yield_curve.php startRange: 1986-01-01 searchRange: all This should give you the file. You may also need to fake your useragent.

Related

download csv data through requests.get() retrieves me php text

PUT method using python3 and urbllib - headers

robobrowser to deal with response in json

python urllib2.urlopen - html text is garbled - why?

Python urllib POST response

Categories

Resources