How to modify Pandas's Read_html user-agent? - python

I'm trying to scrape English football stats from various html tables via the Transfetmarkt website using the pandas.read_html() function.
Example:
import pandas as pd
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
df = pd.read_html(url)
However this code generates a "ValueError: Invalid URL" error.
I then attempted to parse the same website using the urllib2.urlopen() function. This time i got a "HTTPError: HTTP Error 404: Not Found". After the usual trial and error fault finding, it turns that the urllib2 header presents a python like agent to the webserver, which i presumed it doesn't recognize.
Now if I modify urllib2's agent and read its contents using beautifulsoup, i'm able to read the table without a problem.
Example:
from BeautifulSoup import BeautifulSoup
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
response = opener.open(url)
html = response.read()
soup = BeautifulSoup(html)
table = soup.find("table")
How do I modify pandas's urllib2 header to allow python to scrape this website?
Thanks

Currently you cannot. Relevant piece of code:
if _is_url(io): # io is the url
try:
with urlopen(io) as url:
raw_text = url.read()
except urllib2.URLError:
raise ValueError('Invalid URL: "{0}"'.format(io))
As you see, it just passes the url to urlopen and reads the data. You can file an issue requesting this feature, but I assume you don't have time to wait for it to be solved so I would suggest using BeautifulSoup to parse the html data and then load it into a DataFrame.
import urllib2
url = 'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"tabelle_grafik"})[0]
Or if you can use requests:
tables = pd.read_html(requests.get(url,
headers={'User-agent': 'Mozilla/5.0'}).text,
attrs={"class":"tabelle_grafik"})[0]

Related

Download content of Webpage

I need to download the content of a web page using Python.
What I need is the TLE of a specific satellite from Space-Track.org website.
An example of the url I need to scrape is the following:
https://www.space-track.org/basicspacedata/query/class/gp/NORAD_CAT_ID/44235/format/tle/emptyresult/show
Below the unsuccesful code I wrote/copied:
import requests
url = 'https://www.space-
track.org/basicspacedata/query/class/gp/NORAD_CAT_ID/44235/format/tle/emptyresult/show'
res = requests.post(url)
html_page = res.content
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
print(text)
res.post(url) returns Response [204] and I can't access the content of the webpage.
Could this happen because of the required login?
I must admit that I am not experienced with Python and I don't have the knowledge to this myself.
What I can do is to manipulate text files and from the DevTools page I can get the HTML file and extrapolate the text, but how can I do this programmatically?
To access the url you mentioned , you need USERNAME and PASSWORD Authorization.
to do this( customize to your need):
import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib ## http.cookiejar in python3
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")
br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
I don't have access to this API, so take my advice with a grain of salt, but you should also try using requests.get instead of requests.post.
Why? Because requests.post POSTs data to the server, while requests.get GETs data from the server. GET and POST are known as HTTP methods, and to learn more about them, see https://www.tutorialspoint.com/http/http_methods.htm. Since web browsers use GET, you should give that a try.

Python3 beautifulsoup doesn't parse anything

import requests
from bs4 import BeautifulSoup
url = "https://www.sahibinden.com/hyundai/"
req = requests.get(url)
context = req.content
soup = BeautifulSoup(context, "html.parser")
print(soup.prettify())
I am getting an error with the above code. If I try to parse another website it works, but there is a problem with sahibinden.com . When i run the program it is waiting like 1 minute than it throws an error. I ve to parse this website. Could you please help me with explaining what the issue is?
Your problem is due to the server is expecting a user agent, can't perform the request without it.
It's possible that the error that's giving to you is a timeout?
Add the following to your code
headers_dict = {'User-Agent': user_agent}
req = requests.get(url, headers=headers_dict)

Login to a website with Google using BeautifulSoup and Python 2.7

I am writing a Python web-crawler for Quora, but need to log in using Google. I have searched the net, but nothing satisfies my problem. Here is my code:
# -*- coding: utf-8 -*-
import mechanize
import os
import requests
import urllib2
from bs4 import BeautifulSoup
import cookielib
# Store the cookies and create an opener that will hold them
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# Add our headers
opener.addheaders = [('User-agent', 'RedditTesting')]
# Install our opener (note that this changes the global opener to the one
# we just made, but you can also just call opener.open() if you want)
urllib2.install_opener(opener)
# The action/ target from the form
authentication_url = 'https://quora.com'
# Input parameters we are going to send
payload = {
'op': 'login-main',
'user': '<username>',
'passwd': '<password>'
}
# Use urllib to encode the payload
data = urllib.urlencode(payload)
# Build our Request object (supplying 'data' makes it a POST)
req = urllib2.Request(authentication_url, data)
# Make the request and read the response
resp = urllib2.urlopen(req)
contents = resp.read()
# specify the url
quote_page = "https://www.quora.com/"
# query the website and return the html to the variable ‘page’
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
name_box = soup.find('div', attrs={"class": "ContentWrapper"})
name = name_box.text.strip() # strip() is used to remove starting and trailing
print name
for link in soup.find_all('img'):
image = link.get("src")
os.path.split(image)
image_name = os.path.split(image)[1]
print(image_name)
r2 = requests.get(image)
with open(image_name, "wb") as f:
f.write(r2.content)
As I don't have any actual username for the site, I use my own Gmail account. In order to login, I used some code from a different question, but that does not work.
Any indentation errors are due to my lousy formatting.
To login and scrape, use a Session; make a POST request with the your credentials as a payload and then scrape.
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
p = s.post("https://quora.com", data={
"email": '*******',
"password": "*************"
})
print(p.text)
base_page = s.get('https://quora.com')
soup = BeautifulSoup(base_page.content, 'html.parser')
print(soup.title)

fetch and save http response before redirect using mechanize

I am trying to fetch a sample page in python
import mechanize
def viewpage(url):
browser = mechanize.Browser()
page = browser.open(url)
source_code = page.read()
print source_code
viewpage('https://sama.com/index.php?req=1')
However everytime it will get redirected to index2.php (by a location header from webserver) thus for example the code print the response from index2.php rather than index.php is there anyway to avoid that?
You can use urllib2 or requests for more complex stuff.
import urllib2
response = urllib2.urlopen("http://google.com")
page_source = response.read()
urllib2 is a built-in module and requests is 3rd party.

How to get contents of frames automatically if browser does not support frames + can't access frame directly

I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.
Okay, this was an interesting task to do with requests and BeautifulSoup.
There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.
Here's the complete code:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
You should probably extract separate blocks of code into functions to make it more readable and reusable.
FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.
Hope that helps.

Categories