I'm pulling HTML from web sites, by sending headers to make the site think I'm just a user surfing the site, like so:
def page(goo):
import fileinput
import sys, heapq, array, urllib
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import re
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
filehandle = myopener.open(goo)
return filehandle.read()
page=page(WebSite)
This works perfectly with most websites, even Google and Wikipedia, but not with Tmart.com. Somehow, Tmart can see it's not a web browser, and returns an error. How can I fix this?
They might be detecting that you don't have a JavaScript interpreter? Hard to tell without seeing the error message you are receiving. There is one method that is guaranteed to work though. And that is directly driving a browser using Selenium Webdriver.
Selenium is normally used for functionally testing web sites. But works very well for scraping sites that use JavaScript as well.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.someurl.com')
html = browser.page_source
See all the methods available on browser here: http://code.google.com/p/selenium/source/browse/trunk/py/selenium/webdriver/remote/webdriver.py
For this to work you will also need to have the chromedriver executable available: http://code.google.com/p/chromedriver/downloads/list
Related
I am trying to scrape and download files from search results displayed when just clicking the search button on https://elibrary.ferc.gov/eLibrary/search. When the search results are displayed, the links look like https://elibrary.ferc.gov/eLibrary/filedownload?fileid=823F0FAB-E23A-CD5B-9FDD-7B3B7A700000 as an example. Clicking on this link on the search results page forces a download (content-disposition: attachment). However, I am saving the search results as an html page and then scraping the links. I am trying to get the file associated with the link and store it locally however my code isn't working.
#!/usr/local/bin python3
import os
import sys
import psycopg2
from pathlib import Path
import urllib.request
import requests
session = requests.Session()
session.headers.update(
{"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"}
)
r1 = session.get("https://elibrary.ferc.gov/eLibrary/search", verify=False)
dl_url = "https://elibrary.ferc.gov/eLibrary/filedownload?fileid=020E6084-66E2-5005-8110-C31FAFC91712"
req = session.get(dl_url, verify=False)
with open("dl_url", "wb") as outfile:
outfile.write(req.content)
I am not able to download the file contents at all (pdf, docx etc). The code above is just to try and solve the local download issue. Thanks in advance for any help.
Solved by using a json post request. Original URL won't work.
i'm trying to learn how to parse html with python
and i`m currently stuck with soup.findAll return me an empty array,therefore there are elements which could be found
Here is my code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
headers = {"User-Agent":'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
url = 'https://www.oddsportal.com/matches/tennis/20191114/'
responce = requests.get(url,headers=headers)
soup = BeautifulSoup(responce.text, 'html.parser')
info = soup.findAll('tr', {'class':'odd deactivate'})
print(info)
i`ll appreciate any help,thanks in advance
i'm trying to learn how to parse html with python
You happened to pick a webpage which isn't very beginner-friendly when it comes to webscraping. Broadly speaking, most webpages use one or both of these two common methods for loading / displaying data:
The user makes a request to a server (visits a page, for example).
The server gets the necessary data from a database. The server
generates an HTML response using a templating engine, and returns the
response for the user's browser to render.
The user makes a request to a server. The server returns an
HTML-skeleton response which gets populated with data dynamically by
making other requests / using APIs etc.
The webpage you picked is of the second type. Just because you can see the <tr> elements in the "Elements" tab of Chrome's Dev Tools doesn't mean that that's what the server sent you. By looking at the network tab of Chrome's Dev Tools you can see that a request is made to these two resources:
https://fb.oddsportal.com/ajax-next-games/2/0/1/20191114/yje3d.dat?=1574007087150
https://fb.oddsportal.com/ajax-next-games-odds/2/0/X0/20191114/1/yje3d.dat?=1574007087151
(The Query String parameters will not be the same for you. Visiting those urls also won't be very interesting unless you provide the right payload.)
The first resource seems to be a jQuery script which makes a request, the response of which contains HTML (this is your table). It looks something like this:
You can see that they seem to have assigned unique IDs to each of the matches. Giron Marcos vs. Holt Brandon in this case has an ID of ATM9GmXG.
The second resource is similar. It's also a jQuery script which seems to be making a request to their main API. The response this time is JSON, which is always desirable for webscraping. Here's what part of that looks like (notice the same ID):
Apparently, the page only loades the "odds" parts once it is called in a browser. So you could use Selenium and Chrome driver.
Note that you need to download the Chrome driver and place the driver in your .../python/ directory. Make sure you choose a matching driver version, meaning a version of Chrome driver that matches the version of the Chrome browser you have installed.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests, time, traceback, random, csv, codecs, re, os
# Webdriver
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('log-level=3')
browser = webdriver.Chrome(chrome_options=options)
url = 'https://www.oddsportal.com/matches/tennis/20191114/'
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
info = soup.findAll('tr', {'class':'odd deactivate'})
print(info)
I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.
I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search
All the ones in caps are hidden. I have also managed to figure out the results structure.
My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.
import re
import urlparse
import mechanize
from bs4 import BeautifulSoup
class DocumentFinderScraper(object):
def __init__(self):
self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
##TO DO
##submit form
#get return URL
#scrape results
#analyze pagination
if __name__ == '__main__':
scraper = DocumentFinderScraper()
scraper.scrape()
Any help would be dearly appreciated
I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:
As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.
For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.
This code should log you in using Selenium:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
usr = ""
pwd = ""
driver = webdriver.Firefox()
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)
i will take a screenshot from this page: http://books.google.de/books?id=gikDAAAAMBAJ&pg=PA1&img=1&w=2500 or save the image that it outputs.
But i can't find a way. With wget/curl i get an "unavailable error" and also with others tools like webkit2png/wkhtmltoimage/wkhtmltopng.
Is there a clean way to do it with python or from commandline?
Best regards!
You can use ghost.py if you like.
https://github.com/jeanphix/Ghost.py
Here is an example of how to use it.
from ghost import Ghost
ghost = Ghost(wait_timeout=4)
ghost.open('http://www.google.com')
ghost.capture_to('screen_shot.png')
The last line saves the image in your current directory.
Hope this helps
I had difficulty getting Ghost to take a screenshot consistently on a headless Centos VM. Selenium and PhantomJS worked for me:
from selenium import webdriver
br = webdriver.PhantomJS()
br.get('http://www.stackoverflow.com')
br.save_screenshot('screenshot.png')
br.quit
Sometimes you need extra http headers such User-Agent to get downloads to work. In python 2.7, you can:
import urllib2
request = urllib2.Request(
r'http://books.google.de/books?id=gikDAAAAMBAJ&pg=PA1&img=1&w=2500',
headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 firefox/2.0.0.11'})
page = urllib2.urlopen(request)
with open('somefile.png','wb') as f:
f.write(page.read())
Or you can look at the params for adding http headers in wget or curl.
im making some script with mechanize.browser module.
one of problem is all other thing is ok, but when submit() form,it not working,
so i was found some suspicion source part.
in the html source i was found such like following.
<form method="post" onsubmit="return loginCheck(this)" name="FRMLOGIN"/>
im thinking, loginCheck(this) making problem when submit form.
but how to handle this kind of javascript function with mechanize module ,so i can
successfully submit form and can receive result?
folloing is my current script source.
if anyone can help me ..much appreciate!!
# -*- coding: cp949-*-
import sys,os
import mechanize, urllib
import cookielib
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
import datetime, time, socket
import re,sys,os,mechanize,urllib,time
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.open('http://user.buddybuddy.co.kr/Login/LoginForm.asp?URL=')
html = br.response().read()
print html
br.select_form(name='FRMLOGIN')
print br.viewing_html()
br.form['ID']='zero1zero2'
br.form['PWD']='012045'
br.submit()
print br.response().read()
mechanize doesn't support Javascript at all. If you absolutely have to run that Javascript, look into Selenium. It offers python bindings to control a real, running browser like Firefox or IE.
you will either need to make use of unmaintained module DOMForm and Spidermonkey (http://pypi.python.org/pypi/python-spidermonkey) to process javascript. Or you figure out what loginCheck() is doing and perform its work prior form submission in python. If loginCheck() just checks for obvious validity of login data, that should be pretty easy.
Please note, that the action parameter of the stated form tag is missing. It's probably given in the javascript part.
Depending on what you intend it might be easier to work with urllib2 only. You might assume a static appearance of that web page and just post data with urllib2's methods and get the results with it also.
onsubmit is just ignored by mechanize, no javascript interpretation is done.
You need to verify what loginCheck(); in some limited case (Validation) you can do programmatically what javascript does.