Now let me start off by saying that I know bs4, scrapy, selenium and so much more can do this but that isnt what I want for numerous reasons.
What I would like to do is open a webbrowser (chrome, ie, firefox) and extract the html from the page after loading the site in that web browser from what webbrowser.
import webbrowser
import time
class ScreenCapture:
url = 'https://www.google.com/'
webbrowser.get("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s").open(url)
# get html from browser that is open
Related
from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
url ='https://www.sambav.com/hyderabad/doctors'
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.find_all('div',class_='sambavdoctorname'):
link = links.find('a')
print(link['href'])
driver.close()
I am trying to scrape this page, the link is same in all pages. I am trying to extract the links from all mutiple pages but it's not giving any output nor showing any error just the program gets end.
If you check that website by developer tools in browser ( chrome or mozilla or whatever), before loading website, the website fetch data from few sources. One of this sources is "https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=" . Your code could be simplified (and there is no need to use selenium):
import requests
r = requests.get('https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=')
BASE_URL_DOCTOR = 'https://www.sambav.com/hyderabad/doctor/'
for item in r.json():
print(BASE_URL_DOCTOR + item['uniqueName'])
I'm trying to scrape video from the website. I can find the video link using Chrome DevTools. But when I use BeautifulSoup to get the video link. The link is hidden. Please help modify the code below to get the video link.
There is the screenshot of the Chrome DevTools. Basically, I need the 'src' of the 'video' tag.
import re
import urllib.request
from bs4 import BeautifulSoup as BS
url_video='http://s.weibo.com/video?q=%23%E6%AC%A7%E9%98%B3%E5%A6%AE%E5%A6%AE%23&xsort=hot&hasvideo=1&tw=video&Refer=weibo_video'
#open and read page
page=urllib.request.urlopen(url_video)
html=page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html, "lxml")
lst_url_video=[]
print(soup.body.find_all('div',class_='thumbnail')[0])
Please help modify the code to get the video link.
There is a possibility that the site is using some client-side javascript to load some of its html content. When you make a request using urllib.request, it wont execute any client-side javascript. So if the site does load some of its html content via client-side javascript, you'll need a javascript engine in order to run it (i.e. a web browser). You can use a headless browser to execute client-side javascript while scraping a web page. Here's a guide to using chrome headless with puppeteer
https://medium.com/#e_mad_ehsan/getting-started-with-puppeteer-and-chrome-headless-for-web-scraping-6bf5979dee3e
I have been looking on the internet for an answer for this but so far I haven't found quite what I was looking for. So far I'm able to open a webpage via Python webbrowser, but what I want to know is how to download the HTML file from that webpage that Python has asked the browser (firefox in this case) to open. This is because there are certain webpages with sections that I can not fully access without a certain browser extension/addon (MetaMask), as they require to also log in from within that extension, which is done automatically if I open firefox normally or with the webbrowser module. This is why requesting the HTML with an URL directly from Python with code such as this doesn't work:
import requests
url = 'https://www.google.com/'
r = requests.get(url)
r.text
from urllib.request import urlopen
with urlopen(url) as f:
html = f.read()
The only solution I have got so far is to open the webpage with the webbrowser module and then use the pyautogui module, which I can use to make my PC automatically press Ctrl+S (firefox browser hotkeys to save the HTML file from the webpage I'm currently in) and then make it press enter.
import webbrowser
import pyautogui
import time
def get_html():
url='https://example.com/'
webbrowser.open_new(url) #Open webpage in default browser (firefox)
time.sleep(1.2)
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.press('enter')
get_html()
However, I was wondering if there is a more sophisticated and efficient way that doesn't involve simulated key pressing with pyautogui.
Can you try the following:
import requests
url = 'https://www.google.com/'
r = requests.get(url)
with open('page.html', 'w') as outfile:
outfile.write(r.text)
If the above solution doesn't work, you can use selenium library to open a browser:
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
with open('page.html', 'w') as f:
f.write(driver.page_source)
I am using Requests module to send GET and POST requests to websites and then processing their responses. If the Response.text meets a certain criteria, I want it to be opened up in a browser. To do so currently I am using selenium package and resending the request to the webpage via the selenium webdriver. However, I feel it's inefficient as I have already obtained the response once, so is there a way to render this obtained Response object directly into the browser opened via selenium ?
EDIT
A hacky way that I could think of is to write the response.text to a temporary file and open that in the browser. Please let me know if there is a better way to do it than this ?
To directly render some HTML with Selenium, you can use the data scheme with the get method:
from selenium import webdriver
import requests
content = requests.get("http://stackoverflow.com/").content
driver = webdriver.Chrome()
driver.get("data:text/html;charset=utf-8," + content)
Or you could write the page with a piece of script:
from selenium import webdriver
import requests
content = requests.get("http://stackoverflow.com/").content
driver = webdriver.Chrome()
driver.execute_script("""
document.location = 'about:blank';
document.open();
document.write(arguments[0]);
document.close();
""", content)
I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html
(I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly) files in 1976. I would like to save them to the directory that my Python script is in.
I've tried using the urllib.request package, but I could get far enoughto get to the http text, not how to "click" on the file to download it.
import urllib.request
url = 'http://www.google.com/googlebooks/uspto-patents-grants-text.html'
savename = 'google_patent_urltext'
urllib.request.urlretrieve(url, savename )
Thank you very much for help.
As I understand you seek for a command that will simulate leftclicking on file and automatically download it. If so, you can use Selenium.
something like:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile ()
profile.set_preference("browser.download.folderList",2)
profile.set_preference("browser.download.manager.showWhenStarting",False)
profile.set_preference("browser.download.dir", 'D:\\') #choose folder to download to
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",'application/octet-stream')
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.google.com/googlebooks/uspto-patents-grants-text.html#2015')
filename = driver.find_element_by_xpath('//a[contains(text(),"ipg150106.zip")]') #use loop to list all zip files
filename.click()
UPDATED! 'application/octet-stream' zip-mime type should be used instead of "application/zip". Now it should work:)
The html you are downloading is the page of links. You need to parse the html to find all the download links. You could use a library like beautiful soup to do this.
However, the page is very regularly structured so you could use a regular expression to get all the download links:
import re
html = urllib.request.urlopen(url).read()
links = re.findall('<a href="(.*)">', html)