How to download file using selenium? - python

I am trying to get the download link and download the files.
I hava a log file which contains following links:
http://www.downloadcrew.com/article/18631-aida64
http://www.downloadcrew.com/article/4475-sumo
http://www.downloadcrew.com/article/2174-iolo_system_mechanic_professional
...
...
I have a code like this:
import urllib, time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
f = open("dcrewtest.txt")
for line in f.readlines():
try:
driver.find_element_by_xpath("//div/div[2]/div[2]/div[2]/div[3]/div/a/img").click()
time.sleep(8)
except:
pass
url = line.encode
pageurl = urllib.urlopen(url).read()
soup = BeautifulSoup(pageurl)
for a in soup.select("h1#articleTitle"):
print a.contents[0].strip()
for b in soup.findAll("th"):
if b.text == "Date Updated:":
print b.parent.td.text
elif b.text == "Developer:":
print c.parent.td.text
Up till here I do not know how to get the download link and download it.
Is it possible to download the file using selenium?

According to documentation, you should configure FirefoxProfile to automatically download files with a specified content-type. Here's an example using your first URL in the txt file that saves the exe file in the current directory:
import os
from selenium import webdriver
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", os.getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-msdos-program")
driver = webdriver.Firefox(firefox_profile=fp)
driver.get("http://www.downloadcrew.com/article/18631-aida64")
driver.find_element_by_xpath("//div[#class='downloadLink']/a/img").click()
Note, that I've also simplified the xpath.

Related

Using selenium to scrape paginated table data (Python)

I have this table: https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page=1. It's paginated I want to scrape all the content from the table starting from page 1 to the very end. I am trying to use the xpath but can't seem to get it to work.
Here is my code, any help welcome!
from selenium import webdriver
from selenium.webdriver.common.by import By
import os
# co.add_argument('--ignore-certificate-errors')
#co.add_argument('--no-proxy-server')
#co.add_argument("--proxy-server='direct://'")
#co.add_argument("--proxy-bypass-list=*")
co = webdriver.ChromeOptions()
co.add_argument('--headless')
driver = webdriver.Chrome(executable_path="C:/Users/user/Desktop/IG Trading/chromedriver.exe", chrome_options=co)
driver.get('https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page=1')
stock_names = driver.find_elements(By.XPATH, '/html/body/app-root/app-handshake/div/app-page-content/app-filter-toggle/app-ftse-index-table/section/table')
print(stock_names)
# for stock_name in stock_names:
# print(stock_name)
# text = stock_name.text
# print(text)
This is one way you can obtain that information:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options as Firefox_Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd
from tqdm import tqdm
firefox_options = Firefox_Options()
# firefox_options.add_argument("--width=1500")
# firefox_options.add_argument("--height=500")
# firefox_options.headless = True
driverService = Service('chromedriver/geckodriver')
browser = webdriver.Firefox(service=driverService, options=firefox_options)
big_df = pd.DataFrame()
browser.get('https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table')
try:
WebDriverWait(browser, 3).until(EC.element_to_be_clickable((By.ID, "ccc-notify-accept"))).click()
print('accepted cookies')
except Exception as e:
print('no cookie button!')
t.sleep(2)
for i in tqdm(range(1, 40)):
browser.get(f'https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page={i}')
t.sleep(1)
df = pd.read_html(WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "table[class='full-width ftse-index-table-table']"))).get_attribute('outerHTML'))[0]
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
big_df.to_csv('lse_companies.csv')
print('all done')
browser.quit()
This will display in terminal the big dataframe once all pages scraped, and also save it as a csv file on disk (in the same folder you are running your script from). Setup is Firefox/geckodriver on linux, however you can adapt it to your own, just observe the imports, and the logic after defining the browser/driver.
Selenium docs: https://www.selenium.dev/documentation/
TQDM: https://pypi.org/project/tqdm/

Webscraping using Selenium in Python

I am trying to scrape data from the Sunshine List website (http://www.sunshinelist.ca/) using the BeautifulSoup library and the Selenium package (in order to deal with the 'Next' button on the webpage). I know there are several related posts but I just can't identify where and how I should explicitly ask the driver to wait.
Error: StaleElementReferenceException: Message: The element reference
of stale: either the element is no longer attached to
the DOM or the page has been refreshed
This is the code I have written:
import numpy as np
import pandas as pd
import requests
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
ffx_bin = FirefoxBinary(r'C:\Users\BhagatM\AppData\Local\Mozilla Firefox\firefox.exe')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)
driver.get("http://www.sunshinelist.ca/")
driver.maximize_window()
tablewotags1=[]
while True:
divs = driver.find_element_by_id('datatable-disclosures')
divs1=divs.find_elements_by_tag_name('tbody')
for d1 in divs1:
div2=d1.find_elements_by_tag_name('tr')
for d2 in div2:
tablewotags1.append(d2.text)
try:
driver.find_element_by_link_text('Next →').click()
except NoSuchElementException:
break
year1=tablewotags1[0::10]
name1=tablewotags1[3::10]
position1=tablewotags1[4::10]
employer1=tablewotags1[1::10]
df1=pd.DataFrame({'Year':year1,'Name':name1,'Position':position1,'Employer':employer1})
df1.to_csv('Sunshine List-1.csv', index=False)
I think you just need to point to the correct firefox Binary. Also, Which version of Firefox are you using? Looks like it's one of the newer versions, this should do if thats the case.
ffx_bin = FirefoxBinary(r'pathtoyourfirefox')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)
Cheers
EDIT: So in order to answer your new enquery, "why is not writting the CVS" you should do so like this:
import csv # You are missing this import
ls_general_list = []
def csv_for_me(list_to_csv):
with open(pathtocsv, 'a', newline='') as csvfile:
sw = csv.writer(csvfile, delimeter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for line in list_to_csv:
for data in line:
sw.writerow(data)
Then replace this in you code, df=pd.DataFrame({'Year':year,'Name':name,'Position':position,'Employer':employer})
for this one, ls.general_list.append(('Year':year,'Name':name,'Position':position,'Employer':employer))
then do so like this,
csv_for_me(ls_general_list)
Please accept the answer if it's satisfactory and now you have a csv

How to download a HTML webpage using Selenium with python?

I want to download a webpage using selenium with python. using the following code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument('--save-page-as-mhtml')
d = DesiredCapabilities.CHROME
driver = webdriver.Chrome()
driver.get("http://www.yahoo.com")
saveas = ActionChains(driver).key_down(Keys.CONTROL)\
.key_down('s').key_up(Keys.CONTROL).key_up('s')
saveas.perform()
print("done")
However the above code isnt working. I am using windows 7.
Is there any by which i can bring up the 'Save as" Dialog box?
Thanks
Karan
You can use below code to download page HTML:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.yahoo.com")
with open("/path/to/page_source.html", "w", encoding='utf-8') as f:
f.write(driver.page_source)
Just replace "/path/to/page_source.html" with desirable path to file and file name
Update
If you need to get complete page source (including CSS, JS, ...), you can use following solution:
pip install pyahk # from command line
Python code:
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import ahk
firefox = FirefoxBinary("C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe")
from selenium import webdriver
driver = web.Firefox(firefox_binary=firefox)
driver.get("http://www.yahoo.com")
ahk.start()
ahk.ready()
ahk.execute("Send,^s")
ahk.execute("WinWaitActive, Save As,,2")
ahk.execute("WinActivate, Save As")
ahk.execute("Send, C:\\path\\to\\file.htm")
ahk.execute("Send, {Enter}")

Click button, then scrape data on seemingly static webpage?

I'm trying to scrape the player statistics in the Totals table at this link: http://www.basketball-reference.com/players/j/jordami01.html. It's much more difficult to scrape the data as-is when you first appear on that site, so you have the option of clicking 'CSV' right above the table. This format would be much easier to digest.
I'm having trouble
import urllib2
from bs4 import BeautifulSoup
from selenium import webdriver
player_link = "http://www.basketball-reference.com/players/j/jordami01.html"
browser = webdriver.Firefox()
browser.get(player_link)
elem = browser.find_element_by_xpath("//span[#class='tooltip' and #onlick='table2csv('totals')']")
elem.click()
When I run this, a Firefox window pops up, but the code never changes the table from its original format to CSV. The CSV table only pops up in the source code after I click CSV (obviously). How can I get selenium to click that CSV button and then BS to scrape the data?
You don't need BeautifulSoup here. Click the CSV button with selenium, extract the contents of the appeared pre element with CSV data and parse it with built-in csv module:
import csv
from StringIO import StringIO
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
player_link = "http://www.basketball-reference.com/players/j/jordami01.html"
browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.set_page_load_timeout(10)
# stop load after a timeout
try:
browser.get(player_link)
except TimeoutException:
browser.execute_script("window.stop();")
# click "CSV"
elem = wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='table_heading']//span[. = 'CSV']")))
elem.click()
# get CSV data
csv_data = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "pre#csv_totals"))).text.encode("utf-8")
browser.close()
# read CSV
reader = csv.reader(StringIO(csv_data))
for line in reader:
print(line)

Bypass Referral Denied error in selenium using python

I was making a script to download images from comic naver and I'm kind of done with it, however I can't seem to save the images.
I successfully grabbed the images via urlib and BeasutifulSoup, now, seems like they've introduced hotlink blocking and I can't seem to save the images on my system via urlib or selenium.
Update: I tried changing the useragent to see if that was causing problems... still the same.
Any fix or solution?
My code right now :
import requests
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Chrome/15.0.87"
)
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver = webdriver.PhantomJS(desired_capabilities=dcap)
soup = BeautifulSoup(urllib.urlopen(url).read())
scripts = soup.findAll('img', alt='comic content')
for links in scripts:
Imagelinks = links['src']
filename = Imagelinks.split('_')[-1]
print 'Downloading Image : '+filename
driver.get(Imagelinks)
driver.save_screenshot(filename)
driver.close()
Following 'MAI's' reply, I tried what I could with selenium, and got what I wanted. It's solved now. My code :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url = "http://comic.naver.com/webtoon/detail.nhn?titleId=654817&no=44&weekday=tue"
driver.get(url)
elem = driver.find_elements_by_xpath("//div[#class='wt_viewer']//img[#alt='comic content']")
for links in elem:
print links.get_attribute('src')
driver.quit()
but, when I try to taek screenshots of this, it shows that the "element is not attached to the page". Now, how am I supposed to solve that :/
(Note: Apologies, I'm not able to comment, so I have to make this an answer.)
To answer your original question, I've just been able to download an image in cURL from Naver Webtoons (the English site) by adding a Referer: http://www.webtoons.com header like so:
curl -H "Referer: http://www.webtoons.com" [link to image] > img.jpg
I haven't tried, but you'll probably want to use http://comic.naver.com instead. To do this with urllib, create a Request object with the header required:
req = urllib.request.Request(url, headers={"Referer": "http://comic.naver.com"})
with urllib.request.urlopen(req) as response, open("image.jpg", "wb") as outfile:
Then you can save the file using shutil.copyfileobj(src, dest). So instead of taking screenshots, you can simply get a list of all the images to download, then make a request for each one using the referer header.
Edit: I have a working script on GitHub which only requires urllib and BeautifulSoup.
I took a short look at the website with Chrome dev tools.
I would suggest you to download the image directly instead of screen-shooting. Selenium webdriver should actually run the javascripts on PhantomJS headless browser, so you should get images loaded by javascript at the following path.
The path that I am getting by eye-balling the html is
html body #wrap #container #content div #comic_view_area div img
The image tags in the last level have IDs like content_image_N, N counting from 0. So you can also get specific picture by using img#content_image_0 for example.

Categories