How to download website when URL doesn't change after data addition

How to download website when URL doesn't change after data addition - python

I would like to download data from http://ec.europa.eu/taxation_customs/vies/ site. Case is that when I enter data on it through program the URL doesn't change, so file saved on disc has a page same as the one which were opened from the begining without data.Maybe I don't know how to access this site after adding data? I'm new in Python and tried to look for solution but with no result so if there was such issue, please link me. Here's my code. I appreciate all responses:)
import requests
import selenium
import select as something
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pdfkit
url = "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
driver = webdriver.Chrome(executable_path ="C:\\Users\\Python\\Chromedriver.exe")
driver.get("http://ec.europa.eu/taxation_customs/vies/")
#wait = WebDriverWait(driver, 10)
obj = Select(driver.find_element_by_id("countryCombobox"))
obj = obj.select_by_index(1)
vies_r = requests.get(url)
vies_vat = driver.find_element_by_id("number")
vies_vat.send_keys('U54799909')
vies_verify = driver.find_element_by_id("submit")
vies_verify.click()
path_wkhtmltopdf = r'C:\Users\Python\wkhtmltox\wkhtmltox\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
print(driver.current_url)
pdfkit.from_url(driver.current_url, "out.pdf", configuration=config)
Ukalo

Related

How to download PDF from url in python

Note: This is very different problem compared to other SO answers (Selenium Webdriver: How to Download a PDF File with Python?) available for similar questions.
This is because The URL: https://webice.ongc.co.in/pay_adv?TRACKNO=8262# does not directly return the pdf but in turn makes several other calls and one of them is the url that returns the pdf file.
I want to be able to call the url with a variable for the query param TRACKNO and to be able to save the pdf file using python.
I was able to do this using selenium, but my code fails to work when the browser is used in headless mode and I need it to work in headless mode. The code that I wrote is as follows:
import requests
from urllib3.exceptions import InsecureRequestWarning
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def extract_url(driver):
advice_requests = driver.execute_script("var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;")
print(advice_requests)
for request in advice_requests:
if(request.get('initiatorType',"") == 'object' and request.get('entryType',"") == 'resource'):
link_split = request['name'].split('-')
if(link_split[-1] == 'filedownload=X'):
print("..... Successful")
return request['name']
print("..... Failed")
def save_advice(advice_url,tracking_num):
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)
response = requests.get(advice_url,verify=False)
with open(f'{tracking_num}.pdf', 'wb') as f:
f.write(response.content)
def get_payment_advice(tracking_nums):
options = webdriver.ChromeOptions()
# options.add_argument('headless') # DOES NOT WORK IN HEADLESS MODE SO COMMENTED OUT
driver = webdriver.Chrome(options=options)
for num in tracking_nums:
print(num,end=" ")
driver.get(f'https://webice.ongc.co.in/pay_adv?TRACKNO={num}#')
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'ls-highlight-domref')))
time.sleep(0.1)
advice_url = extract_url(driver)
save_advice(advice_url,num)
except:
pass
driver.quit()
get_payment_advice['8262']
As it can be seen I get all the network calls that the browser makes in the first line of the extract_url function and then parse each request to find the correct one. However this does not work in headless mode
Is there any other way of doing this as this seems like a workaround? If not, can this be fixed to work in headless mode?

I fixed it, i only changed one function. The correct url is in the given page_source of the driver (with beautifulsoup you can parse html, xml etc.):
from bs4 import BeautifulSoup
def extract_url(driver):
soup = BeautifulSoup(driver.page_source, "html.parser")
object_element = soup.find("object")
data = object_element.get("data")
return f"https://webice.ongc.co.in{data}"
The hostname part may can be extracted from the driver.
I think i did not changed anything else, but if it not work for you, I can paste the full code.
Old Answer:
if you print the text of the returned page (print(driver.page_source)) i think you would get a message that says something like:
"Because of your system configuration the pdf can't be loaded"
This is because the requested site checks some preferences to decide if you are a roboter or not. Maybe it helps to change some arguments (screen size, user agent) to fix this. Here are some information about, how to detect a headless browser.
And for the next time you should paste all relevant code into the question (imports) to make it easier to test.

How do I make the driver navigate to new page in selenium python

I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.

You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class

Looping through URL links in CSV file with Python Selenium

I've got a collection of URL's in a csv file and I want to loop through these links and open each link in the CSV one at a time. I'm getting several different errors depending on what I try but nonetheless I can't get the browser to open the links. The print shows that the links are there.
When I run my code i get the following error:
Traceback (most recent call last):
File "/Users/Main/PycharmProjects/ScrapingBot/classpassgiit.py", line 26, in <module>
open = browser.get(link_loop)
TypeError: Object of type bytes is not JSON serializable
Can someone help me with my code below if I am missing something or if i am doing it wrong.
My code:
import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(links)
for link_loop in url_html:
print(contents)
open = browser.get(link_loop)

Apparently, you are messing something up with the names. Without having a copy of the .csv file, I cannot reproduce the error - hence, I will assume that you correctly extract the link from the text file.
In the second part of your code, you use requests.get to get the links (mind the plural) option, but links apparently is an element that you define in the previous section (links = row[0]), whereas link is the actual object you define in the for loop. Below you can find a version of the code that might be a helpful starting point.
Let me add, though, that the contemporaneous use of requests and selenium in this case makes little sense in your context: why getting an HTML page and then loop over its elements to get other pages with selenium?
import csv
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link) # now this is singular
# Do what you have to do here with requests, in spite of using selenium #

Since you have not provided any form of what is contained in your variable contents I will assume that it is a list of url strings.
As #cap.py mentioned you are messing up by using requests and selenium at the same time. When you do a GET web request, the server at the destination will send you a text response. This text can be simply some text, like Hello world! or it can be some html. But this html code as to be interpreted in your computer which sent the request.
That's the point of selenium over requests: requests return the text gathered from the destination (url) while selenium ask a browser (e.g. Chrome) to do gather the text and if this text is some html, to interpret it to give you a real readable web page. Moreover the browser is running the javascript inside your page so dynamic pages works as well.
In the end the only thing needed to run your code is to do this:
import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests
browser = webdriver.Chrome(executable_path=r'./chromedriver')
contents = []
with open('ClassPasslite.csv', 'rt') as cp_csv:
cp_url = csv.reader(cp_csv)
for row in cp_url:
links = row[0]
contents.append(links)
#link should be something like "https://www.classpass.com/studios/forever-body-coaching-london?search-id=49534025882004019"
for link in contents:
browser.get(link)
# paste the code you have here
Tip: Don't forget that browsers take some time to load pages. Adding some time.sleep(3) will help you a lot.

Python, Selenium, web tab assistance

There is a web site, I can get the data I need with Python / Selenium (I am new to Selenium and Python)
on the web page there are TABS, I can get the data on the first tab as that one is active by default, I cannot get data on the second TAB.
I attached an image: this shows the data in the overview TAB, I want to get the data in the Fundamental TAB as well. This web page is investing.com.
As for the code: (I did not use everything yet, some were added for future use)
from time import sleep, strftime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart
from bs4 import BeautifulSoup
url = 'https://www.investing.com/stock-screener/?
sp=country::6|sector::a|industry::a|equityType::a|exchange::a|last::1,1220|avg_volume::250000,15950000%3Ceq_market_cap;1'
chrome_path = 'E:\\BackUp\\IT\\__Programming\\Python\\_Scripts\\_Ati\\CSV\\chromedriver'
driver = webdriver.Chrome(chrome_path)
#driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(url)
my_name = driver.find_elements_by_xpath("//td[#data-column-name='name_trans']")
my_symbol = driver.find_elements_by_xpath("//td[#data-column-name='viewData.symbol']")
my_last = driver.find_elements_by_xpath("//td[#data-column-name='last']")
my_change = driver.find_elements_by_xpath("//td[#data-column-name='pair_change_percent']")
my_marketcap = driver.find_elements_by_xpath("//td[#data-column-name='eq_market_cap']")
my_volume = driver.find_elements_by_xpath("//td[#data-column-name='turnover_volume']")
The code Above all works.
The Xpath of the second tab does not work.
PE Ratio is in the second tab. (in the fundamentals)
I tried the three:
my_peratio = driver.find_elements_by_xpath("//*[#id="resultsTable"]/tbody/tr[1]/td[4]")
my_peratio = driver.find_elements_by_xpath("//*[#id='resultsTable']")
my_peratio = driver.find_elements_by_xpath("//td[#data-column-name='eq_pe_ratio']")
There are no error messages but the string 'my_peratio' han nothing in it. It is empty.
I really appreciate if you could direct me to the right direction.
Thanks a lot
Ati
enter image description here

Probably the data which is shown on the second tab is loaded dynamically.
In that case, you have to click on the second tab to show the data first.
driver.find_element_by_xpath("selector_for_second_tab").click()
After that it should be possible to get the data.

How to bypass disclaimer while scraping a website

I was able to scrape the following website before using "driver = webdriver.PhantomJS()" for work reason. What I was scraping were the price and the date.
https://www.cash.ch/fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf
This stopped working some days ago due to a disclaimer page which I have to agree at first.
https://www.cash.ch/fonds-investor-disclaimer?redirect=fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf
Once agreed I visually saw the real content, however the driver seems not, print out is [], so it must be still with the url of the disclaimer.
Please see code below.
from selenium import webdriver
from bs4 import BeautifulSoup
import csv
import os
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
#Swisscanto
driver.get("https://www.cash.ch/fonds/swisscanto-ast-avant-bvg- portfolio-45-p-19225268/swc/chf")
s_swisscanto = BeautifulSoup(driver.page_source, 'lxml')
nav_sc = s_swisscanto.find_all('span', {"data-field-entry": "value"})
date_sc = s_swisscanto.find_all('span', {"data-field-entry": "datetime"})
print(nav_sc)
print(date_sc)
print("Done Swisscanton")

This should work (I think the button you want to click in zustimmen?)
driver = webdriver.PhantomJS()
driver.get("https://www.cash.ch/fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf"
accept_button = driver.find_element_by_link_text('zustimmen')
accept_button.click()
content = driver.page_source
More details here
python selenium click on button

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download website when URL doesn't change after data addition - python

Related

How to download PDF from url in python

How do I make the driver navigate to new page in selenium python

Looping through URL links in CSV file with Python Selenium

Python, Selenium, web tab assistance

How to bypass disclaimer while scraping a website

Categories

Resources