There is a web site, I can get the data I need with Python / Selenium (I am new to Selenium and Python)
on the web page there are TABS, I can get the data on the first tab as that one is active by default, I cannot get data on the second TAB.
I attached an image: this shows the data in the overview TAB, I want to get the data in the Fundamental TAB as well. This web page is investing.com.
As for the code: (I did not use everything yet, some were added for future use)
from time import sleep, strftime
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart
from bs4 import BeautifulSoup
url = 'https://www.investing.com/stock-screener/?
sp=country::6|sector::a|industry::a|equityType::a|exchange::a|last::1,1220|avg_volume::250000,15950000%3Ceq_market_cap;1'
chrome_path = 'E:\\BackUp\\IT\\__Programming\\Python\\_Scripts\\_Ati\\CSV\\chromedriver'
driver = webdriver.Chrome(chrome_path)
#driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get(url)
my_name = driver.find_elements_by_xpath("//td[#data-column-name='name_trans']")
my_symbol = driver.find_elements_by_xpath("//td[#data-column-name='viewData.symbol']")
my_last = driver.find_elements_by_xpath("//td[#data-column-name='last']")
my_change = driver.find_elements_by_xpath("//td[#data-column-name='pair_change_percent']")
my_marketcap = driver.find_elements_by_xpath("//td[#data-column-name='eq_market_cap']")
my_volume = driver.find_elements_by_xpath("//td[#data-column-name='turnover_volume']")
The code Above all works.
The Xpath of the second tab does not work.
PE Ratio is in the second tab. (in the fundamentals)
I tried the three:
my_peratio = driver.find_elements_by_xpath("//*[#id="resultsTable"]/tbody/tr[1]/td[4]")
my_peratio = driver.find_elements_by_xpath("//*[#id='resultsTable']")
my_peratio = driver.find_elements_by_xpath("//td[#data-column-name='eq_pe_ratio']")
There are no error messages but the string 'my_peratio' han nothing in it. It is empty.
I really appreciate if you could direct me to the right direction.
Thanks a lot
Ati
enter image description here
Probably the data which is shown on the second tab is loaded dynamically.
In that case, you have to click on the second tab to show the data first.
driver.find_element_by_xpath("selector_for_second_tab").click()
After that it should be possible to get the data.
Related
For the site https://www.wsop.com/tournaments/results/, the objective is to download all available PDFs on the REPORTS section, behind all different drop down options where they are available.
Currently I am trying to do this using selenium, because I couldn't find an api, but I am open to other suggestions. For now the code is a bunch of copy-paste from relevant questions and YT videos.
My plan of attack is to select an option in the drop-down menu, press 'GO' (to load them), navigate to 'REPORTS' (if available) and download all the PDFs available. And then iterate over all options. Challenge 2 is then to get the PDFs to something like a dataframe to do some analysis.
Below is my current code, that only manages to download the top PDF of the by default selected option in the drop-down:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import os
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
#settings and loading webpage
options=Options()
options.headless=True
CD=ChromeDriverManager().install()
driver=webdriver.Chrome(CD,options=options)
params={'behavior':'allow','downloadPath':os.getcwd()+'\\PDFs'}
driver.execute_cdp_cmd('Page.setDownloadBehavior',params)
driver.get('https://www.wsop.com/tournaments/results/')
#Go through the dropdown
drp=Select(driver.find_element_by_id("CPHbody_aid"))
drp.select_by_index(0)
drp=Select(driver.find_element_by_id("CPHbody_grid"))
drp.select_by_index(1)
drp=Select(driver.find_element_by_id("CPHbody_tid"))
drp.select_by_index(5)
#Click the necessary buttons (section with issues)
driver.find_element_by_xpath('//*[#id="nav-tabs"]/a[6]').click()
#driver.find_element_by_name('GO').click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "GO"))).click()
#WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "REPORTS"))).click()
a=driver.find_element_by_id("reports").click()
I can navigate through the drop-down just fine (and it should be easy to iterate over them). However, I do not get the 'GO' button pressed. I tried it a bunch of different ways, a few I showed as a comment in the code.
I am able to press the REPORTS tab, but I think that breaks down when there are different amounts of tabs, the line in the comments might work better, but for now I am not able to download all PDFs anyway, it just takes the first PDF of the page.
Many thanks to whoever can help:)
The website is structured in such a way that you can loop through the years that a WSOP was played, then within each year you can loop through every event and get the data from page into a pandas dataframe. This is far more efficient than taking screenshots into PDFs
You can edit how far you want to go back with the from_year variable in line 5, going way back will obviously take more time. See the below script which will output all the data into csv. Note that not every event has POY points available. Also you'll need to pip install requests, pandas and bs4 if you haven't already.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from_year = 2020
wsop_tounrament_url = 'https://www.wsop.com/tournaments/GetTournaments.aspx?aid=2'
wsop_resp = requests.get(wsop_tounrament_url)
soup = BeautifulSoup(wsop_resp.text,'html.parser')
years = [x['value'] for x in soup.find_all('option') if str(from_year) in x.text]
event_dfs = []
for year in years:
event_resp = requests.get(f'https://www.wsop.com/tournaments/GetEvents.aspx?grid={str(year)}')
soup = BeautifulSoup(event_resp.text,'html.parser')
event_ids = [x['value'] for x in soup.find_all('option')]
for event in event_ids:
page = 1
while True:
url = f'https://www.wsop.com/tournaments/results/?aid=2&grid={year}&tid={event}&rr=5&curpage={page}'
results = requests.get(url)
soup = BeautifulSoup(results.text,'html.parser')
info = soup.find('div',{'id':'eventinfo'})
dates = info.find('p').text
name = info.find('h1').text
year_name = soup.find('div',{'class':'content'}).find('h3').text.strip()
table = soup.find('div',{'id':'results'})
size = int(table.find('ul')['class'][0][-1])
rows = table.find_all('li')
if len(rows) <= size+1:
break
print(f'processing {year_name} - {name} - page {page}')
output = []
headers = []
for x in range(size):
series = []
for i, row in enumerate(rows):
if i == x:
headers.append(row.text)
continue
if i%size == x:
series.append(row.text)
output.append(series)
df = pd.DataFrame(output)
df = df.transpose()
df.columns = headers
df['year_name'] = year_name
df['event_id'] = event
df['year_id'] = year
df['event_name'] = name
df['dates'] = dates
df['url'] = url
event_dfs.append(df)
page += 1
print(f'Scraped {year_name} successfully')
final_df = pd.concat(event_dfs)
final_df.to_csv('wsop_output.csv',index=False)
I am not going to write you the whole script but here's how to click on the "go" button :
We can see from the Developper tools that the button is the only element to have the class "submit-red-button", so we can access it with : driver.find_elements_by_class_name('submit-red-button')[0].click()
You say that you can access the Reports tab but it did not work when I tested your program so just in case, you can use driver.find_elements_by_class_name('taboff')[4] to get it.
Then, all you need to do is to click on each pdf link in order to download the files
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I would like to download data from http://ec.europa.eu/taxation_customs/vies/ site. Case is that when I enter data on it through program the URL doesn't change, so file saved on disc has a page same as the one which were opened from the begining without data.Maybe I don't know how to access this site after adding data? I'm new in Python and tried to look for solution but with no result so if there was such issue, please link me. Here's my code. I appreciate all responses:)
import requests
import selenium
import select as something
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pdfkit
url = "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
driver = webdriver.Chrome(executable_path ="C:\\Users\\Python\\Chromedriver.exe")
driver.get("http://ec.europa.eu/taxation_customs/vies/")
#wait = WebDriverWait(driver, 10)
obj = Select(driver.find_element_by_id("countryCombobox"))
obj = obj.select_by_index(1)
vies_r = requests.get(url)
vies_vat = driver.find_element_by_id("number")
vies_vat.send_keys('U54799909')
vies_verify = driver.find_element_by_id("submit")
vies_verify.click()
path_wkhtmltopdf = r'C:\Users\Python\wkhtmltox\wkhtmltox\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
print(driver.current_url)
pdfkit.from_url(driver.current_url, "out.pdf", configuration=config)
Ukalo
I am trying to scrape the data in this DB. I did ask a similar question about this previously, but my current question is specific/I am starting to understand the issue more.
So far, with selenium, I can type 22663 into the 'search by plant-based food' field, then click 'food-disease associations' underneath, and then click submit, as shown here:
it's the next page that I have the issue with, I cannot click 'Plant-Disease Associations'.
I have tried numerous ideas from other SO posts:
import sys
import pandas as pd
from bs4 import BeautifulSoup
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import csv
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.by import By
#binary = FirefoxBinary('/Users/kela/Desktop/scripts/scraping/geckodriver')
url = 'http://147.8.185.62/services/NutriChem-2.0/'
driver = webdriver.Firefox(executable_path='/Users/kela/Desktop/scripts/scraping/geckodriver')
driver.get(url)
#input the tax ID
element = driver.find_element_by_id("input_food_name")
element.send_keys("22663")
#click food-disease association
element = Select(driver.find_element_by_css_selector('[name=food_search_section]'))
element.select_by_value('food_disease')
#click submit
submit_xpath = '/html/body/form/p[2]/input[1]'
destination_page_link = driver.find_element_by_xpath(submit_xpath)
destination_page_link.click()
# this is where it goes wrong
#click plant-disease associations
#table_data = driver.find_elements_by_xpath('//td[#class="likeabutton"]')
#driver.find_element_by_link_text("plant-disease").click()
#driver.find_element_by_link_text("nutrichem12587_disease.tsv").click()
#driver.find_element_by_xpath("//div[contains(#onclick'nutrichem12587_disease.tsv']").click()
#values = []
#for i in table_data.find_element_by_tag_name('Plant-Disease associations'):
# values.append(i.text)
#print(value)
#span = table_data.find_element_by_tag_name('Plant-Disease associations')
#print(span)
#select = Select(driver.find_element_by_xpath("/html/body/table/tbody/tr/td[3]"))
#select.click()
#submit_xpath = '/html/body/table/tbody/tr/td[3]/div/span'
#submit_xpath = '/html/body/table/tbody/tr/td[3]'
#destination_page_link = driver.find_element_by_xpath(submit_xpath)
#destination_page_link.click()
#element = driver.find_element_by_xpath("//select[#name='plant-disease']")
#element.select_by_value('Plant-Disease associations')
#xpath2 = '/html/body/table/tbody/tr/td[3]/div'
#destination_page_link = driver.find_element_by_xpath(xpath2)
#destination_page_link.click()
#xpath2 = '/html/body/table/tbody/tr/td[3]/div/span'
#destination_page_link = driver.find_element_by_xpath(xpath2)
#destination_page_link.click()
I've commented out all the lines that I've tried and don't work. You can see I've tried multiple options as suggested on different SO posts, I'm aware that there are a lot of similar questions out there, but none of the solutions seem to work for me; all the errors are basically the same, 'cannot find element' (e.g. selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: nutrichem12587_disease.tsv)
Can someone please help me click on the 'Plant-Disease association' button. I'm wondering, is it because the page that I'm trying to click on is .php?
It is inside a frame. You need to switch to that
driver.find_element_by_css_selector('[value="Submit"]').click()
driver.switch_to.frame(driver.find_element_by_css_selector('frame'))
driver.find_element_by_css_selector('[onclick*="plant-disease"]').click()
I created a script with selenium to get given GST NUMBER information. I completed that program and it's giving me required details on output without any problem.
Now I do not want to interact it with chrome browser anymore so I'm trying to do this with BeautilfulSoup.
BeautifulSoup is new for me so I have not much idea to find element and I searched a lot about how to send keys with BeautifulSoup but I'm not getting it.
Now my script is stuck here.
from bs4 import BeautifulSoup
import requests
import urllib.request as urllib2
quote_page = 'https://my.gstzen.in/p/search-taxpayer'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
Now even if I manage to find the gst input element then I'm wondering how do I send keys to it? like the 15 digit gst number with sending enter button code or clicking on "Search GST Details".
If possible, then let me know the solution so I can start my research on it.
Actually, I need to do complete this tonight.
Plus, Here is my script which do the same thing with selenium easily and I want to do the same thing with BeautilfulSoup because I do not want chrome to be run every time while checking the GST and BeautilfulSoup seems interesting.
import selenium
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import csv
import requests
#import pyvirtualdisplay
#from pyvirtualdisplay import display
#display = Display(visible=0, size=(800, 600))
#display.start()
browser = webdriver.Chrome('E:\\Chrome Driver\\chromedriver_win32\\chromedriver.exe')
browser.set_window_position(-10000,0)
browser.get('https://my.gstzen.in/p/search-taxpayer/')
with open ('product.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
next(csv_reader)
for row in csv_reader:
name, phone = row
time.sleep(1)
gst = browser.find_element_by_name('gstin')
gst.click()
gst.send_keys(name)
time.sleep(1)
Details = browser.find_element_by_xpath("//*[contains(text(), ' Search GSTIN Details')]")
Details.click()
info = browser.find_element_by_class_name('col-sm-4')
print(info.text)
info2 = browser.find_element_by_xpath('/html/body/div[4]/div/div/div[1]/div[2]/div[2]/div[1]/div[2]')
print(info2.text)
input('Press Enter to quit')
browser.quit()
BeautifulSoup is a library for parsing and formatting, not interacting with web pages. For the latter, if that page requires JavaScript to work, you're stuck using a headless browser.
If it doesn't, you have at least two options:
Watch the Network tab in your browser's developer tools and see if you can recreate the request for the page you want using requests or urllib2
Use mechanize, which is built specifically to work with forms on sites that don't depend on JavaScript
mechanize is a little more work if there's no CSRF token or similar mechanism (though, again, it'll fail if JavaScript is required) and a little less work if there is.