I'm trying to extract information from this page:
I'm trying to extract the time (6:30 PM).
My strategy is to find the second instance of the date (Mar. 31st, 2022), and then get the first sibling of that. Photo here (I want the part boxed in yellow):
Here's what I've tried:
#Get First Date (Date at top of the page)
try:
first_date = driver.find_elements_by_css_selector('a[href^="https://www.bandsintown.com/a/"] + div + div')
first_date = first_date[0].text
except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
print ("first_date doesn't exist")
continue
#Get time. This will the first sibling of the second instance of date
try:
event_time = driver.find_elements_by_xpath("//div[text()='" + first_date + "'][1]/following-sibling::div")
print(event_time[0].text)
except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
continue
However, this is not getting me what I want. What am I doing wrong here? I'm looking for a way to get the first sibling of the second instance using Xpath.
It seems it is first element with PM / AM so I would use find_element with
'//div[contains(text(), " PM") or contains(text(), " AM")]'
like this
item = driver.find_element(By.XPATH, '//div[contains(text(), " PM") or contains(text(), " AM")]')
print(item.text)
I use space before PM/AM to make sure it is not inside word.
Your xpath works when I add ( ) so it first gets divs and later select by index.
Without () it may treats [text()="..."][1] like [text()="..." and 1].
And it needs [2] instead of [1] because xpath start counting at 1, not 0
"(//div[text()='" + first_date + "'])[2]/following-sibling::div"
Full working example
from selenium import webdriver
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import time
url = 'https://www.bandsintown.com/e/103275458-nayo-jones-at-promise-of-justice-initiative?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event'
#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.get(url)
time.sleep(5)
item = driver.find_element(By.XPATH, '//div[contains(text(), " PM") or contains(text(), " AM")]')
print(item.text)
print('---')
first_date = driver.find_elements(By.CSS_SELECTOR, 'a[href^="https://www.bandsintown.com/a/"] + div + div')
first_date = first_date[0].text
event_time = driver.find_elements(By.XPATH, "(//div[text()='" + first_date + "'])[2]/following-sibling::div")
print(event_time[0].text)
The following xpath will give you date and time.
date:
print(driver.find_element_by_xpath("//a[text()='Promise of Justice Initiative']/following::div[4]").text)
time:
print(driver.find_element_by_xpath("//a[text()='Promise of Justice Initiative']/following::div[5]").text)
or your use.
print(driver.find_element_by_xpath("
//a[contains(#href,'https://www.bandsintown.com/v/')]/following::div[contains(text(), 'PM') or contains(text(), 'AM')]").text)
Related
I used web scrapping through Python with selenium in order to get daily price values for EEX French Power futures at the url "https://www.eex.com/en/market-data/power/futures#%7B%22snippetpicker%22%3A%2221%22%7D".
I guess they updated their website as the url changed recently, and now my script doesn't work properly anymore as I can't find a way to click on each displayed product button (Year, Quarter, Month, Weekend, Day).
Here is my code until the step that doesn't work (it simply doesn't click, it doesn't fail) :
import time
import datetime
from datetime import date
from dateutil.relativedelta import relativedelta
import pyodbc
from selenium.webdriver.common.action_chains import ActionChains
import pandas as pd
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.eex.com/en/market-data/power/futures#%7B%22snippetpicker%22%3A%2221%22%7D"
dico_product = ('Day', 'Weekend', 'Week', 'Month', 'Quarter', 'Year')
now = datetime.datetime.now()
date_prx = now.date()
options=Options()
d = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
d.get(url)
time.sleep(6)
d.maximize_window()
cookies_button_str = "//input[#class='btn bordered uo_cookie_btn_type_1']"
d.find_element(By.XPATH, cookies_button_str).click()
time.sleep(4)
dateinput_button_str = "//div[#class = 'mv-date-input']//div[#class = 'mv-stack-block']//input[#class = 'mv-input-box']"
Date_input = date_prx
Date_input_str = str(Date_input.year) + '-' + str(Date_input.month) + '-' + str(Date_input.day)
element_view = d.find_element(By.CLASS_NAME, 'collapsed')
d.execute_script("arguments[0].scrollIntoView()", element_view)
WebDriverWait(d, 20).until(EC.presence_of_element_located((By.XPATH, dateinput_button_str)))
element = d.find_element(By.XPATH, dateinput_button_str)
time.sleep(2)
d.execute_script('arguments[0].value = "' + str(Date_input_str) + '";', element)
time.sleep(2)
element_button_str = './/div[contains(#class, "mv-button-base mv-hyperlink-button")]'
containers = d.find_elements(By.XPATH, element_button_str)
for item in containers:
if item.text in dico_product:
print('Traitement ' + str(item.text) + ' pour la date ' + str(Date_input_str) + '.')
element_button_str = './/div[contains(#class, "' + str(item.get_attribute("class")) + '") and contains(., "' + str(item.text) + '")]'
product_button = d.find_element(By.XPATH, element_button_str)
d.execute_script("arguments[0].click()", product_button)
It does find the element to click on, but it doesn't click.
What is suprising is that if you take the old url, that get you to the Austrian futures by default, it works fine. But if you take the proper url, it doesn't.
I don't know if it can be done or if it's no use, but honestly I tried everything I could think of. Could you gently help me ?
Thank you
Hello I'm trying to scrape some questions from a web forum
I am able to scrape questions with a
find_elements_by_xpath
it's something like this :
questions = driver.find_elements_by_xpath('//div[#class="autu-generated"]//div[#class="corpus"]//div[#class="body-bd"]//p')
I made a diagram so you can understand my situation :
my problem is if I didn't specify the auto-generated class in the XPath it's gonna return all the values from the other divs (which I don't want )
and writing the auto-generated class manually like I did to test isn't a valid idea because I'm scraping multiple questions with multiple classes
do you guys have any ideas on how to resolve this problem ??
here is the web forum
thank you
my code :
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time
from fastparquet.parquet_thrift.parquet.ttypes import TimeUnit
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import pandas as pd
driver = webdriver.Chrome('/Users/ossama/Downloads/chromedriver_win32/chromedriver')
page = 1
#looping in pages
while page <= 10:
driver.get('https://forum.bouyguestelecom.fr/questions/browse?flow_state=published&order=created_at.desc&page='+str(page)+'&utf8=✓&search=&with_category%5B%5D=2483')
# checking to click the pop-up cookies interfaces
if page == 1:
#waiting 10s for the pop-up to show up before accepting it
time.sleep(10)
driver.find_element_by_id('popin_tc_privacy_button_3').click()
# store all the links in a list
#question_links = driver.find_elements_by_xpath('//div[#class="corpus"]//a[#class="content_permalink"]')
links = driver.find_elements_by_xpath('//div[#class="corpus"]//a[#class="content_permalink"]')
forum_links= []
for link in links:
value = link.get_attribute("href")
print(value)
forum_links.append(value)
else:
links = driver.find_elements_by_xpath('//div[#class="corpus"]//a[#class="content_permalink"]')
for link in links:
value = link.get_attribute("href")
print(value)
forum_links.append(value)
q_df = pd.DataFrame(forum_links)
q_df.to_csv('forum_links.csv')
page = page + 1
for link in forum_links:
driver.get(link)
#time.sleep(5)
#driver.find_element_by_id('popin_tc_privacy_button_3').click()
questions = driver.find_elements_by_xpath('//div[#class="corpus"]//div[#class="body-bd"]//p')
authors = driver.find_elements_by_xpath('//div[#class="corpus"]//div[#class="metadata"]//dl[#class="author-name"]//dd//a')
dates = driver.find_elements_by_xpath('//div[#class="corpus"]//div[#class="metadata"]//dl[#class="date"]//dd')
questions_list = []
for question in questions:
for author in authors:
for date in dates:
questions_list.append([question.text, author.text, date.text])
print(question.text)
print(author.text)
print(date.text)
q_df = pd.DataFrame(questions_list)
q_df.to_csv('colrow.csv')
Improved XPATH, and removed second loop.
page = 1
while page <= 10:
driver.get(
'https://forum.bouyguestelecom.fr/questions/browse?flow_state=published&order=created_at.desc&page=' + str(
page) + '&utf8=✓&search=&with_category%5B%5D=2483')
driver.maximize_window()
print("Page url: " + driver.current_url)
time.sleep(1)
if page == 1:
AcceptButton = driver.find_element(By.ID, 'popin_tc_privacy_button_3')
AcceptButton.click()
questions = driver.find_elements(By.XPATH, '//div[#class="corpus"]//a[#class="content_permalink"]')
for count, item in enumerate(questions, start=1):
print(str(count) + ": question detail:")
questionfount = driver.find_element(By.XPATH,
"(//div[#class='corpus']//a[#class='content_permalink'])[" + str(
count) + "]")
questionfount.click()
questionInPage = WebDriverWait(driver, 20).until(EC.visibility_of_element_located(
(By.XPATH, "(//p[#class='old-h1']//following::div[contains(#__uid__, "
"'dim')]//div[#class='corpus']//a["
"#class='content_permalink'])[1]")))
author = WebDriverWait(driver, 20).until(EC.visibility_of_element_located(
(By.XPATH, "(//p[#class='old-h1']//following::div[contains(#__uid__, 'dim')]//div["
"#class='corpus']//div[contains(#class, 'metadata')]//dl["
"#class='author-name']//a)[1]")))
date = WebDriverWait(driver, 20).until(EC.visibility_of_element_located(
(By.XPATH, "(//p[#class='old-h1']//following::div[contains(#__uid__, 'dim')]//div["
"#class='corpus']//div[contains(#class, 'metadata')]//dl[#class='date']//dd)[1]")))
print(questionInPage.text)
print(author.text)
print(date.text)
print(
"-----------------------------------------------------------------------------------------------------------")
driver.back()
driver.refresh()
page = page + 1
driver.quit()
Output (in Console):
Page url: https://forum.bouyguestelecom.fr/questions/browse?flow_state=published&order=created_at.desc&page=1&utf8=%E2%9C%93&search=&with_category%5B%5D=2483
1: question detail:
Comment annuler ma commande bbox
ELHADJI
17 novembre 2021
-----------------------------------------------------------------------------------------------------------
2: question detail:
BBOX adsl : Interruption Service Internet ?
GABRIELA
17 novembre 2021
-----------------------------------------------------------------------------------------------------------
to overcome this issue i found that the div with auto-generated class had a uid
so here is what the xpath looks like now :
questions = driver.find_elements_by_xpath('//div[#__uid__="dim2"]//div[#class="corpus"]//div[#class="body-bd"]//p')
sometimes we just gotta focus right !
I am trying to extract the text from within a <strong> tag that is deeply nested in the HTML content of this webpage: https://www.marinetraffic.com/en/ais/details/ships/imo:9854612
For example:
The strong tag is the only one on the webpage that will contain the string 'cubic meters'.
My objective is to extract the entire text, i.e., "138124 cubic meters Liquid Gas". When I try the following, I get an error:
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)
element = driver.find_element_by_link_text("//strong[contains(text(),'cubic meters')]").text
print(element)
Error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"//strong[contains(text(),'cubic meters')]"}
What am I doing wrong here?
The following also throws an error:
element = driver.find_element_by_xpath("//strong[contains(text(),'cubic')]").text
Your code works on Firefox(), but not on Chrome().
The page uses lazy loading, so you have to scroll to Summary and then it loads the text with the expected strong.
I used a little slower method - I search all
elements with class='lazyload-wrapper, and in the loop scroll to the item and check if there is strong. If there isn't any strong, then I scroll to the next class='lazyload-wrapper.
from selenium import webdriver
import time
#driver = webdriver.Firefox()
driver = webdriver.Chrome()
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
elements = driver.find_elements_by_xpath("//span[#class='lazyload-wrapper']")
for number, item in enumerate(elements):
print('--- item', number, '---')
#print('--- before ---')
#print(item.text)
actions.move_to_element(item).perform()
time.sleep(0.1)
#print('--- after ---')
#print(item.text)
try:
strong = item.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
print(strong.text)
break
except Exception as ex:
#print(ex)
pass
Result:
--- item 0 ---
--- item 1 ---
--- item 2 ---
173400 cubic meters Liquid Gas
The result shows that I could use elements[2] to skip two elements, but I wasn't sure if this text will be always in the third element.
Before I created my version I tested other versions and here is the full working code:
from selenium import webdriver
import time
#driver = webdriver.Firefox()
driver = webdriver.Chrome()
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
driver.get(url)
time.sleep(3)
def test0():
elements = driver.find_elements_by_xpath("//strong")
for item in elements:
print(item.text)
print('---')
item = driver.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
print(item.text)
def test1a():
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
element = driver.find_element_by_xpath("//div[contains(#class,'MuiTypography-body1')][last()]//div")
actions.move_to_element(element).build().perform()
text = element.text
print(text)
def test1b():
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
text = driver.find_element_by_xpath("//div[contains(#class,'MuiTypography-body1')][last()]//strong").text
print(text)
def test2():
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(driver.page_source, "html.parser")
soup.find_all(string=re.compile(r"\d+ cubic meters"))
def test3():
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
elements = driver.find_elements_by_xpath("//span[#class='lazyload-wrapper']")
for number, item in enumerate(elements, 1):
print('--- number', number, '---')
#print('--- before ---')
#print(item.text)
actions.move_to_element(item).perform()
time.sleep(0.1)
#print('--- after ---')
#print(item.text)
try:
strong = item.find_element_by_xpath("//strong[contains(text(), 'cubic')]")
print(strong.text)
break
except Exception as ex:
#print(ex)
pass
#test0()
#test1a()
#test1b()
#test2()
test3()
You can use Beautiful Soup for this, and more precisely the string argument; from the documentation, "you can search for strings instead of tags".
As an argument, you can also pass a regex pattern.
>>> from bs4 import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup(driver.page_source, "html.parser")
>>> soup.find_all(string=re.compile(r"\d+ cubic meters"))
['173400 cubic meters Liquid Gas']
If you're sure there is only one result, or you need just the first, you can also use find instead of find_all.
Your XPath expression is correct and works in Chrome. You get NoSuchElementException, because the element is not loaded within the 3 seconds you wait and does not exist.
To wait for the element, use the WebDriverWait class. It waits explicitly for a specific condition of the element, and in your case presents is enough.
In the code below, Selenium will wait for the element to be presented in the HTML for 10 seconds, polling every 500 milliseconds. You can read about WebDriverWait and conditions here.
Some useful information:
Not visible elements return an empty string. In such a case you need to wait for the visibility of the element, or if the element requires a scroll to scroll to it (example added).
You can also get the text from a not-visible element using JavaScript.
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium import webdriver
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9854612"
locator = "//strong[contains(text(),'cubic meters')]"
with webdriver.Chrome() as driver: # Type: webdriver
wait = WebDriverWait(driver, 10)
driver.get(url)
cubic = wait.until(ec.presence_of_element_located((By.XPATH, locator))) # Type: WebElement
print(cubic.text)
# The below examples are just for information
# and are not needed for the case
# Example with scroll. Scroll to the element to make it visible
cubic.location_once_scrolled_into_view
print(cubic.text)
# Example using JavaScript. Works for not visible elements.
text = driver.execute_script("return arguments[0].textContent", cubic)
print(text)
It would be correct to use the marinetraffic API.
I guess you should first scroll to that element and only after that try accessing it including getting it text.
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
element = driver.find_element_by_xpath("//div[contains(#class,'MuiTypography-body1')][last()]//div")
actions.move_to_element(element).build().perform()
text = element.text
In case the above still not good enough you can scroll page height one time like this:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
the_text = driver.find_element_by_xpath("//div[contains(#class,'MuiTypography-body1')][last()]//strong").text
I am trying to retrive the day temperature of a local weather site.
I built this loop using BeautifulSoup.
Unfortunately the loop breaks after the first round.
this is my code and the result:
code:
#coding: latin-1
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
# create a file zam-data.txt
# seperated with komma
f = open('zamg-data.txt','w')
# start webdriver
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
#loop through month and days
for m in range(1,13):
for d in range (1, 32):
# was the last day in a month
if (m==2 and d>28):
break
elif (m in [4,6,9,11] and d>30):
break
#open zamg site
timestamp = '2019' +'-'+ str(m) +'-'+ str(d)
print("call page of "+timestamp)
url = "https://www.zamg.ac.at/cms/de/klima/klima-aktuell/klimamonitoring/?param=t&period=period-ymd-"+timestamp
driver.get(url)
# extract temprature
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all(class_='u-txt--big')[1].string
print(len(data))
print(data + '...okay')
# format month for timestamp
if(len(str(m)) < 2):
mStamp = '0'+str(m)
else:
mStamp = str(m)
# format day for timestamp
if(len(str(d)) < 2):
dStamp = '0'+ str(d)
else:
dStamp = str(d)
# timestamp
timestamp = '2019' + mStamp + dStamp
# write time and value
f.write(timestamp + ',' + data + '\n')
# data is extracted - close
f.close()
my result:
➜ weather-app python get-data-02.py
call page of 2019-1-1
5
+3,9 ...okay
call page of 2019-1-2
Traceback (most recent call last):
File "get-data-02.py", line 37, in <module>
data = soup.find_all(class_='u-txt--big')[1].string
IndexError: list index out of range
➜ weather-app
I don't understand what is wrong here. the 2nd page is loaded in the browser but then it breaks
any Ideas?
#coding: latin-1
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import datetime
import time
base = datetime.datetime(2019,1,1).date()
date_list = [base + datetime.timedelta(days=x) for x in range(365)]
# start webdriver
driver = webdriver.Chrome("/usr/local/bin/chromedriver")
base_url = "https://www.zamg.ac.at/cms/de/klima/klima-aktuell/klimamonitoring/?param=t&period=period-ymd-"
with open('zamg-data.txt','w') as file:
for dt in date_list:
timestamp = dt.strftime("%Y-%m-%d")
print("call page of "+timestamp)
url = f"{base_url}{timestamp}"
driver.get(url)
WebDriverWait(driver, timeout=40).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "u-txt--big")))
# extract temprature
html = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all(class_='u-txt--big')[1].string
print(len(data))
print(data + '...okay')
# timestamp
timestamp_1 = dt.strftime("%Y%m%d")
# write time and value
file.write(timestamp_1 + ',' + data + '\n')
time.sleep(3)
driver.quit()
print("Done!!!")
As someone from the comment section mentioned, you need to make the browser wait till all elements of that class are detected. I've added an explicit time delay after each page load so that the website is not overwhelmed with requests. It is a potential way to get your IP blocked. It's best to always use a context manager, whenever you can.
Hi Everone I Want To Scrape But u get this error while in 59
i have 1089 items in my xlsx file
Error:
Traceback (most recent call last):
File ".\seleniuminform.py", line 28, in <module>
s.write(phone[i].text + "," + wevsite_link[i].text + "\n")
IndexError: list index out of range
Here is my python code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
with open("Sans Fransico.csv","r") as s:
s.read()
df = pd.read_excel('myfile.xlsx') # Get all the urls from the excel
mylist = df['Urls'].tolist() #urls is the column name
driver = webdriver.Chrome()
for url in mylist:
driver.get(url)
wevsite_link = driver.find_elements_by_css_selector(".text--offscreen__373c0__1SeFX+ .link-size--default__373c0__1skgq")
phone = driver.find_elements_by_css_selector(".text--offscreen__373c0__1SeFX+ .text-align--left__373c0__2pnx_")
num_page_items = len(phone)
with open("Sans Fransico.csv", 'a',encoding="utf-8") as s:
for i in range(num_page_items):
s.write(phone[i].text + "," + wevsite_link[i].text + "\n")
driver.close()
print ("Done")
Link:
https://www.yelp.com/biz/daeho-kalbijjim-and-beef-soup-san-francisco-9?osq=Restaurants
Here Error in This Website and Phone:
I’m not very familiar with Selenium, so i can’t comment on that aspect.
The first time you open “Sans Francisco.csv” you read the contents without assigning them to a variable.
As for your error, it’s caused by the fact that your range is based on the length of phone, not on the length of wevsite_link. If wevsite_link is shorter than phone, you get an error. In simple terms, you are finding fewer website links than phone numbers, yet your code assumes that you will always find the exact same amount of each.
Can you explain your code a bit more? What are you trying to do?
It seems some items have no phone so it found less phones then websides.
You should rather first find all ".text--offscreen__373c0__1SeFX+" and later use for-loop to search phone and website in every item separatelly.
Using try/except you can recognize if item has no phone and use empty string as phone number
for url in mylist:
driver.get(url)
all_items = driver.find_elements_by_css_selector(".text--offscreen__373c0__1SeFX+")
for item in all_items:
try:
wevsite_link = item.find_element_by_css_selector(".link-size--default__373c0__1skgq")
wevsite_link = wevsite_link.text
#except selenium.common.exceptions.NoSuchElementException:
except:
wevsite_link = ''
try:
phone = item.find_element_by_css_selector(".text-align--left__373c0__2pnx_")
phone = phone.text
#except selenium.common.exceptions.NoSuchElementException:
except:
phone = ''
with open("Sans Fransico.csv", 'a',encoding="utf-8") as s:
s.write(phone + "," + wevsite_link + "\n")
I didn't have url to page so I couldn't test it.
At a glance, I'm suspecting that
phone = driver.find_elements_by_css_selector(".text--offscreen__373c0__1SeFX+ .text-align--left__373c0__2pnx_")
is returning 0. Perhaps the css selectors you're trying to find matches for aren't accurate.