how to pull out text from a div class using selenium headless [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to pull out the "0%" from the following div tag:
<div class="sem-report-header-td-diff ">0%</div>
my current code is:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path='mypath/chrome.exe',
chrome_options=options)
url = 'https://www.semrush.com/info/burton.com'
driver.get(url)
driver.implicitly_wait(2)
change_elements = driver.find_elements_by_xpath(xpath='//div[#class="sem-report-header-td-diff "]')
not sure what I'm doing wrong. This works with href tags, but its not working for this.

As per the HTML you have shared to extract the text 0% you need to use the method get_attribute("innerHTML") and you can use either of the following solutions:
css_selector:
myText = driver.find_element_by_css_selector("div.sem-report-header-td-diff").get_attribute("innerHTML")
xpath:
myText = driver.find_element_by_xpath("div[#class='sem-report-header-td-diff']").get_attribute("innerHTML")

First of all, not "elements", it is "element". and, second point is, you didn't get ttext. you just called element.
so, here is the code:
element_text =
driver.find_element_by_xpath("//div[#class='sem-report-header-td-diff']").text

Related

i'm learning how to crawling site in python! but i don't know how to do it to tree structure [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 days ago.
Improve this question
When I press each item on "https://dicom.innolitics.com/ciods" this site (like CR Image, Patient, Referenced Patient Sequence ...these values) , I want to save the descriptions of the items in the right layout in a variable.
I'm trying to save the values by clicking on the items on the left.
But, I found out that none of the values in the tree were crawled!
driver = webdriver.Chrome()
url = "https://dicom.innolitics.com/ciods"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tree-table')))
table_list = []
tree_table = driver.find_element(By.CLASS_NAME, 'tree-table')
tree_rows = tree_table.find_elements(By.TAG_NAME, 'tr')
for i, row in enumerate(tree_rows):
row.click()
td = row.find_element(By.TAG_NAME, 'td')
a = td.find_element(By.CLASS_NAME, 'row-name')
row_name = a.find_element(By.TAG_NAME, 'span').text
print(f'Row {i+1} name: {row_name}')
driver.quit()
i did like this
wanna know how to crawl the values in the tree.
It'd be better if you teach me how to crawl the layout on the right :)

How to always find the newest blog post with Selenium [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I select the first post in the new posts section at publish0x.com/newposts with selenium?
The post will always be in the same spot on the page but the title and author will vary.
The page where the posts are located: https://www.publish0x.com/newposts
An example of a post: https://www.publish0x.com/dicecrypto-articles/welcome-to-altseason-as-major-altcoins-explodes-xmkmvkk
First get a list of all the posts, and then extract the first element:
driver.get('https://www.publish0x.com/newposts')
post = driver.find_element_by_css_selector('#main > div.infinite-scroll > div:nth-child(1) > div.content')
and now you can extract the author name and title from the post like:
title = post.find_element_by_css_selector('h2 > a').text
author = post.find_element_by_css_selector(p.text-secondary > small:nth-child(4) > a').text
driver.get("https://www.publish0x.com/newposts");
List<WebElement> LIST = driver.findElements(By.xpath("//div[#class="infinite-scroll"]//div/descendant::div[#class="content"]"));
System.out.println(LIST.size());
LIST.get(0).click();
use above xpath to click 1 st post ( this is java code, you can use same xpath for python)
try this and let me know

Web scraping with Selenium not capturing full text [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I'm trying to mine quite a bit of text from a list of links using Selenium/Python.
In this example, I scrape only one of the pages and that successfully grabs the full text:
page = 'https://xxxxxx.net/xxxxx/September%202020/2020-09-24'
driver = webdriver.Firefox()
driver.get(page)
elements = driver.find_element_by_class_name('text').text
elements
Then, when I try to loop through the whole list of links (all the by day links on this page: https://overrustlelogs.net/Destinygg%20chatlog/September%202020) (using the same method that worked for grabbing the text from a single page), it is not grabbing the full text:
for i in tqdm(chat_links):
driver.get(i)
#driver.implicitly_wait(200)
elements = driver.find_element_by_class_name('text').text
#elements = driver.find_element_by_xpath('/html/body/main/div[1]/div[1]').text
#elements = elements.text
temp={'elements':elements}
chat_text.append(temp)
driver.close()
chat_text
My thought is that maybe it doesn't have the chance to load the whole thing, but it works on the single page. Also, the driver.get method seems meant to load the whole given page.
Any ideas? Thanks, much appreciated.
The page is lazy loading you need scroll the pages and add data in the list.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver=webdriver.Chrome()
driver.get("https://overrustlelogs.net/Destinygg%20chatlog/September%202020/2020-09-30")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".text>span")))
height=driver.execute_script("return document.body.scrollHeight")
data=[]
while True:
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(1)
for item in driver.find_elements_by_css_selector(".text>span"):
if item.text in data:
continue
else:
data.append(item.text)
lastheight=driver.execute_script("return document.body.scrollHeight")
if height==lastheight:
break
height=lastheight
print(data)

Syntax error when using find_elements_by_xpath [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
Web scraping beginner here.
I am trying to get the amount of items on this webpage: https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand
However when I use the len()-function, it says there is an error in the syntax.
from bs4 import BeautifulSoup
import requests
import selenium
from selenium.webdriver import Firefox
driver = Firefox()
url = "https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand"
driver.get(url)
items = len(driver.find_elements_by_xpath(//*[#id="product-12257648"])
for item in range(items):
price = item.find_element_by_xpath("/html/body/main/div/div/div/div[2]/div/div[1]/section/div/article[16]/a/p/span[1]")
print(price)
It then outputs this error:
File "C:/Users/rasmu/PycharmProjects/du nu ffs/jsscrape.py", line 13
items = len(driver.find_elements_by_xpath(//*[#id="product-12257648"])
^
SyntaxError: invalid syntax
Process finished with exit code 1
Try this:
items = len(driver.find_elements_by_xpath("//*[#id='product-12257648']"))
You need double quotes surrounding the XPath.
If you want all prices, you can refactor your code as such --
from selenium import webdriver
# start driver, navigate to url
driver = webdriver.Firefox()
url = "https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand"
driver.get(url)
# iterate product price elements
for item in driver.find_elements_by_xpath("//p[./span[#data-auto-id='productTilePrice']]"):
# print price text of element
print(item.text)
driver.close()

How to find backlinks in a website with python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am kind of stuck with this situation, I want to find backlinks of websites, I cannot find how to do it, here is my regex:
readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http"))
What I want to do, to find backlinks, is that, finding links that starts with http but not the links that include google, and I cannot figure out how to manage this?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div>hello</div>
Not this one"
Link 1
Link 2
"""
def processor(tag):
href = tag.get('href')
if not href: return False
return True if (href.find("google") == -1) else False
soup = BeautifulSoup(html)
back_links = soup.findAll(processor, href=re.compile(r"^http"))
print back_links
--output:--
[Link 2]
However, it may be more efficient just to get all the links starting with http, then search those links for links that do not have 'google' in their hrefs:
http_links = soup.findAll('a', href=re.compile(r"^http"))
results = [a for a in http_links if a['href'].find('google') == -1]
print results
--output:--
[Link 2]
Here is a regexp that matches http pages but not if including google:
re.compile("(?!.*google)^http://(www.)?.*")

Categories