I have a test which is sending thousands of api requests in order to complete the test. This is causing the test to take a very long time to complete, over 15 minutes.
What I am doing over and over again is find an element, then I find two elements within that element and then I read the text of those elements.
This causes selenium to send an api request for the element and then another to find the elements within the element, and then one final one to get the text of the element. I could skip all of these api requests if I just did one request to get the DOM tree of the first element then parse the HTML for the text of the elements. Is there a library used for this purpose.
Something along the lines of this?
Import ElementParser
elems = map(ElementParser, driver.find_elements(By.CSS_SELECTOR, 'div.row'))
for elem in elems:
name = elem.find_element(By.CSS_SELECTOR, 'div.content').text
description = elem.find_element(By.CSS_SELECTOR. 'div.description').text
It would be great if this library uses a WebElement as an argument and mimics selenium syntax as shown above. If nothing like this exists then I could create my own Library but I'd rather use something that already exists.
I'm new to stackoverflow so hopefully random advice questions like this is appropriate.
BeautifulSoup was the exact sort of library I was looking for. I was easily able to override one of my functions to use BeautifulSoup instead of selenium. This caused nearly a 100x speed increase since it would no longer need to wait for API responses.
Original function:
def find_value(self, cell):
text = self.find_element(*self.get_row_selector(cell.cell_number)).get_attribute('innerHTML')
if text:
return text
New Function:
def find_value(self, cell):
text = BeautifulSoup(self.get_attribute('innerHTML'), 'html.parser').select(self.get_row_selector(cell.cell_number)[1])[0].decode_contents()
if text:
return text
(Code not copied verbatim to make more readable. Also type(self) == WebElement)
Related
Basic concept I know:
find_element = find single elements. We can use .text or get.attribute('href') to make the element can be readable. Since find_elements is a list, we can't use .textor get.attribute('href') otherwise it shows no attribute.
To scrape information to be readable from find_elements, we can use for loop function:
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
for i in vegetables_search:
print(i.text)
Here is my problem, when I use find_element, it shows the same result. I searched the problem on the internet and the answer said that it's because using find_element would just show a single result only. Here is my code which hopes to grab different urls.
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
But I don't know how to combine the results into pandas. If I print these codes, links variable prints the same url on the csv file...
vegetables_search = driver.find_elements(By.CLASS_NAME, "product-brief-wrapper")
Product_name =[]
links = []
for search in vegetables_search:
Product_name.append(search.find_element(By.TAG_NAME, "h4").text)
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
#use panda modules to export the information
df = pd.DataFrame({'Product': Product_name,'Link': links})
df.to_csv('name.csv', index=False)
print(df)
Certainly, if I use loop function particularly, it shows different links.(That's mean my Xpath is correct(!?))
product_link = (driver.find_elements(By.XPATH, "//a[#rel='noopener']"))
for i in product_link:
print(i.get_attribute('href'))
My questions:
Besides using for loop function, how to make find_elements becomes readable? Just like find_element(By.attribute, 'content').text
How to go further step for my code? I cannot print out different urls.
Thanks so much. ORZ
This is the html code which's inspected from the website:
This line:
links.append(driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
should be changed to be
links.append(search.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href'))
driver.find_element(By.XPATH, ".//a[#rel='noopener']").get_attribute('href') will always search for the first element on the DOM matching .//a[#rel='noopener'] XPath locator while you want to find the match inside another element.
To do so you need to change WebDriver driver object with WebElement search object you want to search inside, as shown above.
I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.
I sadly couldn't find any resources online for my problem. I'm trying to store elements found by XPath in a list and then loop over the XPath elements in a list to search in that object. But instead of searching in that given object, it seems that selenium is always again looking in the whole site.
Anyone with good knowledge about this? I've seen that:
// Selects nodes in the document from the current node that matches the selection no matter where they are
But I've also tried "/" and it didn't work either.
Instead of giving me the text for each div, it gives me the text from all divs.
My Code:
from selenium import webdriver
driver = webdriver.Chrome()
result_text = []
# I'm looking for all divs with a specific class and store them in a list
divs_found = driver.find_elements_by_xpath("//div[#class='a-fixed-right-grid-col a-col-left']")
# Here seems to be the problem as it seems like instead of "divs_found[1]" it behaves like "driver" an looking on the whole site
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath("//a[contains(#href, '/gp/product/')]")
# Now I'm looking in the found href matches to store the text from it
for href in hrefs_matching_in_div:
result_text.append(href.text)
print(result_text)
You need to add . for immediate child.Try now.
hrefs_matching_in_div = divs_found[1].find_elements_by_xpath(".//a[contains(#href, '/gp/product/')]")
Here, I want to scrape a website called "fundsnetservices.com." Specifically, I want to grab the text below each program — it's about a paragraph's worth of text.
Using the Google Chrome Inspect method, I was able to pull this...
'/html/body/div[3]/div/div/div[1]/div/p[2]/text()'
... as the xpath. However, every time I print the text out, it returns [ ]. Why might this be?
response = urllib.request.urlopen('http://www.fundsnetservices.com/searchresult/30/International-Grants-&-Funders/18.html')
tree = etree.HTML(response.read().decode('utf-16'))
text = tree.xpath('/html/body/div[3]/div/div/div[1]/div/p[2]/text()')
It seems your code returns whitespace nodes. Correct your XPath with :
//p[#class="tdclass"]/text()[3]
I'm new in Selenium with Python. I'm trying to scrape some data but I can't figure out how to parse outputs from commands like this:
driver.find_elements_by_css_selector("div.flightbox")
I was trying to google some tutorial but I've found nothing for Python.
Could you give me a hint?
find_elements_by_css_selector() would return you a list of WebElement instances. Each web element has a number of methods and attributes available. For example, to get an inner text of the element, use .text:
for element in driver.find_elements_by_css_selector("div.flightbox"):
print(element.text)
You can also make a context-specific search to find other elements inside the current element. Taking into account, that I know what site you are working with, here is an example code to get the departure and arrival times for the first-way flight in a result box:
for result in driver.find_elements_by_css_selector("div.flightbox"):
departure_time = result.find_element_by_css_selector("div.departure p.p05 strong").text
arrival_time = result.find_element_by_css_selector("div.arrival p.p05 strong").text
print [departure_time, arrival_time]
Make sure you study Getting Started, Navigating and Locating Elements documentation pages.