I am trying to scrape job postings and I'm having Selenium go to each individual posting to get some text. The problem is the structure of the page isn't the same for every posting so I'm trying to tell Selenium to grab the text from the element that immediately follows a div that contains the text Job Scope since that is where the text always is.
Here is one site, again it's not the same for every one but is always after Job Scope/Duties: https://recruitingbypaycor.com/career/JobIntroduction.action?clientId=8a87142e46c6fe710146f995773e6461&id=8a78839e812f7de70181456c3ad709ff&source=&lang=en
Here is my code:
job_descriptions = WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.ID,"gnewtonJobDescriptionText")))
for job in job_descriptions:
job_description.append(job.find_elements(By.XPATH, "//div[contains(text(),'Job Scope')]/following-sibling::div"))
I've got the script to work here but it produces empty list. So when I add .text to the end it errors our saying 'list' object has no attribute 'text' (obviously since it's producing an empty list. The only thing I can think of is that the text Job Scope is actually within a b tag inside the preceding div so maybe that's why?
Related
Im doing some scrapping with selenium Python, my problem is that, when I call WebElement.text() it gives me a string in one line with no format. But I want to get that text just as the web shows, that is, with the line breaks.
For example, the element with text:
<br>'Hello this is an example'<br>
In the web it shows as:
<br>
'Hello this is an<br>
example'
I want the second result, but Selenium gives me the first one. I tried to 'manually' give format to the text using the width of the words with PIL, but the results are quite unexact.
Instead of using the text attribute, you need to use the get_attribute("innerHTML") as follows:
print(WebElement.get_attribute("innerHTML"))
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
References
Link to useful documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
I know what it does, but can't understand HOW it does, if you know what I mean.
For example, the code below will pull out all links from the page OR it will timeout if it won't find any <a> tag on the page.
driver.get('https://selenium-python.readthedocs.io/waits.html')
links = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'a')))
for link in links:
print(link.get_attribute('href'))
driver.quit()
I'm wondering HOW Selenium knows for sure that presence_of_all_elements_located((By.TAG_NAME, 'a')) detected all <a> elements and the page won't dynamically load any more links?
BTW, pardon the following question, but can you also explain why we use double brackets here EC.presence_of_all_elements_located((By.TAG_NAME, 'a'))? Is that because presence_of_all_elements_located method accepts tuple as its parameter?
Selenium doesn't know the page won't dynamically load more links. When you use this presence_of_all_elements_located class (not a method!), then so long as there is 1 matching element on the page, it will return a list of all such elements.
When you write EC.presence_of_all_elements_located((By.TAG_NAME, 'a')) you are instantiating this class with a single argument which is a tuple as you say. This tuple is called a "locator".
"How this works" is kind of complicated and the only way to really understand is to read the source code. Selenium sees the root html as a WebElement and all children elements also as WebElements. These classes are created and discarded dynamically. They are only kept around if assigned to something. When you check for the presence of all elements matching your locator, it will traverse the HTML tree by jumping from parent to children and back up to parent siblings. Waiting for the presence of something just does this on a loop until it gets a positive match (then it completes the tree traversal and returns a list) or until the wait times out.
I would like to get exact text inside tag with selenium and python.
When I inspect the element, I can see the html below on the browser.
<div class="value ng-binding" ng-bind="currentEarning">£8.8</div> == $0
I have written the python code with selenium in order to get text as follows.
currentEaring = Ladbrokes.find_element_by_xpath('//div[#ng-bind="currentEarning"]').text
When I run this script several times, I occasionally get the result as 0 - this is not true.
Rarely I can get £8.8 - this is ture.
I guess I occasionally get 0 because of the == $0 but not sure.
How can I get the text as £8.8? - using regex? If then, how?
it's happening may be because, it takes some time to populate the text after page has loaded, and it seems like you are not waiting enough.
you can use explicit wait to wait until element contains certain text.
For your case, following example might work.
wait = WebDriverWait(driver, 30)
wait.until(EC.text_to_be_present_in_element((By.XPATH, "//div[#ng-bind='currentEarning']"), "£"))
Here is the Answer to your Question:
One of the reason to get improper results may be due to asynchronous rendering of the HTML DOM due to presence of JavaScript and AJAX calls. As you have taken help of the ng-bind attribute only so our intended node may not be the unique/first match in the HTML DOM. Hence we will refine our xpath to be more granular & unique by adding the class attribute along with ng-bind attribute and take help of get_attribute method to get the text £8.8 as follows:
currentEaring = Ladbrokes.find_element_by_xpath('//div[#class="value ng-binding" and #ng-bind="currentEarning"]').get_attribute("innerHTML")
Let me know if this Answers your Question.
I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.
So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.
Now, I try to get all the text from the div job_description where I actually get nothing. I used
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns [] response.xpath('//div[#class="job_description"]/div[#class="container"]/div[#class="row"]/text()').extract()
You were close with
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
The div-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[#class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:
response.xpath('//div[#class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[#class="job_description] and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[#class="col-sm-5 justify-text"]//*/text()').extract()