How to get the job description using scrapy? - python

I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.
So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.
Now, I try to get all the text from the div job_description where I actually get nothing. I used
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns [] response.xpath('//div[#class="job_description"]/div[#class="container"]/div[#class="row"]/text()').extract()

You were close with
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
The div-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[#class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:
response.xpath('//div[#class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[#class="job_description] and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[#class="col-sm-5 justify-text"]//*/text()').extract()

Related

Scraping words within a string from websites

I'm pretty new to scrapy and Python. I'm making a web scraper attempting to scrape business owners' names from the HTML text of their websites. My issue is that I can't exactly use an xpath or css response to grab the text from the website code, because I'm scraping hundreds of different websites with different coding, classes, pages, etc. Here's what I have so far:
html_text = str(response.text)
owner_name=re.findall("owner", html_text)
if owner_name:
print("OWNER FOUND # " + str(response.url))
All this really does, obviously, is let me know if the program has found a page mentioning the owner. I'm not really sure how to go about scraping their name from within the html code. I assume their name would immediately follow wherever owner was mentioned in the HTML, so I'm essentially trying to scrape the next word or two after the word owner.
Maybe you want to use strings find method to get the position of the sub string owner and then slice the string.
>>> string = "a lot of filler text and the owner is john doe"
>>> i = string.find("owner")
>>> print(string[i:i+30])
owner is john doe

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?

Scrapy: How do I get text and text with <b> tag at the same time when using scrapy and xpath?

I need to get 183.7 from the html below
<span class="price"><b>183</b>.7</span>
but if run below code with scrapy shell mode, only '.7' is available
response.xpath('//span[#class="price"]/text()').get()
How shall I write the code to get complete number?
I have read Scrapy tutial at http://doc.scrapy.org/en/1.7/topics/selectors.html#topics-selectors
but it is still hard for me to understand the right xpath setting to get values I need.
If I try
response.xpath('//span[#class="price"]').get()
it returns
['<span class="price"><'b>183</'b>.7 </span>']
which is also not what exactly I need.
you can use the "//" to get all child text on the element like this:
"".join(response.xpath('//span[#class="price"]//text()').extract())

In Angular website, get exact text inside <div> tag with Selenium & Python?

I would like to get exact text inside tag with selenium and python.
When I inspect the element, I can see the html below on the browser.
<div class="value ng-binding" ng-bind="currentEarning">£8.8</div> == $0
I have written the python code with selenium in order to get text as follows.
currentEaring = Ladbrokes.find_element_by_xpath('//div[#ng-bind="currentEarning"]').text
When I run this script several times, I occasionally get the result as 0 - this is not true.
Rarely I can get £8.8 - this is ture.
I guess I occasionally get 0 because of the == $0 but not sure.
How can I get the text as £8.8? - using regex? If then, how?
it's happening may be because, it takes some time to populate the text after page has loaded, and it seems like you are not waiting enough.
you can use explicit wait to wait until element contains certain text.
For your case, following example might work.
wait = WebDriverWait(driver, 30)
wait.until(EC.text_to_be_present_in_element((By.XPATH, "//div[#ng-bind='currentEarning']"), "£"))
Here is the Answer to your Question:
One of the reason to get improper results may be due to asynchronous rendering of the HTML DOM due to presence of JavaScript and AJAX calls. As you have taken help of the ng-bind attribute only so our intended node may not be the unique/first match in the HTML DOM. Hence we will refine our xpath to be more granular & unique by adding the class attribute along with ng-bind attribute and take help of get_attribute method to get the text £8.8 as follows:
currentEaring = Ladbrokes.find_element_by_xpath('//div[#class="value ng-binding" and #ng-bind="currentEarning"]').get_attribute("innerHTML")
Let me know if this Answers your Question.

How to Collect the line with Selenium Python

I want to know how I can collect line, mailto link using selenium python the emails contains # sign in the contact page I tried the following code but it is somewhere works and somewhere not..
//*[contains(text(),"#")]
the emails formats are different somewhere it is <p>Email: name#domain.com</p> or <span>Email: name#domain.com</span> or name#domain.com
is there anyway to collect them with one statement..
Thanks
Here is the XPath you are looking for my friend.
//*[contains(text(),"#")]|//*[contains(#href,"#")]
You could create a collection of the link text values that contain # on the page and then iterate through to format. You are going to have to format the span like that has Email: name#domain.com anyway.
Use find_elements_by_partial_link_text to make the collection.
I think you need 2 XPath. First XPath for finding element that contains text "Email:", second XPath for element that contains attribute "mailto:".
//*[contains(text(),"Email:")]|//*[contains(#href,"mailto:")]

Categories