Scrapy: extract text from span without class or id - python

I have the following html structure:
I would like to extract the text ("“Business-Thinking”-Fokus im Master-Kurs") from the span highlighted (using Scrapy), however I have trouble reaching to it as it does not contain any specific class or id.
I tried to access it with the following absolute xPath:
sel.xpath('/html/body/div[4]/div[1]/div/div/h1/span/text()').extract()
I don't get any error, however it returns a blank file, meaning the text is not extracted.
Note: The parent classes are not unique, that's why I'm not using a relative path. As the text varies, I also cannot reach the span by looking for the text it contains.
Do you have any suggestion on how I should modify my xPath to extract the text? Thanks!

If you load the page using scrapy shell url it loads without javascript.
When you look at source without javascript, the xpath to the span is /html/body/div/div[1]/div/div/h1/span
To load webpages with javascript in Scrapy use Splash.

Related

Saving HTML Element Tree including CSS properties using Selenium

I'm using Python with Selenium.
I am attempting to do some web scraping. I have a WebElement (which contains child elements) that I would like to save to a offline file. So far, I have managed to get the raw HTML for my WebElement using WebElement.get_attribute('innerHTML'). This works, but, no CSS is present in the final product because a stylesheet is used on the website. So I'd like to get these CSS properties converted to inline.
I found this stackoverflow solution which shows how to get the CSS properties of an element. However, getting this data, then parsing the HTML as a string to add these properties inside the HTML tag's style attribute, then doing this for all the child elements, feels like it'd be a significant undertaking.
So, I was wondering whether there was a more straightforward way of doing this.

check is a particular tab or subsection is present

Is it possible to use web crawling with scrapy and a base url to check if a website has a particular section, sub section or tab or not? For example, here
https://www.christiani.de/
on of the tabs is Service. This tab further contains sections including Kataloge anfordern. I want to search the whole website if there is a Kataloge section anywhere such that the section name can also include other words for example anforden. Can I achieve this using scrapy? The tutorials that I have seen work with css selectors but those might be different for all websites. What else can I try?
So far I can see the text "Kataloge" is being used in <a> and <span> tags. Based on this data you can use the following xpath to fetch the instances of word "Kataloge" used and then print the text part.
no_of_instances=driver.find_elements_by_xpath("//a[contains(text(),'Kataloge')] | //span[contains(text(),'Kataloge')]")
for i in no_of_instances:
print(i.text)
Output must be words: Kataloge, Kataloge anfordern or Kataloge {{any_random_text}}

Are there any selenium locators present which can scrape any content of a webpage?

Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...

Selenium Python: How to get css without targetting a specific class/id/tag

I'm working on a scraper project and one of the goals is to get every image link from HTML & CSS of a website. I was using BeautifulSoup & TinyCSS to do that but now I'd like to switch everything on Selenium as I can load the JS.
I can't find in the doc a way to target some CSS parameters without having to know the tag/id/class. I can get the images from the HTML easily but I need to target every "background-image" parameter from the CSS in order to get the URL from it.
ex: background-image: url("paper.gif");
Is there a way to do it or should I loop into each element and check the corresponding CSS (which would be time-consuming)?
You can grab all the Style tags and parse them, searching what you look.
Also you can download the css file, using the resource URL and parse them.
Also you can create a XPATH/CSS rule for searching nodes that contain the parameter that you're looking for.

How to extract google news with specific key word using scrapy?

I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.
After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]
In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()

Categories