How to extract google news with specific key word using scrapy? - python

I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.

After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]

In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()

Related

check is a particular tab or subsection is present

Is it possible to use web crawling with scrapy and a base url to check if a website has a particular section, sub section or tab or not? For example, here
https://www.christiani.de/
on of the tabs is Service. This tab further contains sections including Kataloge anfordern. I want to search the whole website if there is a Kataloge section anywhere such that the section name can also include other words for example anforden. Can I achieve this using scrapy? The tutorials that I have seen work with css selectors but those might be different for all websites. What else can I try?
So far I can see the text "Kataloge" is being used in <a> and <span> tags. Based on this data you can use the following xpath to fetch the instances of word "Kataloge" used and then print the text part.
no_of_instances=driver.find_elements_by_xpath("//a[contains(text(),'Kataloge')] | //span[contains(text(),'Kataloge')]")
for i in no_of_instances:
print(i.text)
Output must be words: Kataloge, Kataloge anfordern or Kataloge {{any_random_text}}

Python web scraping: websites from google search result

A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))

Are there any selenium locators present which can scrape any content of a webpage?

Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...

Python webscraping how to get only the body html

Hey I am trying to implement a program that can get urls from the html of a website, but I only want the urls from the body. Basically, I want to avoid ads and menus on the website and only get links to the websites that are embedded in the actual article. Does anyone know of a good way of isolating the body html from the rest of the html without hardcoding how the body is designated for each website?
It is a simple process to scrape only specific parts of the html. For the most part you can choose elements from the page you want. Let's say you only want the <div id="example">example</div> you can specify your scraper to only pick up that div. Please check this example out.
https://realpython.com/beautiful-soup-web-scraper-python/

How to scrape value from page that loads dynamicaly?

The homepage of the website I'm trying to scrape displays four tabs, one of which reads "[Number] Available Jobs". I'm interested in scraping the [Number] value. When I inspect the page in Chrome, I can see the value enclosed within a <span> tag.
However, there is nothing enclosed in that <span> tag when I view the page source directly. I was planning on using the Python requests module to make an HTTP GET request and then use regex to capture the value from the returned content. This is obviously not possible if the content doesn't contain the number I need.
My questions are:
What is happening here? How can a value be dynamically loaded into a
page, displayed, and then not appear within the HTML source?
If the value doesn't appear in the page source, what can I do to
reach it?
If the content doesn't appear in the page source then it is probably generated using javascript. For example the site might have a REST API that lists jobs, and the Javascript code could request the jobs from the API and use it to create the node in the DOM and attach it to the available jobs. That's just one possibility.
One way to scrap this information is to figure out how that javascript works and make your python scraper do the same thing (for example, if there is a simple REST API it is using, you just need to make a request to that same URL). Often that is not so easy, so another alternative is to do your scraping using a javascript capable browser like selenium.
One final thing I want to mention is that regular expressions are a fragile way to parse HTML, you should generally prefer to use a library like BeautifulSoup.
1.A value can be loaded dynamically with ajax, ajax loads asynchronously that means that the rest of the site does not wait for ajax to be rendered, that's why when you get the DOM the elements loaded with ajax does not appear in it.
2.For scraping dynamic content you should use selenium, here a tutorial
for data that load dynamically you should look for an xhr request in the networks and if you can make that data productive for you than voila!!
you can you phantom js, it's a headless browser and it captures the html of that page with the dynamically loaded content.

Categories