A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))
Related
Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...
Hey I am trying to implement a program that can get urls from the html of a website, but I only want the urls from the body. Basically, I want to avoid ads and menus on the website and only get links to the websites that are embedded in the actual article. Does anyone know of a good way of isolating the body html from the rest of the html without hardcoding how the body is designated for each website?
It is a simple process to scrape only specific parts of the html. For the most part you can choose elements from the page you want. Let's say you only want the <div id="example">example</div> you can specify your scraper to only pick up that div. Please check this example out.
https://realpython.com/beautiful-soup-web-scraper-python/
I am new to scrapy, trying to extract google news from the the given link bellow:
https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966
"cholera" key word was provided that shows small blocks of various news associated with cholera key world further I try this with scrapy to extract the each block that contents individual news.
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
response.css(".ts._JGs._KHs._oGs._KGs._jHs::text").extract()
where .ts._JGs._KHs._oGs._KGs._jHs::text represent the div class="ts _JGs _KHs _oGs _KGs _jHs for each block of news.
but it return None.
After struggling I find out a way to scrap desired data with very simple trick,
fetch("https://www.google.co.in/search?q=cholera+news&safe=strict&source=lnms&tbm=nws&sa=X&ved=0ahUKEwik0KLV-JfYAhWLpY8KHVpaAL0Q_AUICigB&biw=1863&bih=966")
and css selector "class="g" tag can be used to extract desired block like this
response.css(".g").extract()
which return list of all the individual news blocks which can be further used on the basis of list index like this:
response.css(".g").extract()[0]
or
response.css(".g").extract()[1]
In scrapy shell uses view(response) and you will see in web browser what you fetch().
Google uses JavaScript to display data, but it can also send page which doesn't use JavaScript. But page without JavaScript usually has different tags and classes.
You can also turn off JavaScript in your browse and then open Google to see tags.
Try this:
response.css('#search td ::text').extract()
I'm using BeautifulSoup to try to pull either the top links or simply the top headlines from different topics on the CNN homepage. I seem to be missing something here and would appreciate some assistance. I have managed to come up with a few web scrapers before, but it's always through a lot of resistance and is quite the uphill battle.
What it looks like to me is that the links I need are ultimately stored somewhere like this:
<article class="cd cd--card cd--article cd--idx-1 cd--extra-small cd--has-siblings cd--media__image" data-vr-contentbox="/2015/10/02/travel/samantha-brown-travel-channel-feat/index.html" data-eq-pts="xsmall: 0, small: 300, medium: 460, large: 780, full16x9: 1100" data-eq-state="small">
I can grab that link after data-vr-contentbox and append it to the end of www.cnn.com and it brings me to the page I need. My problem is in actually grabbing that link. I've tried various forms to grab them. My current iteration is as follows:
r = requests.get("http://www.cnn.com/")
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll("article"):
test = link.get("data-vr-contentbox")
print(test)
My issue here is that it only seems to grab a small number of things that I actually need. I'm only seeing two articles from politics, none from travel, etc. I would appreciate some assistance in resolving this issue. I'm looking to grab all of the links under each topic. Right now I'm just looking at politics or travel as a base to get started.
Particularly, I want to be able to specify the topic (tech, travel, politics, etc.) and grab those headlines. Whether I could simply grab the links and use those to get the headline from the respective page, or simply grab the headlines from here... I seem unable to do either. It would be nice to be able to view everything in a single topic at once, but finding out how to narrow this down isn't proving very simple.
An example article is the "IOS 9's Wi-Fi Assist feature costly" which can be found within tags.
I want to be able to find ALL articles under, say, the Tech heading on the homepage and isolate those tags to grab the headline. The tags for this headline look like this:
<div class="strip-rec-link-title ob-tcolor">IOS 9's Wi-Fi Assist feature costly</div>
Yet I don't know how to do BOTH of these things. I can't even seem to grab the headline, despite it being within tags when I try this:
for link in soup.findAll("div"):
print("")
print(link)
I feel like I have a fundamental misunderstanding somewhere, although I've managed to do some scrapers before.
My guess is that the cnn.com website has a bunch of javascript which renders a lot of the content after beautifulsoup reads it. I opened cnn.com and looked at the source in safari and there were 197 instances of data-vr-contentbox. However when I ran it through beautifulsoup and dumped it out there were only 13 instances of data-vr-contentbox.
There are a bunch of posts out there about handling it. You can start with the method used in this question: Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?
I have a folder full of HTML documents that are saved copies of webpages, but i need to know what site they came from, what function can i use to extract the website name from the documents? I did not find anything in the BeautifulSoup module. Is there a specific that i should be looking for in the document? I do not need to know the full url i just need to know the name of the website.
You can only do that if the url is mentioned somewhere in the source...
First find out where the url is if it is mentioned. If it's there it'll probably be in the base tag. Sometimes websites have nice headers with a link to their landing page which could be used if all you want is the domain. Or it could be in a comment somewhwere depending on how you saved it.
If the way the url is mentioned is similar in all the pages then your job is easy: Either use re or BeautifulSoup or lxml and xpath to grab the info you need. There are other tools available but either of those will do.