Python webscraping how to get only the body html - python

Hey I am trying to implement a program that can get urls from the html of a website, but I only want the urls from the body. Basically, I want to avoid ads and menus on the website and only get links to the websites that are embedded in the actual article. Does anyone know of a good way of isolating the body html from the rest of the html without hardcoding how the body is designated for each website?

It is a simple process to scrape only specific parts of the html. For the most part you can choose elements from the page you want. Let's say you only want the <div id="example">example</div> you can specify your scraper to only pick up that div. Please check this example out.
https://realpython.com/beautiful-soup-web-scraper-python/

Related

Python web scraping: websites from google search result

A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))

Webscraper in python where I provide a webpage that has a list of links which the scraper then visits individually

I am a beginner in programming and I am trying to make a scraper. As of right now I'm using the requests library and BeautifulSoup. I provide the program a link and I am able to extract any information I want from that single web page. What I am trying to accomplish is as follows... I want to provide a web page to the program, the web page that I provide is a search result where there is a list of links that could be clicked. I want the program to be able to get the links of those search results, and then scrape some information from each of those specific pages from the main web page that I provide.
If anyone can give me some sort of guidance on how I could achieve this I would appreciate it greatly! Are there some other libraries I should be using? Is there some reading material you could refer me to, maybe a video?
You can put all the url links in a list then have your request-sending function loop through it. Use the requests or urllib package for this.
For the search logic, you would want to look for the <a> tag with href property.

Is it possible to automatically scrape articles from websites - Python & Beautiful Soup

Trying to make a script to scrape one or two articles (article URLs only) from different websites, i was able to make a Python script that uses BeautifulSoup to get the website's HTML, find the website's Navbar menu via its Class name, and loop trough each website section, the problem is that each website has a different Class name or Xpath for the Navbar menu and its sections ..
Is there a way to make the script work for multiple websites with as little human intervention as possible ?
Any suggestions are more than welcome,
Thanks
Did it, i have only needed to use Python and Selenium, an Xpath for the Navbar Elements for each website and another Xpath for all types of articles on the different website pages, saved everything on a database and the rest is just customized for our specific needs, it wasn't that complicated in the end, thanks for the help <3

fetch text from web with Angular JS tags such as ng-view

I'm trying to fetch all the visible text from a website, I'm using python-scrapy for this work. However what i observe scrapy only works with HTML tags such as div,body,head etc. and not with angular js tags such as ng-view, if there is any element within ng-view tags and when I do a right-click on the page and do view source then the content inside the tag doesn't appear and it displays like <ng-view> </ng-view>, So how can I use python to scrap the elements within this ng-view tags.Thanks in advance..
To answer your question
how can I use python to scrap the elements within this ng-view tags
You can't.
The content you want to scrape renders on the client side(browser), what scrapy get's you is just static content from server, your browser than interprets the HTML code and renders the JS code. And JS code than fetches different content from server again and makes some stuff with it.
Can it be done?
Yes!
One of the ways is to use some sort oh headless browser like http://phantomjs.org/ to fetch all the content. Once you have the content you can save it and scrape it as you wish. The thing is that this kind of web scraping is not as easy and straight forward as just scraping regular HTML. There is a reason why Google still doesn't scrape web pages that render their content via JS.

How to use Python to make a web crawler to full-text RSS

I want to write a web crawler in Python that prints only the contents of, for example, a news article.
I tried to do it using BeautifulSoup, to print the content inside a <div> with a specific id, but every website has a different id for the 'entry' div.
I found this website and tried to make a crawler for this website, but I have 2 problems:
I don't know if its a good idea to make a crawler for this website, because maybe it will not work one day
I tried to print only the text from the website (fivefilters.org) but it prints the HTML markup as well. Could someone please show me how to print just the text of the page?

Categories