I've scraped a few web articles using Beautifulsoup. After scraping these Id like to find out what country the article is talking about. My current method is this
- Extract the raw text from that article
- Have a list of all 195 countries in a list
- Using the findall() function in Beautifulsoup check how many occurances there are.
def find_country(url_string):
html = urlopen(url_string)
bsObj = BeautifulSoup(html)
countryList = bsObj.find_all("p", string="UK")
print(len(countryList))
I tried this for a site such as this : https://www.bbc.co.uk/news/uk-politics-52701843 and didnt the get the correct result.
However, I read online that specify which parent/child the information should be obtained from. i.e I want to obtain the UKs in the region of the news website. However, I was wondering how I would implement this. So that find_all('p', string=UK) would find the correct amount of the key word UK in the news article.
Thanks for any help, highly appreciated.
Related
A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))
I have decided to learn python 2.7 coding for data analysis and have been watching many tutorials on youtube to get a good understanding of the basics.
I am at the stage where I want to create simple web-crawlers for educational purposes only to learn different techniques and just get used to some of the coding.
I am following a tutorial for a web-crawler but I am not sure of a few things. This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = 'http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts'
r = requests.get(url)
plain_text = r.text
soup = BeautifulSoup(plain_text, 'html.parser')
statements = soup.findAll('div','ec_statements')
for link in statements:
print (link.contents)
I can't seem to get the href links to separate and have the text and date information displayed.
I want it to look like this:
Name of Article
Link to Article
Date of Article
Could someone help with some information on why those steps were taken please?
Much appreciated!
A little code to help you.In bs4,all node are connection,you all read get a "link" node(actually is a div),you want to get his child like tag a,so link.a is ok.
then, a node have two part values, one is attribute,access by a['href'], and content access by a.text.
for link in statements:
print(link.a['href'])
ps:
this is the link variable:
<div id="legalert_title">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</div>
this is link.a:
Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"
this is link.a['href']:
/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act
this is .text:
Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"
all html is in this way, maybe you need learn a little html.
I'm using BeautifulSoup to try to pull either the top links or simply the top headlines from different topics on the CNN homepage. I seem to be missing something here and would appreciate some assistance. I have managed to come up with a few web scrapers before, but it's always through a lot of resistance and is quite the uphill battle.
What it looks like to me is that the links I need are ultimately stored somewhere like this:
<article class="cd cd--card cd--article cd--idx-1 cd--extra-small cd--has-siblings cd--media__image" data-vr-contentbox="/2015/10/02/travel/samantha-brown-travel-channel-feat/index.html" data-eq-pts="xsmall: 0, small: 300, medium: 460, large: 780, full16x9: 1100" data-eq-state="small">
I can grab that link after data-vr-contentbox and append it to the end of www.cnn.com and it brings me to the page I need. My problem is in actually grabbing that link. I've tried various forms to grab them. My current iteration is as follows:
r = requests.get("http://www.cnn.com/")
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll("article"):
test = link.get("data-vr-contentbox")
print(test)
My issue here is that it only seems to grab a small number of things that I actually need. I'm only seeing two articles from politics, none from travel, etc. I would appreciate some assistance in resolving this issue. I'm looking to grab all of the links under each topic. Right now I'm just looking at politics or travel as a base to get started.
Particularly, I want to be able to specify the topic (tech, travel, politics, etc.) and grab those headlines. Whether I could simply grab the links and use those to get the headline from the respective page, or simply grab the headlines from here... I seem unable to do either. It would be nice to be able to view everything in a single topic at once, but finding out how to narrow this down isn't proving very simple.
An example article is the "IOS 9's Wi-Fi Assist feature costly" which can be found within tags.
I want to be able to find ALL articles under, say, the Tech heading on the homepage and isolate those tags to grab the headline. The tags for this headline look like this:
<div class="strip-rec-link-title ob-tcolor">IOS 9's Wi-Fi Assist feature costly</div>
Yet I don't know how to do BOTH of these things. I can't even seem to grab the headline, despite it being within tags when I try this:
for link in soup.findAll("div"):
print("")
print(link)
I feel like I have a fundamental misunderstanding somewhere, although I've managed to do some scrapers before.
My guess is that the cnn.com website has a bunch of javascript which renders a lot of the content after beautifulsoup reads it. I opened cnn.com and looked at the source in safari and there were 197 instances of data-vr-contentbox. However when I ran it through beautifulsoup and dumped it out there were only 13 instances of data-vr-contentbox.
There are a bunch of posts out there about handling it. You can start with the method used in this question: Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?
So I am trying to learn scraping and was wondering how to get multiple webpages of info. I was using it on http://www.cfbstats.com/2014/player/index.html . I want to retrieve all the teams and then go within each teams link, which shows the roster, and then retrieve each players info and within their personal link their stats.
what I have so far is:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cfbstats.com/2014/player/index.html")
r.content
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
college = link.text
collegeurl = link.get("http")
c = requests.get(collegeurl)
c.content
campbells = BeautifulSoup(c.content)
Then I am lost from there. I know I have to do a nested for loop in there, but I don't want certain links such as terms and conditions and social networks.
Just trying to get player info and then their stats which is linked to their name.
You have to somehow filter the links and limit your for loop to the ones that correspond to teams. Then, you need to do the same to get the links to players. Using Chrome's "Developer tools" (or your browser's equivalent), I suggest that you (right-click) inspect one of the links that are of interest to you, then try to find something that distinguishes it from other links that are not of interest. For instance, you'll find out about the CFBstats page:
All team links are inside <div class="conference">. Furthermore, they all contain the substring "/team/" in the href. So, you can either xpath to a link contained in such a div, or filter the ones with such a substring, or both.
On team pages, player links are in <td class="player-name">.
These two should suffice. If not, you get the gist. Web crawling is an experimental science...
not familiar with BeautifulSoup, but certainly you can use regular expression to retrieve the data you want.
I'm a creating a type of news aggregator and I would like to create a program(Python) that correctly detects the headline and displays it. How would I go about doing this? Is this a machine learning problem?
I would appreciate any articles or books that would point me in the right direction.
My past attempts have included BeautifulSoup and Requests module. Any other open source models I should check out?
Thank you,
Fernando
The direct way to scrape a web page requires human learning - look at the page, decide what you think are headlines, find out how they are tagged, and then look for those tags using a parser like BeautifulSoup. For example, the level 1 headlines on Techmeme currently are labeled:
<DIV CLASS="ii">
and the level 2 headlines are:
<STRONG CLASS="L1">
After your program fetches the page and matches the tags you're interested in, see if they identify what you're looking for. If some headlines are missed, add additional tags to your search list. If you get false positives (hits on links that aren't headlines), weeding them out will require extra page-dependent logic. There is no magic to reverse engineering, just grunt work and testing and periodic revalidation to be sure the webmaster hasn't switched things up on you.
After playing around a bit I find that this works best:
Use BeautifuSoup and Requests module
r = requests.get('http://example.com')
soup = BeautifulSoup(r.text)
if soup.findAll('title'):
title = soup.find('title')
print title.renderContents()
What results is title text that should be cleaned up a bit using regular expressions.
Maybe it could be much easer with parsing their RSS\Atom feeds. Google easily delivers these links http://wiki.python.org/moin/RssLibraries and http://pypi.python.org/pypi/Atomisator/1.3
But those are pure XML, so you could use built-in urllib and XML(DOM or SAX) libraries