I'm using BeautifulSoup to try to pull either the top links or simply the top headlines from different topics on the CNN homepage. I seem to be missing something here and would appreciate some assistance. I have managed to come up with a few web scrapers before, but it's always through a lot of resistance and is quite the uphill battle.
What it looks like to me is that the links I need are ultimately stored somewhere like this:
<article class="cd cd--card cd--article cd--idx-1 cd--extra-small cd--has-siblings cd--media__image" data-vr-contentbox="/2015/10/02/travel/samantha-brown-travel-channel-feat/index.html" data-eq-pts="xsmall: 0, small: 300, medium: 460, large: 780, full16x9: 1100" data-eq-state="small">
I can grab that link after data-vr-contentbox and append it to the end of www.cnn.com and it brings me to the page I need. My problem is in actually grabbing that link. I've tried various forms to grab them. My current iteration is as follows:
r = requests.get("http://www.cnn.com/")
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll("article"):
test = link.get("data-vr-contentbox")
print(test)
My issue here is that it only seems to grab a small number of things that I actually need. I'm only seeing two articles from politics, none from travel, etc. I would appreciate some assistance in resolving this issue. I'm looking to grab all of the links under each topic. Right now I'm just looking at politics or travel as a base to get started.
Particularly, I want to be able to specify the topic (tech, travel, politics, etc.) and grab those headlines. Whether I could simply grab the links and use those to get the headline from the respective page, or simply grab the headlines from here... I seem unable to do either. It would be nice to be able to view everything in a single topic at once, but finding out how to narrow this down isn't proving very simple.
An example article is the "IOS 9's Wi-Fi Assist feature costly" which can be found within tags.
I want to be able to find ALL articles under, say, the Tech heading on the homepage and isolate those tags to grab the headline. The tags for this headline look like this:
<div class="strip-rec-link-title ob-tcolor">IOS 9's Wi-Fi Assist feature costly</div>
Yet I don't know how to do BOTH of these things. I can't even seem to grab the headline, despite it being within tags when I try this:
for link in soup.findAll("div"):
print("")
print(link)
I feel like I have a fundamental misunderstanding somewhere, although I've managed to do some scrapers before.
My guess is that the cnn.com website has a bunch of javascript which renders a lot of the content after beautifulsoup reads it. I opened cnn.com and looked at the source in safari and there were 197 instances of data-vr-contentbox. However when I ran it through beautifulsoup and dumped it out there were only 13 instances of data-vr-contentbox.
There are a bunch of posts out there about handling it. You can start with the method used in this question: Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?
Related
Basically, I'm trying to download some images from a site which has been "down for maintenance" for over a year. Attempting to visit any page on the website redirects to the forums. However, visiting image URLs directly will still take you to the specified image.
There's a particular post that I know contains more images than I've been able to brute force by guessing possible file names. I thought, rather than typing every combination of characters possible ad infinitum, I could program a web scraper to do it for me.
However, I'm brand new to Python, and I've been unable to overcome the Javascript redirects. While I've been able to use requests & beautifulsoup to scrape the page it redirects to for 'href', without circumventing the JS I cannot pull from the news article which has links.
import requests
from bs4 import BeautifulSoup
url = 'https://www.he-man.org/news_article.php?id=6136'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
I've added allow_redirects=False to every field to no avail. I've tried searching for 'jpg' instead of 'href'. I'm currently attempting to install Selenium, though it's fighting me, and I expect I'll just be getting 3xx errors anyway.
The images from the news article all begin with the same 62 characters (https://www.he-man.org/assets/images/home_news/justinedantzer_), so I've also thought maybe there's a way to just infinite-monkeys-on-a-keyboard scrape the rest of it? Or the type of file (.jpg)? I'm open to suggestions here, I really have no idea what direction to come at this thing from now that my first six plans have failed. I've seen a few mentions of scraping spiders, but at this point I've sunk a lot of time into dead ends. Are they worth looking into? What would you recommend?
A newbie to Python here. I want to extract info from multiple websites (e.g. 100+) from a google search page. I just want to extract the key info, e.g. those with <h1>, <h2> or <b> or <li> HTML tags etc. But I don't want to extract the entire paragraph <p>.
I know how to gather a list of website URLs from that google search; and I know how to web scrape individual website after looking at the page's HTML. I use the Request and BeautifulSoup for these tasks.
However, I want to know how can I extract key info from all these (100+ !) websites without having to look at their html one by one. Is there a way to automatically find out the HTML tags the website used to emphasize key messages? e.g. some websites may use <h1>, while some may use <b> , or something else...
All I can think of is to come up with a list of possible "emphasis-typed" HTML tags and then just use BeautifulSoup.find_all() to do a wide-scale extraction. But surely there must be an easier way?
It would seem that you must first learn how to do loops and function first. Every website is completely different and scraping a website alone to extract useful information is daunting. I'm a newb myself, but if I have to extract info from headers like you, this is what I would do: (this is just concept code, but hope you'll find it useful)
def getLinks(articleUrl):
html = urlopen('http://en.web.com{}'.format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('h1', {'class':'header'}).find_all('h1',
header=re.compile('^(/web/)((?!:).)*$'))
I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.
I am staring with the url below:
http://www.imdb.com/chart/top
The structure of the HTML file seems to be so confusing:
"
Metascore: "
I am trying to use a format like this:
movie['metascore'] = self.get_text(soup.find('h4', attrs={' ':'Metascore'}))
I'll take a stab at this since it sounds like you're new to scraping. What it sounds like you're actually trying to do is to get the budget, gross, and metascore from each of the individual 250 movie pages on IMDB. You're on the right track by mentioning Scrapy because you do have to crawl to those pages from the initial URL you provided. Scrapy has some excellent documentation, so if you want to use it, I highly recommend you start there first.
However, if all you need is to scrape those 250 pages, you're better off just using Beautiful Soup to do the whole job. Simply do a soup.findAll("td", {"class":"titleColumn"}), extract the links, then do a loop where you have Beautiful Soup open each of the those pages one at a time. If you're not sure how to do that, again, BS has excellent documentation.
From there, it's just a matter of scraping the relevant data you want during each iteration. For instance, the metascore of each film is inside the a <div> of the class star-box-details. Do a .find for that and then you'll have to do some regular expressions to extract the exact piece you want (regular-expressions.info has a great tutorial on regex and if you really get into regex, you'll probably end up sinking hours into RexEgg).
I'm not going to code the whole thing since you'll learn a lot through the trial and error that comes with attempting to solve things, but hopefully that puts you on the right track. However, do note that IMDB forbids scraping, but for small projects I'm sure no one will care. But if you want to get serious, the "Does IMDB provide an API?" post has some excellent resources for how to do it via various third-party APIs (and some even directly from IMDB). In your case, the best might be to simply download the data as text files directly from IMDB. Click on any of the FTP links. The files you'll probably want are business.list.gz and ratings.list.gz. As for the metascore on each movie page, that rating actually comes from Metacritic, so you'll want to go there to pull that data.
Good luck!
I'm a creating a type of news aggregator and I would like to create a program(Python) that correctly detects the headline and displays it. How would I go about doing this? Is this a machine learning problem?
I would appreciate any articles or books that would point me in the right direction.
My past attempts have included BeautifulSoup and Requests module. Any other open source models I should check out?
Thank you,
Fernando
The direct way to scrape a web page requires human learning - look at the page, decide what you think are headlines, find out how they are tagged, and then look for those tags using a parser like BeautifulSoup. For example, the level 1 headlines on Techmeme currently are labeled:
<DIV CLASS="ii">
and the level 2 headlines are:
<STRONG CLASS="L1">
After your program fetches the page and matches the tags you're interested in, see if they identify what you're looking for. If some headlines are missed, add additional tags to your search list. If you get false positives (hits on links that aren't headlines), weeding them out will require extra page-dependent logic. There is no magic to reverse engineering, just grunt work and testing and periodic revalidation to be sure the webmaster hasn't switched things up on you.
After playing around a bit I find that this works best:
Use BeautifuSoup and Requests module
r = requests.get('http://example.com')
soup = BeautifulSoup(r.text)
if soup.findAll('title'):
title = soup.find('title')
print title.renderContents()
What results is title text that should be cleaned up a bit using regular expressions.
Maybe it could be much easer with parsing their RSS\Atom feeds. Google easily delivers these links http://wiki.python.org/moin/RssLibraries and http://pypi.python.org/pypi/Atomisator/1.3
But those are pure XML, so you could use built-in urllib and XML(DOM or SAX) libraries