Web data scraping (online news comments) with Scrapy (Python)

Web data scraping (online news comments) with Scrapy (Python) - python

I want to scrape web comments data from online news purely for research. And I noticed that I have to learn about Scrapy...
Usually, I do programming with Python. I though it will be easy to learn. But I got some problems.
I want to scrape news comment in http://news.yahoo.com/congress-wary--but-unlikely-to-blow-up-obama-s-iran-deal-230545228.html.
But the problem is there is a button (>View Comments (452)) to see the comments. In addition, what I want to do is scraping all the comments in that news. Unfortunately, I have to click another button (View more comments) to see other 10 comments more.
How can I handle this problem?
The code that I've done is as below. Sorry for too poor code.
#############################################
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["news.yahoo.com"]
start_urls = ["http://news.yahoo.com/blogs/oddnews/driver-offended-by-%E2%80%9Cwh0-r8x%E2`%80%9D-license-plate-221720503.html",]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div/p')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('/text()').extract()
items.append(item)
return items
You can see that how much left to be done to solve my problem. But I have to be hurry.. I will do my best anyway.

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a (very detailed) guide on how to find the answer.
The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome Developer Tools for this, but you can use others such as FF firebug.
So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our case requires a bit more work. I can tell two things right away:
They're using javascript to load the comments (because I'm staying on the same page).
They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the page and just showing them to you, with each click it does another request to the database).
Now let's right-click and inspect element on that button. It's actually just a simple span with text:
<span>View Comments (2077)</span>
By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of confusing data. Wait, here's one that makes sense:
http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1
See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object (looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest that a human can read:
go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
'_media.modules.content_comments.switches._enable_mutecommenter': '1',
'_media.modules.content_comments.switches._enable_view_others': '1',
'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
'count': '10',
'enable_collapsed_comment': '1',
'isNext': 'true',
'offset': '20',
'pageNumber': '2',
'sortBy': 'highestRated'}
Now it's just a matter of trial-and-error. However, a few things to note here:
Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So now we understand how to use offset
The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that from the original page somehow. Try digging around a little, you'll find it.
Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article itself)
Using the devtools you have access to all client-side scripts. So by digging you can find that that link to /get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the request, and try to emulate that (though you can probably figure it out yourself)
You might need to overcome some security measures. For example, you might need a session-key from the original article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in case it shows up.
Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the html comments you are getting (for which you might want to check out BeautifulSoup).
As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task either.
So don't panic.
It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt). Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help you.
Good luck!

I'm thankful for this question, as it got me started on trying to scrape yahoo comments, and I just wanted to add an update, because yahoo has changed the way they handle comments since this question was posted. First, there are 3 URLs of interest, depending on what you want to get. With these, you can get main comment threads, replies to a thread, or comments from a user. These are
urlComments = 'https://www.yahoo.com/news/_td/api/resource/canvass.getMessageListForContext_ns;context=%1s;count=10;index=%1s;lang=en-US;namespace=yahoo_content;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;rankingProfile=canvassHalfLifeDecayProfile;region=US;sortBy=popular;type=null;userActivity=true'
urlReply = 'https://www.yahoo.com/news/_td/api/resource/canvass.getReplies_ns;context=%1s;count=10;index=%1s;lang=en-US;messageId=%1s;namespace=yahoo_content;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;region=US;sortBy=createdAt;tags='
urlUser = 'https://www.yahoo.com/news/_td/api/resource/canvass.getUserMessageHistory;count=10;index=%1s;lang=en-US;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;region=US;sortBy=createdAt;userId=%1s'
Now, I've inserted a couple of %1s into the URLs, to insert desired variables into the URL, such as the article id, index, or user id. As before, a few parameters are needed:
params = {'bkt': ["news-d-202","newsdmcntr"],
'device': 'desktop',
'feature': 'cacheContentCanvas,videoDocking,newContentAttribution,livecoverage,featurebar,deferModalCluster,specRetry,newLayout,sidepic,canvassOffnet,ntkFilmstrip,autoNotif,CanvassTags',
'intl': 'us',
'lang': 'en-US',
'partner': 'none',
'prid': '5t11qvhclanab',
'region': 'US',
'site': 'fp',
'tz': 'America/PICKACITY', <-- insert a city
'ver': '2.0.7765',
'returnMeta': 'true'}
Using the requests library, we can pull out comments, with say.
response = requests.get(u, params=params) #u is a url from above
coms = response.json()['data']['canvassMessages'] #drop the ['canvassMessages'] if you want to get replies to a thread
From there, you can pull out whatever you want from the comment. Now, coms will only have 10 comments in it (if you look at the URLs, you will see a count=10--unfortunately, the max appears to be 30). To get the next set of 10, insert the coms[-1]['index'] value into the desired URL, and grab the next 10. However, the problem that I've come across is that you can only grab about 1000 comments before yahoo taps out. For example, if you visit this comment page, you will see comments 1000-1009 (find an "index" value, and you get something like v=1:s=popular:sl=1498836633:off=1000). But if you visit this next comment page, you should see comments 1010-1019, but actually no comments are loaded. This is rather annoying, and if someone is aware of how to overcome this problem, I'd be quite welcoming of the solution. Lastly, in order to get the article id, open up the page source, and search for "pstaid", then copy the value that follows it. For example, the above links are for comments from This article with lots of comments. If you search for "pstaid", you find the value 0efc85df-eb0b-373e-b6f3-4c513ed2a415, and this is the article id that you would insert into the URL.

Related

How to extract hidden html content with scrapy?

I'm using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i've inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']
def parse(self, response):
product = response.css('a.shelfProductTile-descriptionLink::text').extract()
yield "productnames"

You won't be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.
The simplest way to get the product names, I'm not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.
First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.
Code Example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['woolworths.com']
data = {
'excludeUnavailable': 'true',
'source': 'RR-Best Sellers'}
def start_requests(self):
url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
def parse(self, response):
data = response.json()
for a in data:
yield {
'name': a['Name'],
}
Explanation
We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.
The reason why we're using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.
So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You'll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.
In this next image, you can see a preview of the product data
We can then get the request URL and anything else from this request.
I will often copy this request as a CURL (Bash Command) as seen here
And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.
Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.
So back to the code. We do a request, using meta argument we can pass on the data, remember because it's outside the function we have to use self.data and then specifying the callback to parse.
We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you'll have put to import json at the top of the script.
From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.
We then yield a dictionary to extract the data, a refers to each products which is it's own dictionary and a['Name'] refers to that specific python dictionary key 'Name' and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.
The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.
I can't really help you more than this until you specify any other data you want from this page. Please remember that you're going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you're going to learn coding.
Additional Information on the Approach of Dynamic Content
There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it's providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.
In order of preference
Re-engineering the HTTP requests
Scrapy-splash
Scrapy_selenium
importing selenium package into your scripts
Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.
There are multiple ways to include selenium into your scripts see down below as an overview.
Recommended Reading/Research
Look at the scrapy documentation with regard to dynamic content here
This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It's pretty inefficient when doing larger scale scraping.
If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you're mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.
Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here
Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You'll need to have a play around to get the login in procedure from this, it doesn't have the same level of detail as the selenium package itself.

Automatically resolving disambiguation pages

The problem
I'm using the Wikipedia API to get page HTML which I parse. I use queries like this one to get the HTML for the first section of a page.
The MediaWiki API provides a handy parameter, redirects, which will cause the API to automatically follow pages that redirect other pages. For example, if I search for 'Cats' with https://en.wikipedia.org/w/api.php?page=Cats&redirects, I will be shown the results for Cat because Cats redirects to Cat.
I'd like a similar function for disambiguation pages such as this, by which if I arrive at a disambiguation page, I am automatically redirected to the first link. For example, if I make a request to a page like Mercury, I'd automatically be redirected to Mercury (element), as it is the first link listed in the page.
The Python HTML parser BeautifulSoup is fairly slow on large documents. By only requesting the first section of articles (that's all I need for my use), using section=0, I can parse it quickly. This is perfect for most articles. But for disambiguation pages, the first section does not include any of the links to specific pages, making it a poor solution. But if I request more than the first section, the HTML loading slows down, which is unnecessary for most articles. See this query for an example of a disambiguation page in which links are not included in the first section.
What I have so far
As of right now, I've gotten as far as detecting when a disambiguation page is reached. I use code like
bs4.BeautifulSoup(page_html).find("p", recursive=false).get_text().endswith(("refer to:", "refers to:"))
I also spent a while trying to write code that automatically followed a link, before I realized that the links were not included in
My constraints
I'd prefer to keep the number of requests made to a minimum. I also need to be parsing as little HTML as possible, because speed is essential for my application.
Possible solutions (which I need help executing)
I could envision several solutions to this problem:
A way within the MediaWiki API to automatically follow the first link from disambiguation pages
A method within the Mediawiki API that allows it to return different amounts of HTML content based on a condition (like presence of a disambiguation template)
A way to dramatically improve the speed of bs4 so that it doesn't matter if I end up having to parse the entire page HTML

As Tgr and everybody said, no, such a feature doesn't exist because it doesn't make sense. The first link in a disambiguation page doesn't have any special status or meaning.
As for the existing API, see https://www.mediawiki.org/wiki/Extension:Disambiguator#API_usage
By the way, the "bot policy" you linked does not really apply to crawlers/scraper; the only relevant policy/guideline is the User-Agent policy.

Impossible to extract data from this url

This is my first post here. It has been 5 months since I have been learning python from scratch, on my own, and I did acquire most of my knowledge thanks to this forum, and I am now able to create webbots which can easily scrape all types of data, especially in sport betting sites.
Though for this particular need, there is one site from which I cannot extract what I am looking for:
winamax
I would like to get all links from all football events (on the left side, for example:
"https://www.winamax.fr/paris-sportifs#!/match/prelive/7894014"
but when I look at the source code, or when I print my soup, I just get nothing.
url = "https://www.winamax.fr/paris-sportifs#!/sports"
urlRequest = requests.get(url, proxies=proxies, headers=headers)
#of course, proxies and headers are defined beforehand
soup = BeautifulSoup(urlRequest.content)
print(soup)
For all bookmakers I have already come up with, there was always either a simple html tree structure in which all items were easy to find, or a hidden javascript file, or a json link.
But for this one, even when trying to catch the flow with Firebug, I cannot find anything relevant.
Thanks in advance if someone has an idea on how to get that (I considered using PhantomJS but not tried yet).
EDIT:
#ssundarraj:
Hereunder the header, the same I have been using in all my projects, so not relevant in my opinion, but anyway, here it is:
AgentsFile='UserAgents.txt'
lines = open(AgentsFile).read().splitlines()
myline =random.choice(lines)
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer' : 'https://www.winamax.fr',
'User-Agent': myline}
EDIT2:
#Chris Lear
using firebug, in the net panel, you can search through all the
response bodies (there's a checkbox called "Response Bodies" that
appears when you click the search box). That will show you that the
data is being fetched by json. I'll leave you to try to make sense of
it, but that might give you a start (searching for ids is probably
best)
I checked the box you mentioned hereabove, but with no effect :(
With or without filter, nothing is displayed in my network panel, as you can see on the picture:
nothing caught

Used firebug and find out this.
Make POST request to https://www.winamax.fr/betting/slider/slider.php with parameters:
key=050e42fb0761c96526e8510eda89248f
lang=FR
Don't know if key is changing but now it works.

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?

There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.

The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.

Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.