Impossible to extract data from this url - python

This is my first post here. It has been 5 months since I have been learning python from scratch, on my own, and I did acquire most of my knowledge thanks to this forum, and I am now able to create webbots which can easily scrape all types of data, especially in sport betting sites.
Though for this particular need, there is one site from which I cannot extract what I am looking for:
winamax
I would like to get all links from all football events (on the left side, for example:
"https://www.winamax.fr/paris-sportifs#!/match/prelive/7894014"
but when I look at the source code, or when I print my soup, I just get nothing.
url = "https://www.winamax.fr/paris-sportifs#!/sports"
urlRequest = requests.get(url, proxies=proxies, headers=headers)
#of course, proxies and headers are defined beforehand
soup = BeautifulSoup(urlRequest.content)
print(soup)
For all bookmakers I have already come up with, there was always either a simple html tree structure in which all items were easy to find, or a hidden javascript file, or a json link.
But for this one, even when trying to catch the flow with Firebug, I cannot find anything relevant.
Thanks in advance if someone has an idea on how to get that (I considered using PhantomJS but not tried yet).
EDIT:
#ssundarraj:
Hereunder the header, the same I have been using in all my projects, so not relevant in my opinion, but anyway, here it is:
AgentsFile='UserAgents.txt'
lines = open(AgentsFile).read().splitlines()
myline =random.choice(lines)
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding':'gzip,deflate,sdch',
'Accept-Language':'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer' : 'https://www.winamax.fr',
'User-Agent': myline}
EDIT2:
#Chris Lear
using firebug, in the net panel, you can search through all the
response bodies (there's a checkbox called "Response Bodies" that
appears when you click the search box). That will show you that the
data is being fetched by json. I'll leave you to try to make sense of
it, but that might give you a start (searching for ids is probably
best)
I checked the box you mentioned hereabove, but with no effect :(
With or without filter, nothing is displayed in my network panel, as you can see on the picture:
nothing caught

Used firebug and find out this.
Make POST request to https://www.winamax.fr/betting/slider/slider.php with parameters:
key=050e42fb0761c96526e8510eda89248f
lang=FR
Don't know if key is changing but now it works.

Related

How to retrieve data from API Explorer?

My question is more in the "concept" side, as I don't have any code to show yet. I've basically got access to an API Explorer for a website, but the information retrieved when I put a specific url in the API Explorer is not the same as the html information I'd get if I opened a webpage with the same url and "inspected" the elements. I'm honestly lost on how to retrieve the data I need, as they are only present in the API Explorer but can't be accessible via web scraping.
Here is an example to show you what I mean:
API Explorer link: https://platform.worldcat.org/api-explorer/apis/worldcatidentities/identity/Read,
and the specific url to request is: http://www.worldcat.org/identities/lccn-n80126307/
If I put the url (http://www.worldcat.org/identities/lccn-n80126307/) myself and "inspect element", this piece of information:
does not have all the same data as:
For example, the language count, audLevel, oclcnum and many others are not existent in the html version but are in the API Explorer and with other authors, the genres count is only existent in the API Explorer.
I realize that one is in xml and the other in html so is that why the data is not the same in both versions? And whatever is the reason, what can I do to retrieve the data present only in the API Explorer? (such as genres count, audLevel, oclcnum, etc.)
Any insight would be really helpful.
It's not unusual for sites not showing all the data, that's in the underlying json/xml. Those sorts of things often holds interesting content that aren't displayed anywhere onsite.
In this case the server gives you, what you ask for. If you're going for the data using Python, all you really have to do is specify in your header what you're after. If you don't do that on this site, you get the html-stuff.
If you do like this, you'll get the xml data, you're interested in:
import requests
import xml.dom.minidom
url = 'https://www.worldcat.org/identities/lccn-n80126307/'
r = requests.get(url, headers={'Accept': 'application/json'})
# a couple of lines for printing the xml pretty
xml = xml.dom.minidom.parseString(r.text)
pretty_xml_as_string = xml.toprettyxml()
print(pretty_xml_as_string)
Then all you have to do is extract the content, you're after. That can be done in many ways. Let me know if this helps you.

Scraping the content of a box contains infinite scrolling in Python

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

getting specific images from page

I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl:
redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
print div.find('a')['t1'] #works fine
print div.find('img')['src'] #This getting issue KeyError: 'src'
But this gives only title, not the image source
Is there anything wrong?
Edit:
I have edited my source, still could not get image url.
Bing is using some techniques to block automated scrapers. I tried to print
div.find('img')
and found that they are sending source in attribute names src2, so following should work -
div.find('img')['src2']
This is working for me. Hope it helps.
If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async endpoint which contains the image search results.
Which leads to the 3 main options you have:
simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2; see requests module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile
automate a real browser using selenium - stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them.
use Bing Search API (this should probably be option #1)

Web data scraping (online news comments) with Scrapy (Python)

I want to scrape web comments data from online news purely for research. And I noticed that I have to learn about Scrapy...
Usually, I do programming with Python. I though it will be easy to learn. But I got some problems.
I want to scrape news comment in http://news.yahoo.com/congress-wary--but-unlikely-to-blow-up-obama-s-iran-deal-230545228.html.
But the problem is there is a button (>View Comments (452)) to see the comments. In addition, what I want to do is scraping all the comments in that news. Unfortunately, I have to click another button (View more comments) to see other 10 comments more.
How can I handle this problem?
The code that I've done is as below. Sorry for too poor code.
#############################################
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["news.yahoo.com"]
start_urls = ["http://news.yahoo.com/blogs/oddnews/driver-offended-by-%E2%80%9Cwh0-r8x%E2`%80%9D-license-plate-221720503.html",]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div/p')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('/text()').extract()
items.append(item)
return items
You can see that how much left to be done to solve my problem. But I have to be hurry.. I will do my best anyway.
Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a (very detailed) guide on how to find the answer.
The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome Developer Tools for this, but you can use others such as FF firebug.
So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our case requires a bit more work. I can tell two things right away:
They're using javascript to load the comments (because I'm staying on the same page).
They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the page and just showing them to you, with each click it does another request to the database).
Now let's right-click and inspect element on that button. It's actually just a simple span with text:
<span>View Comments (2077)</span>
By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of confusing data. Wait, here's one that makes sense:
http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1
See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object (looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest that a human can read:
go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
'_media.modules.content_comments.switches._enable_mutecommenter': '1',
'_media.modules.content_comments.switches._enable_view_others': '1',
'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
'count': '10',
'enable_collapsed_comment': '1',
'isNext': 'true',
'offset': '20',
'pageNumber': '2',
'sortBy': 'highestRated'}
Now it's just a matter of trial-and-error. However, a few things to note here:
Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So now we understand how to use offset
The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that from the original page somehow. Try digging around a little, you'll find it.
Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article itself)
Using the devtools you have access to all client-side scripts. So by digging you can find that that link to /get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the request, and try to emulate that (though you can probably figure it out yourself)
You might need to overcome some security measures. For example, you might need a session-key from the original article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in case it shows up.
Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the html comments you are getting (for which you might want to check out BeautifulSoup).
As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task either.
So don't panic.
It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt). Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help you.
Good luck!
I'm thankful for this question, as it got me started on trying to scrape yahoo comments, and I just wanted to add an update, because yahoo has changed the way they handle comments since this question was posted. First, there are 3 URLs of interest, depending on what you want to get. With these, you can get main comment threads, replies to a thread, or comments from a user. These are
urlComments = 'https://www.yahoo.com/news/_td/api/resource/canvass.getMessageListForContext_ns;context=%1s;count=10;index=%1s;lang=en-US;namespace=yahoo_content;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;rankingProfile=canvassHalfLifeDecayProfile;region=US;sortBy=popular;type=null;userActivity=true'
urlReply = 'https://www.yahoo.com/news/_td/api/resource/canvass.getReplies_ns;context=%1s;count=10;index=%1s;lang=en-US;messageId=%1s;namespace=yahoo_content;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;region=US;sortBy=createdAt;tags='
urlUser = 'https://www.yahoo.com/news/_td/api/resource/canvass.getUserMessageHistory;count=10;index=%1s;lang=en-US;oauthConsumerKey=frontpage.oauth.canvassKey;oauthConsumerSecret=frontpage.oauth.canvassSecret;region=US;sortBy=createdAt;userId=%1s'
Now, I've inserted a couple of %1s into the URLs, to insert desired variables into the URL, such as the article id, index, or user id. As before, a few parameters are needed:
params = {'bkt': ["news-d-202","newsdmcntr"],
'device': 'desktop',
'feature': 'cacheContentCanvas,videoDocking,newContentAttribution,livecoverage,featurebar,deferModalCluster,specRetry,newLayout,sidepic,canvassOffnet,ntkFilmstrip,autoNotif,CanvassTags',
'intl': 'us',
'lang': 'en-US',
'partner': 'none',
'prid': '5t11qvhclanab',
'region': 'US',
'site': 'fp',
'tz': 'America/PICKACITY', <-- insert a city
'ver': '2.0.7765',
'returnMeta': 'true'}
Using the requests library, we can pull out comments, with say.
response = requests.get(u, params=params) #u is a url from above
coms = response.json()['data']['canvassMessages'] #drop the ['canvassMessages'] if you want to get replies to a thread
From there, you can pull out whatever you want from the comment. Now, coms will only have 10 comments in it (if you look at the URLs, you will see a count=10--unfortunately, the max appears to be 30). To get the next set of 10, insert the coms[-1]['index'] value into the desired URL, and grab the next 10. However, the problem that I've come across is that you can only grab about 1000 comments before yahoo taps out. For example, if you visit this comment page, you will see comments 1000-1009 (find an "index" value, and you get something like v=1:s=popular:sl=1498836633:off=1000). But if you visit this next comment page, you should see comments 1010-1019, but actually no comments are loaded. This is rather annoying, and if someone is aware of how to overcome this problem, I'd be quite welcoming of the solution. Lastly, in order to get the article id, open up the page source, and search for "pstaid", then copy the value that follows it. For example, the above links are for comments from This article with lots of comments. If you search for "pstaid", you find the value 0efc85df-eb0b-373e-b6f3-4c513ed2a415, and this is the article id that you would insert into the URL.

How to change link url in a HTTP response?

Now I have a http response from website A, I need to change all the link urls in this http response to the url of website B, so that when users get this http response in browser, click on links, they will be directed to website B not A.
I'm using python and django. Is there a package or tool can do this trick?
Thanks in advance.
Depending upon the nature of the response you get from website A, what you want to do with it, and on how important it is that the replacement be efficient, there are a few possible ways of doing things. I'm not 100% clear on your situation and what you want to achieve.
If the links in the response from website A start with website A's hostname, then just get the response as a string and do response = response.replace('http://website-a.com', 'http://website-b.com') before you present the response to the user.
If the response is HTML, and the links are relative, the easiest solution to code would probably be to use lxml.rewrite_links (see http://lxml.de/lxmlhtml.html#working-with-links). I suspect this is what you're looking for.
If you've got some other situation, well, then I dunno what's appropriate. Maybe a regex. Maybe a custom algorithm of your own design. It depends upon what kind of content you're getting back from website A, how you can recognise links in it, and how you want to change them.
If you use Apache as Webserver you could use a module to replace Text in the response like http://mod-replace.sourceforge.net/. This seems to be more reasonable than invoking perl or python for every request. But you have to be aware that all the text might be replaced - not only the links which have an efect. Therefore this would be a very dirty solution.

Categories