My problem begins when i try to crawl an app store, lets say google play.
for every app there are alot of comments and i want to crawl them FAST.
but the comment section in google is generated by java script.
here is a link for example: https://play.google.com/store/apps/details?id=com.gameloft.android.ANMP.GloftAMHM in that link you can see that in order to generate more comments you need to click on a button several times. (after 5-6 clicks aprox) the page generate more comments by executing a javascript.
At first i solved this problem using a web driver (firefox) and simulate a real person clicking on the button, and it generate comments, and he keep pressing till all comments are generated.
Problem with this is: 1, it takes too much time. 2, sometimes after tons fo clicks and JS generation the web browser is fail to response.
What I need is a way to generate all comments per application in a better, faster way. maybe theres some kind of tech, or just anything else that would improve my solution,
Im using a spider I've created in scrapy.
All kind of help will be much appreciated
One of the reasons they generate/show additional comments is exactly that they do not want someone to crawl them... the other is for the initial page to load without them (faster), and only if someone starts reading comments to show few more..
Unless they provide an API where you can pull all the comments at once, I do not see another quick way of pulling them, apart of simulating clicks and scrolls... (slow way of doing it)
Are you respecting robots.txt? Why or why not?
Related
I couldn’t really find any questions similar to mine, but I was curious if there’s a way to redirect a URL you click on, and make it go to a sub link or a sub URL for a different website. For example:
If you click on the website URL “chess.com” it will redirect you to either: “google.com/ a random sublink “ for example, or “chess.dethgrr45dffrr/google.com” or something like that. I want it to basically load that selected website, but instead of that url I want it to be a different one. This may seem confusing, so my apologies. I was wondering if this could be done either in python or simply in the web browser. I wanted to implement this into my script, so it would just stay on one website, rather than leaving and going to different websites. It doesn’t have to be Google, it could be a different website. I know this is not the best explanation of what I was thinking. If someone could help me out, that would be great, thanks!
I am trying a to build a comment scraper for YouTube comments. I already tried to build a scraper with selenium by opening a headless browser and scrolling down, opening comment responses and then extracting the data. But a headless browser is too slow and also the scrolling seems to be unreliable because the number of scraped comments does not match the number of comment given for each video. Maybe I did something wrong, but this is irrelevant to me right now: one of the main reasons why I would like to find another way is the time: it is too slow.
I know there have been a lot of questions about scraping youtube comment on stackoverflow, but almost every answer I found suggested to use some kind of headless browser, i.e. selenium, to do the job. I don’t like that because of the reasons mentioned above. Also, some references I found online suggest to refrain from using selenium for web scraping and instead to try reverse engineering, which means, as much as I understood, to emulate ajax calls and get the data, ideally in json.
When scrolling down a youtube video (here's an example), the comments are loaded dynamically. If I have a look at the XHR activity, I get a lot of information:
Looking at the XHR activity it seems like I find very information I need to make a request to get the json data with comments. But actually I struggle to construct the right the request in order to obtain the json file with the comments.
I read some tutorials online, which mostly offer simple examples which are easy to replicate. But none of them really helped me for my endeavour.
Can someone give me a hint and show me how to post the request in order to get the json file with the comments with python? I am using the requests library in Python.
I know, there is an Youtube API which can do the job, but I wanted to find out whether there is a way of doing the same thing without the API. I think it should be possible.
I'm new to coding and trying to use Selenium with Python to click through a website and fill a shopping cart. I've got things working well except for the random ForeSee survey popup. When it appears (and it doesn't always appear in the same location), my code stops working at that point.
I read the ForeSee documentation and it says "...when the invitation is displayed, the fsr.r...cookie is dropped. This cookie prevents a user from being invited again for X days (default 90)."
Hoping for a quick fix, I created a separate Firefox profile and ran through the website and got the ForeSee pop up invitation--no more pop up when manually using that profile. But I still get the pop up when using Selenium.
I used this code:
fp = webdriver.FirefoxProfile('C:\path\to\profile')
browser = webdriver.Firefox(firefox_profile=fp)
EDIT: I got the cookie working. I was using the Local folder instead of the Roaming folder in C:\path\to\profile. Using the roaming folder solved the problem.
My question edited to delete the part about the cookie not working:
Can someone suggest code to permanently handle the ForeSee pop up that appears randomly and on random pages?
I'm using using Protractor with JS, so I can't give you actual code to handle the issue, but I can give you an idea how to approach this.
In a nutshell
When following script is executed in the browser's console -
window.FSR.setFSRVisibility(true);
it makes ForeSee popup appear behind the rest of HTML elements. And doesn't affect UI tests anymore
So my protractor script will look like so
await browser.executeScript(
`window.FSR.setFSRVisibility(true);`
);
Theory
So ForeSee is one of those services that can be integrated with any web app, and will be pulling js code from their API and changing HTML of your app, by executing the code on the scope of the website. Another example of such company is walkme
Obviously in modern world, if these guys can overlay a webpage, they should have a configuration to make it optional (at least for lower environments) and they actually do. What I mentioned as a solution came from this page. But assuming they didn't have such option, one could reach out their support and ask how to workaround their popups. Even if they didn't have such option they would gladly consider it as a feature for improvement.
To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?
There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)