I wanted to read this article online and something popped and I thought that I want to read it offline after I have successfully extracted it... so here I am after 4 weeks of trials and all the problem is down to is I the crawler can't seem to read the content of the webpages even after all of the ruckus...
the initial problem was that all of the info was not present on one page so is used the button to navigate the content of the website itself...
I've tried BeautifulSoup but it can't seem to parse the page very well. I'm using selenium and chromedriver at the moment.
The reason for crawler not being able to read the page seems to be the robot.txt file (the waiting time for crawlers for a single page is 3600 and the article has about 10 pages, which is bearable but what would happen if it were to say 100+)and I don't know how to bypass it or go around it.
Any help??
If robots.txt puts limitations then that's the end of it. You should be web-scraping ethically and this means if the owner of the site wants you to wait 3600 seconds between requests then so be it.
Even if robots.txt doesn't stipulate wait times you should still be mindful. Small business / website owners might not know of this and by you hammering a website constantly it could be costly to them.
Related
I'm trying to scrape data from a specific match from the website Sofascore (I use python & Selenium)
I can acess it first going to https://www.sofascore.com/football/2022-05-12 then clicking the match Tottenham - Arsenal with url https://www.sofascore.com/arsenal-tottenham-hotspur/IsR
However, when I enter this link directly from my browser, I arrive to a completely different page for a future match to come.
Is there a way to differentiate the 2 pages to be able to scrape the original match ?
Thanks
I checked the pages you talked about. The first page (https://www.sofascore.com/football/2022-05-12), sends a lot of information to the server.
That's why you get a certain page. If you want to solve this with requests, you'll need to record everything it sends with Burp suite or a similar tool.
You're probably better off just opening it with selenium and then clicking on the first page and getting the page you want...
If you want to check what the current page is in selenium, you can check if the content is what you expect to be on that page...
Jonthan
I am a very beginner of Python and tried to crawl using BeautifulSoup. And tried to crawl a website for collecting product information.
pr_url = soup.findAll("li", {"class", "_3FUicfNemK"})
pr_url
Everything is same with the other codes of crawl using BeautifulSoup.
But the problem is nothing happened even if I wrote down right components.
So what I thought is the host blocked the product area not to be crawled.
Cuz every element except for the area is crawl-able.
Do you know how to crawl this blocked area?
The site url is:
https://shopping.naver.com/living/homeliving/category?menu=10004487&sort=POPULARITY
Thank you for your comments in advance!
Notice how when you first load the page the outline of the site loads but the products take a while to load up? This is because the site is requesting the rest of the content to load in the background. This content isn't blocked, it's simply loaded later :)
2 options here i.m.o...
1) Figure out the background request and pass that into beautifulsoup. Using the Chrome dev tools network tab I can see that the request for the products is...
https://shopping.naver.com/v1/products?nc=1583366400000&subVertical=HOME_LIVING&page=1&pageSize=10&sort=POPULARITY&filter=ALL&displayType=CATEGORY_HOME&includeZzim=true&includeViewCount=true&includeStoreCardInfo=true&includeStockQuantity=false&includeBrandInfo=false&includeBrandLogoImage=false&includeRepresentativeReview=false&includeListCardAttribute=false&includeRanking=false&includeRankingByMenus=false&includeStoreCategoryName=false&menuId=10004487&standardSizeKeys=&standardColorKeys=&attributeValueIds=&attributeValueIdsAll=&certifications=&menuIds=&includeStoreInfoWithHighRatingReview=false
Should be able to guess the tweaks to the query string here and use that.
2) Use a tool like Selenium which interacts with the browser and will execute any JavaScript for you so you don't have to figure out that side of things. If you're new to this stuff, might be less of a learning curve into web tech here.
I am new to Selenium and web applications. Please bear with me for a second if my question seems way too obvious. Here is my story.
I have written a scraper in Python that uses Selenium2.0 Webdriver to crawl AJAX web pages. One of the biggest challenge (and ethics) is that I do not want to burn down the website's server. Therefore I need a way to monitor the number of requests my webdriver is firing on each page parsed.
I have done some google-searches. It seems like only selenium-RC provides such a functionality. However, I do not want to rewrite my code just for this reason. As a compromise, I decided to limit the rate of method calls that potentially lead to the headless browser firing requests to the server.
In the script, I have the following kind of method calls:
driver.find_element_by_XXXX()
driver.execute_script()
webElement.get_attribute()
webElement.text
I use the second function to scroll to the bottom of the window and get the AJAX content, like the following:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Based on my intuition, only the second function will trigger request firing, since others seem like parsing existing html content.
Is my intuition wrong?
Many many thanks
Perhaps I should elaborate more. I am automating a process of crawling on a website in Python. There is a subtantial amount of work done, and the script is running without large bugs.
My colleagues, however, reminded me that if in the process of crawling a page I made too many requests for the AJAX list within a short time, I may get banned by the server. This is why I started looking for a way to monitor the number of requests I am firing from my headless PhantomJS browswer in script.
Since I cannot find a way to monitor the number of requests in script, I made the compromise I mentioned above.
Therefore I need a way to monitor the number of requests my webdriver
is firing on each page parsed
As far as I know, the number of requests is depending on the webpage's design, i.e. the resources used by the webpage and the requests made by Javascript/AJAX. Webdriver will open a browser and load the webpage just like a normal user.
In Chrome, you can check the requests and responses using Developer Tools panel. You can refer to this post. The current UI design of Developer Tools is different but the basic functions are still the same. Alternatively, you can also use the Firebug plugin in Firefox.
Updated:
Another method to check the requests and responses is by using Wireshark. Please refer to these Wireshark filters.
I have been playing around with windmill to try out some web scraping, however the API waits.forPageLoad is not able to check if the page is fully rendered.
And in a scenario where I need to reload a page with an existing DOM and I use waits.forElement to detect the DOM for the script to "decide" that the page has loaded. This would occasionally detect the DOM even before the page has loaded.
Also loading a page with windmill test client in firefox seems to take forever. The same page if I load with my regular firefox browser may take like 2 seconds but may take up to a minute in the test client. Is it normal for it to take so long?
Lastly I was wondering if there are better alternatives to windmill for webscraping? The documentation seems abit sparse.
Please advice. Thanks :P
client.waits.sleep(milliseconds=u'2000')
an absolute pause of 2 seconds.
client.waits.forPageLoad(timeout=u'20000')
Will wait on future lines until the page loads or until 20 seconds have elapsed, which ever comer first. Think of it as a time bordered assert. If the page loads in under 20 seconds pass, if not fail.
I hope this helps,
TD
I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx