I am trying to scrape data from bet365 for basketball odds however I am encountering where certain leagues webpages wont load even when simply just loading the page and not automating anything else. It is very irregular as to which leagues will load and which won't but for a large portion of the leagues it is displaying the message "Sorry, this page is no longer available. Betting has closed or has been suspended." - I am fairly new to web scraping so am not using any tools to hide the fact it is an automated script so I'm unsure if the site is flagging for bot activity but then I'm unsure why some pages would still load if that was the case.
I am simply just using chromewebdriver and driver.get() links like bet365.com/#/AC/B18/C20814074/D48/E1453/F10 bet365.com/#/AC/B18/C20604387/D48/E1453/F10/
Both work however other links:
bet365.com/#/AC/B18/C20815826/D48/E1453/F10/
bet365.com/#/AC/B18/C20816919/D48/E1453/F10/
Don't work despite being near identical pages
Just looking for some insight into why certain leagues would be blocked and if there is any way to work around it.
Related
i’ve developed a web scraper that extracts reviews from a particular shopping website. It’s coded by Python and the scraping is used based on Selenium + BS4. But my client thinks it’s TOO SLOW and wants it to be using Requests. To scrape the reviews, I have to wait until the reviews show up (or to click a review tab) and then page through for every reviews. I’m guessing the review div is an xhr element or an ajax because the whole page doesn’t load up when i click the next page. All the scrapings are used by BeautifulSoup.
I’m leaving an url so you guys can all go and check!
https://smartstore.naver.com/hoskus/products/4351439834?NaPm=ct%3Dkeuq83q8%7Cci%3Df5e8bd34633b9b48d81db83b289f1b2e0512d2f0%7Ctr%3Dslsl%7Csn%3D888653%7Chk%3D9822b0c3e9b322fa2d874575218c223ce2454a42
I’ve always thought Requests seem to read the HTML far faster than Selenium. But I don’t know how to attain the HTML when it’s all hidden by buttons. Does anybody have an idea? Or something I can refer to?
I wanted to read this article online and something popped and I thought that I want to read it offline after I have successfully extracted it... so here I am after 4 weeks of trials and all the problem is down to is I the crawler can't seem to read the content of the webpages even after all of the ruckus...
the initial problem was that all of the info was not present on one page so is used the button to navigate the content of the website itself...
I've tried BeautifulSoup but it can't seem to parse the page very well. I'm using selenium and chromedriver at the moment.
The reason for crawler not being able to read the page seems to be the robot.txt file (the waiting time for crawlers for a single page is 3600 and the article has about 10 pages, which is bearable but what would happen if it were to say 100+)and I don't know how to bypass it or go around it.
Any help??
If robots.txt puts limitations then that's the end of it. You should be web-scraping ethically and this means if the owner of the site wants you to wait 3600 seconds between requests then so be it.
Even if robots.txt doesn't stipulate wait times you should still be mindful. Small business / website owners might not know of this and by you hammering a website constantly it could be costly to them.
I am a very beginner of Python and tried to crawl using BeautifulSoup. And tried to crawl a website for collecting product information.
pr_url = soup.findAll("li", {"class", "_3FUicfNemK"})
pr_url
Everything is same with the other codes of crawl using BeautifulSoup.
But the problem is nothing happened even if I wrote down right components.
So what I thought is the host blocked the product area not to be crawled.
Cuz every element except for the area is crawl-able.
Do you know how to crawl this blocked area?
The site url is:
https://shopping.naver.com/living/homeliving/category?menu=10004487&sort=POPULARITY
Thank you for your comments in advance!
Notice how when you first load the page the outline of the site loads but the products take a while to load up? This is because the site is requesting the rest of the content to load in the background. This content isn't blocked, it's simply loaded later :)
2 options here i.m.o...
1) Figure out the background request and pass that into beautifulsoup. Using the Chrome dev tools network tab I can see that the request for the products is...
https://shopping.naver.com/v1/products?nc=1583366400000&subVertical=HOME_LIVING&page=1&pageSize=10&sort=POPULARITY&filter=ALL&displayType=CATEGORY_HOME&includeZzim=true&includeViewCount=true&includeStoreCardInfo=true&includeStockQuantity=false&includeBrandInfo=false&includeBrandLogoImage=false&includeRepresentativeReview=false&includeListCardAttribute=false&includeRanking=false&includeRankingByMenus=false&includeStoreCategoryName=false&menuId=10004487&standardSizeKeys=&standardColorKeys=&attributeValueIds=&attributeValueIdsAll=&certifications=&menuIds=&includeStoreInfoWithHighRatingReview=false
Should be able to guess the tweaks to the query string here and use that.
2) Use a tool like Selenium which interacts with the browser and will execute any JavaScript for you so you don't have to figure out that side of things. If you're new to this stuff, might be less of a learning curve into web tech here.
Trying to figure out how to make python play mp3s whenever a tag's text changes on an Online Fantasy Draft Board (ClickyDraft).
I know how to scrape elements from a website with python & beautiful soup, and how to play mp3s. But how do you think can I have it detect when a certain element changes so it can play the appropriate mp3?
I was thinking of having the program scrape the site every 0.5seconds to detect the changes,
but I read that that could cause problems? Is there any way of doing this?
The only way is too scrape the site on a regular basis. 0.5s is too fast. I don't know how time sensitive this project is. But scraping every 1/5/10 minute is good enough. If you need it quicker, just get a proxy (plenty of free ones out there) and you can scrape the site more often.
Just try respecting the site, Don't consume too much of the sites ressources by requesting every 0.5 seconds
I have used google sheet function IMPORTXML for scraping specific parts of webpages but It's not working properly with long xpath, not fluent, not smooth on tons of websites URL.
I have also tried distill extension, excel scraping from web table but It is also not long term smooth solution.
Please help to get notified on changes / updation of specific parts of tons of webpages.