At https://nb-bet.com/Results, I try to access every match page, but the site seems to block me for a while after the first access. It turns out to get access only to 60-80 matches, and then I get an error 500 or 404 (although the page is available and no protection is displayed if you open it through a browser and this is only to any match page, for example, https://nb-bet page .com/Results will still open normally), which disappears after about 30-40 minutes and you can re-access new matches.
If I use time.sleep(random.uniform(5,10)), I only get access to 5-7 matches. I've tried using fake_headers, fake_useragent, access randomly, but to no avail. I need to find a solution without using proxies etc. Any help would be greatly appreciated.
For example, I provide links to 158 matches and how I go through them, the goal is simply to get the code 200 for each page in one pass (ie without a break of 30-40 minutes). The list with links to the matches had to be published on a separate site, because. this site does not skip posting a question because of the large text, I hope you understand.
The list of links is here - https://pastebin.com/LPeCP5bQ
import requests
s = requests.session()
for i in range(len(links)):
r = s.get(links[i])
print(r.status_code)
Related
I'm trying to scrape data from a specific match from the website Sofascore (I use python & Selenium)
I can acess it first going to https://www.sofascore.com/football/2022-05-12 then clicking the match Tottenham - Arsenal with url https://www.sofascore.com/arsenal-tottenham-hotspur/IsR
However, when I enter this link directly from my browser, I arrive to a completely different page for a future match to come.
Is there a way to differentiate the 2 pages to be able to scrape the original match ?
Thanks
I checked the pages you talked about. The first page (https://www.sofascore.com/football/2022-05-12), sends a lot of information to the server.
That's why you get a certain page. If you want to solve this with requests, you'll need to record everything it sends with Burp suite or a similar tool.
You're probably better off just opening it with selenium and then clicking on the first page and getting the page you want...
If you want to check what the current page is in selenium, you can check if the content is what you expect to be on that page...
Jonthan
I am trying to crawl the webpage https://sec.report/, which seems to be protected by a certain server configuration. (I need the data for my master thesis).
I have a list of company names, which I would like to get certain identifiers (CIK) from the above website.
Landauer Inc --> 0000825410.
Starwood Waypoint Homes --> 0001579471.
Supreme Industries Inc --> 0000350846.
[and 2,000 more ...]
Example: Searching for the first entry in the latter list (Landauer Inc), I can get the CIK using the following link: https://sec.report/CIK/Search/Landauer%20Inc. The generic link is https://sec.report/CIK/Search/{company_name}.
Problem: When I send a simple request (Python) to the above URL, I get an HTTP 200 response. Yet, I only get shown a website saying: Please wait up to 5 seconds.... Please see the response here:
Loading page when request is sent.
I assume the website is protected by Cloudfare due to https://checkforcloudflare.selesti.com/?q=https://sec.report/
Try-outs: I have already tried to crawl the page using Python with:
(1) Tor-proxies with full request headers (rotating).
(2) Selenium including Cloudfare packages/extensions.
(3) Simple scrapy spider (I've never used scrapy so that I could have missed a working solution)
Does someone of you have an idea how I could bypass the protection to crawl the necessary data?
Thanks a lot in advance!
You may take a look at this : implicit wait
driver.implicitly_wait(10) # seconds
With that line of code every time you try to select an element on the page selenium will try to get it for 10 seconds (or more if you want) and raise an error if not found
I am trying to crawl pages like this http://cmispub.cicpa.org.cn/cicpa2_web/07/0000010F849E5F5C9F672D8232D275F4.shtml. Each of these pages contains certain information about an individual person.
There are two ways to get to these pages.
One is to coin their urls, which is what I used in my scrapy code. I had my scrapy post request bodies like ascGuid=&isStock=00&method=indexQuery&offName=&pageNum=&pageSize=&perCode=110001670258&perName=&queryType=2to http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml.
These posts would return response where I can use xpath and regex to find strings like'0000010F849E5F5C9F672D8232D275F4' to coin the urls I really wanted:
next_url_part1 = 'http://cmispub.cicpa.org.cn/cicpa2_web/07/'
next_url_part2 = some xptah and regex...
next_url_part3 = '.shtml'
next_url_list.append([''.join([next_url_part1, i, next_url_part3]))
Finally, scrapy sent GET requests to these coined links and downloaded and crawled information I wanted.
Since the pages I wanted are information about different individuals, I can change the perCode= part in those POST request bodies to coin corresponding urls of different persons.
But this way sometimes doesn't work out. I have sent GET requests to about 100,000 urls I coined and I got 5 404 responses. To figure out what's going on and get information I want, I firstly pasted these failed urls in a browser and not to my suprise I still got 404. So I tried the other way on these 404 urls.
The other way is to manually access these pages in a browser like a real person. Since the pages I wanted are information about different individuals, I can write their personal codes on the down-left blank on this page http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml( only works properly under IE), and click the orange search button down-right(which I think is exactly like scrapy sending POST requests). And then a table will be on screen, by clicking the right-most blue words(which are the person's name), I can finally access these pages.
What confuses me is that after I practiced the 2nd way to those failed urls and got what I want, these previously 404 urls will return 200 when I retry them with the 1st way(To avoid influences of cookies I retry them with both scrapy shell and browser's inPrivate mode). I then compared the GET request headers of 200 and 404 responses, and they looks the same. I don't understand what's happening here. Could you please help me?
Here is the rest failed urls that I haven't tried the 2nd way so they still returns 404(if you get 200, maybe some other people have tried the url the 2nd way):
http://cmispub.cicpa.org.cn/cicpa2_web/07/7694866B620EB530144034FC5FE04783.shtml
and the personal code of this person is
110001670258
http://cmispub.cicpa.org.cn/cicpa2_web/07/C003D8B431A5D6D353D8E7E231843868.shtml
and the personal code of this person is
110101301633
http://cmispub.cicpa.org.cn/cicpa2_web/07/B8960E3C85AFCF79BF0823A9D8BCABCC.shtml
and the personal code of this person is 110101480523
http://cmispub.cicpa.org.cn/cicpa2_web/07/8B51A9A73684ADF200A38A5D492A1FEA.shtml
and the personal code of this person is 110101500315
I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.
First of all I apologize for the vague title, but the problem is that I'm not sure what is causing the error.
I'm using Python to extrapolate some data from a website.
The code I created works perfectly when passing one link at the time, but somehow breaks when trying to collect the data from the 8000 pages I have (it actually breaks way before). The process I need to do is this:
Collect all the links from one single page (8000 links)
From each link extrapolate another link contained in an iframe
Scrape the date from the link in 2.
Point 1 is easy and works fine.
Point 2 and 3 works for a while and then I get some errors. Every time at a different point and it's never the same. After some tests, I decided to try a different approach and run my code until point 2 on all the links in 1, trying to collect all the links first. And at this point I found out that, probably, I get the error during this stage.
The code works like this: in a for cycle I pass each item of a list of urls to the function below. It's supposed to search for a link to the Disqus website. There should be only one link and there is always one link. Because with a library as lxml, it's not possible to scan inside the iframe, I use selenium and the ChromeDriver.
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
list_urls=[]
urls=[]
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
list_urls.append(driver.current_url)
driver.switch_to_default_content()
driver.quit()
for item in list_urls:
if item.startswith('http://disqus'):
urls.append(item)
if len(urls)>1:
print "too many urls collected in iframes"
else:
url=urls[0]
return url
At the beginning there was no time.sleep and it worked for roughly 30 links. Then I put a time.sleep(2) and it arrived to about 60. Now with time.sleep(3) it works for around 130 links. Of course, this cannot be a solution. The error I get now, it's always the same (index out of range in url=urls[0]), but each time with a different link. If I check my code with the single link where it breaks, the code works, so it can actually find urls there. And of course, sometimes passes a link where it stopped before and it works with no issue.
I suspect I get this because maybe of a time-out, but of course I'm not sure.
So, how can I understand what's the issue, here?
If the problem is that it makes too many requests (even though the sleep), how can I deal with this?
Thank you.
From your description of the problem, it might be that the host throttles your client when you issue too many requests in a given time. This is a common protection againts DoS attacks and ill-behaved robots - like yours.
The clean solution here is to checkout if the site has a robots.txt file and if so parse it and respect the rules - else, set a large enough wait time between two requests so you dont get kicked.
Also you can get quite a few other issues - 404, lost network connection etc - and even load time issues with selenium.webdriver as documented here:
Dependent on several factors, including the OS/Browser combination,
WebDriver may or may not wait for the page to load. In some
circumstances, WebDriver may return control before the page has
finished, or even started, loading. To ensure robustness, you need to
wait for the element(s) to exist in the page using Explicit and
Implicit Waits.
wrt/ your IndexError, you blindly assume that you'll get at least one url (which means at least one iframe), which might not be the case for any of the reasons above (and a few others too). First you want to make sure you properly handle all corner cases, then fix your code so you don't assume that you do have at least one url:
url = None
if len(urls) > 1:
print "too many urls collected in iframes"
elif len(urls) == 0:
url = urls[0]
else:
print "no url found"
Also if all you want is the first http://disqus url you can find, no need to collect them all, then filter them out, then return the first:
def get_url(webpage_url):
chrome_driver_path= '/Applications/chromedriver'
driver = webdriver.Chrome(chrome_driver_path)
driver.get(webpage_url)
iframes=driver.find_elements_by_tag_name("iframe")
# collects all the urls of all the iframe tags
for iframe in iframes:
driver.switch_to_frame(iframe)
time.sleep(3)
if driver.current_url.startswith('http;//disqus'):
return driver.current_url
driver.switch_to_default_content()
driver.quit()
return None # nothing found