I am trying to crawl the webpage https://sec.report/, which seems to be protected by a certain server configuration. (I need the data for my master thesis).
I have a list of company names, which I would like to get certain identifiers (CIK) from the above website.
Landauer Inc --> 0000825410.
Starwood Waypoint Homes --> 0001579471.
Supreme Industries Inc --> 0000350846.
[and 2,000 more ...]
Example: Searching for the first entry in the latter list (Landauer Inc), I can get the CIK using the following link: https://sec.report/CIK/Search/Landauer%20Inc. The generic link is https://sec.report/CIK/Search/{company_name}.
Problem: When I send a simple request (Python) to the above URL, I get an HTTP 200 response. Yet, I only get shown a website saying: Please wait up to 5 seconds.... Please see the response here:
Loading page when request is sent.
I assume the website is protected by Cloudfare due to https://checkforcloudflare.selesti.com/?q=https://sec.report/
Try-outs: I have already tried to crawl the page using Python with:
(1) Tor-proxies with full request headers (rotating).
(2) Selenium including Cloudfare packages/extensions.
(3) Simple scrapy spider (I've never used scrapy so that I could have missed a working solution)
Does someone of you have an idea how I could bypass the protection to crawl the necessary data?
Thanks a lot in advance!
You may take a look at this : implicit wait
driver.implicitly_wait(10) # seconds
With that line of code every time you try to select an element on the page selenium will try to get it for 10 seconds (or more if you want) and raise an error if not found
Related
I'm working on a project where the data of the following url needs to be scraped: https://www.funda.nl/objectinsights/getdata/5628496/
The last part of the url represents the ID of an object. Opening the link in the browser does work, but sometimes it returns a 404 error. The same when using scrapy shell in python, sometimes I can scrape the url, sometimes not.
When I managed to open the url(without 404 error), I went to inspect > network. But i'm not experienced enough to understand this information. Does someone know the fix? Or additional information to this topic?
Extra urls you can try:
https://www.funda.nl/objectinsights/getdata/5819260/
https://www.funda.nl/objectinsights/getdata/5819578/
https://www.funda.nl/objectinsights/getdata/5819237/
https://www.funda.nl/objectinsights/getdata/5819359/
https://www.funda.nl/objectinsights/getdata/5819371/
https://www.funda.nl/objectinsights/getdata/5819386/
I tested these in scrapy shell and got response 200 each time.
This is not a Scrapy issue if you are having intermittent 404 response even from browser.
They may well limiting you to a small number of requests per ip address or per minute.
Try write some code with a delay in it between requests, or use rotating proxy (free trials are out there if you don't want t sign up for one).
I am trying to scrape data from Trip.com, specifically this page here . When you visit this page in browser it shows result of 20 hotels but as you scroll down more hotel details are loaded. Now what i want is to scrape data of first 50 hotels. But i am asked to not use Scrapy, Selenium. Any help is appreciated. Thanks in advance.
If you use the DevTools and look at the Network tab you can see that a request goes out to https://www.trip.com/restapi/soa2/16709/json/HotelSearch? This endpoint returns the results in a JSON format.
The next step would be to reverse engineer and copy the request into Python using urllib, which is built into Python. This step might require some experience and knowledge of how HTTP requests work.
I am trying to crawl pages like this http://cmispub.cicpa.org.cn/cicpa2_web/07/0000010F849E5F5C9F672D8232D275F4.shtml. Each of these pages contains certain information about an individual person.
There are two ways to get to these pages.
One is to coin their urls, which is what I used in my scrapy code. I had my scrapy post request bodies like ascGuid=&isStock=00&method=indexQuery&offName=&pageNum=&pageSize=&perCode=110001670258&perName=&queryType=2to http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml.
These posts would return response where I can use xpath and regex to find strings like'0000010F849E5F5C9F672D8232D275F4' to coin the urls I really wanted:
next_url_part1 = 'http://cmispub.cicpa.org.cn/cicpa2_web/07/'
next_url_part2 = some xptah and regex...
next_url_part3 = '.shtml'
next_url_list.append([''.join([next_url_part1, i, next_url_part3]))
Finally, scrapy sent GET requests to these coined links and downloaded and crawled information I wanted.
Since the pages I wanted are information about different individuals, I can change the perCode= part in those POST request bodies to coin corresponding urls of different persons.
But this way sometimes doesn't work out. I have sent GET requests to about 100,000 urls I coined and I got 5 404 responses. To figure out what's going on and get information I want, I firstly pasted these failed urls in a browser and not to my suprise I still got 404. So I tried the other way on these 404 urls.
The other way is to manually access these pages in a browser like a real person. Since the pages I wanted are information about different individuals, I can write their personal codes on the down-left blank on this page http://cmispub.cicpa.org.cn/cicpa2_web/public/query0/2/00.shtml( only works properly under IE), and click the orange search button down-right(which I think is exactly like scrapy sending POST requests). And then a table will be on screen, by clicking the right-most blue words(which are the person's name), I can finally access these pages.
What confuses me is that after I practiced the 2nd way to those failed urls and got what I want, these previously 404 urls will return 200 when I retry them with the 1st way(To avoid influences of cookies I retry them with both scrapy shell and browser's inPrivate mode). I then compared the GET request headers of 200 and 404 responses, and they looks the same. I don't understand what's happening here. Could you please help me?
Here is the rest failed urls that I haven't tried the 2nd way so they still returns 404(if you get 200, maybe some other people have tried the url the 2nd way):
http://cmispub.cicpa.org.cn/cicpa2_web/07/7694866B620EB530144034FC5FE04783.shtml
and the personal code of this person is
110001670258
http://cmispub.cicpa.org.cn/cicpa2_web/07/C003D8B431A5D6D353D8E7E231843868.shtml
and the personal code of this person is
110101301633
http://cmispub.cicpa.org.cn/cicpa2_web/07/B8960E3C85AFCF79BF0823A9D8BCABCC.shtml
and the personal code of this person is 110101480523
http://cmispub.cicpa.org.cn/cicpa2_web/07/8B51A9A73684ADF200A38A5D492A1FEA.shtml
and the personal code of this person is 110101500315
I would like to know what is the best/preferred PYTHON 3.x solution (fast to execute, easy to implement, option to specify user agent, send browser & version etc to webserver to avoid my IP being blacklisted) which can scrape data on all of below options (mentioned based on complexity as per my understanding).
Any Static webpage with data in tables / Div
Dynamic webpage which completes loading in one go
Dynamic webpage which requires signin using username password & completes loading in one go after we login.
Sample URL for username password: https://dashboard.janrain.com/signin?dest=http://janrain.com
Dynamic web-page which requires sign-in using oauth from popular service like LinkedIn, google etc & completes loading in one go after we login. I understand this involves some page redirects, token handling etc.
Sample URL for oauth based logins: https://dashboard.janrain.com/signin?dest=http://janrain.com
All of bullet point 4 above combined with option of selecting some drop-down (lets say like "sort by date") or can involve selecting some check-boxes, based on which the dynamic data displayed would change.
I need to scrape the data after the action of check-boxes/drop-downs has been performed as any user would do it to change the display of the dynamic data
Sample URL - https://careers.microsoft.com/us/en/search-results?rk=l-seattlearea
You have option of drop-down as well as some checkbox in the page
Dynamic webpage with Ajax loading in which data can keep loading as
=> 6.1 we keep scrolling down like facebook, twitter or linkedin main page to get data
Sample URL - facebook, twitter, linked etc
=> 6.2 or we keep clicking some button/div at the end of the ajax container to get next set of data;
Sample URL - https://www.linkedin.com/pulse/cost-climate-change-indian-railways-punctuality-more-editors-india-/
Here you have to click "Show Previous Comments" at the bottom of the page if you need to look & scrape all the comments
I want to learn & build one exhausted scraping solution which can be tweaked to cater to all options from the easy task of bullet point 1 to the complex task of bullet point 6 above as and when required.
I would recommend to use BeautifulSoup for your problems 1 and 2.
For 3 and 5 you can use Selenium WebDriver (available as python library).
Using Selenium you can perform all the possible operations you wish (e.g. login, changing drop down values, navigating, etc.) and then you can access the web content by driver.page_source (you may need to use sleep function to wait until the content is fully loaded)
For 6 you can use their own API to get list of news feeds and their links (mostly the returned object comes with link to a particular news feed), once you get the links you can use BeautifulSoup for get the web content.
Note: Pleas do read each web site terms and conditions before scraping because some of them have mentioned Automated Data Collection as Unethical behavior which we we should not do as professional.
Scrapy is for you if you looking for the real scaleable bulletproof solution. In fact scrapy framework is an industry standard for python crawling tasks.
By the way: I'd suggest you avoid JS rendering: all that stuff(chromedriver, selenium, phantomjs) is a last option to crawl sites.
Most of ajax data you can parse simply by forging needed requests.
Just spend more time in Chrome's "network" tab.
Is there is a way to get all items of specific seller on amazon?
When I try to submit requests using different forms of urls to the store (the basic is ("https://www.amazon.com/shops/"), I'm getting 301 with no additional info.
even before the spider itself, from the scrapy shell (some random shop from amazon)
scrapy shell "https://www.amazon.com/shops/A3TJVJMBQL014A"
There is 301 response code:
request <GET https://www.amazon.com/shops/A3TJVJMBQL014A>
response <301 https://www.amazon.com/shops/A3TJVJMBQL014A>
In the browser it will be redirected to https://www.amazon.com/s?marketplaceID=ATVPDKIKX0DER&me=A3TJVJMBQL014A&merchant=A3TJVJMBQL014A&redirect=true
using resulting URL also leads to 301 response.
I was using scrapy shell, while as answered by #PadraicCunningham it doesn't support location header.
Running code from spider resolved the issue.
Since you want a list of all goods sold by one specific seller, you can analyze the page of that seller specifically.
Here, I am going to take Kindle E-readers Seller as an example.
Open the console in your browser and select the max page count element on the seller's page, you can see the number of max pages of this seller is inside a tag <span class="pagnLink"> </span>, so you can find this tag and extract the max page count from it.
you can see there is a slight change in the url when you move to next page of this seller's goods list (from page=1 to page=2), so you can easily construct a new url when you wanna move to next page.
set a loop whose limitation is the number of max page count you got in the first step.
analyze the specific data you wanna get on that page, analyze what html tags they are inside and use some text analyze libraries to help you extract the data. (re, BeautifulSoup .etc)
Briefly, you have to analyze the page before writing codes.
When you start coding, you should first making requests, then get response from your request, then extracting useful data from the response(according to the rules you analyzed before writing codes).