Scrapy's System Performance Requirement? - python

I am planning to create a website for price comparison of service provided by several companies.
The main idea is visitor enter search requirement on the website and start searching, the crawl result display instantly on website rather than output as a file.
I am new to python and scrapy, not sure if scrapy can do this?
I am going to use it daily and even many times every day, and crawl on 30+ websites. I am afraid the search may overload the server? Can a shared web hosting support such crawling? Is there any system performance requirement?

Related

Can websites detect web scraping if I act like a human (Selenium, Python)?

I use Selenium in Python and I want to scrape a lot of websites from one company (many hundreds). But that shouldn't burden the system under any circumstances and because this is a very large website anyway, it shouldn't be a problem for them.
Now my question is if the company can somehow discover that I'm doing web scraping if I'm acting like a human. That means I stay on a website for an extra long time and allow extra time to pass.
I don't think you can recognize me by my IP, because the period of time is very long while I do this and I think it looks like normal traffic.
Are there any other ways that websites can see that I am doing webscraping or generally running a script?
Many Thanks
(P.S.:I know that a similar question has already been asked, but the answer was simply that he doesn't behave like a human and visits the website too quickly. But it's different for me ...)
When you scraping make sure that you respect the robots.txt file which is based at the root of the website. It set the rules of crawling: which parts of the website should not be scraped, how frequently it can be scraped.
User navigation patterns are monitored by large companies to detect bots and scraping attempts. There are many anti scraping tools available in market which are using AI to monitor the various patterns to differentiate between a human and a bot.
Some of the main techniques used to prevent scraping apart from software are
Captcha,
Honey traps,
UA monitoring,
IP monitoring,
Javascript encryption, etc..
There are many more, so what i am saying is that yes it can be detected.
One way they can tell is from your browser headers

Efficient way to scrape images from website in Django/Python

First I guess I should say I am still a bit of a Django/Python noob. I am in the midst of a project that allows users to enter a URL, the site scrapes the content from that page and returns images over a certain size and the page title tag so the user can then pick which image they want to use on their profile. A pretty standard scenario I assume. I have this working by using Selenium (headless Chrome browser) to grab the destination page content, some python to determine the file size and then my Django view spits it all out into a template. I then have it coded in such a way that the image the user selects will be downloaded and stored locally.
However I seriously doubt the scalability of this, its currently just running locally and I am very concerned about how this would cope if there were lots of users all running at the same time. I am firing up that headless chrome browser every time a request is made which doesn't sound efficient, I am having to download the image to determine it's size so I can decide whether it's large enough. One example took 12 seconds to get from me submitting the URL to displaying the results to the user, whereas the same destination URL put through www.kit.com (they have very similar web scraping functionality) took 3 seconds.
I have not provided any code as the code I have does what it should, I think the approach however is incorrect. To summarise what I want is:
To allow a user to enter a URL and for it to return all images (or just the URLs to those images) from that page over a certain size (width/height), and the page title.
For this to be the most efficient solution, taking into account it would be run concurrently between many users at once.
For it to work in a Django (2.0) / Python (3+) environment.
I am not completely against using the API from a 3rd party service if one exists, but it would be my least preferred option.
Any help/pointers would be much appreciated.
You can use 2 python solutions in your case:
1) BeautifulSoup, and here is a good answer how to download the images using it. You just have to make it a separate function and pass site as the argument into it. But also it is very easy to parse only images links as u said - depending on speed what u need (obviously scraping files, specially when there is a big amount of them, will be much slower, than links). This tool is just for parsing and scrapping the content of the page.
2) Scrapy - this is much more powerful tool, framework, via it you can connect your spider to a Django models, operate with images much more efficiently, using its built-in image-pipelines. It is much more flexible with a lot of features how to operate with scrapped data. I am not sure if u need to use it in your project and if it is not overpowered in your case.
Also my advice is to run the spider in some background task like Queue or Celery, and call the result via AJAX, cuz it may take some time to parse the content - so don't make a user wait for the response.
P.S. You can even combine those 2 tools in some cases :)

Can I scrape all URL results using Python from a google search without getting blocked?

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.
It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.

Scraping site with limited connections

I have a python script, which scrapes some information from some site.
This site has a daily limitation for 20 connections
So, I decided to use module requests with specified "proxies".
After a couple of hours testing different "proxy-list" sites I've found one and I've been parsing from site http://free-proxy-list.net/.
Seems, this site doesn't get the list updated often and after testing my script I've wasted all the proxies and I can't access the site anymore.
All these searches make me exhausted and I feel like my script completely sucks.
Is there any way I can avoid "detecting" me by site or I just need to find another list of proxies? If there are some sites with daily updated and all new list of proxies - please, let me know.
P.S. I often have stumbled upon sites like https://hide.me where I just enter the link and it gives me full access. Maybe I can just code this in Python? If it's possible - show me, please, how.

How to search for some specific links(which may be present in a pdf file) in a website and crawl those links for other information?

I have a task to complete. I need to make a web crawler kind of application. What i need to do is to pass a url to my application. This url is website of a government agency. This url also having some links to other individual agencies which are approved by this government agency. I need to go to those links and get some information from that site about that agency. I hope i make myself clear.Now i have to make this application generic. It means i can't hard code it for just one website(government agency). I need to make it like any url given to it , it should check it and then get all the links and proceed. Now in some website these links present in pdfs and in some they are present on a page.
I have to use python for this. I don't know how to approach this. I spend time on this using BeautifulSoup but that require lots of parsing. Other options are scrapy or twill. Honestly i am new to python. I dont know which one is better for this task. So any one can help me in selecting the right tool and right approach to solve this problem. Thanks in advance
There is plenty of information out there about building web scrapers with Python. Python is a great tool for the job.
There are also tons of posts about web scrapers on this website if you search for them.

Categories