Async HTTP server with scrapy and mongodb in python

Async HTTP server with scrapy and mongodb in python - python

I am basically trying to start an HTTP server which will respond with content from a website which I can crawl using Scrapy. In order to start crawling the website I need to login to it and to do so I need to access a DB with credentials and such. The main issue here is that I need everything to be fully asynchronous and so far I am struggling to find a combination that will make everything work properly without many sloppy implementations.
I already got Klein + Scrapy working but when I get to implementing DB accesses I get all messed up in my head. Is there any way to make PyMongo asynchronous with twisted or something (yes, I have seen TxMongo but the documentation is quite bad and I would like to avoid it. I have also found an implementation with adbapi but I would like something more similar to PyMongo).
Trying to think things through the other way around I'm sure aiohttp has many more options to implement async db accesses and stuff but then I find myself at an impasse with Scrapy integration.
I have seen things like scrapa, scrapyd and ScrapyRT but those don't really work for me. Are there any other options?
Finally, if nothing works, I'll just use aiohttp and instead of Scrapy I'll do the requests to the websito to scrap manually and use beautifulsoup or something like that to get the info I need from the response. Any advice on how to proceed down that road?
Thanks for your attention, I'm quite a noob in this area so I don't know if I'm making complete sense. Regardless, any help will be appreciated :)

Is there any way to make pymongo asynchronous with twisted
No. pymongo is designed as a synchronous library, and there is no way you can make it asynchronous without basically rewriting it (you could use threads or processes, but that is not what you asked, also you can run into issues with thread-safeness of the code).
Trying to think things through the other way around I'm sure aiohttp has many more options to implement async db accesses and stuff
It doesn't. aiohttp is a http library - it can do http asynchronously and that is all, it has nothing to help you access databases. You'd have to basically rewrite pymongo on top of it.
Finally, if nothing works, I'll just use aiohttp and instead of scrapy I'll do the requests to the websito to scrap manually and use beautifulsoup or something like that to get the info I need from the response.
That means lots of work for not using scrapy, and it won't help you with the pymongo issue - you still have to rewrite pymongo!
My suggestion is - learn txmongo! If you can't and want to rewrite it, use twisted.web to write it instead of aiohttp since then you can continue using scrapy!

Related

how to create a simple python web serivce, that would run locally and handle parallel requests

So, as the header indicates, im trying to understand what is the most simple way to create a rest api web service, that would run locally on my pc, and would be able to handle requests in parallel (just using my several local cores).
I was checking out flask, but im not sure that's the right option. i know by default it runs with a single instance, and it has an option of using a "process" param, which doesnt work for me and seem not to be the solution.
im looking for something simple, but couldnt find anything.
all i want to be able to do, is be able to receive REST requests and handle them in parallel using a process pool for example.
note: I'm using windows. so gunicorn isn't an option i think

handle url-redirects in asynchronous code

Recently I decided to port a project from using requests to using grequests to make asynchronous HTTP Requests instead for efficiency. One problem I face now
is that in my previous code, in cases where there was a redirect I could handle it using the following snippet:
req = requests.get('a domain')
if req.history:
print req.url # that would be the last domain
In this way I could retrieve the last-visited url after the redirect.
So my question is if it's possible to achieve something similar with grequests.
Reading the docs I didn't manage to find a possible solution.
Thanks in advance!

Using proxies with Unirest

I'm using the Unirest library for making async web requests with Python. I've read the documentation, but I wasn't able to find if I can use proxy with it. Maybe I'm just blind and there's a way to use it with Unirest?
Or is there some other way to specify proxy for Python? Proxies should be changed from script itself after making some requests, so this way should allow me to do it.
Thanks in advance.

I know nothing about Unirest, but, In all the scripts I wrote that requierd proxy support I used SocksiPy (http://socksipy.sourceforge.net) module. It support HTTP, SOCKS4 and SOCKS5 and it s really easy to use. :)

Would something like this work for you?
[1] https://github.com/obriencj/python-promises

Python Requests & Django to send a picture from app to server

I would like to do a very simple thing but I kept having trouble to get it work.
I have a client and a server. Both of them have python. The client needs at a certain time in the python code to send a picture to the server and the server is using python to receive the picture, to do some modifications in the picture then save it to disk.
How can I acheive this the easiest way possible? Is Django a good idea?
My problem is that I keep getting an error from the Django server side and it seems it is because I am not managing the cookies.
Can someone give me a sample code for the client and for the server to authenticate then send the file to the server in https?
Also, if you think it is best to use something else than Django, your comments are welcomed :). In fact I managed to get it work very easily with client python and server php but because I have to treat everything in python on the server, I would have prefered not to install apache, php, ... and use only python also to get the picture.
Many thanks for your help,
John.

You don't need Django - a web framework - for this unless you really want to have the features of Django. (Here's a good link. But to sum it up, it would be "a bunch of website stuff".)
You'd probably be best off just using something to transmit data over the network. There are a lot of ways to do this!
If your data is all local (same network) you can use something like ZeroMQ.
If you are not sure if your data is local, or if you know it won't be, you can use HTTP without a server - the Requests library is awesome for this.
In both these scenarios, you'd need to have a "client" and a "server" which you already have a good handle on.

Using curl vs Python requests

When doing a scrape of a site, which would be preferable: using curl, or using Python's requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an "HTTP 429 too many requests" error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?

Since you want to parallelize the requests, you should use requests with grequests (if you're using gevent, or erequests if you're using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.

Using requests would allow you to do it programmatically, which should result in a cleaner product.
If you use curl, you're doing os.system calls which are slower.

I'd go for the in-language version over an external program any day, because it's less hassle.
Only if it turns out unworkable would I fall back to this. Always consider that people's time is infinitely more valuable than machine time. Any "performance gains" in such an application will probably be swamped by network delays anyway.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.