How to set IMAGES_STORE folder per Item in Scrapy 1.5 - python

Scrapy 1.5 allows setting an IMAGES_STORE setting for storing all downloaded media as explained in documentation
I would like to be able to specify a custom folder per Item based on some values in the Item. Not knowing much about internals of Scrapy, I am not sure exactly which methods to override to accomplish this.
I thought about overriding from_settings(cls, settings) but there I do not have access to Item yet.
Any ideas?

I solved the issue by overriding file_path method. So in IMAGES_STORE I have the base path then I control the variable part from file_path. Something like below. However I had a typo first and scrapy silently ignored it without printing any errors even in debug... I don't know why? So it is best to start with a simple string for testing.
def file_path(self, request, response=None, info=None):
url = request.url
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/full/%s.jpg' % ('my_custom_path', image_guid)

Related

Get all parameters and their values in a request in Django [duplicate]

I am currently defining regular expressions in order to capture parameters in a URL, as described in the tutorial. How do I access parameters from the URL as part the HttpRequest object?
My HttpRequest.GET currently returns an empty QueryDict object.
I'd like to learn how to do this without a library, so I can get to know Django better.
When a URL is like domain/search/?q=haha, you would use request.GET.get('q', '').
q is the parameter you want, and '' is the default value if q isn't found.
However, if you are instead just configuring your URLconf**, then your captures from the regex are passed to the function as arguments (or named arguments).
Such as:
(r'^user/(?P<username>\w{0,50})/$', views.profile_page,),
Then in your views.py you would have
def profile_page(request, username):
# Rest of the method
To clarify camflan's explanation, let's suppose you have
the rule url(regex=r'^user/(?P<username>\w{1,50})/$', view='views.profile_page')
an incoming request for http://domain/user/thaiyoshi/?message=Hi
The URL dispatcher rule will catch parts of the URL path (here "user/thaiyoshi/") and pass them to the view function along with the request object.
The query string (here message=Hi) is parsed and parameters are stored as a QueryDict in request.GET. No further matching or processing for HTTP GET parameters is done.
This view function would use both parts extracted from the URL path and a query parameter:
def profile_page(request, username=None):
user = User.objects.get(username=username)
message = request.GET.get('message')
As a side note, you'll find the request method (in this case "GET", and for submitted forms usually "POST") in request.method. In some cases, it's useful to check that it matches what you're expecting.
Update: When deciding whether to use the URL path or the query parameters for passing information, the following may help:
use the URL path for uniquely identifying resources, e.g. /blog/post/15/ (not /blog/posts/?id=15)
use query parameters for changing the way the resource is displayed, e.g. /blog/post/15/?show_comments=1 or /blog/posts/2008/?sort_by=date&direction=desc
to make human-friendly URLs, avoid using ID numbers and use e.g. dates, categories, and/or slugs: /blog/post/2008/09/30/django-urls/
Using GET
request.GET["id"]
Using POST
request.POST["id"]
Someone would wonder how to set path in file urls.py, such as
domain/search/?q=CA
so that we could invoke query.
The fact is that it is not necessary to set such a route in file urls.py. You need to set just the route in urls.py:
urlpatterns = [
path('domain/search/', views.CityListView.as_view()),
]
And when you input http://servername:port/domain/search/?q=CA. The query part '?q=CA' will be automatically reserved in the hash table which you can reference though
request.GET.get('q', None).
Here is an example (file views.py)
class CityListView(generics.ListAPIView):
serializer_class = CityNameSerializer
def get_queryset(self):
if self.request.method == 'GET':
queryset = City.objects.all()
state_name = self.request.GET.get('q', None)
if state_name is not None:
queryset = queryset.filter(state__name=state_name)
return queryset
In addition, when you write query string in the URL:
http://servername:port/domain/search/?q=CA
Do not wrap query string in quotes. For example,
http://servername:port/domain/search/?q="CA"
def some_view(request, *args, **kwargs):
if kwargs.get('q', None):
# Do something here ..
For situations where you only have the request object you can use request.parser_context['kwargs']['your_param']
You have two common ways to do that in case your URL looks like that:
https://domain/method/?a=x&b=y
Version 1:
If a specific key is mandatory you can use:
key_a = request.GET['a']
This will return a value of a if the key exists and an exception if not.
Version 2:
If your keys are optional:
request.GET.get('a')
You can try that without any argument and this will not crash.
So you can wrap it with try: except: and return HttpResponseBadRequest() in example.
This is a simple way to make your code less complex, without using special exceptions handling.
I would like to share a tip that may save you some time.
If you plan to use something like this in your urls.py file:
url(r'^(?P<username>\w+)/$', views.profile_page,),
Which basically means www.example.com/<username>. Be sure to place it at the end of your URL entries, because otherwise, it is prone to cause conflicts with the URL entries that follow below, i.e. accessing one of them will give you the nice error: User matching query does not exist.
I've just experienced it myself; hope it helps!
These queries are currently done in two ways. If you want to access the query parameters (GET) you can query the following:
http://myserver:port/resource/?status=1
request.query_params.get('status', None) => 1
If you want to access the parameters passed by POST, you need to access this way:
request.data.get('role', None)
Accessing the dictionary (QueryDict) with 'get()', you can set a default value. In the cases above, if 'status' or 'role' are not informed, the values ​​are None.
If you don't know the name of params and want to work with them all, you can use request.GET.keys() or dict(request.GET) functions
This is not exactly what you asked for, but this snippet is helpful for managing query_strings in templates.
If you only have access to the view object, then you can get the parameters defined in the URL path this way:
view.kwargs.get('url_param')
If you only have access to the request object, use the following:
request.resolver_match.kwargs.get('url_param')
Tested on Django 3.
views.py
from rest_framework.response import Response
def update_product(request, pk):
return Response({"pk":pk})
pk means primary_key.
urls.py
from products.views import update_product
from django.urls import path
urlpatterns = [
...,
path('update/products/<int:pk>', update_product)
]
You might as well check request.META dictionary to access many useful things like
PATH_INFO, QUERY_STRING
# for example
request.META['QUERY_STRING']
# or to avoid any exceptions provide a fallback
request.META.get('QUERY_STRING', False)
you said that it returns empty query dict
I think you need to tune your url to accept required or optional args or kwargs
Django got you all the power you need with regrex like:
url(r'^project_config/(?P<product>\w+)/$', views.foo),
more about this at django-optional-url-parameters
This is another alternate solution that can be implemented:
In the URL configuration:
urlpatterns = [path('runreport/<str:queryparams>', views.get)]
In the views:
list2 = queryparams.split("&")
url parameters may be captured by request.query_params
It seems more recommended to use request.query_params. For example,
When a URL is like domain/search/?q=haha, you would use request.query_params.get('q', None)
https://www.django-rest-framework.org/api-guide/requests/
"request.query_params is a more correctly named synonym for request.GET.
For clarity inside your code, we recommend using request.query_params instead of the Django's standard request.GET. Doing so will help keep your codebase more correct and obvious - any HTTP method type may include query parameters, not just GET requests."

Issue with creating/retrieving cookies in Flask

When the class AnonUser is initialized, the code should check if a cookie exists and create a new one if it doesn't. The relevant code snippet is the following:
class AnonUser(object):
"""Anonymous/non-logged in user handling"""
cookie_name = 'anon_user_v1'
def __init__(self):
self.cookie = request.cookies.get(self.cookie_name)
if self.cookie:
self.anon_id = verify_cookie(self.cookie)
else:
self.anon_id, self.cookie = create_signed_cookie()
res = make_response()
res.set_cookie(self.cookie_name, self.cookie)
For some reason, request.cookies.get(self.cookie_name) always returns None. Even if I log "request.cookies" immediately after res.set_cookie, the cookie is not there.
The strange thing is that this code works on another branch with identical code and, as far as I can tell, identical configuration settings (it's not impossible I'm missing something, but I've been searching for the past couple hours for any difference with no luck). The only thing different seems to be the domain.
Does anyone know why this might happen?
I figured out what the problem was. I was apparently wrong about it working on the other branch; for whatever reason it would work if the anonymous user already had some saved collections (what the cookies are used for), and I'm still not sure why that is, but the following ended up resolving the issue:
#app.after_request
def set_cookie(response):
if not request.cookies.get(g.cookie_session.cookie_name):
response.set_cookie(g.cookie_session.cookie_name, g.cookie_session.cookie)
return response
The main things I needed to do were import "request" from flask and realize that I could reference the cookie and cookie name through just referring to the anonymous user ("cookie_session") class where they were set.

Scrapy: overwrite DEPTH_LIMIT variable based on value read from custom config

I am using InitSpider and read a custom json configuration within the def __init__(self, *a, **kw): method.
The json config file contains a directive with which I can control the crawling depth. I can already successfully read this configuration file and extract the value. The main problem is how to tell scrapy to use this value.
Note: I dont want to use a command line argument such as -s DEPTH_LIMIT=3, I actually want to parse it from my custom configuration.
DEPTH_LIMIT is used in scrapy.spidermiddlewares.depth.DepthMiddleware. As you might have had a quick look at the code, you'll see that the DEPTH_LIMIT value is read only when initializing that middleware.
I think this might be a good solution to you:
In the __init__ method of your spider, set a spider attribute max_depth with your custom value.
Override scrapy.spidermiddlewares.depth.DepthMiddleware and have it check the max_depth attribute.
Disable the default DepthMiddleware and enable your own one in the settings.
See also http://doc.scrapy.org/en/latest/topics/spider-middleware.html
A quick example of the overridden middleware described in step #2:
class MyDepthMiddleware(DepthMiddleware):
def process_spider_output(self, response, result, spider):
if hasattr(spider, 'max_depth'):
self.maxdepth = getattr(spider, 'max_depth')
return super(MyDepthMiddleware, self).process_spider_output(response, result, spider)

Routes with trailing slashes in Pyramid

Let's say I have a route '/foo/bar/baz'.
I would also like to have another view corresponding to '/foo' or '/foo/'.
But I don't want to systematically append trailing slashes for other routes, only for /foo and a few others (/buz but not /biz)
From what I saw I cannot simply define two routes with the same route_name.
I currently do this:
config.add_route('foo', '/foo')
config.add_route('foo_slash', '/foo/')
config.add_view(lambda _,__: HTTPFound('/foo'), route_name='foo_slash')
Is there something more elegant in Pyramid to do this ?
Pyramid has a way for HTTPNotFound views to automatically append a slash and test the routes again for a match (the way Django's APPEND_SLASH=True works). Take a look at:
http://docs.pylonsproject.org/projects/pyramid/en/latest/narr/urldispatch.html#redirecting-to-slash-appended-routes
As per this example, you can use config.add_notfound_view(notfound, append_slash=True), where notfound is a function that defines your HTTPNotFound view. If a view is not found (because it didn't match due to a missing slash), the HTTPNotFound view will append a slash and try again. The example shown in the link above is pretty informative, but let me know if you have any additional questions.
Also, heed the warning that this should not be used with POST requests.
There are also many ways to skin a cat in Pyramid, so you can play around and achieve this in different ways too, but you have the concept now.
Found this solution when I was looking for the same thing for my project
def add_auto_route(config,name, pattern, **kw):
config.add_route(name, pattern, **kw)
if not pattern.endswith('/'):
config.add_route(name + '_auto', pattern + '/')
def redirector(request):
return HTTPMovedPermanently(request.route_url(name))
config.add_view(redirector, route_name=name + '_auto')
And then during route configuration,
add_auto_route(config,'events','/events')
Rather than doing config.add_route('events','/events')
Basically it is a hybrid of your methods. A new route with name ending in _auto is defined and its view redirects to the original route.
EDIT
The solution does not take into account dynamic URL components and GET parameters. For a URL like /abc/{def}?m=aasa, using add_auto_route() will throw a key error because the redirector function does not take into account request.matchdict. The below code does that. To access GET parameters it also uses _query=request.GET
def add_auto_route(config,name, pattern, **kw):
config.add_route(name, pattern, **kw)
if not pattern.endswith('/'):
config.add_route(name + '_auto', pattern + '/')
def redirector(request):
return HTTPMovedPermanently(request.route_url(name,_query=request.GET,**request.matchdict))
config.add_view(redirector, route_name=name + '_auto')
I found another solution. It looks like we can chain two #view_config. So this solution is possible:
#view_config(route_name='foo_slash', renderer='myproject:templates/foo.mako')
#view_config(route_name='foo', renderer='myproject:templates/foo.mako')
def foo(request):
#do something
Its behavior is also different from the question. The solution from the question performs a redirect, so the url changes in the browser. In the second form both /foo and /foo/ can appear in the browser, depending on what the user entered. I don't really mind, but repeating the renderer path is also awkward.

How to access scrapy settings from item Pipeline

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.
UPDATE (2021-05-04)
Please note that this answer is now ~7 years old, so it's validity can no longer be ensured. In addition it is using Python2
The way to access your Scrapy settings (as defined in settings.py) from within your_spider.py is simple. All other answers are way too complicated. The reason for this is the very poor maintenance of the Scrapy documentation, combined with many recent updates & changes. Neither in the "Settings" documentation "How to access settings", nor in the "Settings API" have they bothered giving any workable example. Here's an example, how to get your current USER_AGENT string.
Just add the following lines to your_spider.py:
# To get your settings from (settings.py):
from scrapy.utils.project import get_project_settings
...
class YourSpider(BaseSpider):
...
def parse(self, response):
...
settings = get_project_settings()
print "Your USER_AGENT is:\n%s" % (settings.get('USER_AGENT'))
...
As you can see, there's no need to use #classmethod or re-define the from_crawler() or __init__() functions. Hope this helps.
PS. I'm still not sure why using from scrapy.settings import Settings doesn't work the same way, since it would be the more obvious choice of import?
Ok, so the documentation at http://doc.scrapy.org/en/latest/topics/extensions.html says that
The main entry point for a Scrapy extension (this also includes
middlewares and pipelines) is the from_crawler class method which
receives a Crawler instance which is the main object controlling the
Scrapy crawler. Through that object you can access settings, signals,
stats, and also control the crawler behaviour, if your extension needs
to such thing.
So then you can have a function to get the settings.
#classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
my_setting = settings.get("MY_SETTING")
return cls(my_setting)
The crawler engine then calls the pipeline's init function with my_setting, like so:
def __init__(self, my_setting):
self.my_setting = my_setting
And other functions can access it with self.my_setting, as expected.
Alternatively, in the from_crawler() function you can pass the crawler.settings object to __init__(), and then access settings from the pipeline as needed instead of pulling them all out in the constructor.
The correct answer is: it depends where in the pipeline you wish to access the settings.
avaleske has answered as if you wanted access to the settings outside of your pipelines process_item method but it's very likely this is where you'll want the setting and therefore there is a much easier way as the Spider instance itself gets passed in as an argument.
class PipelineX(object):
def process_item(self, item, spider):
wanted_setting = spider.settings.get('WANTED_SETTING')
the project structure is quite flat, why not:
# pipeline.py
from myproject import settings

Categories