How to see/edit/Avoid duplicates in scrapy?

How to see/edit/Avoid duplicates in scrapy? - python

I was just wondering how I can reset the dupefilter process to avoid a certain number of url to be filtered.
Indeed, I tested a crawler many times before succeeding, and now that I want to run it with something like scrapy crawl quotes -o test_new.csv -s JOBDIR=crawls/quotes_new-1
It keeps telling me that some url are duplicated and then not visited..
Would be definitely OK to remove all url from that crawler
Would appreciate to know where the duplicate url are filtered (then I could edit?)
The request No-filter is not possible with my problem because it will loop
I can add my code but as it's a general question I felt it would be more confusing than anything. Just ask if you need it :)
Thank you very much,

You can set scrapys DUPEFILTER_CLASS setting with your own dupefilter class or just extend the default RFPDupeFilter(source code) class with your changes.
This documentation pages explains a bit more:
The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function.
In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string).

Related

URL path parameters vs query parameters in Django

I've looked around for a little while now and can't seem to find anything that even touches on the differences. As the title states, I'm trying to find out what difference getting your data via url path parameters like /content/7 then using regex in your urls.py, and getting them from query params like /content?num=7 using request.GET.get() actually makes.
What are the pros and cons of each, and are there any scenarios where one would clearly be a better choice than the other?
Also, from what I can tell, the (Django's) preferred method seems to be using url path params with regex. Is there any reason for this, other than potentially cleaner URLs? Any additional information pertinent to the topic is welcome.

This would depend on what architectural pattern you would like to adhere to. For example, according to the REST architectural pattern (which we can argue is the most common), you want do design URLs such that without query params, they point to "resources" which roughly correspond to nouns in your application and then HTTP verbs correspond to actions you can perform on that resource.
If, for instance, your application has users, you would want to design URLs like this:
GET /users/ # gets all users
POST /users/ # creates a new user
GET /users/<id>/ # gets a user with that id. Notice this url still points to a user resource
PUT /users/<id> # updates an existing user's information
DELETE /users/<id> # deletes a user
You could then use query params to filter a set of users at a resource. For example, to get users that are active, your URL would look something like
/users?active=true
So to summarize, query params vs. path params depends on your architectural preference.
A more detailed explanation of REST: http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api
Roy Fielding's version if you want to get really academic: http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm

Dynamically add URL rules to Flask app

I am writing an app in which users will be able to store information that they can specify a REST interface for. IE, store a list of products at /<username>/rest/products. Since the URLs are obviously not known before hand, I was trying to think of the best way to implement dynamic URL creation in Flask. The first way I thought of would be to write a catch-all rule, and route the URL from there. But then I am basically duplicating URL routing capabilities when Flask already has them built-in. So, I was wondering if it would be a bad idea to use .add_url_rule() (docs here, scroll down a bit) to attach them directly to the app. Is there a specific reason this shouldn't be done?

Every time you execute add_url_rule() the internal routing remaps the URL map. This is neither threadsafe nor fast. I right now don't understand why you need user specific URL rules to be honest. It kinda sounds like you actually want user specific applications mounted?
Maybe this is helpful: http://flask.pocoo.org/docs/patterns/appdispatch/

I have had similar requirement for my application where each endpoint /<SOMEID>/rest/other for given SOMEID should be bounded to a different function. One way to achieve this is keeping a lookup dictionary where values are the function that handle the specific SOMEID. For example take a look at this snippet:
func_look_up_dict = {...}
#app.route('<SOMEID>/rest/other', methods=['GET'])
def multiple_func_router_endpoint(SOMEID):
if SOMEID in func_look_up_dict.keys():
return jsonify({'result' = func_look_up_dict[SOMEID]()}), 200
else:
return jsonify({'result'='unknown', 'reason'='invalid id in url'}), 404
so for this care you don't really need to "dynamically" add url rules, but rather use a url rule with parameter and handle the various cases withing a single function. Another thing to consider is to really think about the use case of such URL endpoint. If <username> is a parameter that needs to be passed in, why not to use a url rule such as /rest/product/<username> or pass it in as an argument in the GET request?
Hope that helps.

Downloadlink for a file dynamically created by a Trac- Wikimacro

I've been given the task to write a plugin for Trac.
It should provide burndown data for the ticketcount and estimations filed in the issue tracking system.
The user writes his request as a wikimacro and is provided a link/ button for downloading the burndown as a csv-file, output as a chart is also planned, but has lower priority.
I've got a working solution for processing the data but I'm left with the following problem.
My Question
How can I provide a downloadlink/- button on the Wikipage for a file which is dynamically created by the users request?
I've seen some attempts to send files in the trac source itself and other plugins, but since I'm new to web programming that doesn't really help.
Update1
I've been trying to solve the problem the way Felix suggested, which opened up a new problem for me.
This (stupid) example should demonstrate my problem.
My Macro generates the following URL and adds it as a link to the wikipage.
//http://servername.com/projectname/wiki/page_name?teddy=bear
But the RequestHandler doesn't react, even if the condition returns true.
Edit: This piece of code now shows the working version for the example.
New URL:
#example url
#http://127.0.0.1:8000/prove/files/new
class CustomRequestHandlerModule(Component):
implements(IRequestHandler)
def match_request(self,req):
#old, not working
#return "teddy=bear"== str(req.path_info).split('?')[1]
#new
accept="/files/new"== str(req.path_info)
return accept
def process_request(self,req):
csvfile = self.create_csv()
req.send_response(200)
req.send_header('Content-Type', 'text/csv')
req.send_header('Content-length', len(csvfile))
req.send_header('Content-Disposition','filename=lala.csv')
req.end_headers()
req.write(csvfile)
raise RequestDone
Update2
Inserting loggingstatements shows match_request never gets called.
What am I doing wrong? (Yes, the create_csv() exists already)
Update 3 Thx, for helping =)

If match_request isn't getting called, then process_request never has a chance to execute. Assuming that there's nothing wrong with your plugin that's preventing Trac from loading it correctly, what's probably happening is that another handler is matching the URL before your version of match_request gets called. Try increasing your log level to "Debug" and see if it provides enough information to tell who is processing that request.
Another option is to create a custom "namespace" for your auto-generated files. Try replacing 'wiki' in the generated URLs with something like 'files'. This should prevent any of the built-in handlers from handling the request before your plugin's match_request method gets called.

Basically you need to write your own IRequestHandler which handles a specific URL and returns your dynamically created data. Afterwards you macro should return a url which is configured for your request handler.

Best way to request-scope data in Django?

I'm wondering if there's a clever pattern for request-scoping arbitrary information without resorting to either TLS or putting the information in the session.
Really, this would be for contextual attributes that I'd like to not look up more than once in a request path, but which are tied to a request invocation and there's no good reason to let them thresh around in the session.
Something like a dict that's pinned to the request where I can shove things or lazy load them. I could write a wrapper for request and swap it out in a middleware, but I figured I'd check to see what best-practice might be here?

Just assign the dictionary directly to the request. You can do that in middleware or in your view, as you like.

Context processors. They are called once for every request and receive the actual request object - so you can add ANY data to the context, also based on the curent request!

Dealing with URLs in Django

So, basically what I'm trying to do is a hockey pool application, and there are a ton of ways I should be able to filter to view the data. For example, filter by free agent, goals, assists, position, etc.
I'm planning on doing this with a bunch of query strings, but I'm not sure what the best approach would be to pass along the these query strings. Lets say I wanted to be on page 2 (as I'm using pagination for splitting the pages), sort by goals, and only show forwards, I would have the following query set:
?page=2&sort=g&position=f
But if I was on that page, and it was showing me all this corresponding info, if I was to click say, points instead of goals, I would still want all my other filters in tact, so like this:
?page=2&sort=p&position=f
Since HTTP is stateless, I'm having trouble on what the best approach to this would be.. If anyone has some good ideas they would be much appreciated, thanks ;)
Shawn J

Firstly, think about whether you really want to save all the parameters each time. In the example you give, you change the sort order but preserve the page number. Does this really make sense, considering you will now have different elements on that page. Even more, if you change the filters, the currently selected page number might not even exist.
Anyway, assuming that is what you want, you don't need to worry about state or cookies or any of that, seeing as all the information you need is already in the GET parameters. All you need to do is to replace one of these parameters as required, then re-encode the string. Easy to do in a template tag, since GET parameters are stored as a QueryDict which is basically just a dictionary.
Something like (untested):
#register.simple_tag
def url_with_changed_parameter(request, param, value):
params = request.GET
request[param] = value
return "%s?%s" % (request.path, params.urlencode())
and you would use it in your template:
{% url_with_changed_parameter request "page" 2 %}

Have you looked at django-filter? It's really awesome.

Check out filter mechanism in the admin application, it includes dealing with dynamically constructed URLs with filter information supplied in the query string.
In addition - consider saving actual state information in cookies/sessions.

If You want to save all the "parameters", I'd say they are resource identifiers and should normally be the part of URI.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to see/edit/Avoid duplicates in scrapy? - python

Related

URL path parameters vs query parameters in Django

Dynamically add URL rules to Flask app

Downloadlink for a file dynamically created by a Trac- Wikimacro

Best way to request-scope data in Django?

Dealing with URLs in Django

Categories

Resources