My application is listing some game servers IP addresses.
I want to add a simple search engine, taking a regular expression in it. I would type ^200. to list only the IP addresses beginning with 200.
The form would redirect me to the results page by sending a GET request like that :
/servers/search/^200./page/1/range/30/
This is the line I'm using in urls.py :
url(r'^servers/search/(?P<search>[a-zA-Z0-9.]+)/page/(?P<page>\d+)/range/(?P<count>\d+)/$', gameservers.views.index)
But it doesn't work the way I expected. No results are shown. I've intentionally made a syntax error to see the local variables. Then I realized that the search variable's value is the following :
^200./page/1/range/30/
How can I fix this ? I've thought about moving the search parameter in the url's ending, but it might be very interesting to see if there is a way to limit the value with the next /.
Your regex doesn't match at all: you are not accepting the ^ character. But even if it was, there's no way that the full URL could all be captured in the search variable, because then the rest of the URL wouldn't match.
However, I wouldn't try to fix this. Trying to capture complicated patterns in the URL itself is usually a mistake. For a search value, it's perfectly acceptable to move that to a GET query parameter, so that your URL would look something like this:
/servers/search/?search=^200.&page=1&range=30
or, if you like, you could still capture the page and range values in the URL, but leave the search value as a query param.
Related
I am working on getting information from https://www.corporationwiki.com/search/results?term=jim%20smith(just a random name I pick, please don't mind), I want to filter the result by using the drop-down menu and select a State.
However, the web page doesn't implement 'States' as a parameter, which means the URL doesn't change after I select a state.
I tried passing params into requests.get(), the result didn't change.
Here's the code I used:
url = 'https://www.corporationwiki.com/search/results?term=jim%20smith'
r = requests.get(url, params= dict(
query="web scraping",
page=2, states = 'Maryland'),timeout = 5)
There's no error message, however, it also didn't show me the filtered result.
Can anyone help me passing the right parameters so I can filter the result by states?
Thanks :)
Actually, it looks like the website does implement state as parameter. The exact name is "stateFacet".
You can just send your get request to:
https://www.corporationwiki.com/search/withfacets?term=jim%20smith&stateFacet=state_code
Just replace state_code with the correct value. For example:
https://www.corporationwiki.com/search/withfacets?term=jim%20smith&stateFacet=de
This link will filter with the state Delaware.
If the endpoint doesn't support it, then you cannot get it via URL. You will need to look into more complicated methods of doing so, or figure out the correct URL parameter if there is one.
You won't be able to do it with requests. You will probably need to use something like Selenium to simulate clicking the dropdown and picking the filter(s) you want. This is because that dropdown's logic is all javascript which cannot be done through the URL request.
First of all my english is not my native language.
Problem
I try to access and manipulate a form using MechanicalSoup as described in the docs. I did successfull login to the page using the given login form which I found using the "debug mode"(F12) built into chrome.
form action="https://www.thegoodwillout.de/customer/account/loginPost/"
The Form can be found here using the chrome "debugger"
this is working fine and will not produce any error. I tried to up my game and move to a more complicated form which is given on this site. I managed to track down the form to this snippet
form action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU,/product/115178/form_key/r19gQi8K03l21bYk/"
This will result in a
ValueError: No Closing quotation
which is weird since it does not use any special characters and I double checked so that every quotation is closing correctly
What have I tried
I tried tracking down a more specific form which will apply for the given shoe size but this form seems to manage all the content on the Website. I searched the web and found several articles pointing to a bug inside python which I cannot believe will be true!
Source Code with attached error log
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.thegoodwillout.de/nike-air-vortex-schwarz-weiss-anthrazit-903896-010")
browser.select_form('form[action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU,/product/115178/form_key/r19gQi8K03l21bYk/"]')
NOTE: it all seems to track down to a module called shlex which is causing the error
Finally the error log
It would be really helpful if you could point me into the right directions and link some Websites I may not have fully investigated yet.
It's actually an issue with BeautifulSoup4, the library used by MechanicalSoup to navigate within HTML documents, related to the fact that you use a comma (,) in the CSS selector.
BeautifulSoup splits CSS selectors on commas, and therefore considers your query as: browser.select_form('form[action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU and /product/115178/form_key/r19gQi8K03l21bYk/"], parsed separately. When parsing the first, it finds an opening " but no closing ", and errors out.
It's somewhat a feature (you can specify multiple CSS selectors as argument to select), but it's useless here as a feature (there's no point providing several selectors when you expect a single object).
Solution: don't use commas in CSS selectors. You probably have other criteria to match your form.
You may try using %2C instead of the comma (untested).
I am currently going through various Django tutorials in order to understand how url mapping is working . I came across an example which is something like this
this is in my urls.py
url(r'admin_page_edit$',"adminApp.views.showClientDetails",name="admin_page_edit"),
this is in my html page that is currently being displayed to the user
<a href="{% url "admin_page_edit" %}?uname=SomeVal&par2=value" >
Now the URL the browser shows when the above href link is clicked. No problem there
http://127.0.0.1:8000/admin_page_edit?uname=SomeVal&par2=value
And the above URL lands in the corresponding view
adminApp.views.showClientDetails
Now here is the proble, this seems to all work but I am confused as to why this is working ? since the url of the browser is
http://127.0.0.1:8000/admin_page_edit?uname=SomeVal&par2=value
which does not match the regex string in the url
admin_page_edit$
(The above regex means if the string ends with admin_page_edit) but the url string does not end with admin_page_edit instead it is
http://127.0.0.1:8000/admin_page_edit?uname=SomeVal&par2=value
thus ending with par2=value
My question is why is this hitting the corresponding definition in the view when the url regex is not matching ?
Query strings (parts following ?) are not processed by the Django url parser. Why? Because they don't have to be processed. You can just about append any query string to any url:
Like: https://www.facebook.com/?request=pleasedonotwork which works all the same. Unless redirects (or some logging) are done based on queries sent in urls, you can consider the query part of urls as passive.
These query strings can be accessed in your Django views via the request.GET QueryDict
Given a url cnn.com when i feed it in a browser, it finds http://www.cnn.com as correct url.
However
r = requests.get('www.cnn.com')
gives
MissingSchema: Invalid URL u'www.cnn.com': No schema supplied
Error
Is it possible to detect the right url just like a browser does?
Obviously the module you are using does not want to guess the scheme, so you have to provide it. If you build an interface yourself and want your users to be able to omit the scheme, you need to implement some "intelligent" approach yourself. A way to do so would be to use http://docs.python.org/2/library/urlparse.html, check if a scheme was given within the URL. If no scheme was provided, add your desired default scheme (e.g. http) to the ParseResult object and get the modified URL via ParseResult.geturl().
Yes, it's possible, or at least it's possible to make good guesses and test them.
To make a good guess, you could start by looking for "http://" at the start of a URL, and add it if it's not there. To test that guess, you could try to hit the resulting domain and see if you get a successful response. .
I know that with urllib you can parse a string and check if it's a valid URL. But how would one go about checking if a sentence contains a URL within it, and then extract that URL. I've seen some huge regular expressions out there, but i would rather not use something that I really can't comprehend.
So basically I have an input string, and I need to find and extract all the URLs within that string.
What's a clean way of going about this.
You can search for "words" containing : and then pass them to urlparse (renamed to urllib.parse in Python 3.0 and newer) to check if they are valid URLs.
Example:
possible_urls = re.findall(r'\S+:\S+', text)
If you want to restrict yourself only to URLs starting with http:// or https:// (or anything else you want to allow) you can also do that with regular expressions, for example:
possible_urls = re.findall(r'https?://\S+', text)
You may also want to use some heuristics to determine where the URL starts and stops because sometimes people add punctuation to the URLs, giving new valid but unintentionally incorrect URLs, for example:
Have you seen the new look for http://example.com/? It's a total ripoff of http://example.org/!
Here the punctuation after the URL is not intended to be part of the URL. You can see from the automatically added links in the above text that StackOverflow implements such heuristics.
Plucking a URL out of "the wild" is a tricky endeavor (to do correctly). Jeff Atwood wrote a blog post on this subject: The Problem With URLs Also, John Gruber has addressed this issue as well: An Improved Liberal, Accurate Regex Pattern for Matching URLs Also, I have written some code which also attempts to tackle this problem: URL Linkification (HTTP/FTP) (for PHP/Javascript). (Note that my regex is particularly complex because it is designed to be applied to HTML markup, and attempts to skip URLs which are already linkified (i.e. Link!)
Second, when it comes to validating a URI/URL, the document you want to look at is RFC-3986. I've been working on a article dealing with this very subject: Regular Expression URI Validation. You may want to take a look at this as well.
But when you get down to it, this is not a trivial task!