How to manually invoke SpellCheck avoiding Browser timeout - python

I'm trying to update the SpellChecker in my local MoinMoin installation according to this documentation page: https://moinmo.in/HelpOnSpellCheck.
I followed the steps, got a new dictionary file and sym-linked it into data/dict directory in the MoinMoin installation path. I then deleted /data/cache/spellchecker.dict, which should be rebuilt upon invoking the SpellCheck action. If I visit my Wiki and use SpellCheck, the browser times out upon building the SpellCheck database, as expected according to the link above.
In the documentation it says: "If your browser or the webserver timeouts before the file is completely built, one solution is to telnet into your webserver, and manually request the page." This is what I'm trying to do. Unfortunately the request does not seem to invoke the database creation and quickly returns the requested page.
Here is how I requested the page (I'm hosting it locally over port 8085):
telnet 192.168.1.199 8085
Trying 192.168.1.199...
Connected to 192.168.1.199.
Escape character is '^]'.
HEAD /wiki/FrontPage?action=SpellCheck HTTP/1.1
Host: 192.168.1.199
HTTP/1.1 200 OK
...
I would expect that the request invokes the database creation, as it does in a web browser. This should take a few minutes and I should afterwards be able to find the created database in /data/cache/. Unfortunately this does not happen.

If anyone else is interested, here is how I ended up solving the problem: I narrowed the possible timeouts down to either the webserver (nginx) or uwsgi, hence I did the following changes to the config-files:
In /etc/moin/uwsgi.ini:
harakiri 9999
In /etc/nginx/nginx.conf:
uwsgi_read_timeout 9999
uwsgi_send_timeout 9999
I then used the python package requests to send the get-request to the server:
import requests
r = requests.get('http://192.168.1.199:8085/wiki/FrontPage',
params={'action' : 'SpellCheck'}, timeout=9999)
This ran for about 2-3 minutes. Afterwards, the word count shown when performing a spelling check was 629182, reflecting the number of words present in my dictionary.

Related

Getting contents of a webpage with python fails, instead returning a FortiADC Error page

My Python code (which is on a Docker image, within an AWS EC2 box) is using the requests library to try and get the contents of a webpage. If I do something as simple as requests.get(url='https://google.com', verify=False).text, it will return the below, and of note is the "fortiadc_error_page".
/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'google.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
<!DOCTYPE html>\n<html>\t\n\t<head>\n\t</head>\n\t<body>\n\t\t<iframe src="/fortiadc_error_page/index.html" frameborder="0" width="100%" scrolling="no" onload="function resizeIframe(obj) {\n\t obj.style.height = obj.contentWindow.document.body.scrollHeight + 10 + \'px\';\n\t };resizeIframe(this)"></iframe>\n\t</body>
I remove the verify from the request, like so: requests.get(url='https://google.com').text, and I get back: ssl.SSLCertVerificationError: ("hostname 'google.com' doesn't match either of '*.fortinet.com', 'fortinet.com'",)
Anyone know how to fix these errors? I deployed this with a RHEL 8 AMI, so shouldn't be any special security things on here. And I've opened up ports to 443 and 80.
It is very strange as I could then do requests.get('https://stackoverflow.com') and that WORKS. but Google and CNN fail, and the website that I really want to pull from fails.
For a bit more detail - again I have an EC2 deployed on AWS, and on it I've installed K3D. On K3D I've installed a set of Kubernetes manifest files, which includes a cronjob. When I run that cronjob, it runs on a docker image, which has Python and runs the requests code, and fails.
From the looks of things you have a dns issue. The container needs to be able to resolve external addresses correctly.
https://docs.docker.com/config/containers/container-networking/
requests.get(url='https://google.com').text
Works perfectly otherwise.

Django Hostname in Testing

I am wondering if there is a way to obtain the hostname of a Django application when running tests. That is, I would like the tests to pass both locally and when run at the staging server. Hence a need to know http://localhost:<port> vs. http://staging.example.com is needed because some tests query particular URLs.
I found answers on how to do it inside templates, but that does not help since there is no response object to check the hostname.
How can one find out the hostname outside the views/templates? Is it stored in Django settings somewhere?
Why do you need to know the hostname? Tests can run just fine without it, if you use the test client. You do not need to know anything about the system they're running on.
You can also mark tests with a tag and then have the CI system run the tests including that tag.
And finally there is the LiveServerTestCase:
LiveServerTestCase does basically the same as TransactionTestCase with one extra feature: it launches a live Django server in the background on setup, and shuts it down on teardown. This allows the use of automated test clients other than the Django dummy client such as, for example, the Selenium client, to execute a series of functional tests inside a browser and simulate a real user’s actions.
The live server listens on localhost and binds to port 0 which uses a free port assigned by the operating system. The server’s URL can be accessed with self.live_server_url during the tests.
Additional information from comments:
You can test if the URL of an image file is present in your response by testing for the MEDIA_URL:
self.assertContains(response, f'{settings.MEDIA_URL}/default-avatar.svg')
You can test for the existence of an upload in various ways, but the easiest one is to check if there's a file object associated with the FileField. It will throw ValueError if there is not.

Python web scraping : urllib.error.URLError: urlopen error [Errno 11001] getaddrinfo failed

This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.

Timeout with Python Requests + Clojure HttpKit Server but not Ring server

I have some Ring routes which I'm running one of two ways.
lein ring server, with the lein-ring plugin
using org.httpkit.server, like (hs/run-server app {:port 3000}))
It's a web app (being consumed by an Angular.js browser client).
I have some API tests written in Python using the Requests library:
my_r = requests.post(MY_ROUTE,
data=MY_DATA,
headers={"Content-Type": "application/json"},
timeout=10)
When I use lein ring server, this request works fine in the JS client and the Python tests.
When I use httpkit, this works fine in the JS client but the Python client times out with
socket.timeout: timed out
I can't figure out why the Python client is timing out. It happens with httpkit but not with lein-ring, so I can only assume that the cause is related to the difference.
I've looked at the traffic in WireShark and both look like they give the correct response. Both have the same Content-Length field (15 bytes).
I've raised the number of threads to 10 (shouldn't need to) and no change.
Any ideas what's wrong?
I found how to fix this, but no satisfactory explanation.
I was using wrap-json-response Ring middleware to take a HashMap and convert it to JSON. I switched to doing my own conversion in my handler with json/write-str, and this fixes it.
At a guess it might be something to do with the server handling output buffering, but that's speculation.
I've combed through the Wireshark dumps and I can't see any relevant differences between the two. The sent Content-Length fields are identical. The 'bytes in flight' differ, at 518 and 524.
No clue as to why the web browser was happy with this but Python Requests wasn't, and whether or this is a bug in Requests, httpkit, ring-middleware-format or my own code.

Inexplicable Urllib2 problem between virtualenv's.

I have some test code (as a part of a webapp) that uses urllib2 to perform an operation I would usually perform via a browser:
Log in to a remote website
Move to another page
Perform a POST by filling in a form
I've created 4 separate, clean virtualenvs (with --no-site-packages) on 3 different machines, all with different versions of python but the exact same packages (via pip requirements file), and the code only works on the two virtualenvs on my local development machine(2.6.1 and 2.7.2) - it won't work on either of my production VPSs
In the failing cases, I can log in successfully, move to the correct page but when I submit the form, the remote server replies telling me that there has been an error - it's an application server error page ('we couldn't complete your request') and not a webserver error.
because I can successfully log in and maneuver to a second page, this doesn't seem to be a session or a cookie problem - it's particular to the final POST
because I can perform the operation on a particular machine with the EXACT same headers and data, this doesn't seem to be a problem with what I am requesting/posting
because I am trying the code on two separate VPS rented from different companies, this doesn't seem to be a problem with the VPS physical environment
because the code works on 2 different python versions, I can't imagine it being an incompabilty problem
I'm completely lost at this stage as to why this wouldn't work. I've even 'turned-it-off-and-turn-it-on-again' because I just can't see what the problem could be.
I think it has to be something to do with the final POST coming from a VPS that the remote server doesn't like, but I can't figure out what that could be. I feel like there is something going on under the hood of URLlib that is causing the remote server to dislike the reply.
EDIT
I've installed the exact same Python version (2.6.1) on the VPS as is on my working local copy and it doesn't work remotely, so it must be something to do with originating from a VPS. How could this effect the Http request? Is it something lower level?
You might try setting the debuglevel=1 for urllib2 and see what it comes up with:
import urllib2
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
...
This is a total shot in the dark, but are your VPSs 64-bit and your home computer 32-bit, or vice versa? Maybe a difference in default sizes or accuracies of something could be freaking out the server.
Barring that, can you try to find out any information on the software stack the web server is using?
I had similar issues with urllib2 (working with Zimbra's REST api), in the end switched to pycurl with success.
PS
for operations of login/navigate/post, I usually find Mechanize useful and easier to use. Maybe you can give it a show.
Well, it looks like I know why the problem was happening, but I'm not 100% the reason for it.
I simply had to make the server wait (time.sleep()) after it sent the 2nd request (Move to another page) before doing the 3rd request (Perform a POST by filling in a form).
I don't know is it because of a condition with the 3rd party server, or if it's some sort of odd issue with URLlib? The reason it seemed to work on my development machine is presumably because it was slower then the server at running the code?

Categories