Python - Run query to site, and print results in std out - python

I have been trying to write a python scripts used for checks on IP address to www.blacklistalert.com. The goal is to run the script on one or multiple IP addresses as standard input. Then have the standard output from the site be printed on the command line.
import urllib.parse
import urllib.request
ip = str(urllib.request'[$1]')
url = 'http://www.blacklistalert.org/'
values = { 'query': '$1' }
data = urllib.parse.urlencode(values)
url = '?'.join([url, data])
req = urllib.request.Request(url, binary_data)
response = urllib.request.urlopen(req)
the_page = response.read()
print (the_page)
I am running into the problem of sending the query to the page, and getting results, which come on a separate page. I am currently getting the error:
ip = str(urllib.request'[$1]')
SyntaxError: invalid syntax
Whats the best approach to run the IP address query, and get the response in standard output? Thanks in advance.

Python is not bash, and does not process arguments on stdin automatically into variables like $1 (hint: the only place $ is used in Python is when dealing with currency based on the dollar). Instead, Python places command-line arguments in the sys.argv list. You can either read them from there (sys.argv[0] is the script's name, sys.argv[1] and so on are the command-line parameters - read the docs for the full story), or use something like the argparse module, which I recommend. There's a bit more code involved, but it'll be worth it in the end.

Related

How do you correctly parse web links to avoid a 403 error when using Wget?

I just started learning python yesterday and have VERY minimal coding skill. I am trying to write a python script that will process a folder of PDFs. Each PDF contains at least 1, and maybe as many as 15 or more, web links to supplemental documents. I think I'm off to a good start, but I'm having consistent "HTTP Error 403: Forbidden" errors when trying to use the wget function. I believe I'm just not parsing the web links correctly. I think the main issue is coming in because the web links are mostly "s3.amazonaws.com" links that are SUPER long.
For reference:
Link copied directly from PDF (works to download): https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG
Link as it appears after trying to parse it in my code (doesn't work, gives "unknown url type" when trying to download): https%3A//s3.amazonaws.com/os_uploads/2169504_DFA%2520train%2520pass.PNG%3FAWSAccessKeyId%3DAKIAIPCTK7BDMEW7SP4Q%26Expires%3D1909634500%26Signature%3DaQlQXVR8UuYLtkzjvcKJ5tiVrZQ%253D%26response-content-disposition%3Dattachment%253B%2520filename%252A%253Dutf-8%2527%2527DFA%252520train%252520pass.PNG
Additionally if people want to weigh in on how I'm doing this in a stupid way. Each PDF starts with a string of 6 digits, and once I download supplemental documents I want to auto save and name them as XXXXXX_attachY.* Where X is the identifying string of digits and Y just increases for each attachment. I haven't gotten my code to work enough to test that, but I'm fairly certain I don't have it correct either.
Help!
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
## Accessing and Creating Six Digit File Code
pdf_dir = "/users/USERNAME/desktop/worky"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x, safe='://')
print (parsed_url)
wget.download(parsed_url, '/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
##os.rename(r'/users/USERNAME/desktop/worky/(filename).*',r'/users/USERNAME/desktop/worky/(newname)_attach(attachment_counter).*')
attachment_counter += 1
for x in url_list["pdf"]:
print (parsed_url + "\n")```
I prefer to use requests (https://requests.readthedocs.io/en/master/) when trying to grab text or files online. I tried it quickly with wget and I got the same error (might be linked to user-agent HTTP headers used by wget).
wget and HTTP headers issues : download image from url using python urllib but receiving HTTP Error 403: Forbidden
HTTP headers : https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
The good thing with requests is that it lets you modify HTTP headers the way you want (https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers).
import requests
r = requests.get("https://s3.amazonaws.com/os_uploads/2169504_DFA%20train%20pass.PNG?AWSAccessKeyId=AKIAIPCTK7BDMEW7SP4Q&Expires=1909634500&Signature=aQlQXVR8UuYLtkzjvcKJ5tiVrZQ=&response-content-disposition=attachment;%20filename*=utf-8''DFA%2520train%2520pass.PNG")
with open("myfile.png", "wb") as file:
file.write(r.content)
I'm not sure I understand what you're trying to do, but maybe you want to use formatted strings to build your URLs (https://docs.python.org/3/library/stdtypes.html?highlight=format#str.format) ?
Maybe checking string indexes is fine in your case (if x[0:4] == "http":), but I think you should check python re package to use regular expressions to catch the elements you want in a document (https://docs.python.org/3/library/re.html).
import re
regex = re.compile(r"^http://")
if re.match(regex, mydocument):
<do something>
The reason for this behavior is inside wget library. Inside it encodes the URL with urllib.parse.quote() (https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote).
Basically it replaces characters with their appropriate %xx escape character. Your URL is already escaped but the library does not know that. When it parses the %20 it sees % as a character that needs to be replaced so the result is %2520 and different URL - therefore 403 error.
You could decode that URL first and then pass it, but then you would have another problem with this library because your URL has parameter filename*= but the library expects filename=.
I would recommend doing something like this:
# get the file
req = requests.get(parsed_url)
# parse your URL to get GET parameters
get_parameters = [x for x in parsed_url.split('?')[1].split('&')]
filename = ''
# find the get parameter with the name
for get_parameter in get_parameters:
if "filename*=" in get_parameter:
# split it to get the name
filename = get_parameter.split('filename*=')[1]
# save the file
with open(<path> + filename, 'wb') as file:
file.write(req.content)
I would also recommend removing the utf-8'' in that filename because I don't think it is actually part of the filename. You could also use regular expressions for getting the filename, but this was easier for me.

urllib2.open error in python

I can't get URL
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
for url in parser.url_list:
get_rss_th = threading.Thread(target=parser.get_rss,name="get_rss_th", args=(url,))
get_rss_th.start()
print htmldata
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
when specifying htmldata.read() (Python error when using urllib.open)
then getting blank screen
python 2.7
whole code:https://github.com/tech-sketch/zabbix_aws_template/blob/master/scripts/AWS_Service_Health_Dashboard.py
The problem is, that from URL link (RSS feed), i can't get output (data) variable data = zbx_client.recv(4096) is empty- no status
There's no real problem with your code (except for a bunch of indentation errors and syntax errors that apparently aren't in your real code), only with your attempts to debug it.
First, you did this:
print htmldata
That's perfectly fine, but since htmldata is a urllib2 response object, printing it just prints that response object. Which apparently looks like this:
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
That doesn't look like particularly useful information, but that's the kind of output you get when you print something that's only really useful for debugging purposes. It tells you what type of object it is, some unique identifier for it, and the key members (in this case, the socket fileobject wrapped up by the response).
Then you apparently tried this:
print htmldata.read()
But already called read on the same object earlier:
parser.feed(htmldata.read())
When you read() the same file-like object twice, the first time gets everything in the file, and the second time gets everything after everything in the fileā€”that is, nothing.
What you want to do is read() the contents once, into a string, and then you can reuse that string as many times as you want:
contents = htmldata.read()
parser.feed(contents)
print contents
It's also worth noting that, as the urllib2 documentation said right at the top:
See also The Requests package is recommended for a higher-level HTTP client interface.
Using urllib2 can be a big pain, in a lot of ways, and this is just one of the more minor ones. Occasionally you can't use requests because you have to dig deep under the covers of HTTP, or handle some protocol it doesn't understand, or you just can't install third-party libraries, so urllib2 (or urllib.request, as it's renamed in Python 3.x) is still there. But when you don't have to use it, it's better not to. Even Python itself, in the ensurepip bootstrapper, uses requests instead of urllib2.
With requests, the normal way to access the contents of a response is with the content (for binary) or text (for Unicode text) properties. You don't have to worry about when to read(); it does it automatically for you, and lets you access the text over and over. So, you can just do this:
import requests
base_url = "http://status.aws.amazon.com/"
response = requests.get(base_url, timeout=30)
parser.feed(response.content) # assuming it wants bytes, not unicode
print response.text
If I use this code:
import urllib2
import socket
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
print(htmldata.read())
I get the page's HTML code.

Using Proxies within Program

So currently I'm having trouble with how to set the instance of my program to use a different proxy for all Internet related tasks; and I'm also having an error that stating that it: Failed to parse.
Now with the error, I'm not quite sure on the correct approach to fix it since I'm pulling information out of a .dat file; though I believe it may be an encoding error.
Anyway, now with the proxies, I've looked at other posts and submissions for awhile now, though it's still not running properly. I also see posts relating to urllib2 all the time . . . and that's not what I want.
This is basically what I'm trying proxy wise:
proxy = [ . . . ]
os.environ['https_proxy'] = 'http://' + proxy
And this is also the error that follows:
Failed to parse: x.xxx.xxx.xxx:xxxx
Overall thanks for your help, I appreciate it.
As far as I know, there's no magic bullet to change proxy settings of a "whole program". You have to find out which modules you are using (or the module that some other module you are using directly is using - e.g. pandas uses urllib2) and pass the proxy configuration. Below are two examples of the most commonly used modules: urllib2 and requests
Example of using proxies in urllib2 (on a module level):
proxy = {"http":"10.111.111.111:8080", "https":"10.111.111.222:8080"}
opener = urllib2.build_opener(urllib2.ProxyHandler(proxy))
urllib2.install_opener(opener)
resp = urllib2.urlopen('http://www.google.com')
Example of using proxies in requests (on the function call level):
proxy = {"http":"10.111.111.111:8080", "https":"10.111.111.222:8080"}
resp = requests.get('http://www.google.com', proxies = proxy)
The reason why I was receiving the error mentioned above is because: when retrieving the proxies from a .dat file, I wasn't checking for '\n' or ' ' (the ' ' is an example if your data includes spaces, though obviously proxy strings do not).
So an example would be to use .strip after your retrieved information; such as:
fileList = []
with open('[ . . . ].dat') as f:
for line in f:
fileList.append(line.strip('\n').strip(' '))
return fileList[x]

Passing Command Line argument to Python script to be able to open a connection to a web server

I'm writing some shell scripts to check if I can manipulate information on a web server.
I use a bash shell script as a wrapper script to kick off multiple python scripts in sequential order
The wrapper script takes the web server's host name and user name and password. and after performing some tasks, hand's it to the python scripts.
I have no problem getting the information passed to the python scripts
host% ./bash_script.sh host user passwd
Inside the wrapper script I do the following:
/usr/dir/python_script.py $1 $2 $3
print "Parameters: "
print "Host: ", sys.argv[1]
print "User: ", sys.argv[2]
print "Passwd: ", sys.argv[3]
The values are printed correctly
Now, I try to pass the values to open a connection to the web server:
(I successfully opened a connection to the web server using the literal host name)
f_handler = urlopen("https://sys.argv[1]/path/menu.php")
Trying the variable substitution format throws the following error:
HTTP Response Arguments: (gaierror(8, 'nodename nor servname provided, or not known'),)
I tried different variations,
'single_quotes'
"double_quotes"
http://%s/path/file.php % value
for the variable substitution but I always get the same error
I assume that the urlopen function does not convert the substituted variable correctly.
Does anybody now a fix for this? Do I need to convert the host name variable first?
Roland
You need to quote the string when doing variable substitution:
http://%s/path/file.php % value # Does not work
"http://%s/path/file.php" % value # Works
'http://%s/path/file.php' % value # Also works
"""http://%s/path/file.php""" % value # Also works
Python does not do string interpolation the way that PHP does -- so there is no way to simply "embed" a variable in a string and have the variable be resolved to its value.
That is to say, in PHP, this works:
"http://$some_variable/path/to/file.php"
and this does not:
'http://$some_variable/path/to/file.php'
In Python, all strings behave like the single-quoted string in PHP.
This:
f_handler = urlopen("https://sys.argv[1]/path/menu.php")
Should be:
f_handler = urlopen("https://%s/path/menu.php"%(sys.argv[1]))

Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code:
import sys
import urllib2_file
import urllib2
URL='http://aquate.us/upload.php'
d = [('uploaded', open(sys.argv[1:]))]
req = urllib2.Request(URL, d)
u = urllib2.urlopen(req)
print u.read()
I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ).
When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded.
I knew there was probably an error going on so I typed the code into CL python, line by line.
When I ran the u=urllib2.urlopen(req) line, I didn't get an error;
alt text http://www.aquate.us/u/55245858877937182052.jpg
instead, the cursor simply started blinking on a new line beneath that line. I waited a couple of minutes to see if something would happen but it just stayed like that. To get it to stop, I had to press ctrl+break.
What's up with this script?
Thanks in advance!
[Edit]
Forgot to mention -- when I ran the script without the request data (the file) it ran like a charm. Is it a problem with urllib2_file?
[edit 2]:
import MultipartPostHandler, urllib2, cookielib,sys
import win32clipboard as w
cookies = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies),MultipartPostHandler.MultipartPostHandler)
params = {"uploaded" : open("c:/cfoot.js") }
a=opener.open("http://www.aquate.us/upload.php", params)
text = a.read()
w.OpenClipboard()
w.EmptyClipboard()
w.SetClipboardText(text)
w.CloseClipboard()
That code works like a charm if you run it through the command line.
If you're using Python 2.5 or newer, urllib2_file is both unnecessary and unsupported, so check which version you're using (and perhaps upgrade).
If you're using Python 2.3 or 2.4 (the only versions supported by urllib2_file), try running the sample code and see if you have the same problem. If so, there is likely something wrong either with your Python or urllib2_file installation.
EDIT:
Also, you don't seem to be using either of urllib2_file's two supported formats for POST data. Try using one of the following two lines instead:
d = ['uploaded', open(sys.argv[1:])]
## --OR-- ##
d = {'uploaded': open(sys.argv[1:])}
First, there's a third way to run Python programs.
From cmd.exe, type python myprogram.py. You get a nice log. You don't have to type stuff one line at a time.
Second, check the urrlib2 documentation. You'll need to look at urllib, also.
A Request requires a URL and a urlencoded encoded buffer of data.
data should be a buffer in the
standard
application/x-www-form-urlencoded
format. The urllib.urlencode()
function takes a mapping or sequence
of 2-tuples and returns a string in
this format.
You need to encode your data.
If you're still on Python2.5, what worked for me was to download the code here:
http://peerit.blogspot.com/2007/07/multipartposthandler-doesnt-work-for.html
and save it as MultipartPostHandler.py
then use:
import urllib2, MultipartPostHandler
opener = urllib2.build_opener(MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})
or if you need cookies:
import urllib2, MultipartPostHandler, cookielib
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), MultipartPostHandler.MultipartPostHandler())
opener.open(url, {"file":open(...)})

Categories