How to parse URLs using urlparse and split() in python?

How to parse URLs using urlparse and split() in python? - python

Could someone explain to me the purpose of this line host = parsed.netloc.split('#')[-1].split(':')[0]in the following code? I understand that we are trying to get the host name from netlock but I don't understand why we are splitting with the # delimiter and then again with the : delimiter.
import urlparse
parsed = urlparse.urlparse('https://www.google.co.uk/search?client=ubuntu&channel=fs')
print parsed
host = parsed.netloc.split('#')[-1].split(':')[0]
print host
Result:
ParseResult(scheme='https', netloc='www.google.co.uk', path='/search', params='', query='client=ubuntu&channel=fs, fragment='')
www.google.co.uk
Surely if one just needs the domain, we can get that from urlparse.netloc

Netloc in its full form can have HTTP authentication credentials and a port number:
login:password#www.google.co.uk:80
See RFC1808 and RFC1738
So we potentially have to split that into ["login:password", "www.google.co.uk:80"], take the last part, split that into ["www.google.co.uk", "80"] and take the hostname.
If these parts are omitted, there's no harm in trying to split on nonexisting delimeters, and no need to check if they're omitted or not.
urlparse documentation

Related

How to use password with special character in Basic Auth

I'm using Python Telegram Bot API https://github.com/python-telegram-bot/python-telegram-bot behind the proxy.
So my task is to set proxy using Basic Auth.
It works perfectly next way:
REQUEST_KWARGS = {
"proxy_url": "https://TechLogin:Pass#word!#192.168.111.38:5050/" # proxy example
}
updater = Updater(token=my_bot_token, request_kwargs=REQUEST_KWARGS)
But SysAdmin changed password for TechLogin and now it contains some special characters:
new_login = "-th#kr123=,1"
and requests library (or even urllib3) can't parse it:
telegram.vendor.ptb_urllib3.urllib3.exceptions.LocationParseError: Failed to parse: TechLogin:-th
Looks like it can't parse sharp symbol #
How can I escape it ?

You must use URL-encoding, as shown in this post:
Escaping username characters in basic auth URLs
This way the # in the password becomes %23

You can use URL encoded values.
ex:-
# = %23
You can encode values to URL here.
"-th#kr123=,1" = "-th%23kr123%3D%2C1"
"proxy_url": "https://TechLogin:-th%23kr123%3D%2C1#192.168.111.38:5050/"

SSLError - Certificate verify failed [first question answered, code still not working]

import base64
import math
import bs4
from bs4 import BeautifulSoup
# Define Login Auth
usrPass = "un:pass"
b64Val = base64.b64encode(usrPass)
# Input Direct Link
r = requests.get('https://URL...../api/incident?id=34219"', headers ={"Authorization": "Basic dHJhZ286eGVsb2M=" % b64Val}).text
print(r)
I'm new to Python. What I'm trying to do is web-scrape a page that is an online/in-browser ticketing system. There is API documentation provided that details the 'Get Incident' function and spits out a full request URL and an Authorization Header that converts my login to a base64 code - the problem is that I can't figure out how to format this into my code so it actually works. Currently, I'm getting the 'Not all Arguments converted to STR' err, which I suspect is because I'm misusing '%' somewhere.
Thank you in advance for your help and if there is a better way to execute this code please let me know, I have a feeling there's a better way to go about this that I'm not aware of.

You need to precise where is your formatted code. Here, as b64Val seems to be a string, add %s.
r = requests.get('https://URL...../api/incident?id=34219"', headers={"Authorization": "Basic dHJhZ286eGVsb2M=%s" % b64Val}).text
You can also use fstrings if you are in Python 3:
r = requests.get('https://URL...../api/incident?id=34219"', headers={"Authorization": f"Basic dHJhZ286eGVsb2M={b64Val}"}).text

The purpose of the % operator is to replace special tokens such as %s in a string with actual values . For example:
name = "Andy"
print("Hello %s" % name) # prints Hello Andy
You are getting this error because the string you're using does not have any %s tokens.

Python parse log using regex

Hope someone might be able to help. I have a log been sent from a syslog server to python which looks like this:
{'Raw': 'Nov 26 00:23:07 TEST 23856434232342 (2016-11-26T00:23:07) http-proxy[2063]: Allow 1-Trusted 0-External tcp 192.168.0.1 2.3.4.5 57405 80 msg="HTTP Request" proxy_act="HTTP-TEST" op="POST" dstname="www.google.com" arg="/" sent_bytes="351" rcvd_bytes="1400" (HTTP-proxy-TEST-00)'}
I need to be able to extract the IP address, dstname=, sent_bytes= and dcvd_bytes= and if possible parse to json. I started trying to use REGEX (["'])(?:(?=(\\?))\2.)*?\1 to match the double quotes but its not working correctly.
Any ideas how I might get the data I need? Or how to parse the above to json?
Thanks

Assuming IP, dstname sent_bytes and rcvd_bytes are always in order, use re.findall to get them all
import re
s = r"""{'Raw': 'Nov 26 00:23:07 TEST 23856434232342 (2016-11-26T00:23:07) http-proxy[2063]: Allow 1-Trusted 0-External tcp 192.168.0.1 2.3.4.5 57405 80 msg="HTTP Request" proxy_act="HTTP-TEST" op="POST" dstname="www.google.com" arg="/" sent_bytes="351" rcvd_bytes="1400" (HTTP-proxy-TEST-00)'}"""
match = re.findall('(?:tcp |dstname=|sent_bytes=|rcvd_bytes=)"?([^\s"]+)', s)
# match = ['192.168.0.1', 'www.google.com', '351', '1400']
(ip, dstname, sent_bytes, rcvd_bytes) = match
# use this to parse to json

How to check whether the given url is valid or not in django

I need to check whether the given url is valid or not in django. I have used URLValidator for checking the url in django. But it's not working. I have explained the code below.
from django.core.validators import URLValidator
from django.core.exceptions import ValidationError
a = "www.google.com"
url = URLValidator()
try:
url(a)
print "correct"
except:
print "wrong"
When i tried to run the above concept it's showing the output "wrong". And then when i passed the value of a="http://www.google.com" means that the output is "correct". After that when i passed the value of a="http://www.googlecom" means that the output is "correct" .am more confused with it. Please anyone help me to do this concept.
Thanks in advance.

"www.google.com" is not valid URL because there is no protocol (http or https for example). It is malformed.
"http://www.googlecom" is treated as a valid url by Django because it has the right format (protocol://something.something). Although this url does not work, it is not malformed.
...Django does not try to check if googlecom is a valid Top-Level Domain because there are lots of those and the set is changing often. It only checks that the URL is not malformed.

Python IMAP Search from or to designated email address

I am using this with Gmail's SMTP server, and I would like to search via IMAP for emails either sent to or received from an address.
This is what I have:
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('user', 'pass')
mail.list()
mail.select("[Gmail]/All Mail")
status, email_ids = mail.search(None, 'TO "tech163#fusionswift.com" OR FROM "tech163#fusionswift.com"')
The last line of the error is: imaplib.error: SEARCH command error: BAD ['Could not parse command']
Not sure how I'm supposed to do that kind of OR statement within python's imaplib. If someone can quickly explain what's wrong or point me in the right direction, it'd be greatly appreciated.

The error you are receiving is generated from the server because it can't parse the search query correctly. In order to generate a valid query follow the RFC 3501, in page 49 it is explained in detail the structure.
For example your search string to be correct should be:
'(OR (TO "tech163#fusionswift.com") (FROM "tech163#fusionswift.com"))'

Try to use IMAP query builder from https://github.com/ikvk/imap_tools
from imap_tools import A, AND, OR, NOT
# AND
A(text='hello', new=True) # '(TEXT "hello" NEW)'
# OR
OR(text='hello', date=datetime.date(2000, 3, 15)) # '(OR TEXT "hello" ON 15-Mar-2000)'
# NOT
NOT(text='hello', new=True) # 'NOT (TEXT "hello" NEW)'
# complex
A(OR(from_='from#ya.ru', text='"the text"'), NOT(OR(A(answered=False), A(new=True))), to='to#ya.ru')
Of course you can use all library tools

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse URLs using urlparse and split() in python? - python

Related

How to use password with special character in Basic Auth

SSLError - Certificate verify failed [first question answered, code still not working]

Python parse log using regex

How to check whether the given url is valid or not in django

Python IMAP Search from or to designated email address

Categories

Resources