Python string.split() How to ignore empty spaces [duplicate] - python

This question already has answers here:
Why are empty strings returned in split() results?
(9 answers)
Closed 2 years ago.
I'm using split to parse http requests and came across something that I do not like but don't know a better way.
Imagine I have this GET : /url/hi
I'm splitting the url simply like so:
fields = request['url'].split('/')
It's simple, it works but it also makes the contents of the list have the first position as an empty string. I know this is expected behavior.
The question is: Can I change the calling of split to contemplate such thing or do I just live with it?

If you just always want to remove the first entry to the list you could just do this:
fields = request['url'].split('/')[1:]
If you just want to remove any empty strings from the list you can use instead follow your initial call with this:
fields.remove('')
Hope it helps!

Ok, If you sure your string start with '/'
you can ignore first character like this:
url = request['url']
fields = url[1:].split('/') #[1: to end]
If your not sure, simple check first:
url = request['url']
if url.startswith('/'):
url = url[1:]
fields = url.split('/')
Happy coding 😎

Related

Python getting the name of a website from its url [duplicate]

This question already has answers here:
Extract domain name from URL in Python
(8 answers)
Closed 5 months ago.
I want to get the name of a website from a url in a very simple way. Like, I have the URL "https://www.google.com/" or any other url, and I want to get the "google" part.
The issue is that there could be many pitfalls. Like, it could be www3 or it could be http for some reason. It could also be like the python docs where it says "https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse". I only want "python" in that case.
Is there a simple way to do it? The only one I can think of is just doing lots and lots of string.removeprefix or something like that, but thats ugly. I could not find anything that resembled what I searched for in the urllib library, but maybe there is another one?
Here's an idea:
import re
url = 'https://python.org'
url_ext = ['.com', '.org', '.edu', '.net', '.co.uk', '.cc', '.info', '.io']
web_name = ''
# Cuts off extension and everything after
for ext in url_ext:
if ext in url:
web_name = url.split(ext)[0]
# Reverse the string to find first non-alphanumeric character
web_name = web_name[::-1]
final = re.search(r'\W+', web_name).start()
final = web_name[0 : final]
# Reverse string again, return final
print(final[::-1])
The code starts by cutting off the extension of the website and everything that follows it. It then reverses the string and looks for the first non-alphanumeric character and cuts off everything after that utilizing the regex library. It then reverses the string again to print out the final result.
This code is probably not going to work on every single website as there are a million different way to structure a URL but it should work for you to some degree.

find substrings between two string [duplicate]

This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 3 years ago.
I have a string like this:
string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title" width="600"...
title="one more title"...> '''
I am trying to get anything that appears as title (title="Anything here")
I have already tried this but it does not work correctly.
re.findall(r'title=\"(.*)\"',string)
I think your Regex is too Greedy. You can try something like this
re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:
https://docs.python.org/3/library/html.parser.html
https://www.simplifiedpython.net/parsing-html-in-python/
https://github.com/psf/requests-html / Get html using Python requests?
If you would like to read more on performance testing of different python HTML parsers you can learn more here
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help
c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)
for i in c:
print(i.group(1))
The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.

Python String Spliting [duplicate]

This question already has answers here:
In Python, how do I split a string and keep the separators?
(19 answers)
Closed 9 years ago.
This code almost does what I need it to..
for line in all_lines:
s = line.split('>')
Except it removes all the '>' delimiters.
So,
<html><head>
Turns into
['<html','<head']
Is there a way to use the split() method but keep the delimiter, instead of removing it?
With these results..
['<html>','<head>']
d = ">"
for line in all_lines:
s = [e+d for e in line.split(d) if e]
If you are parsing HTML with splits, you are most likely doing it wrong, except if you are writing a one-shot script aimed at a fixed and secure content file. If it is supposed to work on any HTML input, how will you handle something like <a title='growth > 8%' href='#something'>?
Anyway, the following works for me:
>>> import re
>>> re.split('(<[^>]*>)', '<body><table><tr><td>')[1::2]
['<body>', '<table>', '<tr>', '<td>']
How about this:
import re
s = '<html><head>'
re.findall('[^>]+>', s)
Just split it, then for each element in the array/list (apart from the last one) add a trailing ">" to it.

Check if string in the exact form of “<int1>,<int2>” in Python, without using regex, or try/catch [duplicate]

This question already has an answer here:
Check if string in the exact form of "<int1>,<int2>" in Python
(1 answer)
Closed 6 years ago.
I'm converting a string of two integers into a tuple. I need to make sure my string is formatted exactly in the form of:
"<int1>,<int2>"
This is not a duplicate to an earlier question. Since that did not address restrictions I did not know about earlier. My parameter will be "4,5" for example. I'm not allowed to write other helper functions to check if they are formatted correctly. The checks must be done in a single function called convert_to_tuple
I just looked at the project specs again, and I'm not allowed to import any new modules, so regex is off the table. I'm also not allowed to use try/catch either.
Can you point me in the write direction? Thanks
Here is my code for converting the string into a tuple. So I need some type of check for executing this code.
if foo:
s1 = "12,24"
string_li = s1.split(',')
num_li = [int(x) for x in string_li]
num_tuple = tuple(num_li)
return num_tuple
else:
empty_tuple = ()
return empty_tuple
Does this work? (Edited to meet OP's requirements)
def is_int(string):
return string and set(string).issubset(set('1234567890'))
def check_str(s):
parts = s.split(',', 1)
return len(parts) == 2 and is_int(parts[0]) and is_int(parts[1])
I believe for testing (without converting, and without regexes or exception handling) a simple:
vals = s1.split(',')
if len(vals) == 2 and all(map(str.isdigit, vals)):
would verify that there are two components and both of them are non-empty and composed solely of digits.

Python generate string based on regex format [duplicate]

This question already has answers here:
Reversing a regular expression in Python
(8 answers)
Closed 1 year ago.
I have some difficulties learning regex in python. I want to parse my tornado web route configuration along with arguments into a request path string without handlers request.path method.
For example, I have route with patterns like:
/entities/([0-9]+)
/product/([0-9]+/actions
The expected result combine with integer parameter (123) will be a string like:
/entities/123
/product/123/actions
How do I generate string based on that pattern?
Thank you very much in advance!
This might be a possible duplicate to:
Reversing a regular expression in Python
Generate a String that matches a RegEx in Python
Using the answer provided by #bjmc a solution works like this:
>>> import rstr
>>> intermediate = rstr.xeger(\d+)
>>> path = '/product/' + intermediate + '/actions'
Depending on how long you want your intermediate integer, you could replace the regex: \d{1,3}

Categories