Stripping multiple characters from the start of a string - python

I'm trying to trim a sub-string from the beginning of a string based on a condition:
For instance, if the input is a domain name prefixed with http, https and/or www, it needs to strip these and return only the domain.
Here's what I have so far:
if my_domain.startswith("http://"):
my_domain = my_domain[7:]
elif my_domain.startswith("https://"):
my_domain = my_domain[8:]
if my_domain.startswith("www."):
my_domain = my_domain[4:]
print my_domain
I've tried to use these inbuilt functions (.startswith) instead of trying to use regex.
While the code above works, I'm wondering if there is a more efficient way to combine the conditions to make the code shorter or have multiple checks in the same conditional statement?

I know regex is computationally slower than a lot of the built in methods but it is a lot easier to write code wise :)
import re
re.sub("http[s]*://|www\." , "", my_domain)
edit:
As mentioned by #Dunes a more correct way of answering this problem is.
re.sub(r"^https?://(www\.)?" , "" , my_domain)
Old answer left for reference so that Dunes comment still has some context.

Use urllib.parse (Python 3).
>>> from urllib import parse
>>> components = parse.urlsplit('http://stackoverflow.com/questions/38187220/stripping-multiple-characters-from-the-start-of-a-string')
>>> components[1]
'stackoverflow.com'
The Python 2.7 equivalent is named urlparse.
To cover the 'www.' case, you could simply do
* subdomains, domain, ending = components[1].split('.')
return '.'.join((domain, ending))
In Python 2.7 you don’t have access to * unpacking but you can use a list slice instead to get the same effect.

Related

Returning a specific string from a JSON into separate variables | With Python

I am currently using a roblox API which returns some very LARGE json response but I am only looking for a specific data inside, the data I need looks something like this.
gameinstanceId=f4beb4fc-82d1-4573-82f1-dd94c13a94eb
I am only for the data after the "=" and save all the ones it finds into separate variables, I just need to find ALL of them basically.
I don't know how to get around doing this, I thought of using substrings but again I have no idea on how to do it.
Any pointers would be helpful.
If I've understood right, I think the regular expressions package (re) is your friend here. The following will return all instances found in a long string.
PS building regular expressions (regexes) can be tedious and I always forget the notation, so I always go to https://pythex.org/ to build my expressions.
import re
longstring = 'gameinstanceId=f4beb4fc-82d1-4573-82f1-dd94c13a94eb\ngameinstanceId=f4beb4fc-82d1-4573-82f1-dd94c13a94eb\n'
re.findall(r'gameinstanceId=([\w-]*)', longstring)
This code returns a list with all matches:
['f4beb4fc-82d1-4573-82f1-dd94c13a94eb',
'f4beb4fc-82d1-4573-82f1-dd94c13a94eb']
With further feedback and a URL, this approach is probably what you want:
import requests
resp = requests.get('https://rankbotddtgrcm.glitch.me/gameInstances?Place=2679871702')
re.findall(r'gameInstanceId=([\w-]*)', resp.text)
I use list comprehensions for this sort of thing:
mylist = [line.split("=",1)[1] for line in resp.text.splitlines() if line.startswith("gameinstanceId=")]
I typed this in on the fly, but it should be close.

Python3 - Generate string matching multiple regexes, without modifying them

I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.
If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.
I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...
I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`

re.search() or 'in', re.match() or startswith()?

I am learning how to use the re library in Python and a question flashed through my mind. Please forgive me if this sounds stupid. I am new to this stuff. :)
Since according to this answer,
re.search - find something anywhere in the string
re.match - find something at the beginning of the string
Now I have this code:
from re import search
str = "Yay, I am on StackOverflow. I am overjoyed!"
if search('am',str): # not considering regex
print('True') # returns True
if 'am' in str:
print('True') # returns True
And this:
from re import match
str = "Yay, I am on Stack Overflow. I am overjoyed!"
if match('Yay',str): # not considering regex
print('True') # prints True
if str.startswith('Yay'):
print('True') # prints True
So now my question is, which one should I use when I am doing similar stuffs (not considering regular expressions) such as fetching contents from a webpage and finding in its contents. Should I use built-ins like above, or the standard re library? Which one will make the code more optimised/efficient?
Any help will be much appreciated. Thank you!
Regex is mostly used for complex match, search and replace operations, while built-in keyword such as 'in' is mostly used for simple operations like replacing a single word by another. Normally 'in' keyword is preferred. In terms of performance 'in' keyword usage is faster but when you face a situation where you could use 'in' keyword but Regex offers much more elegant solution rather than typing a lot of 'if' statements use Regex.
When you are fetching contents from a webpage and finding stuff in the contents the codex above also applies.
Hope this helps.

Python: check if string meets specific format

Programming in Python3.
I am having difficulty in controlling whether a string meets a specific format.
So, I know that Python does not have a .contain() method like Java but that we can use regex.
My code hence will probably look something like this, where lowpan_headers is a dictionary with a field that is a string that should meet a specific format.
So the code will probably be like this:
import re
lowpan_headers = self.converter.lowpan_string_to_headers(lowpan_string)
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(lowpan_headers[dest_addrS])
However, my issue is in the format and I have not been able to get it right.
The format should be like bbbb00000000000000170d0000306fb6, where the first 4 characters should be bbbb and all the rest, with that exact length, should be hexadecimal values (so from 0-9 and a-f).
So two questions:
(1) any easier way of doing this except through importing re
(2) If not, can you help me out with the regex?
As for the regex you're looking for I believe that
^bbbb[0-9a-f]{28}$
should validate correctly for your requirements.
As for if there is an easier way than using the re module, I would say that there isn't really to achieve the result you're looking for. While using the in keyword in python works in the way you would expect a contains method to work for a string, you are actually wanting to know if a string is in a correct format. As such the best solution, as it is relatively simple, is to use a regular expression, and thus use the re module.
Here is a solution that does not use regex:
lowpan_headers = 'bbbb00000000000000170d0000306fb6'
if lowpan_headers[:4] == 'bbbb' and len(lowpan_headers) == 32:
try:
int(lowpan_headers[4:], 16) # tries interpreting the last 28 characters as hexadecimal
print('Input is valid!')
except ValueError:
print('Invalid Input') # hex test failed!
else:
print('Invalid Input') # either length test or 'bbbb' prefix test failed!
In fact, Python does have an equivalent to the .contains() method. You can use the in operator:
if 'substring' in long_string:
return True
A similar question has already been answered here.
For your case, however, I'd still stick with regex as you're indeed trying to evaluate a certain String format. To ensure that your string only has hexadecimal values, i.e. 0-9 and a-f, the following regex should do it: ^[a-fA-F0-9]+$. The additional "complication" are the four 'b' at the start of your string. I think an easy fix would be to include them as follows: ^(bbbb)?[a-fA-F0-9]+$.
>>> import re
>>> pattern = re.compile('^(bbbb)?[a-fA-F0-9]+$')
>>> test_1 = 'bbbb00000000000000170d0000306fb6'
>>> test_2 = 'bbbb00000000000000170d0000306fx6'
>>> pattern.match(test_1)
<_sre.SRE_Match object; span=(0, 32), match='bbbb00000000000000170d0000306fb6'>
>>> pattern.match(test_2)
>>>
The part that is currently missing is checking for the exact length of the string for which you could either use the string length method or extend the regex -- but I'm sure you can take it from here :-)
As I mentioned in the comment Python does have contains() equivalent.
if "blah" not in somestring:
continue
(source) (PythonDocs)
If you would prefer to use a regex instead to validate your input, you can use this:
^b{4}[0-9a-f]{28}$ - Regex101 Demo with explanation

find multiple things in a string using regex in python

My input string contains various entities like this:
conn_type://host:port/schema#login#password
I want to find out all of them using regex in python.
As of now, I am able to find them one by one, like
conn_type=re.search(r'[a-zA-Z]+',test_string)
if (conn_type):
print "conn_type:", conn_type.group()
next_substr_len = conn_type.end()
host=re.search(r'[^:/]+',test_string[next_substr_len:])
and so on.
Is there a way to do it without if and else?
I expect there to be some way, but not able to find it. Please note that every entity regex is different.
Please help, I don't want to write a boring code.
Why don't you use re.findall?
Here is an example:
import re;
s = 'conn_type://host:port/schema#login#password asldasldasldasdasdwawwda conn_type://host:port/schema#login#email';
def get_all_matches(s):
matches = re.findall('[a-zA-Z]+_[a-zA-Z]+:\/+[a-zA-Z]+:+[a-zA-Z]+\/+[a-zA-Z]+#+[a-zA-Z]+#[a-zA-Z]+',s);
return matches;
print get_all_matches(s);
this will return a list full of matches to your current regex as seen in this example which in this case would be:
['conn_type://host:port/schema#login#password', 'conn_type://host:port/schema#login#email']
If you need help making regex patterns in Python I would recommend using the following website:
A pretty neat online regex tester
Also check the re module's documentation for more on re.findall
Documentation for re.findall
Hope this helps!
>>>import re
>>>uri = "conn_type://host:port/schema#login#password"
>>>res = re.findall(r'(\w+)://(.*?):([A-z0-9]+)/(\w+)#(\w+)#(\w+)', uri)
>>>res
[('conn_type', 'host', 'port', 'schema', 'login', 'password')]
No need for ifs. Use findall or finditer to search through your collection of connection types. Filter the list of tuples, as need be.
If you like it DIY, consider creating a tokenizer. This is very elegant "python way" solution.
Or use a standard lib: https://docs.python.org/3/library/urllib.parse.html but note, that your sample URL is not fully valid: there is no schema 'conn_type' and you have two anchors in the query string, so urlparse wouldn't work as expected. But for real-life URLs I highly recommend this approach.

Categories