find multiple things in a string using regex in python - python

My input string contains various entities like this:
conn_type://host:port/schema#login#password
I want to find out all of them using regex in python.
As of now, I am able to find them one by one, like
conn_type=re.search(r'[a-zA-Z]+',test_string)
if (conn_type):
print "conn_type:", conn_type.group()
next_substr_len = conn_type.end()
host=re.search(r'[^:/]+',test_string[next_substr_len:])
and so on.
Is there a way to do it without if and else?
I expect there to be some way, but not able to find it. Please note that every entity regex is different.
Please help, I don't want to write a boring code.

Why don't you use re.findall?
Here is an example:
import re;
s = 'conn_type://host:port/schema#login#password asldasldasldasdasdwawwda conn_type://host:port/schema#login#email';
def get_all_matches(s):
matches = re.findall('[a-zA-Z]+_[a-zA-Z]+:\/+[a-zA-Z]+:+[a-zA-Z]+\/+[a-zA-Z]+#+[a-zA-Z]+#[a-zA-Z]+',s);
return matches;
print get_all_matches(s);
this will return a list full of matches to your current regex as seen in this example which in this case would be:
['conn_type://host:port/schema#login#password', 'conn_type://host:port/schema#login#email']
If you need help making regex patterns in Python I would recommend using the following website:
A pretty neat online regex tester
Also check the re module's documentation for more on re.findall
Documentation for re.findall
Hope this helps!

>>>import re
>>>uri = "conn_type://host:port/schema#login#password"
>>>res = re.findall(r'(\w+)://(.*?):([A-z0-9]+)/(\w+)#(\w+)#(\w+)', uri)
>>>res
[('conn_type', 'host', 'port', 'schema', 'login', 'password')]
No need for ifs. Use findall or finditer to search through your collection of connection types. Filter the list of tuples, as need be.

If you like it DIY, consider creating a tokenizer. This is very elegant "python way" solution.
Or use a standard lib: https://docs.python.org/3/library/urllib.parse.html but note, that your sample URL is not fully valid: there is no schema 'conn_type' and you have two anchors in the query string, so urlparse wouldn't work as expected. But for real-life URLs I highly recommend this approach.

Related

Returning a specific string from a JSON into separate variables | With Python

I am currently using a roblox API which returns some very LARGE json response but I am only looking for a specific data inside, the data I need looks something like this.
gameinstanceId=f4beb4fc-82d1-4573-82f1-dd94c13a94eb
I am only for the data after the "=" and save all the ones it finds into separate variables, I just need to find ALL of them basically.
I don't know how to get around doing this, I thought of using substrings but again I have no idea on how to do it.
Any pointers would be helpful.
If I've understood right, I think the regular expressions package (re) is your friend here. The following will return all instances found in a long string.
PS building regular expressions (regexes) can be tedious and I always forget the notation, so I always go to https://pythex.org/ to build my expressions.
import re
longstring = 'gameinstanceId=f4beb4fc-82d1-4573-82f1-dd94c13a94eb\ngameinstanceId=f4beb4fc-82d1-4573-82f1-dd94c13a94eb\n'
re.findall(r'gameinstanceId=([\w-]*)', longstring)
This code returns a list with all matches:
['f4beb4fc-82d1-4573-82f1-dd94c13a94eb',
'f4beb4fc-82d1-4573-82f1-dd94c13a94eb']
With further feedback and a URL, this approach is probably what you want:
import requests
resp = requests.get('https://rankbotddtgrcm.glitch.me/gameInstances?Place=2679871702')
re.findall(r'gameInstanceId=([\w-]*)', resp.text)
I use list comprehensions for this sort of thing:
mylist = [line.split("=",1)[1] for line in resp.text.splitlines() if line.startswith("gameinstanceId=")]
I typed this in on the fly, but it should be close.

Python3 - Generate string matching multiple regexes, without modifying them

I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.
If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.
I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...
I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`

re.search() or 'in', re.match() or startswith()?

I am learning how to use the re library in Python and a question flashed through my mind. Please forgive me if this sounds stupid. I am new to this stuff. :)
Since according to this answer,
re.search - find something anywhere in the string
re.match - find something at the beginning of the string
Now I have this code:
from re import search
str = "Yay, I am on StackOverflow. I am overjoyed!"
if search('am',str): # not considering regex
print('True') # returns True
if 'am' in str:
print('True') # returns True
And this:
from re import match
str = "Yay, I am on Stack Overflow. I am overjoyed!"
if match('Yay',str): # not considering regex
print('True') # prints True
if str.startswith('Yay'):
print('True') # prints True
So now my question is, which one should I use when I am doing similar stuffs (not considering regular expressions) such as fetching contents from a webpage and finding in its contents. Should I use built-ins like above, or the standard re library? Which one will make the code more optimised/efficient?
Any help will be much appreciated. Thank you!
Regex is mostly used for complex match, search and replace operations, while built-in keyword such as 'in' is mostly used for simple operations like replacing a single word by another. Normally 'in' keyword is preferred. In terms of performance 'in' keyword usage is faster but when you face a situation where you could use 'in' keyword but Regex offers much more elegant solution rather than typing a lot of 'if' statements use Regex.
When you are fetching contents from a webpage and finding stuff in the contents the codex above also applies.
Hope this helps.

RegEx in Python for WikiMarkup

I'm trying to create a re in python that will match this pattern in order to parse MediaWiki Markup:
<ref>*Any_Character_Could_Be_Here</ref>
But I'm totally lost when it comes to regex. Can someone help me, or point me to a tutorial or resource that might be of some help. Thanks!'
Assuming that svick is correct that MediaWiki Markup is not valid xml (or html), then you could use re in this circumstance (although I will certainly defer to better solutions):
>>> import re
>>> test_string = '''<ref>*Any_Character_Could_Be_Here</ref>
<ref>other characters could be here</ref>'''
>>> re.findall(r'<ref>.*?</ref>', test_string)
['<ref>*Any_Character_Could_Be_Here</ref>', '<ref>other characters could be here</ref>'] # a list of matching strings
In any case, you will want to familiarize yourself with the re module (whether or not you use a regex to solve this particular problem).
srhoades28, this will match your pattern.
if re.search(r"<ref>\*[^<]*</ref>", subject):
# Successful match
else:
# Match attempt failed
Note that from your post, it is assumed that the * after always occurs, and that the only variable part is the blue text, in your example "Any_Character_Could_Be_Here".
If this is not the case let me know and I will tweak the expression.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories