How to make string from regex and value of group - python

I have regexp for twitter profile url and someone's twitter profile url. I can easily extract username from url.
>>> twitter_re = re.compile('twitter.com/(?P<username>\w+)/')
>>> twitter_url = 'twitter.com/dir01/'
>>> username = twitter_re.search(twitter_url).groups()[0]
>>> _
'dir01'
But if I have regexp and username, how do I get url?

Regexen are no two-way street. You can use them for parsing strings, but not for generating strings back from the result. You should probably look into another way of getting the URLs back, like basic string interpolation, or URI templates (see http://code.google.com/p/uri-templates/)

If you are not looking for a general solution to convert any regex into a formatting string, but something that you can hardcode:
twitter_url = 'twitter.com/%(username)s/' % {'username': 'dir01'}
...should give you what you need.
If you want a more general (but not incredibly robust solution):
import re
def format_to_re(format):
# Replace Python string formatting syntax with named group re syntax.
return re.compile(re.sub(r'%\((\w+)\)s', r'(?P<\1>\w+)', format))
twitter_format = 'twitter.com/%(username)s/'
twitter_re = format_to_re(twitter_format)
m = twitter_re.search('twitter.com/dir01/')
print m.groupdict()
print twitter_format % m.groupdict()
Gives me:
{'username': 'dir01'}
twitter.com/dir01/
And finally, the slightly larger and more complete solution that I have been using myself can be found in the Pattern class here.

Why do you need the regex for that - just append the strings.
base_url = "twitter.com/"
twt_handle = "dir01"
twit_url = base_url + twt_handle

Related

How to remove text between variables

I currently have a string that has variables in it.
domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh
I'm trying to delete
&thingy=(all text that is in here)
The order might not always be that, and the text after the = will change.
I started doing something like this, but I feel there has to be quicker alternative:
cleanlist = []
variables = url.split('&')
for t in variables:
if not t.split('=', 1)[0] == 'thingy':
cleanlist.append(t.split('=', 1)[0])
I don't know Python, but from experience with other programming languages, the question I think you should have asked is "How do you parse a URL in Python?" or "How do you parse a url query string in Python?"
Just Googling this I got the following info that may help:
from urlparse import urlparse
o = urlparse('domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh')
q = urlparse.parse_qs(o.query, true)
>>> q.hello
randomtext
>>> q.thingy
randomtext2
Once you parse the URL and query string, just grab what you want.
You can substitute using regex.
import re
p = re.compile(ur'(&thingy=.*)&')
test_str = u"domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
subst = u"&"
result = re.sub(p, subst, test_str)
>>> result
u'domain.com/?hello=randomtext&stuff=1231kjh'
If I get your question right then you're trying the delete all the string which is "&thingy=randotext2&stuff=1231kjh "
This can be easily achieved by doing something like this:
current_str = "domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
cursor = current_str.find("&thingy=")
clean_str = current_str[:cursor]
Now the clean_str variable is what you're looking for.
This will give a clean result which is only:
domain.com/?hello=randomtext
If you wish to delete only a query string argument value, such as &thingy=, in a regular expression it's like this:
import re
domain = "domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
x = re.sub(r'(&thingy=)[^&]*(&?.*)$', r'\1\2', domain)
Never mind what's followed after the given one.

How to extract unknown number of different parts from string with Python regex?

Does anyone know a smart way to extract unknown number of different parts from a string with Python regex?
I know this question is probably too general to answer clearly, so please let's have a look at the example:
S = "name.surname#sub1.sub2.sub3"
As a result I would like to get separately a local part and each subdomain. Please note that in this sample email address we have three subdomains but I would like to find a regular expression that is able to capture any number of them, so please do not use this number.
To avoid straying from the point, let's additionaly assume only alphanumeric characters (hence \w), dots and one # are allowed in email addresses.
I tried to solve it myself and found this way:
L = re.findall(r"([\w.]+)(?=#)|(\w+)", S)
for i in L:
if i[0] == '': print i[1],
else: print i[0],
# output: name.surname sub1 sub2 sub3
But it doesn't look nice to me. Does anyone know a way to achieve this with one regex and without any loop?
Of course, we can easily do it without regular expressions:
L = S.split('#')
localPart = L[0] # name.surname
subdomains = str(L[1]).split('.') # ['sub1', 'sub2', 'sub3']
But I am interested in how to figure it out with regexes.
[EDIT]
Uff, finally I figured this out, here is the nice solution:
S = "name.surname#sub1.sub2.sub3"
print re.split(r"#|\.(?!.*#)", S) # ['name.surname', 'sub1', 'sub2', 'sub3']
S = "name.surname.nick#sub1.sub2.sub3.sub4"
print re.split(r"#|\.(?!.*#)", S) # ['name.surname.nick', 'sub1', 'sub2', 'sub3', 'sub4']
Perfect output.
If I am understanding your request correctly, you want to find each section in your sample email address, without the periods. What you are missing in your sample regex snippet is re.compile. For example:
import re
s = "name.surname#sub1.sub2.sub3"
r = "\w+"
r2 = re.compile(r)
re.findall(r2,s)
This looks for the r2 regex object in the string s and outputs ['name', 'surname', 'sub1', 'sub2', 'sub3'].
Basically you can use the fact that when there's a capture group in the pattern, re.findall returns only the content of this capture group and no more the whole match:
>>> re.findall(r'(?:^[^#]*#|\.)([^.]*)', s)
['sub1', 'sub2', 'sub3']
Obviously the email format can be more complicated than your example string.

Regex issue in python

I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?
try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/
The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)
You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21
You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21
value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45

Python re named group --doing a greedy match

For some reason, I need to extract the fields in an xml doc with python re.
here is an eg. of the string I'll be applying the regex on:
payload2 = '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
Some of the fields you see above like 'clientIP' may not always be present.
The regex I have come up with is:
PAT3 = re.compile(r'.+(event="(?P<event_code>\S*?)"){1}[\S\s]+?(path="(?P<path>[\s\S]+?)"){0,1}[\S\s]+(clientIP="(?P<client_ip>[\S\s]+?)"){0,1}.*', re.UNICODE)
m1 = PAT3.search(payload2)
print m1.groupdict()
output:
{'path': '\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db', 'client_ip': None, 'event_code': '0x80'}
But when I put {1} instead of {0, 1} after (?P<client_ip>[\S\s]+?)") it works. However this defeats the case when the clientIP is absent.
Any idea on how can make the regex work in both cases where a field is present or not present?
My advice:
Stop trying to do a big one-line regex.
It's very simple to just break up your code so that it is not only more readable, but easier too.
My version of your code
payloads = [
'<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>',
'<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
]
def scrape_xml(payload):
import re
ipv4 = r'clientIP="(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'
pat3 = dict()
pat3['event_code'] = r'event="(0[xX][0-9a-fA-F]+?)"'
pat3['path'] = r'path="(.*?)"'
pat3['client_ip'] = ipv4
matches = {}
for index, regex in enumerate(pat3):
matches[index] = re.search(
pattern=pat3[regex],
string=payload,
flags=re.UNICODE
)
for index in matches:
if not index:
print "\n"
if matches[index] is None:
pass
else:
print matches[index].group(0)
for p in payloads:
scrape_xml(p)
Output:
path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
event="0x80"
path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
clientIP="172.26.64.233"
event="0x80"
First, I have to give you the standard warning against parsing XML with regular expressions, but if you're deadset on that…
You probably don't want to be using [\S\s], as that'll match anything, including going past the quote. To prevent that, you made it non-greedy, but there's a better solution: just make it not match quotes: [^"]. Also note that you can replace {0,1} with ?.

Python splitting string to find specific content

I am trying to split a string in python to extract a particular part. I am able to get the part of the string before the symbol < but how do i get the bit after? e.g. the emailaddress part?
>>> s = 'texttexttextblahblah <emailaddress>'
>>> s = s[:s.find('<')]
>>> print s
This above code gives the output texttexttextblahblah 
s = s[s.find('<')+1:-1]
or
s = s.split('<')[1][:-1]
cha0site's and ig0774's answers are pretty straightforward for this case, but it would probably help you to learn regular expressions for times when it's not so simple.
import re
fullString = 'texttexttextblahblah <emailaddress>'
m = re.match(r'(\S+) <(\S+)>', fullString)
part1 = m.group(1)
part2 = m.group(2)
Perhaps being a bit more explicit with a regex isn't a bad idea in this case:
import re
match = re.search("""
(?<=<) # Make sure the match starts after a <
[^<>]* # Match any number of characters except angle brackets""",
subject, re.VERBOSE)
if match:
result = match.group()

Categories