Python splitting string to find specific content - python

I am trying to split a string in python to extract a particular part. I am able to get the part of the string before the symbol < but how do i get the bit after? e.g. the emailaddress part?
>>> s = 'texttexttextblahblah <emailaddress>'
>>> s = s[:s.find('<')]
>>> print s
This above code gives the output texttexttextblahblah 

s = s[s.find('<')+1:-1]
or
s = s.split('<')[1][:-1]

cha0site's and ig0774's answers are pretty straightforward for this case, but it would probably help you to learn regular expressions for times when it's not so simple.
import re
fullString = 'texttexttextblahblah <emailaddress>'
m = re.match(r'(\S+) <(\S+)>', fullString)
part1 = m.group(1)
part2 = m.group(2)

Perhaps being a bit more explicit with a regex isn't a bad idea in this case:
import re
match = re.search("""
(?<=<) # Make sure the match starts after a <
[^<>]* # Match any number of characters except angle brackets""",
subject, re.VERBOSE)
if match:
result = match.group()

Related

Elegant way of extracting substrings matching regex?

Is there a nice way in Python to do:
Check a String matches a set of regular expressions
If yes: get the matching parts back as tuples.
So essentially I want a simple way to enter simple parser/scanner grammars, and simply extract all matching in a certain structure (e.g. tuples)
So suppose we have encoded in a String a country code, an city name and an index. We want to extract this:
input = "123-NEWYORK-[2]"
grammar = "<country,[0-9]+>-<city,[A-Z]*>-[<index,[0-9]*>"
res = HOW_TO_DO_THIS(input,grammar)
if res is None:
print("Does not match")
else
(countrycode,city,index) = res
With python3 you can do, note that the regex has been modified:
import re
input = "123-NEWYORK-[2]"
grammar = r"(?P<country>[0-9]+)-(?P<city>[A-Z]*)-(?P<index>\[[0-9]*\])"
res = re.findall(grammar, input)
if not res:
print("Does not match")
else:
(countrycode,city,index) = res[0]
print(countrycode)
Modifications:
The correct regex would be (?P[0-9]+)-(?P[A-Z])-(?P[[0-9]])
The syntax for regex module in python is re.findall(patter, input_string). Not the opposite.
if not x is easier (and more generic) than if x is None
Check out this code. This is just for simple text lookup but you can extend according to your scenario
import re
f=open('sample.txt',"w")
f.write("<p class = m>babygameover</p>")
f.close()
f=open('sample.txt','r')
string = "<p class = m>(.+?)</p>" # regular expression
pattern = re.compile(string) # compiling
text = f.read()
search = re.findall(pattern,text) # searching
print search

Regex issue in python

I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?
try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/
The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)
You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21
You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21
value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45

Find specific string sections in python

I want to be able to grab sections of strings with a function. Here is an example:
def get_sec(s1,s2,first='{',last='}'):
start = s2.index(first)
end = -(len(s2) - s2.index(last)) + 1
a = "".join(s2.split(first + last))
b = s1[:start] + s1[end:]
print a
print b
if a == b:
return s1[start:end]
else:
print "The strings did not match up"
string = 'contentonemore'
finder = 'content{}more'
print get_sec(string,finder)
#'one'
So that example works...my issue is I want multiple sections, not just one. So my function needs to be able to work for any amount of sections, for example:
test_str = 'contwotentonemorethree'
test_find = 'con{}tent{}more{}'
print get_sec(test_str,test_find)
#['one','two','three']
any ideas on how I can make that function work for an arbitrary number of replacements?
You probably want to use the standard python regex library
import re
a = re.search('con(.*)tent(.*)more(.*)','contwotentonemorethree')
print a.groups()
# ('two', 'one', 'three')
or
print re.findall('con(.)tent(.)more(.*)','contwotentonemorethree')
# [('two', 'one', 'three')]
edit:
you can escape special character in a string using
re.escape(str)
example:
part1 = re.escape('con(')
part2 = re.escape('(tent')
print re.findall(part1 + '(.*)' + part2,'con(two)tent')
It is not just "use regex". you are trying to actually implement regex. well, the easiest way for implemeting regex will be using the re library. of course.
ummm use regex?
import re
re.findall("con(.*)tent(.*)more(.*)",my_string)
Looks like you want something with regular expressions.
Here's python's page about regular expressions: http://docs.python.org/2/library/re.html
As an example, if say you knew that the string would only be broken into segments "con", "tent", "more" you could have:
import re
regex = re.compile(r"(con).*(tent).*(more).*")
s = 'conxxxxtentxxxxxmore'
match = regex.match(s)
Then find the indices of the matches with:
index1 = s.index(match.group(1))
index2 = s.index(match.group(2))
index3 = s.index(match.group(3))
Or if you wanted to find the locations of the other characters (.*):
regex = re.compile(r"con(.*)tent(.*)more(.*)")

Python matching some characters into a string

I'm trying to extract/match data from a string using regular expression but I don't seem to get it.
I wan't to extract from the following string the i386 (The text between the last - and .iso):
/xubuntu/daily/current/lucid-alternate-i386.iso
This should also work in case of:
/xubuntu/daily/current/lucid-alternate-amd64.iso
And the result should be either i386 or amd64 given the case.
Thanks a lot for your help.
You could also use split in this case (instead of regex):
>>> str = "/xubuntu/daily/current/lucid-alternate-i386.iso"
>>> str.split(".iso")[0].split("-")[-1]
'i386'
split gives you a list of elements on which your string got 'split'. Then using Python's slicing syntax you can get to the appropriate parts.
If you will be matching several of these lines using re.compile() and saving the resulting regular expression object for reuse is more efficient.
s1 = "/xubuntu/daily/current/lucid-alternate-i386.iso"
s2 = "/xubuntu/daily/current/lucid-alternate-amd64.iso"
pattern = re.compile(r'^.+-(.+)\..+$')
m = pattern.match(s1)
m.group(1)
'i386'
m = pattern.match(s2)
m.group(1)
'amd64'
r"/([^-]*)\.iso/"
The bit you want will be in the first capture group.
First off, let's make our life simpler and only get the file name.
>>> os.path.split("/xubuntu/daily/current/lucid-alternate-i386.iso")
('/xubuntu/daily/current', 'lucid-alternate-i386.iso')
Now it's just a matter of catching all the letters between the last dash and the '.iso'.
The expression should be without the leading trailing slashes.
import re
line = '/xubuntu/daily/current/lucid-alternate-i386.iso'
rex = re.compile(r"([^-]*)\.iso")
m = rex.search(line)
print m.group(1)
Yields 'i386'
reobj = re.compile(r"(\w+)\.iso$")
match = reobj.search(subject)
if match:
result = match.group(1)
else:
result = ""
Subject contains the filename and path.
>>> import os
>>> path = "/xubuntu/daily/current/lucid-alternate-i386.iso"
>>> file, ext = os.path.splitext(os.path.split(path)[1])
>>> processor = file[file.rfind("-") + 1:]
>>> processor
'i386'

How can I get part of regex match as a variable in python?

In Perl it is possible to do something like this (I hope the syntax is right...):
$string =~ m/lalala(I want this part)lalala/;
$whatIWant = $1;
I want to do the same in Python and get the text inside the parenthesis in a string like $1.
If you want to get parts by name you can also do this:
>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
>>> m.groupdict()
{'first_name': 'Malcom', 'last_name': 'Reynolds'}
The example was taken from the re docs
See: Python regex match objects
>>> import re
>>> p = re.compile("lalala(I want this part)lalala")
>>> p.match("lalalaI want this partlalala").group(1)
'I want this part'
import re
astr = 'lalalabeeplalala'
match = re.search('lalala(.*)lalala', astr)
whatIWant = match.group(1) if match else None
print(whatIWant)
A small note: in Perl, when you write
$string =~ m/lalala(.*)lalala/;
the regexp can match anywhere in the string. The equivalent is accomplished with the re.search() function, not the re.match() function, which requires that the pattern match starting at the beginning of the string.
import re
data = "some input data"
m = re.search("some (input) data", data)
if m: # "if match was successful" / "if matched"
print m.group(1)
Check the docs for more.
there's no need for regex. think simple.
>>> "lalala(I want this part)lalala".split("lalala")
['', '(I want this part)', '']
>>> "lalala(I want this part)lalala".split("lalala")[1]
'(I want this part)'
>>>
import re
match = re.match('lalala(I want this part)lalala', 'lalalaI want this partlalala')
print match.group(1)
import re
string_to_check = "other_text...lalalaI want this partlalala...other_text"
p = re.compile("lalala(I want this part)lalala") # regex pattern
m = p.search(string_to_check) # use p.match if what you want is always at beginning of string
if m:
print m.group(1)
In trying to convert a Perl program to Python that parses function names out of modules, I ran into this problem, I received an error saying "group" was undefined. I soon realized that the exception was being thrown because p.match / p.search returns 0 if there is not a matching string.
Thus, the group operator cannot function on it. So, to avoid an exception, check if a match has been stored and then apply the group operator.
import re
filename = './file_to_parse.py'
p = re.compile('def (\w*)') # \w* greedily matches [a-zA-Z0-9_] character set
for each_line in open(filename,'r'):
m = p.match(each_line) # tries to match regex rule in p
if m:
m = m.group(1)
print m

Categories