I'm trying to extract/match data from a string using regular expression but I don't seem to get it.
I wan't to extract from the following string the i386 (The text between the last - and .iso):
/xubuntu/daily/current/lucid-alternate-i386.iso
This should also work in case of:
/xubuntu/daily/current/lucid-alternate-amd64.iso
And the result should be either i386 or amd64 given the case.
Thanks a lot for your help.
You could also use split in this case (instead of regex):
>>> str = "/xubuntu/daily/current/lucid-alternate-i386.iso"
>>> str.split(".iso")[0].split("-")[-1]
'i386'
split gives you a list of elements on which your string got 'split'. Then using Python's slicing syntax you can get to the appropriate parts.
If you will be matching several of these lines using re.compile() and saving the resulting regular expression object for reuse is more efficient.
s1 = "/xubuntu/daily/current/lucid-alternate-i386.iso"
s2 = "/xubuntu/daily/current/lucid-alternate-amd64.iso"
pattern = re.compile(r'^.+-(.+)\..+$')
m = pattern.match(s1)
m.group(1)
'i386'
m = pattern.match(s2)
m.group(1)
'amd64'
r"/([^-]*)\.iso/"
The bit you want will be in the first capture group.
First off, let's make our life simpler and only get the file name.
>>> os.path.split("/xubuntu/daily/current/lucid-alternate-i386.iso")
('/xubuntu/daily/current', 'lucid-alternate-i386.iso')
Now it's just a matter of catching all the letters between the last dash and the '.iso'.
The expression should be without the leading trailing slashes.
import re
line = '/xubuntu/daily/current/lucid-alternate-i386.iso'
rex = re.compile(r"([^-]*)\.iso")
m = rex.search(line)
print m.group(1)
Yields 'i386'
reobj = re.compile(r"(\w+)\.iso$")
match = reobj.search(subject)
if match:
result = match.group(1)
else:
result = ""
Subject contains the filename and path.
>>> import os
>>> path = "/xubuntu/daily/current/lucid-alternate-i386.iso"
>>> file, ext = os.path.splitext(os.path.split(path)[1])
>>> processor = file[file.rfind("-") + 1:]
>>> processor
'i386'
Related
I've spent the last two hours figuring this out. I have this string:
C:\\Users\\Bob\\.luxshop\\jeans\\diesel-qd\\images\\Livier_11.png
I am interested in getting \\Livier_11.png but it seems impossible for me. How can I do this?
I'd strongly recommend using the python pathlib module. It's part of the standard library and designed to handle file paths. Some examples:
>>> from pathlib import Path
>>> p = Path(r"C:\Users\Bob\.luxshop\jeans\diesel-qd\images\Livier_11.png")
>>> p
WindowsPath('C:/Users/Bob/.luxshop/jeans/diesel-qd/images/Livier_11.png')
>>> p.name
'Livier_11.png'
>>> p.parts
('C:\\', 'Users', 'Bob', '.luxshop', 'jeans', 'diesel-qd', 'images', 'Livier_11.png')
>>> # construct a path from parts
...
>>> Path("C:\some_folder", "subfolder", "file.txt")
WindowsPath('C:/some_folder/subfolder/file.txt')
>>> p.exists()
False
>>> p.is_file()
False
>>>
Edit:
If you want to use regex, this should work:
>>> s = "C:\\Users\\Bob\\.luxshop\\jeans\\diesel-qd\\images\\Livier_11.png"
>>> import re
>>> match = re.match(r".*(\\.*)$", s)
>>> match.group(1)
'\\Livier_11.png'
>>>
You can use this
^.*(\\\\.*)$
Explanation
^ - Anchor to start of string.
.* - Matches anything except new line zero or time (Greedy method).
(\\\\.*) - Capturing group. Matches \\ followed any thing except newline zero or more time.
$ - Anchor to end of string.
Demo
P.S - For such kind of this you should use standard libraries available instead of regex.
If you can clearly say that "\\" is a delimiter (does not appear in any string except to separate the strings) then you can say:
str = "C:\\Users\\Bob\\.luxshop\\jeans\\diesel-qd\\images\\Livier_11.png"
spl = str.split(“\\”) #split the string
your_wanted_string = spl[-1]
Please note this is a very simple way to do it and not always the best way! If you need to do this often or if something important depends on it use a library!
If you are just learning to code then this is easier to understand.
This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}
With a given string:
Surname,MM,Forename,JTA19 R <first.second#domain.com>
I can match all the groups with this:
([A-Za-z]+),([A-Z]+),([A-Za-z]+),([A-Z0-9]+)\s([A-Z])\s<([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})
However, when I apply it to Python it always fails to find it
regex=re.compile(r"(?P<lastname>[A-Za-z]+),"
r"(?P<initials>[A-Z]+)"
r",(?P<firstname>[A-Za-z]+),"
r"(?P<ouc1>[A-Z0-9]+)\s"
r"(?P<ouc2>[A-Z])\s<"
r"(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})"
)
I think I've narrowed it down to this part of email:
[A-Z0-9._%+-]
What is wrong?
Replace
r"(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})"
with
r"(?P<email>[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4})"
to allow for lowercase letters too.
You are passing multiple strings to the compile method, you need to pass in one, whole, regular expression.
exp = '''
(?P<lastname>[A-Za-z]+),
(?P<initials>[A-Z]+),
(?P<firstname>[A-Za-Z]+),
(?P<ouc1>[A-Z0-9]+)\s
(?P<ouc2>[A-Z])\s<
(?P<email>[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})'''
regex = re.compile(exp, re.VERBOSE)
Although I have to say, your string is just comma separated, so this might be a bit easier:
>>> s = "Surname,MM,Forename,JTA19 R <first.second#domain.com>"
>>> lastname,initials,firstname,rest = s.split(',')
>>> ouc1,ouc2,email = rest.split(' ')
>>> lastname,initials,firstname,ouc1,ouc2,email[1:-1]
('Surname', 'MM', 'Forename', 'JTA19', 'R', 'first.second#domain.com')
I am trying to split a string in python to extract a particular part. I am able to get the part of the string before the symbol < but how do i get the bit after? e.g. the emailaddress part?
>>> s = 'texttexttextblahblah <emailaddress>'
>>> s = s[:s.find('<')]
>>> print s
This above code gives the output texttexttextblahblah
s = s[s.find('<')+1:-1]
or
s = s.split('<')[1][:-1]
cha0site's and ig0774's answers are pretty straightforward for this case, but it would probably help you to learn regular expressions for times when it's not so simple.
import re
fullString = 'texttexttextblahblah <emailaddress>'
m = re.match(r'(\S+) <(\S+)>', fullString)
part1 = m.group(1)
part2 = m.group(2)
Perhaps being a bit more explicit with a regex isn't a bad idea in this case:
import re
match = re.search("""
(?<=<) # Make sure the match starts after a <
[^<>]* # Match any number of characters except angle brackets""",
subject, re.VERBOSE)
if match:
result = match.group()
I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']