Stripping particular part from a path in python - python

I have a path which is something like this.
/schemas/123/templates/Template1/a/b
I want to strip off everything and store only the number (i.e. 123) to variable. Can someone help with this. And the number which I want to store stays at the same place in the path every time. I mean the number always comes after /schemas/"number"
Thanks much

pathlib objects are designed for easy access to the component parts of paths:
>>> from pathlib import Path
>>> path = Path('/schemas/123/templates/Template1/a/b')
>>> path.parts
('/', 'schemas', '123', 'templates', 'Template1', 'a', 'b')
>>> [int(part) for part in path.parts if part.isdigit()]
[123]

EDIT In the comments it is stated that it can be numbers and characters
Method 1 Using split
#!/usr/bin/python
testLine = "/schemas/123abc/templates/Template1/a/b"
print testLine.split("/")[2]
Method 2 Using regex
Select anything between the second and the third (if exists) slash
#!/usr/bin/python
import re
testLine = "/schemas/123abc/templates/Template1/a/b"
pattern = "^/schemas/(.[^/]+).*"
matchObj = re.match(pattern, testLine)
if matchObj:
num = matchObj.group(1)
print int(num)
else:
print "not found"
The pattern is as follows:
^/ the string begins with slash
schemas/ next comes the word schemas with a slash in the end
(.[^/]+) contains one or more characters excluding slash (parenthesis used for grouping)
.* ends with any character

Related

How to find what the next word separated by underscore is

I have a Python application that uploads a specific file and finds the correct location path to it.
The first path has two possibilities: path1 or path2
The second path has about 4-5 possibilities. All the files uploaded will be named is similar to this:
GE_1234_path1_possib1_655_ygiu_qis
To find which path is written, I wrote this if-else statement:
path1 = re.search(r'path1', self.fileList[0])
path2 = re.search(r'path2', self.fileList[0])
if path1:
radioButton = 2
if path2:
radioButton = 1
I know I can apply the same if else statement to the 5 possibilities. However I prefer to read what is written exactly after path1. Is there a way with regex to skip the underscore after path and read what the possibility is?
In this example, I'm looking for something that would output possib1
I tried:
print re.findall(r'path1\w', self.fileList[0])
but that just prints path1_
You can use following regex :
print re.findall(r'path1_([^_]+)', self.fileList[0])
([^_]+) is a negated character class within a capture group which will match anything except _ after path1_.
See demo https://regex101.com/r/wM4iI6/1
You can use the following solution:
import re
p = re.compile(ur'path\d+_([^_]+)')
test_str = ur"GE_1234_path1_possib1_655_ygiu_qis"
match = p.search(test_str)
if match:
print match.group(1)
See IDEONE demo
The regex - path\d+_([^_]+) - matches path, then a digit(s) and an underscore, then it matches and captures into Group 1 one or more characters other than _. And then we access that group 1 if a match is found.
Use capturing group to capture the alphanumeric characters which exists next to path1.
print re.findall(r'path1\w([A-Za-z\d]+)', self.fileList[0])
or
>>> s = 'GE_1234_path1_possib1_655_ygiu_qis'
>>> spl = s.split('_')
>>> for i,j in enumerate(spl):
if 'path1' in j:
print(spl[i+1])
possib1
>>>

Python Regular Expression - right-to-left

I am trying to use regular expressions in python to match the frame number component of an image file in a sequence of images. I want to come up with a solution that covers a number of different naming conventions. If I put it into words I am trying to match the last instance of one or more numbers between two dots (eg .0100.). Below is an example of how my current logic falls down:
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.###.0100.exr
I realise there are other ways in which I can solve this issue (I have already implemented a solution where I am splitting the path at the dot and taking the last item which is a number) but I am taking this opportunity to learn something about regular expressions. It appears the regular expression creates the groups from left-to-right and cannot use characters in the pattern more than once. Firstly is there anyway to search the string from right-to-left? Secondly, why doesn't the pattern find two matches in eg2 (123 and 0100)?
Cheers
finditer will return an iterator "over all non-overlapping matches in the string".
In your example, the last . of the first match will "consume" the first . of the second. Basically, after making the first match, the remaining string of your eg2 example is 0100.exr, which doesn't match.
To avoid this, you can use a lookahead assertion (?=), which doesn't consume the first match:
>>> pattern = re.compile(r'\.(\d+)(?=\.)')
>>> pattern.findall(eg1)
['0100']
>>> pattern.findall(eg2)
['123', '0100']
>>> eg3 = 'xx01_010_animation.123.0100.500.9000.1234.exr'
>>> pattern.findall(eg3)
['123', '0100', '500', '9000', '1234']
# and "right to left"
>>> pattern.findall(eg3)[::-1]
['1234', '9000', '500', '0100', '123']
My solution uses a very simple hackish way of fixing it. It reverses the string path in the beginning of your function and reverses the return value at the end of it. It basically uses regular expressions to search the backwards version of your given strings. Hackish, but it works. I used the syntax shown in this question to reverse the string.
import os
import re
def sub_frame_number_for_frame_token(path, token='#'):
path = path[::-1]
folder = os.path.dirname(path)
name = os.path.basename(path)
pattern = r'\.(\d+)\.'
matches = list(re.finditer(pattern, name) or [])
if not matches:
return path
# Get last match.
match = matches[-1]
frame_token = token * len(match.group(1))
start, end = match.span()
apetail_name = '%s.%s.%s' % (name[:start], frame_token, name[end:])
return os.path.join(folder, apetail_name)[::-1]
# Success
eg1 = 'xx01_010_animation.0100.exr'
eg1 = sub_frame_number_for_frame_token(eg1) # result: xx01_010_animation.####.exr
# Failure
eg2 = 'xx01_010_animation.123.0100.exr'
eg2 = sub_frame_number_for_frame_token(eg2) # result: xx01_010_animation.123.####.exr
print(eg1)
print(eg2)
I believe the problem is that finditer returns only non-overlapping matches. Because both '.' characters are part of the regular expression, it doesn't consider the second dot as a possible start of another match. You can probably use the lookahead construct ?= to match the second dot without consuming it with "?=.".
Because of the way regular expressions work, I don't think there is an easy way to search right-to-left (though I suppose you could reverse the string and write the pattern backwards...).
If all you care about is the last \.(\d+)\., then anchor your pattern from the end of the string and do a simple re.search(_):
\.(\d+)\.(?:.*?)$
where (?:.*?) is non-capturing and non-greedy, so it will consume as few characters as possible between your real target and the end of the string, and those characters will not show up in matches.
(Caveat 1: I have not tested this. Caveat 2: That is one ugly regex, so add a comment explaining what it's doing.)
UPDATE: Actually I guess you could just do a ^.*(\.\d\.) and let the implicitly greedy .* match as much as possible (including matches that occur earlier in the string) while still matching your group. That makes for a simpler regex, but I think it makes your intentions less clear.

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

Get any character except digits

I'm trying to search for a string that has 6 digits, but no more, other chars may follow. This is the regex I use \d{6}[^\d] For some reason it doesn't catch the digits which \d{6} do catch.
Update
Now I'm using the regex (\d{6}\D*)$ which do makes sence. But I can't get it to work anyways.
Update 2 - solution
I should of course grouped the \d{6} with parentheses. Doh! Otherwise it includes the none-digit and tries to make a date with that.
End of update
What I'm trying to achive (as a rather dirty hack) is to find a datestring in the header of a openoffice document in either of the following formats: YYMMDD, YYYY-MM-DD or YYYYMMDD. If it finds one of these (and only one) it set the mtime and atime of that file to that date. Try to create a odt-file in /tmp with 100101 in the header and run this script (sample file to download: http://db.tt/9aBaIqqa). It should'nt according to my tests change the mtime/atime. But it will change them if you remove the \D in the script below.
This is all of my source:
import zipfile
import re
import glob
import time
import os
class OdfExtractor:
def __init__(self,filename):
"""
Open an ODF file.
"""
self._odf = zipfile.ZipFile(filename)
def getcontent(self):
# Read file with header
return self._odf.read('styles.xml')
if __name__ == '__main__':
filepattern = '/tmp/*.odt'
# Possible date formats I've used
patterns = [('\d{6}\D', '%y%m%d'), ('\d{4}-\d\d-\d\d', '%Y-%m-%d'), ('\d{8}', '%Y%m%d')]
# go thru all those files
for f in glob.glob(filepattern):
# Extract data
odf = OdfExtractor(f)
# Create a list for all dates that will be found
findings = []
# Try finding date matches
contents = odf.getcontent()
for p in patterns:
matches = re.findall(p[0], contents)
for m in matches:
try:
# Collect regexp matches that really are dates
findings.append(time.strptime(m, p[1]))
except ValueError:
pass
print f
if len(findings) == 1: # Don't change if multiple dates was found in file
print 'ändrar till:', findings[0]
newtime = time.mktime(findings[0])
os.utime(f, (newtime, newtime))
print '-' * 8
You can use \D (capital D) to match any non-digit character.
regex:
\d{6}\D
raw string: (are you sure you are escaping the string correctly?)
ex = r"\d{6}\D"
string:
ex = '\\d{6}\\D'
Try this instead:
r'(\d{6}\D*)$'
(six digits followed by 0 or more non-digits).
Edit: added a "must match to end of string" qualifier.
Edit2: Oh, for Pete's sake:
import re
test_strings = [
("12345", False),
("123456", True),
("1234567", False),
("123456abc", True),
("123456ab9", False)
]
outp = [
" good, matched",
"FALSE POSITIVE",
"FALSE NEGATIVE",
" good, no match"
]
pattern = re.compile(r'(\d{6}\D*)$')
for s,expected in test_strings:
res = pattern.match(s)
print outp[2*(res is None) + (expected is False)]
returns
good, no match
good, matched
good, no match
good, matched
good, no match
I was pretty stupid. If I add an \D to the end of the search the search will of course return that none digit also which I did'nt want. I had to add parenthesis to the part I really wanted. I feel pretty stupid for not catching this with a simple print statement after loop. I really need to code more frequently.

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories