How to remove everything before certain character in Python - python

I'm new to python and struggle with a certain task:
I have a String that could have anything in it, but it always "ends" the same.
It can be just a Filename, a complete path, or just a random string, ending with a Version Number.
Example:
C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1
string-anotherstring-15.1R7-S8.1
string-anotherstring.andanother-15.1R7-S8.1
What always is the same (looking from the end) is that if you reach the second dot and go 2 characters in front of it, you always match the part that I'm interested in.
Cutting everything after a certain string was "easy," and I solved it myself - that's why the string ends with the version now :)
Is there a way to tell python, "look for the second dot from behind the string and go 2 in front of it and delete everything in front of that so that I get the Version as a string?
Happy for any pointers in the right direction.
Thanks

If you want the version number, can you use the hyphen (-) to split the string? Or do you need to depend on the dots only?
Please see below use of rsplit and join which can help you.
>>> a = 'string-anotherstring.andanother-15.1R7-S8.1'
>>> a.rsplit('-')
['string', 'anotherstring.andanother', '15.1R7', 'S8.1']
>>> a.rsplit('-')[-2:] #Get everything from second last to the end
['15.1R7', 'S8.1']
>>> '-'.join(a.rsplit('-')[-2:]) #Get everything from second last to the end, and join them with a hyphen
'15.1R7-S8.1'
>>>
For using dots, use the same way
>>> a
'string-anotherstring.andanother-15.1R7-S8.1'
>>> data = a.rsplit('.')
>>> [data[-3][-2:]]
['15']
>>> [data[-3][-2:]] + data[-2:]
['15', '1R7-S8', '1']
>>> '.'.join([data[-3][-2:]] + data[-2:])
'15.1R7-S8.1'
>>>

You can build a regex from the end mark of a line using the anchor $.
Using your own description, use the regex:
(\d\d\.[^.]*)\.[^.]*$
Demo
If you want the last characters from the end included, just move the capturing parenthesis:
(\d\d\.[^.]*\.[^.]*)$
Demo
Explanation:
(\d\d\.[^.]*\.[^.]*)$
^ ^ #digits
^ # a literal '.'
^ # anything OTHER THAN a '.'
^ # literal '.'
^ # anything OTHER THAN a '.'
^ # end of line

Assuming I understand this correctly, there are two ways to do this that come to mind:
Including both, since I might not understand this correctly, and for completeness reasons. I think the split/parts solution is cleaner, particularly when the 'certain character' is a dot.
>>> msg = r'C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1'
>>> re.search(r'.*(..\..*)', msg).group(1)
'S8.1'
>>> parts = msg.split('.')
>>> ".".join((parts[-2][-2:], parts[-1]))
'S8.1'

For your example, you can split the string by the separator '-', and then join the last two indices. Like so:
txt = "string-anotherstring-15.1R7-S8.1"
x = txt.split("-")
y = "".join(x[-2:])
print(y) # outputs 15.1R7S8.1

Related

How to create a regular expression that would find all pieces of text BETWEEN certain sets of characters?

I have a string that looks like 'E10 1/05/03 2/3211 3/AO Yuzhmor'.
The pieces that i need to extract are the ones following ' \d\/':
1) 05/03
2) 3211
3) AO Yuzhmor
My last idea was ' \d\/(.*?)(?=(( \d\/)|\Z))'
but it still wouldn't work properly on the last piece (the |\Z instruction doesn't seem to do anything).
I think you're close. This works for your example:
>>> s = 'E10 1/05/03 2/3211 3/AO Yuzhmor'
>>> re.findall('\s\d\/(.*?)(?=\s\d\/|$)', s)
['05/03', '3211', 'AO Yuzhmor']
Explanation:
Match on [space][digit]/, capturing everything that follows using a non-greedy quantifier, until the current position is immediately before either another [space][digit]/ (detected using a lookahead, matched but not consumed) or the end of the input. Use findall to return all matching instances in the input.
This can be tricky because we don't know all of the rules of how these strings are built. One option is to use your regex to split the string
>>> re.split(r" \d/", 'E10 1/05/03 2/3211 3/AO Yuzhmor')[1:]
['05/03', '3211', 'AO Yuzhmor']
Another is to be more specific about the fields, assuming that they are always " 1/", " 2/" and " 3/"
>>> re.match(r".*?1/(.*?) 2/(.*?) 3/(.*)", 'E10 1/05/03 2/3211 3/AO Yuzhmor').groups()
('05/03', '3211', 'AO Yuzhmor')
Try
re.findall('\d/(\S+)', s)
:)

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

Regular expression in python to capture multiple forms of badly formatted addresses

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.
I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.
import re
r1 = r"([\w\s+]+),?\s*\(?([\w\s+\\/]+)\)?\s*\(?([\w\s+\\/]+)\)?"
match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')
group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()
print group1
print group2
print group3
This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:
('caracas', 'venezuel' 'a')
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')
The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '.
Thanks in advance for any insight.
Cheers!
Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?
Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):
([\w\s+]+)
This says, "capture one or more (letter or one or more spaces)"
Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:
([\w\s]*\w)
Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.
Next you have:
,?\s*\(?
which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:
(?:,\s*\(|,\s*|\s*\()
which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.
Next you have the capturing expression, very similar to the first:
([\w\s+\\/]+)
Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:
([\w\s\\/]*\w)
The next expression is probably where you're getting your venezuel a problem; let's take a look:
\)?\s*\(?([\w\s+\\/]+)\)?
This is a rather long one, so let's break it down:
\)?\s*\(?
says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:
([\w\s+\\/]+)
This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:
\)?\s*
Followed by making your entire final expression optional, and the outer expression non-capturing:
(?:\(?([\w\s+\\/]+)\)?)?
The final expression would be:
([\w\s]*\w)(?:,\s*\(|,\s*|\s*\()([\w\s\\/]*\w)\)?\s*(?:\(?([\w\s+\\/]+)\)?)?
Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.
Testing it on your examples:
>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')
Could you not just find all the words in the text?
E.g.:
>>> import re
>>> samples = ['caracas, venezuela','caracas (venezuela)','caracas, (venezuela) (df)']
>>>
>>> def find_words(text):
... return re.findall('\w+',text)
...
>>> for sample in samples:
... print find_words(sample)
...
['caracas', 'venezuela']
['caracas', 'venezuela']
['caracas', 'venezuela', 'df']

removing part of a string (up to but not including) in python

I'm trying to strip off part of a string.
e.g. Strip:-
a = xyz-abc
to leave:-
a = -abc
I would usually use lstrip e.g.
a.lstrip('xyz')
but in this case I don't know what xyz is going to be, so I need a way to just strip everything to the left of '-'.
Is it possible to set that option with lstrip or do I have to go about it a different way?
Thanks.
If there's only one - character, this will work:
'xyz-abc'.split('-')[1]
If you want the '-' in there, you have to reattach it:
>>> '-' + 'xyz-abc'.split('-')[1]
'-abc'
There's also count parameter that allows you to split only at the first - character.
>>> '-' + 'xyz-ab-c'.split('-', 1)[1]
'-ab-c'
partition is also potentially useful:
>>> 'xyz-abc'.partition('-')
('xyz', '-', 'abc')
It splits at the first occurrence of the separator:
>>> ''.join('xyz-ab-c'.partition('-')[1:])
'-ab-c'
>>> a = 'xyz-abc'
>>> a.find('-') # return the index of the first instance of '-'
3
>>> a[a.find('-'):] # return the string of everything past that index
'-abc'
You could use a conjunction of .find and splicing.
If there is no guarantee that the text to the left of - doesn't contain dashes of its own, the reversed version of find called rfind is even more useful:
>>> s = "xyv-er-hdgcfh-abc"
>>> print s[s.rfind("-"):]
-abc

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

Categories