How to extract value from re? - python

import re
cc = 'test 5555555555555555/03/22/284 test'
cc = re.findall('[0-9]{15,16}\/[0-9]{2,4}\/[0-9]{2,4}\/[0-9]{3,4}', cc)
print(cc)
[5555555555555555/03/22/284]
This code is working fine but if i put 5555555555555555|03|22|284 on cc variable then this output will come:
[]
I want one condition if it contains '|' then it gives output: 5555555555555555|03|22|284 or '/' then also it will give output: 5555555555555555/03/22/284

Just replace all the /s in your regex (which incidentally don't need to be backslashed) with [/|], which matches either a / or a |. Or if you want backslashes, too, as in your comment on Zain's answer, [/|\\]. (You should always use raw strings r'...' for regexes since they have their own interpretation of backslashes; in a regular string, [/|\\] would have to be written [/|\\\\].)
match = re.findall(
r'[0-9]{15,16}[/|\\][0-9]{2,4}[/|\\][0-9]{2,4}[/|\\][0-9]{3,4}',
cc)
Any other characters you want to include, like colons, can likewise be added between the square brackets.
If you want to accept repeated characters – and treat them as a single delimiter – you can add + to accept "1 or more" of any of the characters:
match = re.findall(
r'[0-9]{15,16}[:/|\\]+[0-9]{2,4}[:/|\\]+[0-9]{2,4}[:/|\\]+[0-9]{3,4}',
cc)
But that will accept, for example, 555555555555555:/|\\03::|::22\\//284 as valid. If you want to be pickier you can replace the character class with a set of alternates, which can be any length. Just separate the options via | – note that outside of the square brackets, a literal | needs a backslash – and put (?:...) around the whole thing: (?:/|\\|\||:|...) whatever, in place of the square-bracketed expressions up there.
I don't recommend assigning the result of the findall back to the original cc variable; for one thing, it's a list, not a string. (You can get the string with e.g. new_cc = match[0]).
Better to create a new variable so (1) you still have the original value in case you need it and (2) when you use the new value in later code, it's clear that it's different.
In fact, if you're going to the trouble of matching this pattern, you might as well go ahead and extract all the components of it at the same time. Just put (...) around the bits you want to keep, and they'll be put in a tuple as the result of that match:
import re
pat = re.compile(r'([0-9]{15,16})[:/|\\]+([0-9]{2,4})[:/|\\]+([0-9]{2,4})[:/|\\]+([0-9]{3,4})')
cc = 'test 5555555555555555/03/22/284 test'
match, = pat.findall(cc)
print(match)
Which outputs this:
('5555555555555555', '03', '22', '284')

Define both options in re to let your string work with both e.g. the following RE used checks for both "\" and also "|" in the string
import re
cc = 'test 5555555555555555/03/22/284 test'
#cc = 'test 5555555555555555|03|22|284 test'
cc = re.findall('[0-9]{15,16}[\/|][0-9]{2,4}[\/|][0-9]{2,4}[\/|][0-9]{3,4}', cc)
print(cc)

Related

How to split a string in python based on separator with separator as a part of one of the chunks?

Looking for an elegant way to:
Split a string based on a separator
Instead of discarding separator, making it a part of the splitted chunks.
For instance I do have date and time data like:
D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30
Sometimes there's D, sometimes not (however I always want it to be a part of first chunk), no trailing or leading zeros for time and timezone only have ':' sometimes. Point is, it is necessary to split on these 'D, T, +' characters cause the segements might not follow the sae length. If they were it would be easier to just split on the index basis. I want to split them over multiple characters like T and + and have them a part of the data as well like:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']
I know a nicer way would be to clean data first and normalize all rows to follow same pattern but just curious how to do it as it is
For now on my ugly solution looks like:
[i+j for _, i in enumerate(['D','T','TZ']) for __, j in enumerate('D2018-4-21T3:55+6'.replace('T',' ').replace('D', ' ').replace('+', ' +').split()) if _ == __]
Use a regular expression
Reference:
https://docs.python.org/3/library/re.html
(...)
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(], [)].
import re
a = '''D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30'''
b = a.splitlines()
for i in b:
m = re.search(r'^D?(.*)([T].*?)([-+].*)$', i)
if m:
print(["D%s" % m.group(1), m.group(2), "TZ%s" % m.group(3)])
Result:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']

re.compile() python: issue in getting particular pattern

I have been trying to extract particular pattern, which looks like (PSSA) or (FJFD10) in a string.
In a string like this, I want to extract for instance something inside that parentheses (PNDM) in this case. However, I wanted to print it without parentheses.
eg_string = """DAAAAAAJFF: Hellllllllo (PNDM)
CC [MIM:606176]: Blalblablalbalbl. {CCO:0000069|Pubd:160,
CC ECO:0000269|PubMed:18162506}. Note=elllelefjfjfjf HAahndfd
"""
What I did was:
patti = re.compile(r'([A-Z]+)')
www = patti.findall(eg_string)
However, this was giving me more than I needed. It did include PNDM, but it also included like DAAAJFF, ECO
Another thing I tried was r'(^[A-Z]+) I knew it was going to print out DAAAAAJFF only. I want to know how to print (PNDM) which is in the middle of the string.
Use the regex: r"\([A-Z]+\)" to get text results for including ().
Demo: https://regex101.com/r/e2gyly/1
Explanation:
\( - will look for opening brace (
[A-Z] - any char between range A to Z
\) - closing brace )
Here ([A-Z]+) is consider as pattern like A-Z any number of times but you need to change it as \(([A-Z]+)\)
Your Code will be like
import re
eg_string = """DAAAAAAJFF: Hellllllllo (PNDM)
CC [MIM:606176]: Blalblablalbalbl. {CCO:0000069|Pubd:160,
CC ECO:0000269|PubMed:18162506}. Note=elllelefjfjfjf HAahndfd
"""
patti = re.compile(r'\(([A-Z]+)\)')
www = patti.findall(eg_string)
print(www)
#Output : ['PNDM']
Hope this will Help...

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

Categories