I am processing a flat file, with data in line by line format, like this
... blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah
I want to extract the sku field, it is the number with 8 char long. However, I am not sure if I should use split or regex, I am not very good at using regex in python.
Assuming your sku values are always 8 char long, and are always preceded by 'sku', and possibly some ':' (with or without spaces in the between), then I would use the regex: r'sku[\s:]*(\d{8})':
>>> import re
>>> string = '... | sku: 01234567 | price: 150 | ... '
>>> re.findall(r'sku[\s:]*(\d{8})', string)[0]
'01234533'
If your sku values length may be variable, just use: r'sku[\s:]*(\d*)':
>>> import re
>>> string = '... | sku: 01234 | price: 150 | sku: 99872453 | blah blah ... '
>>> re.findall(r'sku[\s:]*(\d*)', string)[0]
'01234'
>>> re.findall(r'sku[\s:]*(\d*)', string)[1]
'99872453'
edit
If your 'sku' is followed by some other characters, like sku1, sku2, sku-sp, sku-18 or sku_anything, you could try that:
>>> re.findall(r'sku\D*(\d*)', string)[0]
This is the exact equivalent of:
>>> re.findall(r'sku[^0-9]*([0-9]*)', string)[0]
It's very general. It will match anything that begin with sku, then that will be followed by any undetermined number of non-decimal character (\D*, or [^0-9]*), and by some decimal characters (\d*, or [0-9]*). It will return the latter (a string of undetermined length of decimal characters).
Now, what do mean the things I used to build these expressions:
quantifiers
*: when following a single character or a class of characters, this symbol means that the expression will match any undetermined number of the character or class it follows (* means "0 or some", + means "at least one", ? means "0 or 1").
the {} are used in the same ways than the *, the + and the ?, ie. they follow a character or a class of characters. They also are quantifiers. If you say c{4}, it will match any string composed of exactly 4 'c's. If you say c{1,6} it will match any string composed of between 1 and 6 'c'.
classes
[]: define a class of characters. [abc] means any of the characters 'a', 'b', or 'c'. [a-z] means any of the lower case letters. [A-Z], any of the upper case letters, [a-zA-Z] any of the lower and upper case letters, [0-9] any of the decimal characters. If you want to match decimals with dots, or commas, with plus, minus and 'e' (for exponentials, for example), just say [0-9,\.+-e].
the ^ inside of a class - defined with [], means 'inverted class', everything but the class. Then, [^0-9] means anything but decimal characters, [^a-z] anything but lower case letters, and so on, and so forth.
predefined classes
These are classes that are predefined in python, for making the regexes syntax more friendly:
\s: will match any spacing character (space, tabulation, etc.)
\d: will match any decimal character (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ... This is equivalent to [0-9], which is another way to express a characters class in regexes)
\D: will match any non-decimal character ... This is equivalent to [^0-9], which is another way to express an exluded class of characters in regexes.
\S: will match any non-spacing character ...
\w: will match any 'word character'
\W: will match any non-word character
...
groups
() defines some groups. They have many usages. Here, in findall, the group highlights what you want to be returned by the expression ... ie. (\d{8}) or [0-9]{8} means you want the expression returns to you only the strings of 8 decimal characters in the matching full string.
Regular expressions are really easy to use, and very useful. You just have to very well understand what they can do and what they can't (they are limited to regular languages. If you need to deal with levels of nested things for example, or other languages defined with context-free grammars, regexes won't be enough). You would probably want to have a look on the following pages:
http://docs.python.org/library/re.html
http://www.regular-expressions.info/tutorial.html
Something like the following should achieve what you need, without being dependent on exact spacing and positioning:
>>> s = '... blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah'
>>> match_obj = re.search(r'sku\s*:\s*(\d+)', s)
>>> match_obj.group(1)
'01234567'
Before you attempt to access the match object with the .group() method, you should check that a match actually occurred, i.e.: if match_obj: # do something with match.
If all the 8-digit numbers in your string are SKU numbers, you can use
re.findall(r"\b\d{8}\b", mystring)
The \b word boundary anchors ensure that 8-digit substrings within longer numbers/words will not be matched.
If all of the pipe delimited fields are also (key: value), then you might as well retain the rest of the data unless you need it -- you're already having to parse the string...
s = "sku: 01234567 | price: 150"
dict( k.split(':') for k in s.split('|') )
# {sku': ' 01234567 ', ' price': ' 150'}
Might want to trim some excess leading space though
In my opinion you should use split, "sku:" and "|" acts as separators:
s = "blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah"
s.split("sku:")[1].split("|")[0]
Here is with checking:
s = "blah blah blah | sku: 01234567 | price: 150 | ... blah blah blah"
s1 = s.split("sku:")
if len(s1) == 2:
print s1[1].split("|")[0]
Related
I need to extract the real issue number in my file name. There are 2 patterns:
if there is no leading number in the file name, then the number, which we read first, is the issue number. For example
asdasd 213.pdf ---> 213
abcd123efg456.pdf ---> 123
however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example
123abcd 4567sdds.pdf ---> 4567, since 123 is ignored
890abcd 123efg456.pdf ---> 123, since 890 is ignored
I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:
if there is a leading number, remove it
find the number in the remaining string
or in Python code
import re
reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
if reNumHeading.match(test):
span = reNumHeading.match(test).span()
stripTest = test[span[1]:]
else:
stripTest = test
result = reNum.findall(stripTest)
if result:
print(result[0])
thanks
You can use ? quantifier to define optional pattern
>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
... print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
...
213
123
4567
123
(?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
since + is greedy, all the starting digits will be matched
.*? match any number of characters minimally (because we need the first match of digits)
(\d+) the required digits to be captured
re.search returns a re.Match object from which you can get various details
[1] on the re.Match object will give you string captured by first capturing group
use .group(1) if you are on older version of Python that doesn't support [1] syntax
See also: Reference - What does this regex mean?
Just match digits \d+ that follow a non-digit \D:
import re
lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')
for test in lstTest:
res = re.search(r'\D(\d+)', test)
print(res.group(1))
Output:
4567
213
123
123
Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.
Is this possible with regular expressions?
When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz
See the diagram below (if re.S is used, . will mean AnyChar):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
Using PCRE a solution would be:
This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.
You could use lookaround.
/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g
(I've not tested it.)
In my string (example adopted from this turorial) I want to get everything until the first following . after the generic (year). pattern:
str = 'purple alice#google.com, (2002).blah monkey. (1991).#abc.com blah dishwasher'
I think I'm almost there with my code but not quite yet:
test = re.findall(r'[\(\d\d\d\d\).-]+([^.]*)', str)
... which returns: ['com, (2002)', 'blah monkey', ' (1991)', '#abc', 'com blah dishwasher']
The desired output is:
['blah monkey', '#abc']
In other words, I want to find everything that is between the year pattern and the next dot.
If you want to get every thing between (year). and the first . you can use this:
\(\d{4}\)\.([^.]*)
See Live Demo.
And explanation here:
"\(\d{4}\)\.([^.]*)"g
\( matches the character ( literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
\) matches the character ) literally
\. matches the character . literally
1st Capturing group ([^.]*)
[^.]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
g modifier: global. All matches (don't return on first match)
This should do the trick
print re.findall(r'\(\d{4}\)\.([^\.]+)', str)
$ ['blah monkey', '#abc']
You are using [...] in the wrong way. Try with \(\d{4}\)\.([^.]*)\.:
>>> s = 'purple alice#google.com, (2002).blah monkey. (1991).#abc.com blah dishwasher'
>>> re.findall(r'\(\d{4}\)\.([^.]*)\.', s)
['blah monkey', '#abc']
For the reference, [...] specifies a character class. By using [\(\d\d\d\d\).-] you were saying: one of 0123456789().-.
I have a list of fasta sequences, each of which look like this:
>>> sequence_list[0]
'gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)
I'd like to be able to extract the gene names from each of the fasta entries in my list, but I'm having difficulty finding the right regular expression. I thought this one would work: "^/(.+/),$". Start with a parentheses, then any number of any character, then end with a parentheses followed by a comma. Unfortunately: this returns None:
test = re.search(r"^/(.+/),$", sequence_list[0])
print(test)
Can someone point out the error in this regex?
Without any capturing groups,
>>> import re
>>> str = """
... gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
... ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
... CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)"""
>>> m = re.findall(r'(?<=\().*?(?=\),)', str)
>>> m
['Ndufa10']
It matches only the words which are inside the parenthesis only when the closing bracket is followed by a comma.
DEMO
Explanation:
(?<=\() In regex (?<=pattern) is called a lookbehind. It actually looks after a string which matches the pattern inside lookbehind . In our case the pattern inside the lookbehind is \( means a literal (.
.*?(?=\),) It matches any character zero or more times. ? after the * makes the match reluctant. So it does an shortest match. And the characters in which the regex engine is going to match must be followed by ),
you need to escape parenthesis:
>>> re.findall(r'\([^)]*\),', txt)
['(Ndufa10),']
Can someone point out the error in this regex? r"^/(.+/),$"
regex escape character is \ not / (do not confuse with python escape character which is also \, but is not needed when using raw strings)
=> r"^\(.+\),$"
^ and $ match start/end of the input string, not what you want to output
=> r"\(.+\),"
you need to match "any" characters up to 1st occurence of ), not to the last one, so you need lazy operator +?
=> r"\(.+?\),"
in case gene names could not contain ) character, you can use a faster regex that avoids backtracking
=> r"\([^)]+\),"
I will explain my problem with an example. Here is two different version of my text:
Version 1:
Blah: 1 2345 $ blah blah blah
Version 2:
Blah: 1 2345 $ (9 8546 $) blah blah blah
I try to write a regex in Python where if the text is in Version 2, then it will return the number in the parenthesis. Otherwise, it will return the number outside.
pat = re.compile(r"Blah: [0-9]+\s[0-9]+ /$ \(([0-9]+\s[0-9]+)|Blah: ([0-9]+\s[0-9]+)")
pat.findall(text)
The problem is that it returns ('1 2345', '') or ('', '9 8546') in each case.
How can I change the regex to return only the number?
If you are pretty comfortable with the RegEx you wrote, then I would suggest not to change the RegEx and get the value like this
print "".join(pat.findall(text)[0])
This will just concatenate the matching results. Since the other group captures nothing, you will get a single string.
Note: Also, you need to escape $ in your RegEx, like \$, otherwise it will be considered as the end of line.
Don't use findall. The only situation in which it is useful is when you have a simple regex and you want to get all its matches. When you start having capturing groups it easily become quite useless.
The finditer method returns the actual match objects created during matching instead of returning the tuples of the matched groups. You can slightly modify your regex to use capturing groups:
pat = re.compile(r'Blah: (\d+\s\d+) \$ (\((\d+\s\d+)\s*\$\))?')
Afterwards to get the matched number you can use match.group(3) or match.group(1) to select one or the other depending whether there was a parenthesized match:
text = 'Blah: 1 2345 $ (9 8546 $) blah blah blah\nBlah: 1 2345 $ blah blah blah'
[m.group(3) or m.group(1) for m in pat.finditer(text)]
Outputs:
Out[12]: ['9 8546', '1 2345']