Python regex to split both on number and on capital letter - python

I cannot get exactly what I want with regex
I have, for example a string
2000H2HfH
I need to get ['2000','H','2','Hf','H'].
So, I need to split by number and by capital letter or capital following string
I use this ([A-Z][a-z]?)(\d+)? and lose the staring number, which is understandable why, but I cannot get it back for the result to be readable?

You may use
re.findall(r'\d+|[A-Z][a-z]*', text)
See a regex demo. Details:
\d+ - 1+ digits
| - or
[A-Z][a-z]* - an upper case letter and then zero or more lowercase ones.
See a Python demo:
import re
text = "2000H2HfH"
print( re.findall(r'\d+|[A-Z][a-z]*', text) )
# => ['2000', 'H', '2', 'Hf', 'H']

You have two capture groups one after another, so you capture them one after other. To achieve your goal you should modify your capture like this
([A-Z][a-z]?|\d+)?
Here the | symbol means that you capture capital letter following by lowercase letters OR number.
There is very nice service to compose and test regular expressions https://regex101.com/

Related

Extracting float or int number and substring from a string

I've just learned regex in python3 and was trying to solve a problem.
The problem is something like this:
You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative.
For example:
Input: 2.5ax
Output:['2.5','ax']
Input: -5bcf
Output:['-5','bcf']
Input:-69.67Gh
Output:['-69.67','Gh']
and so on.
I did several attempts with regex to solve the problem.
1st attempt:
import re
i=input()
print(re.findall(r'^(-?\d+(\.\d+)?)|[a-zA-Z]+$',i))
For the input -2.55xy, the expected output was ['-2.55','xy']
But the output came:
[('-2.55', '.55'), ('', '')]
2nd attempt:
My second attempt was similar to my first attempt just a little different:
import re
i=input()
print(re.findall(r'^(-?(\d+\.\d+)|\d+)|[a-zA-Z]+$',i))
For the same input -2.55xy, the output came as:
[('-2.55', '2.55'), ('', '')]
3rd attempt:
My next attempt was like that:
import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))
which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.
4th attempt:
import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])
which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn't look like an effective way to do it.
Conclusion:
So I don't know why my 1st and 2nd attempt isn't working. The output comes with a list of tuples which is maybe because of the groups but I don't know the exact reason and don't know how to solve them. Maybe I didn't understand the way the pattern works. Also why the substring didn't show in the output?
In the end, I want to know what's the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.
The alternation | matches either the left part or the right part.
If the chars a-zA-Z are after the digit, you don't need the alternation | and you can use 2 capture groups to get the matches in that order.
Then using re.findall will return a list of tuples for the capture group values.
(-?\d+(?:\.\d+)?)([a-zA-Z]+)
Explanation
( Capture group 1
-?\d+ Match an optional -
(?:\.\d+)? Optionally match . and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
) Close group 1
( Capture group 2
[a-zA-Z]+ Match 1+ times a char a-z or A-Z
) Close group 2
regex demo
import re
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
pattern = r"(-?\d+(?:\.\d+)?)([a-zA-Z]+)"
for s in strings:
print(re.findall(pattern, s))
Output
[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]
lookahead and lookbehind in re.sub simplify things sometimes.
(?<=\d) look behind
(?=[a-zA-Z]) look ahead
that is split between the digit and the letter.
strings = [
"2.5ax",
"-5bcf",
"-69.67Gh",
]
for s in strings:
print(re.split(r'(?<=\d)(?=[a-zA-Z])', s))
['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']

Python regex negative lookahead matching where it shouldn't

Example first:
import re
details = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
input_re = re.compile(r'(?!output[0-9]*) mem([0-9a-f]+)')
print(input_re.findall(details))
# Out: ['001', '005', '002', '006']
I am using negative lookahead to extract the hex part of the mem entries that are not preceded by an output, however as you can see it fails. The desired output should be: ['001', '002'].
What am I missing?
You may use this regex in findall:
\b(?!output\d+)\w+\s+mem([a-zA-F\d]+)
RegEx Demo
RegEx Details:
\b: Word boundary
(?!output\d+): Negative lookahead to assert that we don't have output and 1+ digits ahead
\w+: Match 1+ word characters
\s+: Match 1+ whitespaces
mem([a-zA-F\d]+): Match mem followed by 1+ of any hex character
Code:
import re
s = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
print( re.findall(r'\b(?!output\d+)\w+\s+mem([a-zA-F\d]+)', s) )
Output:
['001', '002']
Maybe an easier approach is to split it up in 2 regular expressions ?
First filter out anything that starts with output and is followed by mem like so
output[0-9]* mem([0-9a-f]+)
If you filter this out it would result in
input1 mem001 data2 mem002
When you have filtered them out just search for mem again
mem([0-9a-f]+)
That would result in your desired output
['001', '002']
Maybe not an answer to the original question, but it is a solution to your problem
First of all, let's understand why your original regex doesn't work:
A regex encapsulates two pieces of information: a description of a location within a text, and a description of what to capture from that location. Your original regex tells the regex matcher: "Find a location within the text where the following characters are not 'output'+digits but they are ' mem'+alphanumetics". Think of the logic of that expression: if the matcher finds a location in the text where the following characters are ' mem'+alphanumerics, then, in particular, the following characters are not 'output'+digits. Your look ahead does not add anything to the exoression.
What you really need is to tell the matcher: "Find a location in the text where the following characters are ' mem'+alphanumerics, and the previous characters are not 'output'+digits. So what you really need is a look-behind, not look-ahead.
#ArtyomVancyan proposed a good regex with a look-behind, and it could easily be modified to what you need: instead of a single digit after the 'output', you want potentially more digits, so just put an asterisk (*) after the '\d'.

Split according to regex condition

This will be my another question:
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
How can I take all the characters despite it being fullstop, digits, or anything after "Organization: " using regex?
result_organization = re.search("(Organization: )(\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*)", string)
My above code is super long and not wise at all.
I would recommend using find command like this
print(string[string.find("Organization")+14:])
You don't need regex for that, this simple code should give you desired result:
str = "Organization: S.P. Dyer Computer Consulting, Cambridge MA";
if str.startswith("Organization: "):
str = str[14:];
print(str)
You also could use pattern (?<=Organization: ).+
Explanation:
(?<=Organization: ) - positive lookbehind, asserts if what is preceeding is Organization:
.+ - match any character except for newline characters.
Demo
You could use a single capturing group instead of 2 capturing groups.
Instead of specify all the words (\w*\.*\w*\.*\w*\s*\w*\s*\w*\s*) you might choose to match any character except a newline using the dot and then match the 0+ times to match until the end.
But note that that would also match strings like ##$$ ++
^Organization: (.+)
Regex demo | Python demo
For example
import re
string = "Organization: S.P. Dyer Computer Consulting, Cambridge MA"
result_organization = re.search("Organization: (.*)", string)
print(result_organization.group(1))
If you want a somewhat more restrictive pattern you might use a character class and specify what you would allow to match. For example:
^Organization: ([\w.,]+(?: [\w.,]+)*)
Regex demo

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

Limiting regex length

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8
With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']
Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.
If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

Categories