I have always 2 numbers in between and I want to extract everything before 3 so Salvatore and everything after 2 Abdulla
For example I have the following:
txt = "Salvatore32Abdulla"
first = re.findall("^\D+", txt)
last = re.search(,txt)
Expected result:
first = 'Salvatore'
last = 'Abdulla'
I can get the first part, but after 2 I can't get the last part
You could also do this in a single line by slightly changing the solution suggested by #ctwheels as follows. I would suggest you to use re.findall as that gets the job done with a single blow.
import re
txt = "Salvatore32Abdulla"
Option-1
Single line extraction of the non-numeric parts.
first, last = re.findall("\D+", txt)
print((first, last))
('Salvatore', 'Abdulla')
Option-2
If you would (for some reason) also want to keep track of the number in between:
first, num, last = re.findall("(\D+)(\d{2})(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Option-3
As an extension of Option-2 and considering the text with a form 'Salvatore####...###Abdulla', where ####...### denotes a continuous block of digits separating the non-numeric parts and you may or may not have any idea of how many digits could be in-between, you could use the following:
first, num, last = re.findall("(\D+)(\d*)(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Why am I not getting the expected results?
You currently have one issue with your regex and one with your code.
Your regex contains ^, which anchors it to the start of the string. This will only allow you to match Salvatore. You're using findall (which is the appropriate choice if you change the regex to simply \D+), but right now it's only getting one result.
The second re.search call is not needed as you can capture first and last with the findall given an appropriate pattern (see below).
How do I fix it?
See code in use here
import re
txt = "Salvatore32Abdulla"
x = re.findall("\D+", txt)
print(x)
Result:
['Salvatore', 'Abdulla']
You could use a regex like this:
txt = "Salvatore32Abdulla"
regex = r"(\D+)\d\d(\D+)"
match = re.match(regex, txt)
first = match.group(1)
last = match.group(2)
Part after last digit:
match = re.search(r'\D+$',txt)
if match:
print(match.group())
See Python proof | regex proof.
Results: Abdulla
EXPLANATION
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Related
I'm searching for exact course codes in a text. Codes look like this
MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*
So 3 or 4 uppercase letters followed by 4 digits.
I only want ones that do not end with "*" symbol.
I have tried
course_code = re.compile('[A-Z]{4}[0-9]{4}|[A-Z]{3}[0-9]{4}')
which is probably one of the worse ways to do it but kinda works as I can get all the courses listed above. The issue is I don't want those 3 course codes ending with a "*" (failed courses have a * next to their codes) to be included in the list.
I tried adding \w or $ to the end of the expression. Whichever I add, the code returns an empty list.
If I read your requirements correctly, you want this pattern:
^[A-Z]{3,4}[0-9]{4}$
This assumes that you would be searching your entire text stored in a Python string using regex in multiline mode, q.v. this demo:
inp = """MAT1051
CMP1401*
PHY1001*
MAT1041*
ENG1003*"""
matches = re.findall(r'^[A-Z]{3,4}[0-9]{4}$', inp, flags=re.M)
print(matches) # ['MAT1051']
import re
# Add a "$" at the end of the re.
# It requires the match to end after the 4 digits.
course_code = re.compile('[A-Z]{4}[0-9]{4}$|[A-Z]{3}[0-9]{4}$')
# No match here
m = re.match(course_code, "MAT1051*")
print(m)
# This matches
m = re.match(course_code, "MAT1051")
print(m)
I currently investigate a problem that I want to replace something in a string.
For example. I have the following string:
'123.49, 19.30, 02\n'
I only want the first two numbers like '123.49, 19.30'. The split function is not possible, because a I have a lot of data and some with and some without the last number.
I tried something like this:
import re as regex
#result = regex.match(', (.*)\n', string)
result = re.search(', (.*)\\n', string)
print(result.group(1))
This is not working finde. Can someone help me?
Thanks in advance
You could do something like this:
reg=r'(\d+\.\d+), (\d+\.\d+).*'
if(re.search(reg, your_text)):
match = re.search(reg, your_text)
first_num = match.group(1)
second_num = match.group(2)
Alternatively, also adding the ^ sign at the beginning, making sure to always only take the first two.
import re
string = '123.49, 19.30, 02\n'
pattern = re.compile('^(\d*.?\d*), (\d*.?\d*)')
result = re.findall(pattern, string)
result
Output:
[('123.49', '19.30')]
In the code you are using import re as regex. If you do that, you would have to use regex.search instead or re.search.
But in this case you can just use re.
If you use , (.*) you would capture all after the first occurrence of , and you are not taking digits into account.
If you want the first 2 numbers as stated in the question '123.49, 19.30' separated by comma's you can match them without using capture groups:
\b\d+\.\d+,\s*\d+\.\d+\b
Or matching 1 or more repetitions preceded by a comma:
\b\d+\.\d+(?:,\s*\d+\.\d+)+\b
regex demo | Python demo
As re.search can also return None, you can first check if there is a result (no need to run re.search twice)
import re
regex = r"\b\d+\.\d+(?:,\s*\d+\.\d+)+\b"
s = "123.49, 19.30, 02"
match = re.search(regex, s)
if match:
print(match.group())
Output
123.49, 19.30
I am trying to find all occurrences of either "_"+digit or "^"+digit, using the regex ((_\^)[1-9])
The groups I'd expect back eg for "X_2ZZZY^5" would be [('_2'), ('^5')] but instead I am getting [('_2', '_'), ('^5', '^')]
Is my regex incorrect? Or is my expectation of what gets returned incorrect?
Many thanks
** my original re used (_|\^) this was incorrect, and should have been (_\^) -- question has been amended accordingly
You have 2 groups in your regex - so you're getting 2 groups. And you need to match atleast 1 number that follows.
try this:
([_\^][1-9]+)
See it in action here
Demand at least 1 digit (1-9) following the special characters _ or ^, placed inside a single capture group:
import re
text = "X_2ZZZY^5"
pattern = r"([_\^][1-9]{1,})"
regex = re.compile(pattern)
res = re.findall(regex, text)
print(res)
Returning:
['_2', '^5']
I have the following string:
ref:_00D30jPy._50038vQl5C:ref
And would like to formalize the following output string:
5003800000vQl5C
The required regex actions are:
Remove all leading characters until the digit '5'.
Add 5 zeros starting the fifth digit.
Remove the closing ':ref'.
I initially made the following regex to match the whole string:
(ref:(\S+):ref)
How can I alter the Python RegEx to achieve the above?
Use re.sub:
import re
s = 'ref:_00D30jPy._50038vQl5C:ref'
result = re.sub(r'^[^5]*(5.{4})(.*?):ref$', r'\g<1>00000\g<2>', s, 0, re.MULTILINE)
print(result)
Output:
5003800000vQl5C
Explanation:
^[^5]*: match characters except 5 from the beginning
(5.{4}): capture the first 5 characters to group 1
(.*?):ref$: capture the remaining to group 2 except the :ref at the end
\g<1>00000\g<2>: replace the whole line with \g<1>00000\g<2> where \g<1> and \g<2> are substituted by group 1 and 2 repsectively.
Demo has a Python 2-compatible code generator and detailed explanation.
regex is not required for this task. It can be achieved more simply using string slicing.
If the input strings maintain the same format and lengths you can simply do this:
s = 'ref:_00D30jPy._50038vQl5C:ref'
new = '{}00000{}'.format(s[15:20], s[20:-4])
If there is some variability then search for the first '5' in the string and slice from there:
start = s.index('5')
new = '{}00000{}'.format(s[start:start+5], s[start+5:-4])
return poker_hand(list_of_five_cards) returns a string similar to this:
**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)
and I have created a string out of it I want the information inside the brackets. in this vein I have tried:
s = str(poker_hand(one_man))
print s
the_search = re.search(r"\((\w+)\)", s)
and this returns None when you type print the_search. I have also tried
s[s.find("(")+1:s.find(')')]
print s
which returns the whole string. Does anyone know what I am doing wrong?
EDIT sorry for the confusion I should be better,
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
re the assigning... trying to assign it now, will post the results
the pattern you are using to find the item in brackets is not right.
you can try to test your regex in http://regexr.com/
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
pattern = r'\(.+\.\)'
for item in re.findall(pattern,s):
print item.strip('().')
output:
One pair
IIUC at the end of your string you always have the closed brackets. Then try this:
'**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'.split('(')[1][:-1]
Out[1]: 'One pair.'
The idea is to split by the opening brackets, taking what's after, and deleting the closing brackets.
input is 7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)
desired output is One pair
You can use something like:
import re
string = "7-Spades/4-Clubs/3-Diamonds/3-Hearts/8-Spades (One pair.)"
result = re.findall(r"\((.*?)\.?\)", string )
print result[0]
Ideone Demo
Regex Explanation:
\((.*?)\.?\)
Match the character “(” literally «\(»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «\.?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “)” literally «\)»
Use the groups:
import re
s = '**4-Diamonds/2-Clubs/5-Hearts/4-Spades/King-Spades (One pair.)'
print (s)
m = re.search(r'\(([\s\S]+)\.\)', s)
print(m.group(1))