This question already has answers here:
re.sub replace with matched content
(4 answers)
Closed 8 years ago.
I want to replace the digits in the middle of telephone with regex but failed. Here is my code:
temp= re.sub(r'1([0-9]{1}[0-9])[0-9]{4}([0-9]{4})', repl=r'$1****$2', tel_phone)
print temp
In the output, it always shows:
$1****$2
But I want to show like this: 131****1234. How to accomplish it ? Thanks
I think you're trying to replace four digits present in the middle (four digits present before the last four digits) with ****
>>> s = "13111111234"
>>> temp= re.sub(r'^(1[0-9]{2})[0-9]{4}([0-9]{4})$', r'\1****\2', s)
>>> print temp
131****1234
You might have seen $1 in replacement string in other languages. However, in Python, use \1 instead of $1. For correctness, you also need to include the starting 1 in the first capturing group, so that the output also include the starting 1; otherwise, the starting 1 will be lost.
Related
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
I'm trying to find numbers in a string.
import re
text = "42 ttt 1,234 uuu 6,789,001"
finder = re.compile(r'\d{1,3}(,\d{3})*')
print(re.findall(finder, text))
It returns this:
['', ',234', ',745']
What's wrong with regex?
How can I get ['42', '1,234', '6,789,745']?
Note: I'm getting correct result at https://regexr.com
You indicate with parentheses (...) what the groups are that should be captured by the regex.
In your case, you only capture the part after (and including) the first comma. Instead, you can capture the whole number by putting a group around everything, and make the parentheses you need for * non-capturing through an initial ?:, like so:
r'(\d{1,3}(?:,\d{3})*)'
This gives the correct result:
>>> print(re.findall(finder, text))
['42', '1,234', '6,789,001']
you just need to change your finder like this.
finder = re.compile(r'\d+\,?\d+,?\d*')
This question already has answers here:
Python regex does not match line start
(3 answers)
Closed 2 years ago.
I was trying to extract some numbers from mail-data, here is my code:
import re
f = open('mbox-short.txt','r')
x = f.read()
z = re.findall('^X-DSPAM-Confidence: (0\.[0-9])+',x)
print(z)
But when i try to print the output it comes out to be NULL.
Here is the link to the txt file:
http://www.py4inf.com/code/mbox-short.txt
You need to add the re.MULTILINE flag in order for ^ to match at beginning of line anywhere in a string with multiple lines.
Also, you want to include the + quantifier inside the parentheses; otherwise, the match group will only match the last occurrence of several (if there can't be multiple occurrences, that doesn't matter much, of course) and you only match the first digit after the decimal point.
z = re.findall('^X-DSPAM-Confidence: (0\.[0-9]+)', x, re.MULTILINE)
This question already has answers here:
Regex exactly n OR m times
(6 answers)
Closed 4 years ago.
I'm working on a regex expression for a python program where it should find all the dates appear in a text.
According to the assignment's description, the only valid date formats are as the following:
"3/30/18", "3/30/2018", "3-30-2018", "03-30-2018", "30.3.2018",
"30. 3. 2018", "2018-03-30"
I created a string variable containing the valid formats and added a few to check if my code would work.
text_string = 'Examples for valid dates include "3/30/18", "3/30/2018",
"3-30-2018", "03-30-2018", "30.3.2018", "30. 3. 2018", "2018-03-30",
"3/30/1", "3/30/201", "/30/18", "3//18", "3/ /18", "3/30/", "3/301/18"'
and the following is the regex I came up with:
match_string = re.findall('(?:\d{1,2}/\s*\d{1,2}/\s*\d{2,4})|
(?:\d{1,2}-\s*\d{1,2}-\s*\d{2,4})|(?:\d{4}-\s*\d{1,2}-\s*\d{1,2})|
(?:\d{1,2}.\s*\d{1,2}.\s*\d{4})', text_string)
apparently, my code would capture all 7 valid date formats stated above, but it also returned "3/30/201", which should not be a valid date.
I've tried to add '$' into my code, but it only messed things up more, so I'm wondering how to correct my code to fix this problem.
p.s. This is a Regex assignment, I'm not allowed to use 'datetime' T_T
The problematic part of your regex is this:
\d{2,4}
This matches 2 to 4 digits - which means 3 digits are also considered a valid year. If you replace the two occurences of \d{2,4} with \d{2}(?:\d{2})?)\b, the regex works correctly:
(?:\d{1,2}/\s*\d{1,2}/\s*\d{2}(?:\d{2})?)\b|(?:\d{1,2}-\s*\d{1,2}-\s*\d{2}(?:\d{2})?)\b|(?:\d{4}-\s*\d{1,2}-\s*\d{1,2})|(?:\d{1,2}.\s*\d{1,2}.\s*\d{4})
(Don't forget to use a raw string literal to define the regex: r'(?:\d{1,2}/\s*\d{1,2}/\s*\d{2}(?:\d{2})?)\b|(?:\d{1,2}-\s*\d{1,2}-\s*\d{2}(?:\d{2})?)\b|(?:\d{4}-\s*\d{1,2}-\s*\d{1,2})|(?:\d{1,2}.\s*\d{1,2}.\s*\d{4})')
Output:
['3/30/18', '3/30/2018', '3-30-2018', '03-30-2018', '30.3.2018', '30. 3. 2018', '2018-03-30']
\d{2}(?:\d{2})?)\b matches exactly 2 or 4 digits - the \b boundary is there to assert that there aren't any more digits, otherwise it would still consider "3/30/201" to be a valid date.
Lastly, the regex could be written more concisely as
\b\d{1,2}([-/]|\. ?)\d{1,2}\1\d{2}(?:\d{2})?\b|\b\d{4}-\d{2}-\d{2}\b
This uses capture groups to assert that no separators are mixed (like 3-2.2018) and that whitespace is consistent (so things like 1. 2.2018 don't match).
This question already has answers here:
Remove duplicate chars using regex?
(3 answers)
Closed 6 years ago.
I found a code snippet of removing duplicated consecutive characters and reserving the first character in Python by regex from web like this:
import re
re.sub(r'(?s)(.)(?=.*\1)','','aabbcc') #'abc'
But there is a defect that if the string is 'aabbccaabb' it will ignore the first 'aa', 'bb' and turn out 'cab'.
re.sub(r'(?s)(.)(?=.*\1)','','aabbccaabb') #'cab'
Is there a way to solve it by regex?
Without regex, check if previous character is the same as current, using a list comprehension with a condition and join the results:
s='aabbccaabb'
print("".join([c for i,c in enumerate(s) if i==0 or s[i-1]!=c]))
Just remove the .* in the positive look ahead.
import re
print re.sub(r'(?s)(.)(?=\1)','','aabbcc')
print re.sub(r'(?s)(.)(?=\1)','','aabbccaabb')
Output:
abc
abcab
This question already has answers here:
Python - re.findall returns unwanted result
(4 answers)
Closed 6 years ago.
test = """1d48bac (TAIL, ticket: TAG-AB123-6, origin/master) Took example of 123
6f2c5f9 (ticket: TAG-CD456) Took example of 456
9aa5436 (ticket: TAG-EF567-3) Took example of 6789"""
I want to write a regex in python that will extract just the tag- i.e.output should be
[TAG-AB123-6, TAG-CD456, TAGEF567-3]
I tired a regex
print re.findall("TAG-[A-Z]{0,9}\d{0,5}(-\d{0,2})?", test)
but this gives me
['-6', '', '-3']
what am I doing wrong?
Your optional capturing group needs to be made a non-capturing one:
>>> print re.findall(r"TAG-[A-Z]{0,9}\d{0,5}(?:-\d{0,2})?", test)
['TAG-AB123-6', 'TAG-CD456', 'TAG-EF567-3']
findall returns all capturing groups. If there are no capturing groups it will return all the matches.
In addition, note that you can also take advantage of this behaviour (the fact that re.findall returns a list of captures if any instead of the whole match). This allows to describe all the context around the target substring and to easily extract the part you want:
>>> re.findall(r'ticket: ([^,)]*)', test)
['TAG-AB123-6', 'TAG-CD456', 'TAG-EF567-3']