Python Regex for a specific date format [duplicate] - python

This question already has answers here:
Regex exactly n OR m times
(6 answers)
Closed 4 years ago.
I'm working on a regex expression for a python program where it should find all the dates appear in a text.
According to the assignment's description, the only valid date formats are as the following:
"3/30/18", "3/30/2018", "3-30-2018", "03-30-2018", "30.3.2018",
"30. 3. 2018", "2018-03-30"
I created a string variable containing the valid formats and added a few to check if my code would work.
text_string = 'Examples for valid dates include "3/30/18", "3/30/2018",
"3-30-2018", "03-30-2018", "30.3.2018", "30. 3. 2018", "2018-03-30",
"3/30/1", "3/30/201", "/30/18", "3//18", "3/ /18", "3/30/", "3/301/18"'
and the following is the regex I came up with:
match_string = re.findall('(?:\d{1,2}/\s*\d{1,2}/\s*\d{2,4})|
(?:\d{1,2}-\s*\d{1,2}-\s*\d{2,4})|(?:\d{4}-\s*\d{1,2}-\s*\d{1,2})|
(?:\d{1,2}.\s*\d{1,2}.\s*\d{4})', text_string)
apparently, my code would capture all 7 valid date formats stated above, but it also returned "3/30/201", which should not be a valid date.
I've tried to add '$' into my code, but it only messed things up more, so I'm wondering how to correct my code to fix this problem.
p.s. This is a Regex assignment, I'm not allowed to use 'datetime' T_T

The problematic part of your regex is this:
\d{2,4}
This matches 2 to 4 digits - which means 3 digits are also considered a valid year. If you replace the two occurences of \d{2,4} with \d{2}(?:\d{2})?)\b, the regex works correctly:
(?:\d{1,2}/\s*\d{1,2}/\s*\d{2}(?:\d{2})?)\b|(?:\d{1,2}-\s*\d{1,2}-\s*\d{2}(?:\d{2})?)\b|(?:\d{4}-\s*\d{1,2}-\s*\d{1,2})|(?:\d{1,2}.\s*\d{1,2}.\s*\d{4})
(Don't forget to use a raw string literal to define the regex: r'(?:\d{1,2}/\s*\d{1,2}/\s*\d{2}(?:\d{2})?)\b|(?:\d{1,2}-\s*\d{1,2}-\s*\d{2}(?:\d{2})?)\b|(?:\d{4}-\s*\d{1,2}-\s*\d{1,2})|(?:\d{1,2}.\s*\d{1,2}.\s*\d{4})')
Output:
['3/30/18', '3/30/2018', '3-30-2018', '03-30-2018', '30.3.2018', '30. 3. 2018', '2018-03-30']
\d{2}(?:\d{2})?)\b matches exactly 2 or 4 digits - the \b boundary is there to assert that there aren't any more digits, otherwise it would still consider "3/30/201" to be a valid date.
Lastly, the regex could be written more concisely as
\b\d{1,2}([-/]|\. ?)\d{1,2}\1\d{2}(?:\d{2})?\b|\b\d{4}-\d{2}-\d{2}\b
This uses capture groups to assert that no separators are mixed (like 3-2.2018) and that whitespace is consistent (so things like 1. 2.2018 don't match).

Related

why python regex is not finding numbers? [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
I'm trying to find numbers in a string.
import re
text = "42 ttt 1,234 uuu 6,789,001"
finder = re.compile(r'\d{1,3}(,\d{3})*')
print(re.findall(finder, text))
It returns this:
['', ',234', ',745']
What's wrong with regex?
How can I get ['42', '1,234', '6,789,745']?
Note: I'm getting correct result at https://regexr.com
You indicate with parentheses (...) what the groups are that should be captured by the regex.
In your case, you only capture the part after (and including) the first comma. Instead, you can capture the whole number by putting a group around everything, and make the parentheses you need for * non-capturing through an initial ?:, like so:
r'(\d{1,3}(?:,\d{3})*)'
This gives the correct result:
>>> print(re.findall(finder, text))
['42', '1,234', '6,789,001']
you just need to change your finder like this.
finder = re.compile(r'\d+\,?\d+,?\d*')

regex extraction with comma and thousand separators of various sizes [duplicate]

I am wondering, how would regular expression for testing correct format of number for German culture would look like.
In German, comma is used as decimal mark and dot is used to separate thousands.
Therefore:
1.000 equals to 1000
1,000 equals to 1
1.000,89 equals to 1000.89
1.000.123.456,89 equals to 1000123456.89
The real trick, seems to me, is to make sure, that there could be several dots, optionally followed by comma separator
This is the regex I would use:
^-?\d{1,3}(?:\.\d{3})*(?:,\d+)?$
Debuggex Demo
And this is a code example to interpret it as a valid floating point (notice the parseFloat() after the string replacements).
Edit: as mentioned in Severin Klug's answer, the below code assumes that the numbers are known to be in German format. Attempting to "detect" whether a string contains a German format or US format number is not arbitrary and out of scope for this question. '1.234' is valid in both formats but with different actual values, without context it is impossible to know for sure which format was meant.
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89'];
document.getElementById('out').value=numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
<textarea id="out" rows="10" style="width:100%"></textarea>
I would have posted this as a comment, but I dont have enough reputation.
#funkwurm, your post https://stackoverflow.com/a/28361329/7329611 contains javascript
var numbers = ['1.000', '1,000', '1.000,89', '1.000.123.456,89', '1.2'];
numbers.map(function(str) {
return parseFloat(str.replace(/\./g, '').replace(',', '.'));
}).join('\n');
which should convert german numbers to english/international ones - which it does for every number with exactly three digits after a german thousands dot like the numbers you use in the example array. BUT - and there is the critical Use-Case-Error: it just deletes dots from any other string with not three digits after it aswell.
So if you insert a string like '1.2' it returns 12, if you insert '1.23' it returns 123.
And this is a very critical behaviour, if anyone just takes the above code snippet and thinks it'll convert any given number correctly into english ones. Because already correct english numbers will be corrupted! So be careful, please.
This regex should work :
([0-9]{1,3}(?:\.[0-9]{3})*(?:\,[0-9]+)?)
A good regex would be something like this
Regex regex = new Regex("-?\d{1,3}(?:\.\d{3})*(?:,\d+)?");
Match match = regex.Match(input);
Decimal result = Decimal.Zero;
if (match.Success)
result = Decimal.Parse(match.Value, new CultureInfo("de-DE"));
The result is the german number as parsed value.
Try this it will match your inputs:
^(\d+\.)*\d+(,\d+)?
This regex would work for + numbers
/^[0-9]{0,3}(\.[0-9]{3})*(,[0-9]{0,2})?$/
Breakdown
[0-9]{0,3} - this section allows zero up to 3 numbers. empty value is valid, '1', '26', '789' are valid. '1589' is invalid
(\.[0-9]{3})* - this section allows zero or more dots... if there's a dot, there must be three digits after the dot. '2.589' is valid. '2.5896' and '2.45' are invalid
(,[0-9]{0,2})? - this section allows zero or 1 comma. there can be zero up to 2 digits after the comma. '25,', '25,5', '25,45' are valid. '25,456' and '25,45,8' are invalid
Hope this is helpful

REGex in python does not extract right [duplicate]

This question already has answers here:
Python - re.findall returns unwanted result
(4 answers)
Closed 6 years ago.
test = """1d48bac (TAIL, ticket: TAG-AB123-6, origin/master) Took example of 123
6f2c5f9 (ticket: TAG-CD456) Took example of 456
9aa5436 (ticket: TAG-EF567-3) Took example of 6789"""
I want to write a regex in python that will extract just the tag- i.e.output should be
[TAG-AB123-6, TAG-CD456, TAGEF567-3]
I tired a regex
print re.findall("TAG-[A-Z]{0,9}\d{0,5}(-\d{0,2})?", test)
but this gives me
['-6', '', '-3']
what am I doing wrong?
Your optional capturing group needs to be made a non-capturing one:
>>> print re.findall(r"TAG-[A-Z]{0,9}\d{0,5}(?:-\d{0,2})?", test)
['TAG-AB123-6', 'TAG-CD456', 'TAG-EF567-3']
findall returns all capturing groups. If there are no capturing groups it will return all the matches.
In addition, note that you can also take advantage of this behaviour (the fact that re.findall returns a list of captures if any instead of the whole match). This allows to describe all the context around the target substring and to easily extract the part you want:
>>> re.findall(r'ticket: ([^,)]*)', test)
['TAG-AB123-6', 'TAG-CD456', 'TAG-EF567-3']

Python: re module to replace digits of telephone with asterisk [duplicate]

This question already has answers here:
re.sub replace with matched content
(4 answers)
Closed 8 years ago.
I want to replace the digits in the middle of telephone with regex but failed. Here is my code:
temp= re.sub(r'1([0-9]{1}[0-9])[0-9]{4}([0-9]{4})', repl=r'$1****$2', tel_phone)
print temp
In the output, it always shows:
$1****$2
But I want to show like this: 131****1234. How to accomplish it ? Thanks
I think you're trying to replace four digits present in the middle (four digits present before the last four digits) with ****
>>> s = "13111111234"
>>> temp= re.sub(r'^(1[0-9]{2})[0-9]{4}([0-9]{4})$', r'\1****\2', s)
>>> print temp
131****1234
You might have seen $1 in replacement string in other languages. However, in Python, use \1 instead of $1. For correctness, you also need to include the starting 1 in the first capturing group, so that the output also include the starting 1; otherwise, the starting 1 will be lost.

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories