Simple string-to-wildcard comparison in Python - python

I've been looking for a while but failed to find a simple solution for problems like these:
pattern = '20*_*_*'
compare('2023_01_01', pattern)
>>> True
compare('1999_01_01', pattern)
>>> False
I know how to do it with regex, but would like to know if there's an easier and more readable way to do it.

Sounds like the perfect use case for fnmatch:
import fnmatch
pattern = '20*_*_*'
fnmatch.fnmatch('2023_01_01', pattern)
>>> True
fnmatch.fnmatch('1999_01_01', pattern)
>>> False
If you need case-sensitive comparisons, use fnmatch.fnmatchcase() instead of fnmatch.fnmatch().

Related

How to check if a given pathname matches a given regular expression in Python

I'm trying to figure a way out to compare each directory path against a given regular expression to find out if it matches that pattern or not.
I have the following list of paths
C:\Dir
C:\Dir\data
C:\Dir\data\file1
C:\Dir\data\file2
C:\Dir\data\match1\file1
C:\Dir\data\match1\file2
I only want to print those paths that match the following pattern
where "*" can replace zero or more directory levels and match1 can be either the name of a file or directory.
C:\Dir\*\match1
I figured out that re.match() would help me out with this but I'm having a hard time trying to figure out how to define the pattern and the one I came up with (pasted below) doesn't work at all. item will contain the path in quotes
re.match("((C:\\)(Dir)\\(.*)\\(match1))",item)
Can someone please help me out with this task ?
You could go for:
^C:\\Dir\\.+?match1.*
See a demo on regex101.com.
In Python, this would be:
import re
rx = re.compile(r'C:\\Dir\\.+?match1.*')
files = [r'C:\Dir', r'C:\Dir\data', r'C:\Dir\data\file1', r'C:\Dir\data\file2', r'C:\Dir\data\match1\file1', r'C:\Dir\data\match1\file2']
filtered = [match.group(0)
for file in files
for match in [rx.match(file)]
if match]
print(filtered)
Or, if you like filter() and lambda():
filtered = list(filter(lambda x: rx.match(x), files))
Your regexp is:
^C:\\Dir\\.*match1
Explanation is:
C:\\Dir\\ is start sub string of your path
.* any other symbols in path
match1 explicit name of something that goes after (file or dir)
Since I don't have yet reputation to comment, I'll remark here.
The solution proposed by #Jan works for the particular list of paths in question, but has a few problems if applied as a general solution. If list of paths is as follows:
>>> print paths
C:\Dir
C:\Dir\data
C:\Dir\match1
C:\Dir\data\file1
C:\Dir\data\match1\file1
C:\Dir\data\match1\file2
C:\Dir\data\abcmatch1def\file3
C:\Dir\data\file1\match12
C:\Dir\data\file1\match1
>>>
the (r'C:\Dir\.+?match1.*') fails to match "C:\Dir\match1" and produces false positives, i.e. "C:\Dir\data\abcmatch1def\file3" and "C:\Dir\data\file1\match12".
Proposed solution:
>>> import re
>>> for line in paths.splitlines():
... if re.match(r"C:\\Dir.*\\match1(\\|$)", line):
... print line
...
C:\Dir\match1
C:\Dir\data\match1\file1
C:\Dir\data\match1\file2
C:\Dir\data\file1\match1
>>>

Is there a simple way to switch between using and ignoring metacharacters in Python regular expressions?

Is there a way of toggling the compilation or use of metacharacters when compiling regexes? The current code looks like this:
Current code:
import re
the_value = '192.168.1.1'
the_regex = re.compile(the_value)
my_collection = ['192a168b1c1', '192.168.1.1']
my_collection.find_matching(the_regex)
result = ['192a168b1c1', '192.168.1.1']
The ideal solution would look like:
import re
the_value = '192.168.1.1'
the_regex = re.compile(the_value, use_metacharacters=False)
my_collection = ['192a168b1c1', '192.168.1.1']
my_collection.find_matching(the_regex)
result = ['192.168.1.1']
The ideal solution would let the re library handle the disabling of metacharacters, to avoid having to get involved in the process as much as possible.
Nope. However:
the_regex = re.compile(re.escape(the_value))
Use the re.escape() function for this.
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('192.168.1.1')
'192\\.168\\.1\\.1'

How to extract a substring from inside a string?

Let's say I have a string /Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherthing I want to extract just the '0-1-2-3-4-5' part. I tried this:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.find("-")]
But, the result is only 0. How to extract just the '0-1-2-3-4-5' part?
Use os.path.basename and rsplit:
>>> from os.path import basename
>>> name = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> number, tail = basename(name).rsplit('-', 1)
>>> number
'0-1-2-3-4-5'
You're almost there:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
print str[str.find("-")-1:str.rfind("-")]
rfind will search from the end. This assumes that no dashes appear anywhere else in the path. If it can, do this instead:
str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
str = os.path.basename(str)
print str[str.find("-")-1:str.rfind("-")]
basename will grab the filename, excluding the rest of the path. That's probably what you want.
Edit:
As pointed out by #bradley.ayers, this breaks down in the case where the filename isn't exactly described in the question. Since we're using basename, we can omit the beginning index:
print str[:str.rfind("-")]
This would parse '/Apath1/Bpath2/Cpath3/10-1-2-3-4-5-something.otherhing' as '10-1-2-3-4-5'.
This works:
>>> str='/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> str.split('/')[-1].rsplit('-', 1)[0]
'0-1-2-3-4-5'
Assuming that what you want is just what's between the last '/' and the last '-'. The other suggestions with os.path might make better sense (as long as there is no OS confusion over what a a proper path looks like)
you could use re:
>>> import re
>>> ss = '/Apath1/Bpath2/Cpath3/0-1-2-3-4-5-something.otherhing'
>>> re.search(r'(?:\d-)+\d',ss).group(0)
'0-1-2-3-4-5'
While slightly more complicated, it seems like a solution similar to this might be slightly more robust...

Python Regex, re.sub, replacing multiple parts of pattern?

I can't seem to find a good resource on this.. I am trying to do a simple re.place
I want to replace the part where its (.*?), but can't figure out the syntax on how to do this.. I know how to do it in PHP, so I've been messing around with what I think it could be based on that (which is why it has the $1 but I know that isn't correct in python).. I would appreciate if anyone can show the proper syntax, I'm not asking specifics for any certain string, just how I can replace something like this, or if it had more than 1 () area.. thanks
originalstring = 'fksf var:asfkj;'
pattern = '.*?var:(.*?);'
replacement_string='$1' + 'test'
replaced = re.sub(re.compile(pattern, re.MULTILINE), replacement_string, originalstring)
>>> import re
>>> originalstring = 'fksf var:asfkj;'
>>> pattern = '.*?var:(.*?);'
>>> pattern_obj = re.compile(pattern, re.MULTILINE)
>>> replacement_string="\\1" + 'test'
>>> pattern_obj.sub(replacement_string, originalstring)
'asfkjtest'
Edit: The Python Docs can be pretty useful reference.
>>> import re
>>> regex = re.compile(r".*?var:(.*?);")
>>> regex.sub(r"\1test", "fksf var:asfkj;")
'asfkjtest'
The python docs are online, and the one for the re module is here. http://docs.python.org/library/re.html
To answer your question though, Python uses \1 rather than $1 to refer to matched groups.

Python Regex Question

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .
Something like this:
</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>
What Python regex should I use to replace it with something like this:
</tag1><tag3>content</tag3><tag2>
Thanks in advance.
Here is code for something like what you say that you need:
>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>
Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).
Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

Categories