How do I write a regex that excludes certain file suffixes? - python

I'm looking at the tutorial given here :-
https://docs.python.org/2/howto/regex.html#lookahead-assertions
I want to exclude files that end in .pqr.gz and I'm not quite sure how to do that.
e.g., the expected behaviour is :-
f1.gz => succeed
f1.abc.pqr => succeed
f1.pqr.gz => fail
f1.abc.gz => succeed
The best regex I could come up with was :-
r'.*[.](?=[^.]*[.][^.]*)(?!pqr[.]gz$)[^.]*[.][^.]*$'
This excludes files that end in .pqr.gz but doesn't for example allow files that are just f1.gz (i.e. first case I wrote above).
Any ideas on how this can be improved?
EDIT :- There are better ways to do this (e.g., using string.endswith), but I'm curious about how to do this with a regex purely as an exercise.

well, TBH, your use of regex seems overkill to me. You could simply do:
if not '.pqr.gz' in line:
print(line)
and done.
Actually, "simple" string manipulation can do a lot in just a few simple operations, like:
for line in lines:
file, result = line.split(' => ')
if file.endswith('.pqr.gz'):
print("Skipping file {}".format(file), file=sys.stderr)
continue
print(file)
# and you could do something if result == "success" there after!
as you insist on doing it with regexps:
here's your current regex representation
And here's a solution as inspired from #rawing suggestion:
.*(?<!\.pqr\.gz) =>

One thing to be aware of with Python's re module is that re.match implicitly anchors to the start of the string.
Also, you can match literal periods by escaping them (\.), which is probably easier to read (and potentially faster) than putting it in a character class.
For re.match the following regex should do the trick:
r'.*\.pqr\.gz$'
If using re.search instead, the regex can be shortened to just this:
r'\.pqr\.gz$'

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Python3 - Generate string matching multiple regexes, without modifying them

I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.
If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.
I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...
I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Find a Quick Match Email Regular Expression

solution:[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* is a good choice
I am using a regular expression like the below to match email addresses in a file:
email = re.search('(\w+-*[.|\w]*)*#(\w+[.])*\w+',line)
When used on a file like the following, my regular expression works well:
mlk407289715#163.com huofenggib wrong in get_gsid
mmmmmmmmmm776#163.com rouni816161 wrong in get_gsid
But when I use it on a file like below, my regular expression runs unacceptably slowly:
9b871484d3af90c89f375e3f3fb47c41e9ff22 mingyouv9gueishao#163.com
e9b845f2fd3b49d4de775cb87bcf29cc40b72529e mlb331055662#163.com
And when I use the regular expression from this website, it still runs very slowly.
I need a solution and want to know what's wrong.
That's a problem with backtracking. Read this article for more information.
You might want to split the line and work with the part containing an #:
pattern = '(\w+-*[.|\w]*)*#(\w+[.])*\w+'
line = '9b871484d3af90c89f375e3f3fb47c41e9ff22 mingyouv9gueishao#163.com'
for element in line.split():
if '#' in element:
g = re.match(pattern, element)
print g.groups()
Generally when regular expressions are slow, it is due to catastrophic bactracking. This can happen in your regex because of the nested repetition during in the following section:
(\w+-*[.|\w]*)*
If you can work on this section of the regex to remove the repetition from within the parentheses you should see a substantial speed increase.
However, you are probably better of just searching for an email regex and seeing how other people have approached this problem.
It's always a good idea to search StackOverflow to see if your question has already been discussed.
Using a regular expression to validate an email address
This one, from that discussion, looks like a good one to me:
[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*

Python Regex Question

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .
Something like this:
</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>
What Python regex should I use to replace it with something like this:
</tag1><tag3>content</tag3><tag2>
Thanks in advance.
Here is code for something like what you say that you need:
>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>
Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).
Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

Categories