I'd like to create a regex that contains comments and a variable. I thought I'd split up the string like so, but it doesn't work:
import re
regex = re.compile(r'''
^(sm\d{5}-[a-z]+-\d{2}) # study number''' +
doctype + r'''# document
v(\d+)-(\d+) # version number
\.pdf$ # pdf extension
''', re.VERBOSE)
Break your regex pattern into multiple strings, then combine them into a single string with "".join(), like so
import re
pattern = "".join([
"^(sm\d{5}-[a-z]+-\d{2})", # study number
doctype, # document
"v(\d+)-(\d+)", # version number
"\.pdf$", # pdf extension
])
regex = re.compile(pattern, re.VERBOSE)
To avoid the need for comments, you might use descriptive variable names for each section of the regex. Doing it this way, it might also make sense to separate the line positioning characters from the "business logic" of your regex to make these variables more reusable.
study_number_pattern = "(sm\d{5}-[a-z]+-\d{2})"
version_number_pattern = "v(\d+)-(\d+)"
pdf_extension_pattern = "\.pdf"
pattern = "".join([
"^",
study_number_pattern,
doctype,
version_number_pattern,
pdf_extension_pattern,
"$"
])
regex = re.compile(pattern, re.VERBOSE)
Your concatenation sign and a closing block of quotes (''') on the second line of the pattern were commented out along with your note; simply move them in front of the comment.
I put both plus signs on one line, but you could still have them split across multiple ones, that's simply my preference...
regex = re.compile(r'''
^(sm\d{5}-[a-z]+-\d{2})''' # study number
+ doctype + # document
r'''v(\d+)-(\d+) # version number
\.pdf$ # pdf extension
''', re.VERBOSE)
On a related note, do you use an IDE or any text editor for writing code? If not, that would be quite helpful. Errors like this would be instantly caught and highlighted.
Edit:
Stack Overflow syntax highlighting makes it appear as though the lines with comments 'version number' and 'pdf extension' are part of the pattern in the code above, but using the re.VERBOSE flag makes them actual comments.
Related
I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.
for example in
NC don’t like working together www.linktowtweet.org
get back
together
I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.
However, given a string in python, it simple to extract the last word.
Solution 1
Use str.rfind(' '). The idea here is to find the space, preceding the last word. Here is an example.
text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)
Note: If a string is given with no words, last_word will be a blank string.
Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r, but I presume only newlines and tabs will be found in twitter messages.
Also see: string.whitespace
So a complete example (wrapped as a function) would be
def last_word(text):
text = text.replace('\n', ' ') # Replace newlines with spaces.
text = text.replace('\t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".
This may still be the best situation for basic parsing. There is something better for larger problems.
Solution 2
Regular Expressions
These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.
For example, .*\s(\S+) specifies the last word in a string.
Here is it again with a longer explanation.
.* # Match as many characters as possible.
\s # Until a whitespace ("\t\n\x0b\x0c\r ")
( # Remember the next section for the answer.
\S+ # Match a ~word~ (not whitespace) as possible.
) # End saved section.
So then, in python you would use this as follows.
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".
Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b" would match the last word, but ignore a link if that was the last thing.
Example:
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
The second advantage to this method is speed.
As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)
Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.
EDIT
Changed the url avoiding regex from
re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
to
re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
So that calling last_word("NC don’t like working together http://www.linktowtweet.org") returns together and not http://.
To so how this regex works, look at https://regex101.com/r/sdwpqB/2.
Simple, so if your text is:
text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]
So there you go!! Now you'll get the last word "together".
In code i only want to fetch variable name from a c file which is used in if condition.
Following is code snippet of regex:
fieldMatch = re.findall(itemFieldList[i]+"=", codeline, re.IGNORECASE);
here i can find variable itemFieldList[i] from file.
But when i try to add if as shown below nothing is extracted as output even though variable exist in c code in if condition .
fieldMatch = re.findall(("^(\w+)if+[(](\w+)("+itemFieldList[i]+")="), codeline, re.IGNORECASE|re.MULTILINE);
Can anyone suggest how can we create regex to fetch mentioned scenario.
Sample Input :
IF(WORK.env_flow_ind=="R")
OR
IF( WORK.qa_flow_ind=="Q" OR WORK.env_flow_ind=="R")
here itemFieldList[i] = WORK.env_flow_ind
I don't have enough reputation to make this a comment, which it should be and I can't say that I fully understand the question. But to point out a few things:
it's about adding variables to your regex then you should be using string templates to make it more understandable for us and your future self.
"^{}".format(variable)
Doing that will allow you to create a dynamic regex that searches for what you want.
Secondly, I don't think that is your problem. I think that your regex is malformed. I don't know what exactly you are trying to search for but I recommend reading the python regex documentation and testing your regex on a resource like regex101 to make sure that you're capturing what you intend to. From what I can see you are a bit confused about groups. When you put parenthesis around a pattern you are identifying it as a group. You were on the right track trying to exclude the parenthesis in your search by surrounding it with square brackets but it's simpler and cleaner to escape them.
if you are trying to capture this statement:
if(someCondition == fries)
and you want to extract the keyword fries the valid syntax for that pattern is:
(?=if\((?:[\w=\s])+(fries)\))
Since you want this to be dynamic you would replace the string fries with your string template, and you'll get code that ends up something like this:
p = re.compile("(?=if\((?:[\w=\s])+({})\))".format(search), re.IGNORECASE)
p.findall(string)
Regex101 does a better job of breaking down my regex than I ever will:
Link cuz i have no rep
You can build the regex pattern as:
pattern = r"\bif\b\s*\(.*?\b" + re.escape(variablename) + r"\b"
This will look for the word “if” in lowercase, then optionally any spaces, then an opening parenthesis, then optionally any characters, and then your search term, its beginning and its end at word boundaries.
So if variablename is "WORK.env_flow_ind", then re.findall(pattern, textfile) will match the following lines:
if(blabla & WORK.env_flow_ind == "a")
if (WORK.env_flow_id == "b")
if(WORK.env_flow_id == "b")
if( WORK.env_flow_id == "b")
and these won't match:
if (WORK.env_bla == "c")
if (WORK.env_flow_id2 == "d")
Is there a cleaner way to write long regex patterns in python? I saw this approach somewhere but regex in python doesn't allow lists.
patterns = [
re.compile(r'<!--([^->]|(-+[^->])|(-?>))*-{2,}>'),
re.compile(r'\n+|\s{2}')
]
You can use verbose mode to write more readable regular expressions. In this mode:
Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash.
When a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
The following two statements are equivalent:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
(Taken from the documentation of verbose mode)
Though #Ayman's suggestion about re.VERBOSE is a better idea, if all you want is what you're showing, just do:
patterns = re.compile(
r'<!--([^->]|(-+[^->])|(-?>))*-{2,}>'
r'\n+|\s{2}'
)
and Python's automatic concatenation of adjacent string literals (much like C's, btw) will do the rest;-).
You can use comments in regex's, which make them much more readable. Taking an example from http://gnosis.cx/publish/programming/regular_expressions.html :
/ # identify URLs within a text file
[^="] # do not match URLs in IMG tags like:
# <img src="http://mysite.com/mypic.png">
http|ftp|gopher # make sure we find a resource type
:\/\/ # ...needs to be followed by colon-slash-slash
[^ \n\r]+ # stuff other than space, newline, tab is in URL
(?=[\s\.,]) # assert: followed by whitespace/period/comma
/
I have a few URLs in a file, some of them are embedded between specific start and end tags whereas others are not. I only need to extract the ones which are embedded in between the start and end tags.
A line in my inputfile.txt looks like the following:
some gibberish data-start=\"https:\/\/cdn.net\/hphotos-ak-xfa1\/1.jpg\" data-end this is useless text, some gibberishhh data-start=\"https:\/\/cdn.net\/hphotos-xaf1\/2.jpg\" data-end some gibberish fake-data-start=\"https:\/\/cdn.net\/hphotos-xaf1\/2.jpg\" fake-data-end
The start and end tags of the URLs that I need are data-start and data-end as opposed to fake-data-start and fake-data-end.
Now I'm using the following regex in Python to extract the aforementioned URLs:
(?<=\ data-start=\\\")([^"]+\.[^"]+\.[^"]+)(?=\"\ data-end)
I believe the above Regex works which I tested from this link
and My Python Code is:
import re
import string
import sys
s = re.compile('(?<=\ data-start=\\\")([^"]+\.[^"]+\.[^"]+)(?=\"\ data-end)')
fin = open('inputfile.txt')
for line in fin:
m = s.findall(line)
if m:
print m
However, my Python code is unable to find the URLs, on the other hand if I remove all backslashes from my file then the above code works fine. I haven't been able to explain this difference.
Backslash serves as an escape character. Therefore; for every single (\) backslash you need two backslashes (\\). You can use the following regular expression here:
(?<=data-start=\\").*?(?=\\" data-end)
Explanation:
(?<= # look behind to see if there is:
data-start= # 'data-start='
\\ # '\'
" # '"'
) # end of look-behind
.*? # any character except \n (0 or more times)
(?= # look ahead to see if there is:
\\ # '\'
" data-end # '" data-end'
) # end of look-ahead
Note: If your data spans multi-line, use the inline (?s) modifier forcing the dot to match newline characters.
(?s)(?<=data-start=\\").*?(?=\\" data-end)
Final solution:
import re
myfile = open('inputfile.txt', 'r')
regex = re.compile(r'(?<=data-start=\\").*?(?=\\" data-end)')
for line in myfile:
matches = regex.findall(line)
for m in matches:
print m
Output
https:\/\/cdn.net\/hphotos-ak-xfa1\/1.jpg
https:\/\/cdn.net\/hphotos-xaf1\/2.jpg
You seem to have too many backslashes. It looks to me like you could simplify your regex to something like:
(?<= data-start=\\")([^".]+\.[^".]+\.[^"\\]+)
Please note that the original [^"]+\., which means any character that is not a double quote, then a dot, will first eat all the dots, then backtrack, which is why I added the dots in the character classes.
In Python, something like:
s = re.compile(r'(?<= data-start=\\")([^".]+\.[^".]+\.[^"\\]+)')
If I have a string where there is a valid JSON substring like this one:
mystr = '100{"1":2, "3":4}312'
What is the best way to do extract just the JSON string? The numbers outside can be anything (except a { or }), including newlines and things like that.
Just to be clear, this is the result I want
newStr = '{"1":2, "3":4}'
The best way I can think of do this is to use find and rfind and then take the substring. This seems too verbose to me and it isn't python 3.0 compliant (which I would prefer but is not essential)
Any help is appreciated.
Note that the following code very much assumes that there is nothing other than non-bracket material on either side of the JSON string.
import re
matcher = re.compile(r"""
^[^\{]* # Starting from the beginning of the string, match anything that isn't an opening bracket
( # Open a group to record what's next
\{.+\} # The JSON substring
) # close the group
[^}]*$ # at the end of the string, anything that isn't a closing bracket
""", re.VERBOSE)
# Your example
print matcher.match('100{"1":2, "3":4}312').group(1)
# Example with embedded hashmap
print matcher.match('100{"1":{"a":"b", "c":"d"}, "3":4}312').group(1)
The short, non-precompiled, non-commented version:
import re
print re.match("^[^\{]*(\{[^\}]+\})[^}]*$", '100{"1":2, "3":4}312').group(1)
Although for the sake of maintenance, commenting regular expressions is very much preferred.