Regex find word including "-" - python

I have the below regex (from this link: get python dictionary from string containing key value pairs)
r"\b(\w+)\s*:\s*([^:]*)(?=\s+\w+\s*:|$)"
Here is the explanation:
\b # Start at a word boundary
(\w+) # Match and capture a single word (1+ alnum characters)
\s*:\s* # Match a colon, optionally surrounded by whitespace
([^:]*) # Match any number of non-colon characters
(?= # Make sure that we stop when the following can be matched:
\s+\w+\s*: # the next dictionary key
| # or
$ # the end of the string
) # End of lookahead
My question is that when my string has the word with the "-" in between, for example: movie-night, the above regex is not working and I think it is due to the b(\w+). How can I change this regex to work with word including the "-"? I have tried b(\w+-) but it does not work. Thanks for your help in advance.

You could try something such as this:
r"\b([\w\-]+)\s*:\s*([^:]*)(?=\s+\w+\s*:|$)"
Note the [\w\-]+, which allows matching both a word character and a dash.
For readability in the future, you may also want to investigate re.X/re.VERBOSE, which can make regex more readable.

Related

Regexp, find in a procedure last "end" to replace with another word

I have tried to replace in all procedures some mistakes. Now, I need to find last "end;" in procedure and replace it with another text.
I wrote like: (\s.*)(end|END)(.*(;).*)
But in work not correctly, it also replace some words in the middle of the text. I using re biblio from python.
You can use
result = re.sub(r'(?si)(.*)\bend\b', r'\g<1>some other word', text)
The regex matches
(?si) - an inline re.DOTALL (s) and re.IGNORECASE (i) modifier
(.*) - Group 1: any zero or more chars as many as possible
\bend\b -a whole word end.
The \g<1>some other word replacement is the Group 1 value (I used \g<1> since it will be helpful if your some other word starts with a digit) plus your word.
NOTE: if your some other word can contain literal backslashes, do not forget to double them.

Python regex specific word with singe quote at end

Searching a large syslog repo and need to get a specific word to match with a certain condition.
I'm using regex to compile a search for this word. I've read the python docs on regex characters and I understand how to specify each criteria separately but somehow missing how to concatenate all together for my specific search. This is what I have so far but not working...
p = re.compile("^'[A-Z]\w+'$")
match = re.search(p, syslogline, )
the word is a username that can be alphanum, always beginning with an uppercase character (preceded by blank space), can contain chars or nums, is 3-12 in length and ends with single quote.
an example would be: Epresley01' or J98473'
Brief
Based on your requirements (also stated below), your regex doesn't work because:
^' Asserts the position at the start of the line and ensures a ' is the first character of that line.
$ Asserts the position at the end of the line.
Having said that you specify that it's preceded by a space character (which isn't present in your pattern). You pattern also checks for ' which isn't the first character of the username. Given that you haven't actually given us a sample of your file I can't confirm nor deny that your string starts before the username and ends after it, but if that's not the case the anchors ^$ are also not helping you here.
Requirements
The requirements below are simply copied from the OP's question (rewritten) to outline the username format. The username:
Is preceded by a space character.
Starts with an uppercase letter.
Contains chars or nums. I'm assuming here that chars actually means letters and that all letters in the username (including the uppercase starting character) are ASCII.
Is 3-12 characters in length (excluding the preceding space and the end character stated below).
Ends with an apostrophe character '.
Code
See regex in use here
(?<= )[A-Z][^\W_]{2,11}'
Explanation
(?<= ) Positive lookbehind ensuring what precedes is a space character
[A-Z] Match any uppercase ASCII letter
[^\W_]{2,11} Match any word character except underscore _ (equivalent to a-zA-Z0-9)
This appears a little confusing because it's actually a double-negative. It's saying match anything that's not in the set. The \W matches any non-word character. Since it's a double-negative, it's like saying don't match non-word characters. Adding _ to the set negates it.
' Match the apostrophe character ' literally
I think you can do it like this:
(Updated after the comment from #ctwheels)
See regex in use here
[A-Z][a-zA-Z0-9]{1,10}'
Explanation
Match a whitespace
Match an uppercase character [A-Z]
Match [a-zA-Z0-9]+
Match an apostrophe '
Demo

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Strip punctuation with regular expression - python

I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?
Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.
Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.

match until a certain pattern using regex

I have string in a text file containing some text as follows:
txt = "java.awt.GridBagLayout.layoutContainer"
I am looking to get everything before the Class Name, "GridBagLayout".
I have tried something the following , but I can't figure out how to get rid of the "."
txt = re.findall(r'java\S?[^A-Z]*', txt)
and I get the following: "java.awt."
instead of what I want: "java.awt"
Any pointers as to how I could fix this?
Without using capture groups, you can use lookahead (the (?= ... ) business).
java\s?[^A-Z]*(?=\.[A-Z]) should capture everything you're after. Here it is broken down:
java //Literal word "java"
\s? //Match for an optional space character. (can change to \s* if there can be multiple)
[^A-Z]* //Any number of non-capital-letter characters
(?=\.[A-Z]) //Look ahead for (but don't add to selection) a literal period and a capital letter.
Make your pattern match a period followed by a capital letter:
'(java\S?[^A-Z]*?)\.[A-Z]'
Everything in capture group one will be what you want.
This seems to do what you want with re.findall(): (java\S?[^A-Z]*)\.[A-Z]

Categories