I'm looking for a regex to match hyphenated words in Python.
The closest I've managed to get is: '\w+-\w+[-w+]*'
text = "one-hundered-and-three- some text foo-bar some--text"
hyphenated = re.findall(r'\w+-\w+[-\w+]*',text)
which returns list ['one-hundered-and-three-', 'foo-bar'].
This is almost perfect except for the trailing hyphen after 'three'. I only want the additional hyphen if followed by a 'word'. i.e. instead of the '[-\w+]\*' I need something like '(-\w+)*' which I thought would work, but doesn't (it returns ['-three, '']). i.e. something that matches |word followed by hyphen followed by word followed by hyphen_word zero or more times|.
Try this:
re.findall(r'\w+(?:-\w+)+',text)
Here we consider a hyphenated word to be:
a number of word chars
followed by any number of:
a single hyphen
followed by word chars
Related
I have tried to replace in all procedures some mistakes. Now, I need to find last "end;" in procedure and replace it with another text.
I wrote like: (\s.*)(end|END)(.*(;).*)
But in work not correctly, it also replace some words in the middle of the text. I using re biblio from python.
You can use
result = re.sub(r'(?si)(.*)\bend\b', r'\g<1>some other word', text)
The regex matches
(?si) - an inline re.DOTALL (s) and re.IGNORECASE (i) modifier
(.*) - Group 1: any zero or more chars as many as possible
\bend\b -a whole word end.
The \g<1>some other word replacement is the Group 1 value (I used \g<1> since it will be helpful if your some other word starts with a digit) plus your word.
NOTE: if your some other word can contain literal backslashes, do not forget to double them.
I would like to extract the last word of each line using regex. Most of the last words are built up like this:
sfdsa AAAAB3NzaCLkc3M
gadsgadsg AAAB3NzaCl/Ezfl
dogjasdpgpds AAAB3Nza/ClBAm+4lj
I already tried:
lastwords = re.findall(r'\s(\w+)$', content, re.MULTILINE)
You need to try that:
\s*([\S]+)$
Regex 101 Demo
Explanation:
\s* zero or more whitespace characters
[\S]+ followed by one or more non whitespace characters
$ followed by end of line.
By that way, you are guaranteed to match the last occurance of whitespace characters as that will be followed by no further whitespace characters.
The reason behind your regex did not work because \w+ only covers A-Za-z0-9_
So, / doesn't match in two of your example.
I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $
I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?
Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.
Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.
What regex can I use to check if there is an excessive number of capitals in a word? e.g.
AAAApples
The program should match AAAApples as having too many capital letters at the start, and using re.sub, replace them with empty strings to leave Apples
So using regex, this: r'^[A-Z]*[a-z]' finds capitals, and checks that the next is a lowercase letter. I then replace this with an empty string, to remove the capitals. But of course, this then also removes 'Ap', leaving 'ples'.
What do I need to do to my regex to fix this?
Use a capture group to get the letters after the extra capitals.
re.sub(r'^[A-Z]+([A-Z][a-z])', r'\1', string)
This matches a sequence of uppercase letters, followed by an uppercase and then lowercase letter. The parentheses cause the match for the last two letters to be put in a capture group. In the replacement \1 is replaced with the contents of the first capture group.
Or you can use lookahead:
re.sub(r'^[A-Z]+(?=[A-Z][a-z])', '', string)
A lookahead specifies that the pattern matches only if it's followed by a match for the sub-pattern, but that sub-pattern isn't included in the match. So this matches a sequence of uppercase letters that must be followed by an uppercase and then lowercase letter. But only the initial sequence of uppercase letters is included in the match, which then gets replaced by the empty string.
Go to regular-expressions.info to learn all about regexp.