Python Regex: Symbol + in every letter in the same word - python

I am using Python.
I want to make a regex that allos the following examples:
Day
Dday
Daay
Dayy
Ddaay
Ddayy
...
So, each letter of a word, one or more times.
How can I write it easily? Exist an expression that make it easy?
I have a lot of words.
Thanks

We can try using the following regex pattern:
^([A-Za-z])\1*([A-Za-z])\2*([A-Za-z])\3*$
This matches and captures a single letter, followed by any number of occurrences of this letter. The \1 you see in the above pattern is a backreference which represents the previous matched letter (and so on for \2 and \3).
Code:
word = "DdddddAaaaYyyyy"
matchObj = re.match( r'^([A-Za-z])\1*([A-Za-z])\2*([A-Za-z])\3*$', word, re.M|re.I)
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
print "matchObj.group(3) : ", matchObj.group(3)
else:
print "No match!!"
Demo

To match a character one or more times you can use the + quantifier. To build the full pattern dynamically you would need to split the word to characters and add a + after each of them:
pattern = "".join(char + "+" for char in word)
Then just match the pattern case insensitively.
Demo:
>>> import re
>>> word = "Day"
>>> pattern = "".join(char + "+" for char in word)
>>> pattern
'D+a+y+'
>>> words = ["Dday", "Daay", "Dayy", "Ddaay", "Ddayy"]
>>> all(re.match(pattern, word, re.I) for word in words)
True

Try /d+a+y+/gi:
d+ Matches d one or more times.
a+ Matches a one or more times.
y+ Matches y one or more times.

As per my original comment, the below does exactly what I explain.
Since you want to be able to use this on many words, I think this is what you're looking for.
import re
word = "day"
regex = r"^"+("+".join(list(word)))+"+$"
test_str = ("Day\n"
"Dday\n"
"Daay\n"
"Dayy\n"
"Ddaay\n"
"Ddayy")
matches = re.finditer(regex, test_str, re.IGNORECASE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
This works by converting the string into a list, then converting it back to string, joining it on +, and appending the same. The resulting regex will be ^d+a+y+$. Since the input you presented is separated by newline characters, I've added re.MULTILINE.

Related

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

So I have this text:
"#Natalija What a wonderful day, isn't it #Kristina123 ?"
I tried to remove these two substrings that start with the character '#' by using re.sub function but it didn't work.
How do I remove the susbstring that starts with this character?
Try this regex :
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
t = re.sub('#.*? ', '', text)
print(t)
OUTPUT :
What a wonderful day, isn't it ?
This should work.
# matches the character #
\w+ matches any word character as many times as possible, so it stops at blank character
Code:
import re
regex = r"#\w+"
subst = "XXX"
test_str = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print (result)
output:
XXX What a wonderful day, isn't it XXX ?
It's possible to do it with re.sub(), it would be something like this:
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
output = re.sub('#[a-zA-Z0-9]+\s','',text)
print(output) # Output: What a wonderful day, isn't it ?
# matches the # character
[a-zA-Z0-9] matches alphanumerical (uppercase and lowercase)
"+" means "one or more" (otherwise it would match only one of those characters)
\s matches whitespaces
Alternatively, this can also be done without using the module re. You can first split the sentence into words. Then remove the words containing the # character and finally join the words into a new sentence.
if __name__ == '__main__':
original_text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
individual_words = original_text.split(' ')
words_without_tags = [word for word in individual_words if '#' not in word]
new_sentence = ' '.join(words_without_tags)
print(new_sentence)
I think this would be work for you. The pattern #\w+?\s will determine expressions which start with # continued by one or more alphanumeric characters then finish with an optional white space.
import re
string = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
pattern = '#\w+?\s'
replaced = re.sub(pattern, '', string)
print(replaced)

Python how to find a substring in a string and print the whole string containing the substring

I am struggling to find a solution to print a string which contains a particular substring. So e.g. I have a string
mystr = "<tag> name = mon_this_is_monday value = 10 </tag>"
I want to search for "mon" in the string above and print "mon_this_is_monday" but not sure how to do it
I tried doing
pattern = re.compile('mon_')
try:
match = re.search(pattern, mystr).group(0)
print(match)
except AttributeError:
print('No match')
but this this just gives mon_ as output for match. How do I get the whole string "mon_this_is_monday" as output?
We could try using re.findall with the pattern \b\w*mon\w*\b:
mystr = "<tag> name = mon_this_is_monday value = 10 </tag>"
matches = re.findall(r'\b\w*mon\w*\b', mystr)
print(matches)
This prints:
['mon_this_is_monday']
The regex pattern matches:
\b a word boundary (i.e. the start of the word)
\w* zero or more word characters (letters, numbers, or underscore)
mon the literal text 'mon'
\w* zero or more word characters, again
\b another word boundary (the end of the word)
print([string for string in mystr.split(" ") if "mon" in string])
you can also do a search on regex
import re
mystr = "<tag> name = mon_this_is_monday value = 10 </tag>"
abc = re.search(r"\b(\w*mon\w*)\b",mystr)
print(abc.group(0))

Python Regex - get words around match

I want to get the words before and after my match. I could use string.split(' ') - but as I already use regex, isn't there a much better way using only regex?
Using a match object, I can get the exact location. However, this location is character indexed.
import re
myString = "this. is 12my90\nExample string"
pattern = re.compile(r"(\b12(\w+)90\b)",re.IGNORECASE | re.UNICODE)
m = pattern.search(myString)
print("Hit: "+m.group())
print("Indix range: "+str(m.span()))
print("Words around match: "+myString[m.start()-1:m.end()+1]) # should be +/-1 in _words_, not characters
Output:
Hit: 12my90 Indix
range: (9, 15)
Words around match: 12my90
For getting the matching word and the word before, I tried:
pattern = re.compile(r"(\b(w+)\b)\s(\b12(\w+)90\b)",re.IGNORECASE |
re.UNICODE)
Which yields no matches.
In the second pattern you have to escape the w+ like \w+.
Apart from that, there is a newline in your example which you can match using another following \s
Your pattern with 3 capturing groups might look like
(\b\w+\b)\s(\b12\w+90\b)\s(\b\w+\b)
Regex demo
You could use the capturing groups to get the values
print("Words around match: " + m.group(1) + " " + m.group(3))
new line character is missing
regx = r"(\w+)\s12(\w+)90\n(\w+)"

How to Find Words Not Containing Specific Letters?

I'm trying to write a code using regex and my text file. My file contains these words line by line:
nana
abab
nanac
eded
My purpose is: displaying the words which does not contain the letters which are given as substring's letters.
For example, if my substring is "bn", my output should be only eded. Because nana and nanac contains "n" and abab contains "b".
I have written a code but it only checks first letter of my substring:
import re
substring = "bn"
def xstring():
with open("deneme.txt") as f:
for line in f:
for word in re.findall(r'\w+', line):
for letter in substring:
if len(re.findall(letter, word)) == 0:
print(word)
#yield word
xstring()
How do I solve this problem?
Here, we would just want to have a simple expression such as:
^[^bn]+$
We are adding b and n in a not-char class [^bn] and collecting all other chars, then by adding ^ and $ anchors we will be failing all strings that might have b and n.
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"^[^bn]+$"
test_str = ("nana\n"
"abab\n"
"nanac\n"
"eded")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
RegEx
If this expression wasn't desired, it can be modified/changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
#Xosrov has the right approach, with a few minor issues and typos. The below version of the same logic works
import re
def xstring(substring, words):
regex = re.compile('[%s]' % ''.join(sorted(set(substring))))
# Excluding words matching regex.pattern
for word in words:
if not re.search(regex, word):
print(word)
words = [
'nana',
'abab',
'nanac',
'eded',
]
xstring("bn", words)
If you want to check if a string has a set of letters, use brackets.
For example using [bn] will match words that contain one of those letters.
import re
substring = "bn"
regex = re.compile('[' + substring + ']')
def xstring():
with open("dename.txt") as f:
for line in f:
if(re.search(regex, line) is None):
print(line)
xstring()
It might not be the most efficient but you could try doing something with set intersections the following code segment will print the the value in the string word only if it does not contain any of the letters 'b' or 'n'
if (not any(set(word) & set('bn'))):
print(word)

How to python regex match the following?

1<assume tab here>Algebra I<assume tab here>START
1.1 What are the Basic Numbers? 1-1
For each of the two lines above, how do I regex match only the number up to and including the "?". In essence, I want the following groups:
["1", "Algebra I"]
["1.1", "What are the Basic Numbers?"]
Matching everything up to and including a question mark, or up to a "tab character".
How can I do this with a single regex?
Here's an easy regex:
^([\d.]+)\s*([^\t?]+\??)
Group 1 is the numbers, Group 2 contains the text.
To retrieve one single match:
match = re.search(r"^([\d.]+)\s*([^\t?]+\??)", s)
if match:
mynumbers = match.group(1)
myline = match.group(2)
To iterate over the matches, get groups 1 and 2 from:
reobj = re.compile(r"^([\d.]+)\s*([^\t?]+\??)", re.MULTILINE)
for match in reobj.finditer(s):
# matched text: match.group()
Here you go:
(\d(?:\.\d)*)\s+(?:(.*?\?|.*?)\t)
For explanation: (\d(?:\.\d)*) matches a number followed by zero or more .\d's. this is followed by one or more whitespace characters followed by anything (that is lazy and not greedy) with (.*?) which is followed by either ? or \t in a non-capturing group.
Output:
string1 = "1.1 What are the Basic Numbers? 1-1"
string2 = '1\tAlgebra I\tSTART'
m = re.match(pattern, string2)
m.group(1)
#'1'
m.group(2)
#'Algebra I'
m = re.match(pattern, string1)
m.group(1)
#'1.1'
m.group(2)
#'What are the Basic Numbers?'
EDIT: added non-capturing groups.
EDIT#2: fixed it to include question mark
EDIT#3 fixed no of groups.

Categories