How to find specific pattern in a paragraph in Python? - python

I want to find a specific pattern in a paragraph. The pattern must contain a-zA-Z and 0-9 and length is 5 or more than 5. How to implement it on Python?
My code is:
str = "I love5 verye mu765ch"
print(re.findall('(?=.*[0-9])(?=.*[a-zA-Z]{5,})',str))
this will return a null.
Expected result like:
love5
mu765ch
the valid pattern is like:
9aacbe
aver23893dk
asdf897

This is easily done with some programming logic and a simple regex:
import re
string = "I love5 verye mu765ch a123...bbb"
pattern = re.compile(r'(?=\D*\d)(?=[^a-zA-Z]*[a-zA-Z]).{5,}')
interesting = [word for word in string.split() if pattern.match(word)]
print(interesting)
This yields
['love5', 'mu765ch', 'a123...bbb']
See a demo on ideone.com.

Related

Filtering a list of strings using regex

I have a list of strings that looks like this,
strlist = [
'list/category/22',
'list/category/22561',
'list/category/3361b',
'list/category/22?=1512',
'list/category/216?=591jf1!',
'list/other/1671',
'list/1y9jj9/1yj32y',
'list/category/91121/91251',
'list/category/0027',
]
I want to use regex to find the strings in this list, that contain the following string /list/category/ followed by an integer of any length, but that's it, it cannot contain any letters or symbols after that.
So in my example, the output should look like this
list/category/22
list/category/22561
list/category/0027
I used the following code:
newlist = []
for i in strlist:
if re.match('list/category/[0-9]+[0-9]',i):
newlist.append(i)
print(i)
but this is my output:
list/category/22
list/category/22561
list/category/3361b
list/category/22?=1512
list/category/216?=591jf1!
list/category/91121/91251
list/category/0027
How do I fix my regex? And also is there a way to do this in one line using a filter or match command instead of a for loop?
You can try the below regex:
^list\/category\/\d+$
Explanation of the above regex:
^ - Represents the start of the given test String.
\d+ - Matches digits that occur one or more times.
$ - Matches the end of the test string. This is the part your regex missed.
Demo of the above regex in here.
IMPLEMENTATION IN PYTHON
import re
pattern = re.compile(r"^list\/category\/\d+$", re.MULTILINE)
match = pattern.findall("list/category/22\n"
"list/category/22561\n"
"list/category/3361b\n"
"list/category/22?=1512\n"
"list/category/216?=591jf1!\n"
"list/other/1671\n"
"list/1y9jj9/1yj32y\n"
"list/category/91121/91251\n"
"list/category/0027")
print (match)
You can find the sample run of the above implementation here.

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

How to extract function name python regex

Hello I am trying to extract the function name in python using Regex however I am new to Python and nothing seems to be working for me. For example: if i have a string "def myFunction(s): ...." I want to just return myFunction
import re
def extractName(s):
string = []
regexp = re.compile(r"\s*(def)\s+\([^\)]*\)\s*{?\s*")
for m in regexp.finditer(s):
string += [m.group()]
return string
Assumption: You want the name myFunction from "...def myFunction(s):..."
I find something missing in your regex and the way it is structured.
\s*(def)\s+\([^\)]*\)\s*{?\s*
Lets look at it step by step:
\s*: match to zero or more white spaces.
(def): match to the word def.
\s+: match to one or more white spaces.
\([^\)]*\): match to balanced ()
\s*: match to zero or more white spaces.
After that pretty much doesn't matter if you are going for just the name of the function. You are not matching the exact thing you want out of the regex.
You can try this regex if you are interested in doing it by regex:
\s*(def)\s([a-zA-Z]*)\([a-zA-z]*\)
Now the way I have structured the regex, you will get def myFunction(s) in group0, def in group1 and myFunction in group2. So you can use the following code to get you result:
import re
def extractName(s):
string = ""
regexp = re.compile(r"(def)\s([a-zA-Z]*)\([a-zA-z]*\)")
for m in regexp.finditer(s):
string += m.group(2)
return string
You can check your regex live by going on this site.
Hope it helps!

How to count sentences taking into account the occurrence of ellipses

I've written the following script to count the number of sentences in a text file:
import re
filepath = 'sample_text_with_ellipsis.txt'
with open(filepath, 'r') as f:
read_data = f.read()
sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
However, if I run it on a sample_text_with_ellipsis.txt with the following content:
Wait for it... awesome!
I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").
What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?
Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.
Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use
[!?]+|(?<!\.)\.(?!\.)
See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.
[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.
See Python demo:
import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count) # => 1
Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:
import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))
This yields a sentence count of 1 as expected.

How can I grab all terms beginning with '#'?

I have a string like so: "sometext #Syrup #nshit #thebluntislit"
and i want to get a list of all terms starting with '#'
I used the following code:
import re
line = "blahblahblah #Syrup #nshit #thebluntislit"
ht = re.search(r'#\w*', line)
ht = ht.group(0)
print ht
and i get the following:
#Syrup
I was wondering if there is a way that I could instead get a list like:
[#Syrup,#nshit,#thebluntislit]
for all terms starting with '#' instead of just the first term.
Regular expression is not needed with good programming languages like Python:
hashed = [ word for word in line.split() if word.startswith("#") ]
You can use
compiled = re.compile(r'#\w*')
compiled.findall(line)
Output:
['#Syrup', '#nshit', '#thebluntislit']
But there is a problem. If you search the string like 'blahblahblah #Syrup #nshit #thebluntislit beg#end', the output will be ['#Syrup', '#nshit', '#thebluntislit', '#end'].
This problem may be addressed by using positive lookbehind:
compiled = re.compile(r'(?<=\s)#\w*')
(it's not possible to use \b (word boundary) here since # is not among
\w symbols [0-9a-zA-Z_] which may constitute the word which boundary is being searched).
Looks like re.findall() will do what you want.
matches = re.findall(r'#\w*', line)

Categories