I've written the following script to count the number of sentences in a text file:
import re
filepath = 'sample_text_with_ellipsis.txt'
with open(filepath, 'r') as f:
read_data = f.read()
sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
However, if I run it on a sample_text_with_ellipsis.txt with the following content:
Wait for it... awesome!
I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").
What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?
Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.
Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use
[!?]+|(?<!\.)\.(?!\.)
See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.
[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.
See Python demo:
import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count) # => 1
Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:
import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))
This yields a sentence count of 1 as expected.
Related
I want to find a specific pattern in a paragraph. The pattern must contain a-zA-Z and 0-9 and length is 5 or more than 5. How to implement it on Python?
My code is:
str = "I love5 verye mu765ch"
print(re.findall('(?=.*[0-9])(?=.*[a-zA-Z]{5,})',str))
this will return a null.
Expected result like:
love5
mu765ch
the valid pattern is like:
9aacbe
aver23893dk
asdf897
This is easily done with some programming logic and a simple regex:
import re
string = "I love5 verye mu765ch a123...bbb"
pattern = re.compile(r'(?=\D*\d)(?=[^a-zA-Z]*[a-zA-Z]).{5,}')
interesting = [word for word in string.split() if pattern.match(word)]
print(interesting)
This yields
['love5', 'mu765ch', 'a123...bbb']
See a demo on ideone.com.
Hello I am trying to extract the function name in python using Regex however I am new to Python and nothing seems to be working for me. For example: if i have a string "def myFunction(s): ...." I want to just return myFunction
import re
def extractName(s):
string = []
regexp = re.compile(r"\s*(def)\s+\([^\)]*\)\s*{?\s*")
for m in regexp.finditer(s):
string += [m.group()]
return string
Assumption: You want the name myFunction from "...def myFunction(s):..."
I find something missing in your regex and the way it is structured.
\s*(def)\s+\([^\)]*\)\s*{?\s*
Lets look at it step by step:
\s*: match to zero or more white spaces.
(def): match to the word def.
\s+: match to one or more white spaces.
\([^\)]*\): match to balanced ()
\s*: match to zero or more white spaces.
After that pretty much doesn't matter if you are going for just the name of the function. You are not matching the exact thing you want out of the regex.
You can try this regex if you are interested in doing it by regex:
\s*(def)\s([a-zA-Z]*)\([a-zA-z]*\)
Now the way I have structured the regex, you will get def myFunction(s) in group0, def in group1 and myFunction in group2. So you can use the following code to get you result:
import re
def extractName(s):
string = ""
regexp = re.compile(r"(def)\s([a-zA-Z]*)\([a-zA-z]*\)")
for m in regexp.finditer(s):
string += m.group(2)
return string
You can check your regex live by going on this site.
Hope it helps!
My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex
You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group
I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.
I have a file with such data:
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
Sentence[0].Sentence[1].Sentence[2].'/n'
What I want to print out are all Sentences0. This is what I have done, but it prints out a blank list.
from nltk import *
import codecs
f=codecs.open('topon.txt','r+','cp1251')
text = f.readlines()
first=[sentence for sentence in text if re.findall('\.\n^Abc',sentence)]
print first
You don't need NLTK for this (nor are you using it). Unless I misunderstand the question, this should do the trick:
with open('topon.txt') as infile:
for line in infile:
print line.split('.', 1)[0]
In addition to #inspectorG4dget 's answer, you can do it by regexes:
from nltk import *
import codecs
f = codecs.open('a.txt', 'r+', 'cp1251')
text = f.readlines()
print [re.findall('^[^.]+', sentence) for sentence in text]
Splitting a paragraph at periods works only if every sentence ends with a period, and periods are used for nothing else. If you have a lot of real text, neither of these is even close to true. Abbreviations, questions? exclamations! etc. will trip you up a lot. So, use the tool that the nltk provides for this purpose: the function sent_tokenize(). It's not perfect, but it's a whole lot better than looking for periods. If text is your list of paragraphs, you use it like this:
first = [ ]
for par in text:
sentences = nltk.sent_tokenize(par)
first.append(sentences[0])
You could fold the above into a list comprehension, but it's not going to be very readable...
I have a string like so: "sometext #Syrup #nshit #thebluntislit"
and i want to get a list of all terms starting with '#'
I used the following code:
import re
line = "blahblahblah #Syrup #nshit #thebluntislit"
ht = re.search(r'#\w*', line)
ht = ht.group(0)
print ht
and i get the following:
#Syrup
I was wondering if there is a way that I could instead get a list like:
[#Syrup,#nshit,#thebluntislit]
for all terms starting with '#' instead of just the first term.
Regular expression is not needed with good programming languages like Python:
hashed = [ word for word in line.split() if word.startswith("#") ]
You can use
compiled = re.compile(r'#\w*')
compiled.findall(line)
Output:
['#Syrup', '#nshit', '#thebluntislit']
But there is a problem. If you search the string like 'blahblahblah #Syrup #nshit #thebluntislit beg#end', the output will be ['#Syrup', '#nshit', '#thebluntislit', '#end'].
This problem may be addressed by using positive lookbehind:
compiled = re.compile(r'(?<=\s)#\w*')
(it's not possible to use \b (word boundary) here since # is not among
\w symbols [0-9a-zA-Z_] which may constitute the word which boundary is being searched).
Looks like re.findall() will do what you want.
matches = re.findall(r'#\w*', line)