Advance Sentence Split - python

Python question!
I would like some help on splitting up sentences in a text file. I do not want to create a massive if loop, but I need to find a way how to split up the text file into sentences. I must do this without .readlines()
I would like to split up the sentences using periods, quotations, and exclamation points...BUT:
Periods followed by whitespace followed by a lowercase letter will not split the sentence
Periods followed by a digit with no intervening whitespace will not split the sentence
Things such as Mr., Mrs., Dr., and so on will NOT split the sentence of course
sequences of letters such as e.g, www.website.com, etc.)
and at last periods followed by punctuation such as commas and more periods (ellipses)
I would like to have these split up sentences from the text file printed out to the user. How would I go about this process? I understand basic string formatting and indices, but adding ellipses, surnames, etc. are going to make it a bit harder for me...
**Also going to be using tkinter to create an open file button and a drop down menu that allows the user to create a new text file from the output of the program one sentence per line in a .txt file.
Thank you!
Here's what I got
import re
punctuation = ['.', '?', '!']
exceptions = ['Mr.', 'Mrs.', 'Ms.', 'Sr.', 'e.g', '...']
lines = []
with open('myData.txt') as myFile:
lines = re.split(punctuation, myFile)

this is my code
import re
punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []
with open('myData.txt','r',encoding="utf-8") as myFile:
lines = punctuation.sub(r"\1\2<pad>", myFile.read())
lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

Related

String out of first words in a textfile in python

I cannot solve the following exercise:
In the given function "a_open()" open the file "mytext" and create a string out of the first words in each line of the file. Each word should be separated by a blank (" ").
I am stuck at this point:
a_open():
f= open ("mytext", "r")
for line in f:
print (line.split(' ')[0])
I am aware I should use the function .join but I do not know how. Any suggestions?
Thank you in advance!
Assuming this is Python, you might use an approach like passing the filename to the function.
Create an empty list words outside of the loop to hold all the first words per line.
Per line, split on a space, use strip to remove the leading and trailing whitespaces and newlines and filter out the "empty" entries.
If the list is not empty, add the first item to the list.
After processing all the lines, use join with a space on the words list to return a string of all the words.
def a_open(filename):
words = []
for line in open(filename, "r"):
parts = list(filter(None, line.strip().split(' ')))
if len(parts):
words.append(parts[0])
return ' '.join(words)
print(a_open("mytext"))
If the contents of the file is for example
This abc
is def
a
test k lm
The output will be
This is a test
Another option using a regex could be reading the whole file, and use re.findall to return a list of groups.
The pattern ^\s*(\S+) matches optional whitespace chars \s* at the start of the string ^ and captures 1 or more non whitespace chars in group 1 (\S+) which will be returned.
import re
def a_open(filename):
return ' '.join(
re.findall(r"^\s*(\S+)",
open(filename, "r").read(),
re.MULTILINE)
)
print(a_open("mytext"))
Output
This is a test

Regular Expression_How to extract several matching patterns from a line?

I have a .csv document consists of several lines. In each line I have tab separated informations such as,
name_1:ayse \t name_2:fatma \t birth_date_1:24 \t birth_date_2:august \t birth_date_3:2018 \t death_date:2100 \t location:turkey.
The sequence of these informations may not be same in each line and there many informations like this in each line.
What am I trying to do is to get a specific part of the string which only has "birth_date" information in it.
I am managed to get only all 3 strings related with birth date as follows
['birth_date_1', 'birth_date_2', 'birth_date_3']
with the help of below code.
inputfile = open('ornek_data.csv','r',encoding="utf-8")
for rownum, line in enumerate(inputfile):
pattern_birth = re.compile(r"\w*birth_date\w*",re.IGNORECASE)
if pattern_birth.search(line) is not None:
a = re.findall("\w*birth_date\w*", line)
print(a)
However what i want actually is to extact below list as output and write it in another document for each line.
['birth_date_1:24', 'birth_date_2:august', 'birth_date_3:2018']
I tried several other regular expressions methods such as below but I couldn't handle it. I will be glad if anyone can help me with this problem.
for rownum, line in enumerate(inputfile):
pattern_birth = re.compile(r"\w*birth_date\w*",re.IGNORECASE)
if pattern_birth.search(line) is not None:
a = re.findall("\w*birth_date.*?:$", line)
print(a)
I would not use a regex here.
Split on '\t' and check if the splitted contains 'birth_date', simple!:
s = 'name_1:ayse \t name_2:fatma \t birth_date_1:24 \t birth_date_2:august \t birth_date_3:2018 \t death_date:2100 \t location:turkey.'
print([x.strip() for x in s.split('\t') if 'birth_date' in x])
# ['birth_date_1:24', 'birth_date_2:august', 'birth_date_3:2018']
Use "\w*birth_date.*?\s" or r"birth_date_\d:.*?\s"
Ex:
import re
line = "name_1:ayse \t name_2:fatma \t birth_date_1:24 \t birth_date_2:august \t birth_date_3:2018 \t death_date:2100 \t location:turkey."
print(re.findall("\w*birth_date.*?\s", line))
Output:
['birth_date_1:24 ', 'birth_date_2:august ', 'birth_date_3:2018 ']
Your regex doesn't match what you are trying to extract, so you need to extend it.
As an aside, you should only re.compile once - the point of compilation is to avoid needing to parse the regex again.
There is also no need to check for no matches separately. Just loop over all the matches; if there are none, the loop will execute zero times.
pat = re.compile(r"\bbirth_date_\d+:\d+",re.IGNORECASE)
with open('ornek_data.csv','r',encoding="utf-8") as inputfile:
for rownum, line in enumerate(inputfile):
for a in pat.findall(line):
print(rownum, a)
The \w* wasn't doing anything useful (if you don't care if it's there or not, as the * quantifier does, why search for it?) whereas \b requires the match to occur at a word boundary (so adjacent to whitespace or punctuation, or beginning or end of line). \d matches a digit and : simply matches itself.
If this is a well-formed CSV file, maybe instead use a CSV reader and print the fields which match startswith('birth_date_')

How can I travel through the words of a file in PYTHON?

I have a file .txt and I want to travel through the words of it. I have a problem, I need to remove the punctuation marks before travelling through the words. I have tried this, but it isn't removing the punctuation marks.
file=open(file_name,"r")
for word in file.read().strip(",;.:- '").split():
print word
file.close()
The problem with your current method is that .strip() doesn't really do what you want. It removes leading and trailing characters (and you want to remove ones within the text), and if you want to specify characters in addition to whitespace, they need to be in a list.
Another problem is that there are many more potential punctuation characters (question marks, exclamations, unicode ellipses, em dashes) that wouldn't get filtered out by your list. Instead, you can use string.punctuation to get a wide range of characters (note that string.punctuation doesn't include some non-English characters, so its viability may depend on the source of your input):
import string
punctuation = set(string.punctuation)
text = ''.join(char for char in text if char not in punctuation)
An even faster method (shown in other answers on SO) uses string.translate() to replace the characters:
import string
text = text.translate(string.maketrans('', ''), string.punctuation)
strip()only removes characters found at the beginning or end of a string.
So split() first to cut into words, then strip() to remove punctuation.
import string
with open(file_name, "rt") as finput:
for line in finput:
for word in line.split():
print word.strip(string.punctuation)
Or use a natural language aware library like nltk: http://www.nltk.org/
You can try using the re module:
import re
with open(file_name) as f:
for word in re.split('\W+', f.read()):
print word
See the re documentation for more details.
Edit: In case of non ASCII characters, the previous code ignore them. In that case the following code can help:
import re
with open(file_name) as f:
for word in re.compile('\W+', re.unicode).split(f.read().decode('utf8')):
print word
The following code preserves apostrophes and blanks, and could easily be modified to preserve double quotations marks, if desired. It works by using a translation table based on a subclass of the string object. I think the code is fairly easy to understand. It might be made more efficient if necessary.
class SpecialTable(str):
def __getitem__(self, chr):
if chr==32 or chr==39 or 48<=chr<=57 \
or 65<=chr<=90 or 97<=chr<=122:
return chr
else:
return None
specialTable = SpecialTable()
with open('temp2.txt') as inputText:
for line in inputText:
print (line)
convertedLine=line.translate(specialTable)
print (convertedLine)
print (convertedLine.split(' '))
Here's typical output.
This! is _a_ single (i.e. 1) English sentence that won't cause any trouble, right?
This is a single ie 1 English sentence that won't cause any trouble right
['This', 'is', 'a', 'single', 'ie', '1', 'English', 'sentence', 'that', "won't", 'cause', 'any', 'trouble', 'right']
'nother one.
'nother one
["'nother", 'one']
I would remove the punctuation marks with the replace function after storing the words in a list like so:
with open(file_name,"r") as f_r:
words = []
for row in f_r:
words.append(row.split())
punctuation = [',', ';', '.', ':', '-']
words = [x.replace(y, '') for y in punctuation for x in words]

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

extract English words from string in python

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).
My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks
import sys
import re
for line in sys.stdin:
line = re.sub("[^A-Za-z]", "", line.strip())
line = line.lower()
words = ' '.join(line.split())
for word in words:
print '%s\t%s' % (word, 1)
You've actually got two problems.
First, this:
line = re.sub("[^A-Za-z]", "", line.strip())
This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.
Next, even if you didn't do that, you do this:
words = ' '.join(line.split())
This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)
So, in the next line, when you do this:
for word in words:
You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.
If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:
words = line.split()
for word in words:
Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:
However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:
line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:
… or just:
words = re.split(r"[^A-Za-z]", line.strip())
for word in words:
There are two issues here:
line = re.sub("[^A-Za-z]", "", line.strip()) will remove all the non-characters, making it hard to split word in the subsequent stage. One alternatively solution is like this words = re.findall('[A-Za-z]', line)
As mentioned by #abarnert, in the existing code words is a string, for word in words will iterate each letter. To get words as a list of words, you can follow 1.

Categories