extract English words from string in python

extract English words from string in python - python

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).
My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks
import sys
import re
for line in sys.stdin:
line = re.sub("[^A-Za-z]", "", line.strip())
line = line.lower()
words = ' '.join(line.split())
for word in words:
print '%s\t%s' % (word, 1)

You've actually got two problems.
First, this:
line = re.sub("[^A-Za-z]", "", line.strip())
This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.
Next, even if you didn't do that, you do this:
words = ' '.join(line.split())
This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)
So, in the next line, when you do this:
for word in words:
You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.
If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:
words = line.split()
for word in words:
Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:
However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:
line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:
… or just:
words = re.split(r"[^A-Za-z]", line.strip())
for word in words:

There are two issues here:
line = re.sub("[^A-Za-z]", "", line.strip()) will remove all the non-characters, making it hard to split word in the subsequent stage. One alternatively solution is like this words = re.findall('[A-Za-z]', line)
As mentioned by #abarnert, in the existing code words is a string, for word in words will iterate each letter. To get words as a list of words, you can follow 1.

Related

String out of first words in a textfile in python

I cannot solve the following exercise:
In the given function "a_open()" open the file "mytext" and create a string out of the first words in each line of the file. Each word should be separated by a blank (" ").
I am stuck at this point:
a_open():
f= open ("mytext", "r")
for line in f:
print (line.split(' ')[0])
I am aware I should use the function .join but I do not know how. Any suggestions?
Thank you in advance!

Assuming this is Python, you might use an approach like passing the filename to the function.
Create an empty list words outside of the loop to hold all the first words per line.
Per line, split on a space, use strip to remove the leading and trailing whitespaces and newlines and filter out the "empty" entries.
If the list is not empty, add the first item to the list.
After processing all the lines, use join with a space on the words list to return a string of all the words.
def a_open(filename):
words = []
for line in open(filename, "r"):
parts = list(filter(None, line.strip().split(' ')))
if len(parts):
words.append(parts[0])
return ' '.join(words)
print(a_open("mytext"))
If the contents of the file is for example
This abc
is def
a
test k lm
The output will be
This is a test
Another option using a regex could be reading the whole file, and use re.findall to return a list of groups.
The pattern ^\s*(\S+) matches optional whitespace chars \s* at the start of the string ^ and captures 1 or more non whitespace chars in group 1 (\S+) which will be returned.
import re
def a_open(filename):
return ' '.join(
re.findall(r"^\s*(\S+)",
open(filename, "r").read(),
re.MULTILINE)
)
print(a_open("mytext"))
Output
This is a test

Iterator for words returns characters

I have tried to build up my first iterator for words in a text:
def words(text):
regex = re.compile(r"""(\w(?:[\w']*\w)?|\S)""", re.VERBOSE)
for line in text:
words = regex.findall(line)
if words:
for word in words:
yield word
if I only use this line words = regex.findall(line) I retrieve a list with all words but if I use the function and do a NEXT() the it will return the text character by character.
Any idea what I do wrong?

I believe that you are passing a string to text because that is the only way it would result in all characters. So, given that, I updated the code to accommodate a string (all I did was remove one of the loops):
import re
import re
def words(text):
regex = re.compile(r"""(\w(?:[\w']*\w)?|\S)""", re.VERBOSE)
words = regex.findall(text)
for word in words:
yield word
print(list(words("I like to test strings")))

Is text a list of strings? If it's on string (even if containing new lines) it explains the result...

Removing any pattern (word or regex) defined in a list from a string

I have a list
forbidden_patterns=['Word1','Word2','Word3','\d{4}']
and a string :
string1="This is Word1 a list thatWord2 I'd like to 2016 be readableWord3"
What is the way to have string1 to have all the patterns and words defined in forbidden_patterns removed so it ends with :
clean_string="This is a list that I'd like to be readable"
The \d{4} is to remove the year pattern which in this case is 2016
List comprehension are very welcome

Here you are:
import re
forbidden_patterns = ['Word1', 'Word2', 'Word3', '\d{4}']
string = "This is Word1 a list thatWord2 I'd like to 2016 be readableWord3"
for pattern in forbidden_patterns:
string = ''.join(re.split(pattern, string))
print(string)
Essentially, this code goes through each of the patterns in forbidden_patterns, splits string using that particular pattern as a delimiter (which removes the delimiter, in this case the pattern, from the string), and joins it back together into a string for the next pattern.
EDIT
To get rid of the extra spaces, put the following line as the first line in the for-loop:
string = ''.join(re.split(r'\b{} '.format(pattern), string))
This line checks to see if the pattern is a whole word, and if so, removes that word and one of the spaces. Make sure that this line goes above string = ''.join(re.split(pattern, string)), which is "less specific" than this line.

import re
new_string = string1
for word in forbidden_words:
new_string = re.sub(word, '', new_string)
Your new_string would be the one you want. Though, it's a bit long and removing some words leaving you with 2 spaces as This is a list that I'd like to be readable

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.

I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.

This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word

Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

Advance Sentence Split

Python question!
I would like some help on splitting up sentences in a text file. I do not want to create a massive if loop, but I need to find a way how to split up the text file into sentences. I must do this without .readlines()
I would like to split up the sentences using periods, quotations, and exclamation points...BUT:
Periods followed by whitespace followed by a lowercase letter will not split the sentence
Periods followed by a digit with no intervening whitespace will not split the sentence
Things such as Mr., Mrs., Dr., and so on will NOT split the sentence of course
sequences of letters such as e.g, www.website.com, etc.)
and at last periods followed by punctuation such as commas and more periods (ellipses)
I would like to have these split up sentences from the text file printed out to the user. How would I go about this process? I understand basic string formatting and indices, but adding ellipses, surnames, etc. are going to make it a bit harder for me...
**Also going to be using tkinter to create an open file button and a drop down menu that allows the user to create a new text file from the output of the program one sentence per line in a .txt file.
Thank you!
Here's what I got
import re
punctuation = ['.', '?', '!']
exceptions = ['Mr.', 'Mrs.', 'Ms.', 'Sr.', 'e.g', '...']
lines = []
with open('myData.txt') as myFile:
lines = re.split(punctuation, myFile)

this is my code
import re
punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|！|？|；|…|　|!|؟|؛)+")
lines = []
with open('myData.txt','r',encoding="utf-8") as myFile:
lines = punctuation.sub(r"\1\2<pad>", myFile.read())
lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract English words from string in python - python

Related

String out of first words in a textfile in python

Iterator for words returns characters

Removing any pattern (word or regex) defined in a list from a string

Python regex keep a few more tokens

Advance Sentence Split

Categories

Resources