Stripping punctuation from unique strings in an input file

Stripping punctuation from unique strings in an input file - python

This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:
f = open('#file name ...', 'a+')
for x in set(f.read().split()):
print x
But the problem is that if the input file has, for instance, this line:
This is not is, clearly is: weird
It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?
Thanks for any help. (I am really new to Python.)

import re
for x in set(re.findall(r'\b\w+\b', f.read())):
should be more able to distinguish words correctly.
This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).
If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].
>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']

You can use translation tables if you don't care about replacing your punctuation characters with white space, for eg.
>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = " "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is clearly is weird'
# And for your case of creating a set of unique words.
>>> set('This is not is clearly is weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

Related

Not getting expected output for some reason?

Question: please debug logic to reflect expected output
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(word_list)
OUTPUT is :
['Hello', 'there', '.', '']
Problem: needs to be expected without space
Expected: ['Hello', 'there', '.']

First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', ''] because-
The \W, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_] so it is splitting your string by space(\s) and literal dot(.) character
So if you want to get the expected output you need to do some further processing like the below-
With Earlier Code:
import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W+)", s)))
print(l)
With Edited code:
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(list(filter(None,word_list)))
Output:
['Hello', 'there', '.']
Working Code: https://rextester.com/KWJN38243

assuming word is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.
Here is the string:
Hello there.
Here is how it is split:
Hello|there|
that means you have three values: hello there and an empty string '' in the last place.
And the values you split on are a space and a period
So the output should be the three values and the two characters that we split on:
hello - space - there - period - empty string
which is exactly what I get.
import re
s = "Hello there."
t = re.split(r"(\W+)", s)
print(t)
output:
['Hello', ' ', 'there', '.', '']
Further Explanation
From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:
date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,
The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.

How can I travel through the words of a file in PYTHON?

I have a file .txt and I want to travel through the words of it. I have a problem, I need to remove the punctuation marks before travelling through the words. I have tried this, but it isn't removing the punctuation marks.
file=open(file_name,"r")
for word in file.read().strip(",;.:- '").split():
print word
file.close()

The problem with your current method is that .strip() doesn't really do what you want. It removes leading and trailing characters (and you want to remove ones within the text), and if you want to specify characters in addition to whitespace, they need to be in a list.
Another problem is that there are many more potential punctuation characters (question marks, exclamations, unicode ellipses, em dashes) that wouldn't get filtered out by your list. Instead, you can use string.punctuation to get a wide range of characters (note that string.punctuation doesn't include some non-English characters, so its viability may depend on the source of your input):
import string
punctuation = set(string.punctuation)
text = ''.join(char for char in text if char not in punctuation)
An even faster method (shown in other answers on SO) uses string.translate() to replace the characters:
import string
text = text.translate(string.maketrans('', ''), string.punctuation)

strip()only removes characters found at the beginning or end of a string.
So split() first to cut into words, then strip() to remove punctuation.
import string
with open(file_name, "rt") as finput:
for line in finput:
for word in line.split():
print word.strip(string.punctuation)
Or use a natural language aware library like nltk: http://www.nltk.org/

You can try using the re module:
import re
with open(file_name) as f:
for word in re.split('\W+', f.read()):
print word
See the re documentation for more details.
Edit: In case of non ASCII characters, the previous code ignore them. In that case the following code can help:
import re
with open(file_name) as f:
for word in re.compile('\W+', re.unicode).split(f.read().decode('utf8')):
print word

The following code preserves apostrophes and blanks, and could easily be modified to preserve double quotations marks, if desired. It works by using a translation table based on a subclass of the string object. I think the code is fairly easy to understand. It might be made more efficient if necessary.
class SpecialTable(str):
def __getitem__(self, chr):
if chr==32 or chr==39 or 48<=chr<=57 \
or 65<=chr<=90 or 97<=chr<=122:
return chr
else:
return None
specialTable = SpecialTable()
with open('temp2.txt') as inputText:
for line in inputText:
print (line)
convertedLine=line.translate(specialTable)
print (convertedLine)
print (convertedLine.split(' '))
Here's typical output.
This! is _a_ single (i.e. 1) English sentence that won't cause any trouble, right?
This is a single ie 1 English sentence that won't cause any trouble right
['This', 'is', 'a', 'single', 'ie', '1', 'English', 'sentence', 'that', "won't", 'cause', 'any', 'trouble', 'right']
'nother one.
'nother one
["'nother", 'one']

I would remove the punctuation marks with the replace function after storing the words in a list like so:
with open(file_name,"r") as f_r:
words = []
for row in f_r:
words.append(row.split())
punctuation = [',', ';', '.', ':', '-']
words = [x.replace(y, '') for y in punctuation for x in words]

Regex: how to identify words in a screen (or how to exclude punctuation and numbers)

Can someone help me with identifying words only in the text file? Upper or lower case but no numbers, brackets, dashes, punctuation, etc.(whatever the definition of the "word" is)
I was thinking about:
r"\w+ \w+"
but it does not work
Thank you

You can use a character class with specifying the range of expected characters :
r'[a-zA-Z]+'
Read more here http://www.regular-expressions.info/charclass.html
And in python you can use the function re.findall() to return all the matches in a list or re.finditer which returns an iterator of match objects.

re.findall(r"\b[a-z]+\b",test_str,re.I)
You can do it this way.

import re
text = "hey there 222 how are you ??? fine I hope!"
print re.findall("[a-z]+", subject, re.IGNORECASE)
#['hey', 'there', 'how', 'are', 'you', 'fine', 'I', 'hope']
Regex explanation
[a-z]+
Options: Case insensitive;
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Python Live Demo
http://ideone.com/JT8ZjD

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.

I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.

This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word

Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

Trying to parse a string to two seperate strings based on case

I'm currently working on a python bot which retrieves information from a meta block on an HTML page. I get the content of the meta block, and now I am stuck on trying to parse it to two different strings.
An example of the content would be:
Lowercase Words WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS
So far I have:
lowercase = ' '.join(w for w in content.split() if (not w.isupper()) and (not w.isdigit()))
uppercase = ' '.join(w for w in content.split() if (w.isupper() or w.isdigit()))
where the uppercase string is meant to contain everything that isn't the words "Lowercase" or "Words"
I have not been able to find much help with this sort of issue, and was wondering if anyone would know of a trick or work around? Thanks

Why not use regular expressions:
import re
s = "Lowercase Words WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS"
match = re.match(r"(([^\s]*[a-z]+[^\s]*\s+)+)([^a-z]+)", s)
if match:
lowercase = match.group(1)
uppercase = match.group(3)
This will match a single line string beginning with an arbitrary number of words of which each must contain at least one lower case letter(a-z). Note, that camel-case is also recognized as a lower case string (e.g. "LowerCase"). The second part will then match the rest of the string which must not contain any lower case letters.
Let's try to understand the regexp now:
We want to match lower case words, so we write: [a-z]+But this will only match words that are completely made up from lower-case letters - we want to allow other characters as well and match the word as lower case if it contains at least one lower case character. [^\s] will match any character that is not a white-space (word boundary). We combine both patterns like this: [^\s]*[a-z]+[^\s]*.This matches any number of non-whitespace characters (even zero) followed by lower-case characters and then followed by any sequence of non-whitespace characters again. So this basically means, we match any sequence that does not contain white-space and at least one lower-case letter.Now we make a sequence of such words, delimited by whitespace: ([^\s]*[a-z]+[^\s]*\s+)+
Matching the upper case part is pretty straight-forward, because we only need to match everything (including whitespace) that is not a lower-case character: [^a-z]+
To make matches of both patterns available through groups, we wrap 'em up in braces again:
lowercase: (([^\s]*[a-z]+[^\s]*\s+)+)
uppercase: ([^a-z]+)
Perhaps you need to adjust the pattern further, to suit your needs, but I believe this should be a good starting point...

Something like this?
>>> from string import punctuation as punc
def ispunc(strs):
return all(x in punc for x in strs)
...
>>> strs = "Lowercase Words WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS"
>>> ' '.join(w for w in strs.split() if (w.isupper() or w.isdigit() or ispunc(w)))
"WITH UPPERCASE CONTAINING 2 AND ALSO ', AND MANY MORE CHARACTERS"
>>> ' '.join(w for w in strs.split() if (not w.isupper()) and (not w.isdigit() and not ispunc(w)))
'Lowercase Words'
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stripping punctuation from unique strings in an input file - python

Related

Not getting expected output for some reason?

How can I travel through the words of a file in PYTHON?

Regex: how to identify words in a screen (or how to exclude punctuation and numbers)

Python regex keep a few more tokens

Trying to parse a string to two seperate strings based on case

Categories

Resources