Question: please debug logic to reflect expected output
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(word_list)
OUTPUT is :
['Hello', 'there', '.', '']
Problem: needs to be expected without space
Expected: ['Hello', 'there', '.']
First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', ''] because-
The \W, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_] so it is splitting your string by space(\s) and literal dot(.) character
So if you want to get the expected output you need to do some further processing like the below-
With Earlier Code:
import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W+)", s)))
print(l)
With Edited code:
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(list(filter(None,word_list)))
Output:
['Hello', 'there', '.']
Working Code: https://rextester.com/KWJN38243
assuming word is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.
Here is the string:
Hello there.
Here is how it is split:
Hello|there|
that means you have three values: hello there and an empty string '' in the last place.
And the values you split on are a space and a period
So the output should be the three values and the two characters that we split on:
hello - space - there - period - empty string
which is exactly what I get.
import re
s = "Hello there."
t = re.split(r"(\W+)", s)
print(t)
output:
['Hello', ' ', 'there', '.', '']
Further Explanation
From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:
date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,
The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.
Related
I have a file .txt and I want to travel through the words of it. I have a problem, I need to remove the punctuation marks before travelling through the words. I have tried this, but it isn't removing the punctuation marks.
file=open(file_name,"r")
for word in file.read().strip(",;.:- '").split():
print word
file.close()
The problem with your current method is that .strip() doesn't really do what you want. It removes leading and trailing characters (and you want to remove ones within the text), and if you want to specify characters in addition to whitespace, they need to be in a list.
Another problem is that there are many more potential punctuation characters (question marks, exclamations, unicode ellipses, em dashes) that wouldn't get filtered out by your list. Instead, you can use string.punctuation to get a wide range of characters (note that string.punctuation doesn't include some non-English characters, so its viability may depend on the source of your input):
import string
punctuation = set(string.punctuation)
text = ''.join(char for char in text if char not in punctuation)
An even faster method (shown in other answers on SO) uses string.translate() to replace the characters:
import string
text = text.translate(string.maketrans('', ''), string.punctuation)
strip()only removes characters found at the beginning or end of a string.
So split() first to cut into words, then strip() to remove punctuation.
import string
with open(file_name, "rt") as finput:
for line in finput:
for word in line.split():
print word.strip(string.punctuation)
Or use a natural language aware library like nltk: http://www.nltk.org/
You can try using the re module:
import re
with open(file_name) as f:
for word in re.split('\W+', f.read()):
print word
See the re documentation for more details.
Edit: In case of non ASCII characters, the previous code ignore them. In that case the following code can help:
import re
with open(file_name) as f:
for word in re.compile('\W+', re.unicode).split(f.read().decode('utf8')):
print word
The following code preserves apostrophes and blanks, and could easily be modified to preserve double quotations marks, if desired. It works by using a translation table based on a subclass of the string object. I think the code is fairly easy to understand. It might be made more efficient if necessary.
class SpecialTable(str):
def __getitem__(self, chr):
if chr==32 or chr==39 or 48<=chr<=57 \
or 65<=chr<=90 or 97<=chr<=122:
return chr
else:
return None
specialTable = SpecialTable()
with open('temp2.txt') as inputText:
for line in inputText:
print (line)
convertedLine=line.translate(specialTable)
print (convertedLine)
print (convertedLine.split(' '))
Here's typical output.
This! is _a_ single (i.e. 1) English sentence that won't cause any trouble, right?
This is a single ie 1 English sentence that won't cause any trouble right
['This', 'is', 'a', 'single', 'ie', '1', 'English', 'sentence', 'that', "won't", 'cause', 'any', 'trouble', 'right']
'nother one.
'nother one
["'nother", 'one']
I would remove the punctuation marks with the replace function after storing the words in a list like so:
with open(file_name,"r") as f_r:
words = []
for row in f_r:
words.append(row.split())
punctuation = [',', ';', '.', ':', '-']
words = [x.replace(y, '') for y in punctuation for x in words]
I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo
I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo
I have a string as given below.
string= 'Sam007's Helsen007' is a 'good' boy's in 'demand6's6'.
I want to extract the string inside the quotes.
The output should looks like,
['Sam007's Helsen007', 'good', 'demand6's6']
The regex I have written in :
re.findall("(?:[^a-zA-Z0-9]*')(.*?)(?:'[^a-zA-Z0-9*])", text)
But this gives output
["Sam007's Helsen007", 'good', "s in 'demand6's6"]
when I use modify the regex to
re.findall("(?:[^a-zA-Z0-9]')(.*?)(?:'[^a-zA-Z0-9*])", text)
It gives me an output:
['good', "demand6's6"]
The second case seems more appropriate, but it cant handle the case if a string is starting with a quote.
How can I handle the case.
st= "'Sam007's Helsen007' is a 'good' boy's in 'demand6's6'"
print re.findall(r"\B'.*?'\B",st)
Use \B i.e non word boundary
Output:["'Sam007's Helsen007'", "'good'", "'demand6's6'"]
If you look carefully through your string you want a string ' which has a non word character before and ' which has a non word character after.
This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:
f = open('#file name ...', 'a+')
for x in set(f.read().split()):
print x
But the problem is that if the input file has, for instance, this line:
This is not is, clearly is: weird
It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?
Thanks for any help. (I am really new to Python.)
import re
for x in set(re.findall(r'\b\w+\b', f.read())):
should be more able to distinguish words correctly.
This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).
If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].
>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']
You can use translation tables if you don't care about replacing your punctuation characters with white space, for eg.
>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = " "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is clearly is weird'
# And for your case of creating a set of unique words.
>>> set('This is not is clearly is weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])