how to ignore punctuation when counting characters in string in python - python

In my homework there is question about write a function words_of_length(N, s) that can pick unique words with certain length from a string, but ignore punctuations.
what I am trying to do is:
def words_of_length(N, s): #N as integer, s as string
#this line i should remove punctuation inside the string but i don't know how to do it
return [x for x in s if len(x) == N] #this line should return a list of unique words with certain length.
so my problem is that I don't know how to remove punctuation , I did view "best way to remove punctuation from string" and relevant questions, but those looks too difficult in my lvl and also because my teacher requires it should contain no more than 2 lines of code.
sorry that I can't edit my code in question properly, it's first time i ask question here, there much i need to learn, but pls help me with this one. thanks.

Use string.strip(s[, chars])
https://docs.python.org/2/library/string.html
In you function replace x with strip (x, ['.', ',', ':', ';', '!', '?']
Add more punctuation if needed

First of all, you need to create a new string without characters you want to ignore (take a look at string library, particularly string.punctuation), and then split() the resulting string (sentence) into substrings (words). Besides that, I suggest using type annotation, instead of comments like those.
def words_of_length(n: int, s: str) -> list:
return [x for x in ''.join(char for char in s if char not in __import__('string').punctuation).split() if len(x) == n]
>>> words_of_length(3, 'Guido? van, rossum. is the best!'))
['van', 'the']
Alternatively, instead of string.punctuation you can define a variable with the characters you want to ignore yourself.

You can remove punctuation by using string.punctuation.
>>> from string import punctuation
>>> text = "text,. has ;:some punctuation."
>>> text = ''.join(ch for ch in text if ch not in punctuation)
>>> text # with no punctuation
'text has some punctuation'

Related

how to prevent regex matching substring of words?

I have a regex in python and I want to prevent matching substrings. I want to add '#' at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'#\1', str(sent))
return sents
And the example is :
mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)
And the answer is :
['#ali_s #ali_t #ali_u #aabs:/t.co/#kMMALke2l9']
As you can see, it puts '#' at the beginning of 'aabs' and 'kMMALke2l9'. That it is wrong.
I tried to edit the code as bellow :
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|\s)[a-zA-Z0-9_]{4,15}(\s|$))', r'#\1', str(sent))
return sents
But the result will become like this :
['#ali_s ali_t# ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements.
The correct result I expect is:
"#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9"
Could anyone help?
Thanks
This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(\w{4,15})$':
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^(\w{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
You can separate words by spaces by adding (?<=\s) to the start and \s to the end of your first expression.
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|(?<=\s))[a-zA-Z0-9_]{4,15}\s)', r'#\1', str(sent))
return sents
The result will be like this:
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
I am not sure what you are trying to accomplish, but the reason it puts the # at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the # before the whitespace.
you could try to split it to
check at beginning of string and put at first position and
check after every whitespace and put to second position
Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail

How would I remove the Arabic prefix "ال" from an arabic string?

I have tried things like this, but there is no change between the input and output:
def remove_al(text):
if text.startswith('ال'):
text.replace('ال','')
return text
text.replace returns the updated string but doesn't change it, you should change the code to
text = text.replace(...)
Note that in Python strings are "immutable"; there's no way to change even a single character of a string; you can only create a new string with the value you want.
If you want to only remove the prefix ال and not all of ال combinations in the string, I'd rather suggest to use:
def remove_prefix_al(text):
if text.startswith('ال'):
return text[2:]
return text
If you simply use text.replace('ال',''), this will replace all ال combinations:
Example
text = 'الاستقلال'
text.replace('ال','')
Output:
'استقل'
I would recommend the method str.lstrip instead of rolling your own in this case.
example text (alrashid) in Arabic: 'الرَشِيد'
text = 'الرَشِيد'
clean_text = text.lstrip('ال')
print(clean_text)
Note that even though arabic reads from right to left, lstrip strips the start of the string (which is visually to the right)
also, as user 6502 noted, the issue in your code is because python strings are immutable, thus the function was returning the input back
"ال" as prefix is quite complex in Arabic that you will need Regex to accurately separate it from its stem and other prefixes. The following code will help you isolate "ال" from most words:
import re
text = 'والشعر كالليل أسود'
words = text.split()
for word in words:
alx = re.search(r'''^
([وف])?
([بك])?
(لل)?
(ال)?
(.*)$''', word, re.X)
groups = [alx.group(1), alx.group(2), alx.group(3), alx.group(4), alx.group(5)]
groups = [x for x in groups if x]
print (word, groups)
Running that (in Jupyter) you will get:

How can lines with other characters than letters be removed from output?

I have a code where I extract bigrams from a large corpus, and concatenate/merge them to get unigrams. 'may', 'be' --> maybe. The corpus contains, of course, a lot of punctuations, but I also discovered that it contains other characters such as emojis... My plan was to put punctuations in a list, and if those characters are not in a line, print the line. Maybe I should change my approach and only print the lines ONLY containing letters and no other characters, since I don't know what kinds of characters are in the corpus. How can this be done? I do need to keep these other characters for the first part of the code, so that bigrams that don't actually exist are printed. The last lines of my code are at the moment:
counted = collections.Counter(grams)
for gram, count in sorted(counted.items()):
s = ''
print (s.join(gram))
And the output I get is:
!aku
!bet
!brå
!båda
These lines won't be of any use for me... Would really appreciate some help! :)
If you want to check that each string contains only letters you can probably use the isalpha() method.
>>> '!båda'.isalpha()
False
>>> 'båda'.isalpha()
True
As you can see from the example, this method should recognize any unicode letter, not just ascii.
To filter out strings that contain a non-letter character, the code can check for the existence of non-letter character in each string:
# coding=utf-8
import string
import unicodedata
source_strings = [u'aku', u'bet', u'brå', u'båda', u'!båda']
valid_chars = (set(string.ascii_letters))
valid_strings = [s for s in source_strings if
set(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')) <= valid_chars]
# valid_strings == [u'aku', u'bet', u'brå', u'båda']
# "båda" was not included.
You can use the unicodedata module to classify the characters:
import unicodedata
unigram= ''.join(gram)
if all(unicodedata.category(char)=='Ll' for char in unigram):
print(unigram)
If you want to remove from your lines only some characters, then you can filter with an easy replace your line before edit it:
sourceList = ['!aku', '!bet', '!brå', '!båda']
newList = []
for word in sourceList:
for special in ['!','&','å']:
word = word.replace(special,'')
newList.append(word)
Then you can do what is needed for your bigram exercise. Hope this help.
Second query: in case you have lots of characters then on your string you can use always the isalpha():
sourceList = ['!aku', '!bet', 'nor mal alpha', '!brå', '!båda']
newList = [word for word in sourceList if word.isalpha()]
In this case you will only check for characters. Hope this clarify second query.

Find multiple list items within a string

Working on a problem in which I am trying to get a count of the number of vowels in a string. I wrote the following code:
def vowel_count(s):
count = 0
for i in s:
if i == 'a' or i == 'e' or i == 'i' or i == 'o' or i == 'u':
count += 1
print count
vowel_count(s)
While the above works, I would like to know how to do this more simply by creating a list of all vowels, then looping my If statement through that, instead of multiple boolean checks. I'm sure there's an even more elegant way to do this with import modules, but interested in this type of solution.
Relative noob...appreciate the help.
No need to create a list, you can use a string like 'aeiou' to do this:
>>> vowels = 'aeiou'
>>> s = 'fooBArSpaM'
>>> sum(c.lower() in vowels for c in s)
4
You can actually treat a string similarly to how you would a list in python (as they are both iterables), for example
vowels = 'aeiou'
sum(1 for i in s if i.lower() in vowels)
For completeness sake, others suggest vowels = set('aeiou') to allow not matching checks such as 'eio' in vowels. However note if you are iterating over your string in a for loop one character at a time, you won't run into this problem.
A weird way around this is the following:
vowels = len(s) - len(s.translate(None, 'aeiou'))
What you are doing with s.translate(None, 'aeiou') is creating a copy of the string removing all vowels. And then checking how the length differed.
Special note: the way I'm using it is even part of the official documentation
What is a vowel?
Note, though, that method presented here only replaces exactly the characters present in the second parameter of the translate string method. In particular, this means that it will not replace uppercase versions characters, let alone accented ones (like áèïôǔ).
Uppercase vowels
Solving the uppercase ones is kind of easy, just do the replacemente on a copy of the string that has been converted to lowercase:
vowels = len(s) - len(s.lower().translate(None, 'aeiou'))
Accented vowels
This one is a little bit more convoluted, but thanks to this other SO question we know the best way to do it. The resulting code would be:
from unicodedate import normalize
# translate special characters to unaccented versions
normalized_str = normalize('NFD', s).encode('ascii', 'ignore')
vowels = len(s) - len(normalized_str.lower().translate(None, 'aeiou'))
You can filter using a list comprehension, like so:
len([letter for letter in s if letter in 'aeiou'])

Python regular expression to remove space and capitalize letters where the space was?

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']
You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps
I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call

Categories