Counting punctuation in text using Python and regex

Counting punctuation in text using Python and regex - python

I am trying to count the number of times punctuation characters appear in a novel. For example, I want to find the occurrences of question marks and periods along with all the other non alphanumeric characters. Then I want to insert them into a csv file. I am not sure how to do the regex because I don't have that much experience with python. Can someone help me out?
texts=string.punctuation
counts=dict(Counter(w.lower() for w in re.findall(r"\w+", open(cwd+"/"+book).read())))
writer = csv.writer(open("author.csv", 'a'))
writer.writerow([counts.get(fieldname,0) for fieldname in texts])

In [1]: from string import punctuation
In [2]: from collections import Counter
In [3]: counts = Counter(open('novel.txt').read())
In [4]: punctuation_counts = {k:v for k, v in counts.iteritems() if k in punctuation}

from string import punctuation
from collections import Counter
with open('novel.txt') as f: # closes the file for you which is important!
c = Counter(c for line in f for c in line if c in punctuation)
This also avoids loading the whole novel into memory at once.
Btw this is what string.punctuation looks like:
>>> punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
You may want to add or detract symbols from here depending on your needs.
Also Counter defines a __missing__ with simply does return 0. So instead of down-initialising it into a dictionary and then calling .get(x, 0). Just leave it as a counter and access it like c[x], if it doesn't exist, its count is 0. I'm not sure why everybody has the sudden urge to downgrade all their Counters into dicts just because of the scary looking Counter([...]) you see when you print one, when in fact Counters are dictionaries too and deserve respect.
writer.writerow([counts.get(c, 0) for c in punctuation])
If you leave your counter you can just do this:
writer.writerow([counts[c] for c in punctuation])
and that was much easier.

import re
def count_puncts(x):
# sub. punct. with '' and returns the new string with the no. of replacements.
new_str, count = re.subn(r'\W', '', x)
return count

The code you have is very close to what you'd need if you were counting words. If you were trying to count words, the only modification you'd have to make would probably be to change the last line to this:
writer.writerows(counts.items())
Unfortunately, you're not trying to count words here. If you're looking for counts of single characters, I'd avoid using regular expressions and go straight to count. Your code might look like this:
book_text = open(cwd+"/"+book).read()
counts = {}
for character in texts:
counts[character] = book_text.count(character)
writer.writerows(counts.items())
As you might be able to tell, this makes a dictionary with the characters as keys and the number of times that character appears in the text as the value. Then we write it as we would have done for counting words.

Using curses:
import curses.ascii
str1 = "real, and? or, and? what."
t = (c for c in str1 if curses.ascii.ispunct(c))
d = dict()
for p in t:
d[p] = 1 if not p in d else d[p] + 1 for p in t

Related

pythonic way to count total number of letters in string in python

I just started learning python, and I can think of two ways to count letters in a string (ignoring numbers, punctuation, and white spaces)
Using for loop:
for c in s:
if c.isalpha():
counter += 1
print(counter)
Create a list of alphabets and count the length of the list: (It will create an unwanted list)
import re
s = "Nice. To. Meet. You."
letters = re.findall("([a-z]|[A-Z])", s)
counter = len(letters)
print(counter)
Can anyone tell me is there a "pythonic" way to achieve the same result?
like single line code or a function called that will return an int which is the answer?
Thank you very much.

Your first approach is perfectly pythonic, and probably the way to go. You could slightly simplify, using filter or a list comprehension as:
s = "Nice. To. Meet. You."
len(list(filter(str.isalpha, s)))
# 13
Or:
len([i for i in s if i.isalpha()])
# 13
Your second approach isn't really advisable, since you don't really need to use a regex for this. Note that you could simplify that pattern to ([a-zA-Z]), by the way.

You can use regex to remove anything that is not a letter and then count the length of the string:
import re
s = "Nice. To. Meet. You."
counter = len(re.sub(r'[^a-zA-Z]','',s))

Using regex (re), you can find all letters and count their occurrence:
len(re.findall(r"[a-zA-Z]", string))

Find semordnilap(reverse anagram) of words in a string

I'm trying to take a string input, like a sentence, and find all the words that have their reverse words in the sentence. I have this so far:
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
def semordnilap(s):
s = s.lower()
b = "!##$,"
for char in b:
s = s.replace(char,"")
s = s.split(' ')
dict = {}
index=0
for i in range(0,len(s)):
originalfirst = s[index]
sortedfirst = ''.join(sorted(str(s[index])))
for j in range(index+1,len(s)):
next = ''.join(sorted(str(s[j])))
if sortedfirst == next:
dict.update({originalfirst:s[j]})
index+=1
print (dict)
semordnilap(s)
So this works for the most part, but if you run it, you can see that it's also pairing "he" and "he" as an anagram, but it's not what I am looking for. Any suggestions on how to fix it, and also if it's possible to make the run time faster, if I was to input a large text file instead.

You could split the string into a list of words and then compare lowercase versions of all combinations where one of the pair is reversed. Following example uses re.findall() to split the string into a list of words and itertools.combinations() to compare them:
import itertools
import re
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
words = re.findall(r'\w+', s)
pairs = [(a, b) for a, b in itertools.combinations(words, 2) if a.lower() == b.lower()[::-1]]
print(pairs)
# OUTPUT
# [('was', 'saw'), ('stressed', 'desserts'), ('stop', 'pots')]
EDIT: I still prefer the solution above, but per your comment regarding doing this without importing any packages, see below. However, note that str.translate() used this way may have unintended consequences depending on the nature of your text (like stripping # from email addresses) - in other words, you may need to deal with punctuation more carefully than this. Also, I would typically import string and use string.punctuation rather than the literal string of punctuation characters I am passing to str.translate(), but avoided that below in keeping with your request to do this without imports.
s = "Although he was stressed when he saw his desserts burnt, he managed to stop the pots from getting ruined"
words = s.translate(None, '!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~').split()
length = len(words)
pairs = []
for i in range(length - 1):
for j in range(i + 1, length):
if words[i].lower() == words[j].lower()[::-1]:
pairs.append((words[i], words[j]))
print(pairs)
# OUTPUT
# [('was', 'saw'), ('stressed', 'desserts'), ('stop', 'pots')]

How to count only the words that I want?

I want to count only words of a dictionary.
For example :
There is a text :
Children can bye (paid) by credit card.
I want to count just paid.
But my code counts (paid).
import re, sys
d = {}
m = "children can bye (paid) by credit card."
n = m.split()
for i in n:
d[i] = 0
for j in n:
d[j] = d[j] + 1
Is there any advice ?

You can split the string with the following regex to split by nonword chars:
import re
n = re.split('\W+', m)
You can check the syntax here.

You just need to remove the punctuation from your individual tokens. Assuming you want to remove all the punctuation, take a look at the string module. Then (for example), you can go through each token and remove the punctuation. You can do this with one list comprehension:
words = [''.join(ch for ch in token if ch not in string.punctuation)
for token in m.split()]
All this code does is run through each character (ch) in each token (the results of m.split()). It allows all characters except it'll strip out any characters in string.punctuation. Of course if you want a different set of characters (say, maybe you want to allow apostrophes), you can just define that set of characters and use that instead.

String replacement with dictionary, complications with punctuation

I'm trying to write a function process(s,d) to replace abbreviations in a string with their full meaning by using a dictionary. where s is the string input and d is the dictionary. For example:
>>>d = {'ASAP':'as soon as possible'}
>>>s = "I will do this ASAP. Regards, X"
>>>process(s,d)
>>>"I will do this as soon as possible. Regards, X"
I have tried using the split function to separate the string and compare each part with the dictionary.
def process(s):
return ''.join(d[ch] if ch in d else ch for ch in s)
However, it returns me the same exact string. I have a suspicion that the code doesn't work because of the full stop behind ASAP in the original string. If so, how do I ignore the punctuation and get ASAP to be replaced?

Here is a way to do it with a single regex:
In [24]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}
In [25]: s = 'I will do this ASAP, AFAIK. Regards, X'
In [26]: re.sub(r'\b' + '|'.join(d.keys()) + r'\b', lambda m: d[m.group(0)], s)
Out[26]: 'I will do this as soon as possible, as far as I know. Regards, X'
Unlike versions based on str.replace(), this observes word boundaries and therefore won't replace abbreviations that happen to appear in the middle of other words (e.g. "etc" in "fetch").
Also, unlike most (all?) other solutions presented thus far, it iterates over the input string just once, regardless of how many search terms there are in the dictionary.

You can do something like this:
def process(s,d):
for key in d:
s = s.replace(key,d[key])
return s

Here is a working solution: use re.split(), and split by word boundaries (preserving the interstitial characters):
''.join( d.get( word, word ) for word in re.split( '(\W+)', s ) )
One significant difference that this code has from Vaughn's or Sheena's answer is that this code takes advantage of the O(1) lookup time of the dictionary, while their solutions look at every key in the dictionary. This means that when s is short and d is very large, their code will take significantly longer to run. Furthermore, parts of words will still be replaced in their solutions: if d = { "lol": "laugh out loud" } and s="lollipop" their solutions will incorrectly produce "laugh out loudlipop".

use regular expressions:
re.sub(pattern,replacement,s)
In your application:
ret = s
for key in d:
ret = re.sub(r'\b'+key+r'\b',d[key],ret)
return ret
\b matches the beginning or end of a word. Thanks Paul for the comment

Instead of splitting by spaces, use:
split("\W")
It will split by anything that's not a character that would be part of a word.

python 3.2
[s.replace(i,v) for i,v in d.items()]

This is string replacement as well (+1 to #VaughnCato). This uses the reduce function to iterate through your dictionary, replacing any instances of the keys in the string with the values. s in this case is the accumulator, which is reduced (i.e. fed to the replace function) on every iteration, maintaining all past replacements (also, per #PaulMcGuire's point above, this replaces keys starting with the longest and ending with the shortest).
In [1]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}
In [2]: s = 'I will do this ASAP, AFAIK. Regards, X'
In [3]: reduce(lambda x, y: x.replace(y, d[y]), sorted(d, key=lambda i: len(i), reverse=True), s)
Out[3]: 'I will do this as soon as possible, as far as I know. Regards, X'
As for why your function didn't return what you expected - when you iterate through s, you are actually iterating through the characters of the string - not the words. Your version could be tweaked by iterating over s.split() (which would be a list of the words), but you then run into an issue where the punctuation is causing words to not match your dictionary. You can get it to match by importing string and stripping out string.punctuation from each word, but that will remove the punctuation from the final string (so regex would be likely be the best option if replacement doesn't work).

How can I make multiple replacements in a string using a dictionary?

Suppose we have:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
How can I replace each appearance within s of any of d's keys, with the corresponding value (in this case, the result would be 'Досуг englishA')?

Using re:
import re
s = 'Спорт not russianA'
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
keys = (re.escape(k) for k in d.keys())
pattern = re.compile(r'\b(' + '|'.join(keys) + r')\b')
result = pattern.sub(lambda x: d[x.group()], s)
# Output: 'Досуг not englishA'
This will match whole words only. If you don't need that, use the pattern:
pattern = re.compile('|'.join(re.escape(k) for k in d.keys()))
Note that in this case you should sort the words descending by length if some of your dictionary entries are substrings of others.

You could use the reduce function:
reduce(lambda x, y: x.replace(y, dict[y]), dict, s)

Solution found here (I like its simplicity):
def multipleReplace(text, wordDict):
for key in wordDict:
text = text.replace(key, wordDict[key])
return text

one way, without re
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'.split()
for n,i in enumerate(s):
if i in d:
s[n]=d[i]
print ' '.join(s)

Almost the same as ghostdog74, though independently created. One difference,
using d.get() in stead of d[] can handle items not in the dict.
>>> d = {'a':'b', 'c':'d'}
>>> s = "a c x"
>>> foo = s.split()
>>> ret = []
>>> for item in foo:
... ret.append(d.get(item,item)) # Try to get from dict, otherwise keep value
...
>>> " ".join(ret)
'b d x'

With the warning that it fails if key has space, this is a compressed solution similar to ghostdog74 and extaneons answers:
d = {
'Спорт':'Досуг',
'russianA':'englishA'
}
s = 'Спорт russianA'
' '.join(d.get(i,i) for i in s.split())

I used this in a similar situation (my string was all in uppercase):
def translate(string, wdict):
for key in wdict:
string = string.replace(key, wdict[key].lower())
return string.upper()
hope that helps in some way... :)

Using regex
We can build a regular expression that matches any of the lookup dictionary's keys, by creating regexes to match each individual key and combine them with |. We use re.sub to do the substitution, by giving it a function to do the replacement (this function, of course, will do the dict lookup). Putting it together:
import re
# assuming global `d` and `s` as in the question
# a function that does the dict lookup with the global `d`.
def lookup(match):
return d[match.group()]
# Make the regex.
joined = '|'.join(re.escape(key) for key in d.keys())
pattern = re.compile(joined)
result = pattern.sub(lookup, s)
Here, re.escape is used to escape any characters with special meaning in the replacements (so that they don't interfere with building the regex, and are matched literally).
This regex pattern will match the substrings anywhere they appear, even if they are part of a word or span across multiple words. To avoid this, modify the regex so that it checks for word boundaries:
# pattern = re.compile(joined)
pattern = re.compile(rf'\b({joined})\b')
Using str.replace iteratively
Simply iterate over the .items() of the lookup dictionary, and call .replace with each. Since this method returns a new string, and does not (cannot) modify the string in place, we must reassign the results inside the loop:
for to_replace, replacement in d.items():
s = s.replace(to_replace, replacement)
This approach is simple to write and easy to understand, but it comes with multiple caveats.
First, it has the disadvantage that it works sequentially, in a specific order. That is, each replacement has the potential to interfere with other replacements. Consider:
s = 'one two'
s = s.replace('one', 'two')
s = s.replace('two', 'three')
This will produce 'three three', not 'two three', because the 'two' from the first replacement will itself be replaced in the second step. This is normally not desirable; however, in the rare case when it should work this way, this approach is the only practical one.
This approach also cannot easily be fixed to respect word boundaries, because it must match literal text, and a "word boundary" can be marked in multiple different ways - by varying kinds of whitespace, but also without text at the beginning and end of the string.
Finally, keep in mind that a dict is not an ideal data structure for this approach. If we will iterate over the dict, then its ability to do key lookup is useless; and in Python 3.5 and below, the order of dicts is not guaranteed (making the sequential replacement problem worse). Instead, it would be better to specify a list of tuples for the replacements:
d = [('Спорт', 'Досуг'), ('russianA', 'englishA')]
s = 'Спорт russianA'
for to_replace, replacement in d: # no more `.items()` call
s = s.replace(to_replace, replacement)
By tokenization
The problem becomes much simpler if the string is first cut into pieces (tokenized), in such a way that anything that should be replaced is now an exact match for a dict key. That would allow for using the dict's lookup directly, and processing the entire string in one go, while also not building a custom regex.
Suppose that we want to match complete words. We can use a simpler, hard-coded regex that will match whitespace, and which uses a capturing group; by passing this to re.split, we split the string into whitespace and non-whitespace sections. Thus:
import re
tokenizer = re.compile('([ \t\n]+)')
tokenized = tokenizer.split(s)
Now we look up each of the tokens in the dictionary: if present, it should be replaced with the corresponding value, and otherwise it should be left alone (equivalent to replacing it with itself). The dictionary .get method is a natural fit for this task. Finally, we join the pieces back up. Thus:
s = ''.join(d.get(token, token) for token in tokenized)
More generally, for example if the strings to replace could have spaces in them, a different tokenization rule will be needed. However, it will usually be possible to come up with a tokenization rule that is simpler than the regex from the first section (that matches all the keys by brute force).
Special case: replacing single characters
If the keys of the dict are all one character (technically, Unicode code point) each, there are more specific techniques that can be used. See Best way to replace multiple characters in a string? for details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting punctuation in text using Python and regex - python

In [1]: from string import punctuation In [2]: from collections import Counter In [3]: counts = Counter(open('novel.txt').read()) In [4]: punctuation_counts = {k:v for k, v in counts.iteritems() if k in punctuation}

import re def count_puncts(x): # sub. punct. with '' and returns the new string with the no. of replacements. new_str, count = re.subn(r'\W', '', x) return count

Using curses: import curses.ascii str1 = "real, and? or, and? what." t = (c for c in str1 if curses.ascii.ispunct(c)) d = dict() for p in t: d[p] = 1 if not p in d else d[p] + 1 for p in t

Related

pythonic way to count total number of letters in string in python

Find semordnilap(reverse anagram) of words in a string

How to count only the words that I want?

String replacement with dictionary, complications with punctuation

How can I make multiple replacements in a string using a dictionary?

Categories

Resources