Python - counting non alphanumeric characters in a Pandas dataframe

Python - counting non alphanumeric characters in a Pandas dataframe - python

I am trying to count the occurrences of characters in a column in a Pandas DataFrame. For example, I want to know in total how many times the character A appears in the column. The problem occurs when there's a non-alphanumeric character.
Here's a minimum reproducible example:
import pandas as pd
df = pd.DataFrame(data = ['AA', 'BA', 'ABA'], columns = ['col1'])
charset = set("".join(list(df['col1'])))
print(charset)
This is the set of characters in the column:
{'B', 'A'}
for char in charset:
print(char, ' ', sum(df['col1'].str.count(char)))
This is the number of times each character appears in the column:
B 2
A 5
Trying the same again, except with a few non-alphanumeric characters:
df2 = pd.DataFrame(data = ['AA+', 'BA', 'ABA('], columns = ['col1'])
charset = set("".join(list(df2['col1'])))
print(charset)
As expected, the set of characters:
{'(', 'B', '+', 'A'}
However trying to count the characters now fails:
for char in charset:
print(char, ' ', sum(df2['col1'].str.count(char)))
error: missing ), unterminated subpattern at position 0
Is there some way to escape the non-alphanumeric characters, or otherwise get the counts I am looking for?

Because input in Series.str.count is regex, is possible use re.escape:
pat : str
Valid regular expression.
df2 = pd.DataFrame(data = ['AA+', 'BA', 'ABA('], columns = ['col1'])
#list is not necessary
charset = set("".join(df2['col1']))
print(charset)
{'(', 'B', 'A', '+'}
import re
for char in charset:
#used pandas sum
print(char, ' ', df2['col1'].str.count(re.escape(char)).sum())
( 1
B 2
A 5
+ 1

Slightly extending what you have already done, you can use a conditional dictionary comprehension to check that each character in charset is an ASCII letter..
from string import ascii_letters
>>> {char: df['col1'].str.count(char).sum() for char in charset
if char in ascii_letters}
{'B': 2, 'A': 5}

Related

Using regular expression, list all the letters that follows a vowel according to their occurrence frequency

How can I find consonants letters that came after the vowels in words of string and count the frequency
str = 'car regular double bad '
result19 = re.findall(r'\b\w*[aeiou][^ aeiou]\w*\b' , str)
print(result19) #doesn't work
Expected output
letter r count = 2
letter b count = 1
letter d count = 1

I am not sure whether this is what you want or not, but it might help as an answer and not a comment.
I think you are on the right track, but you need a few modifications and other lines to achieve the excepted:
import re
myStr = 'car regular double bad '
result19 = re.findall(r'[aeiou][^aeiou\s]+' , myStr)
myDict = {}
for value in result19:
if not value[1] in myDict:
myDict[value[1]] = 0
myDict[value[1]] += 1
myDict
This will result in a dictionary containing the values and the number the have appeared:
{'b': 1, 'd': 1, 'g': 1, 'l': 1, 'r': 2}
For having a better output you can use a for loop to print each key and its value:
for chr, value in myDict.items():
print(chr, "->", value)
Output
r -> 2
g -> 1
l -> 1
b -> 1
d -> 1

Your pattern \b\w*[aeiou][^ aeiou]\w*\b matches zero or more repetitions of a word character using \w* and only matches a single occurrence of [aeiou][^ aeiou] in the "word"
If you want to match all consonant letters based on the alphabet a-z after a vowel, you can match a single occurrence of [aeiou] and use a capture group matching a single consonant.
Then make use of re.findall to return a list of the group values.
import re
txt = 'car regular double bad '
lst = re.findall(r'[aeiou]([b-df-hj-np-tv-z])', txt)
dct = {c: lst.count(c) for c in lst}
print(dct)
Output
{'r': 2, 'g': 1, 'l': 1, 'b': 1, 'd': 1}
If you want to match a non whitespace char other than a vowel after matching a vowel, you can use this pattern [aeiou]([^\saeiou])
Note that the l is also in the output as it comes after the u in ul

How to print the strings having repeating characters?

The question is that:
Suppose I have a string S='ABC', then I want the output to be this list['AAA','BBB','CCC','AAB','ABB','AAC','ACC','BBC','BCC']
How do I achieve this result?
Edit: Thanks to #Breno Monteiro, I came up with the solution based on the example he had shown. What I did was produced the list ['AAA','BBB','CCC'] at first, by multiplying 3 with each of the characters. After that, I replaced the first and second index of the each of the elements in ['AAA','BBB','CCC'] by the second character in the string i.e., if the character is 'A' then its replaced by 'B', if its 'C', then its replaced by 'A' and so on and so forth. So the real output came out to be ['AAA', 'BBB', 'CCC', 'BAA', 'BBA', 'CBB', 'CCB', 'ACC', 'AAC']
My code:
string='ABC'
K=3
output=['AAA','BBB','CCC','AAB','ABB','AAC','ACC','BBC','BCC']
s=""
exp_output,temp=[],[]
ind=1
#including all repeating characters in the string
for i in string:
s+=i*K
exp_output.append(s)
s=""
#including all repeating characters by the first and second index
for i in exp_output:
for j in range(K-1):
i=i.replace(i[j],string[ind%len(string)],1)
temp.append(i)
#print(temp)
ind+=1
exp_output.extend(temp)
print(exp_output)

The simplest way to repeat characters in Python is:
character = 'A'
repeat_times = 3
print(character * repeat_times)
Output: AAA
You can also use Python strings as a list of characters, like this:
characters = 'ABC'
repeat_times = 3
for character in characters:
print(character*repeat_times)
Output: AAA, BBB, CCC

Counting Instances of Consecutive Duplicate Letters in a Python String

I'm trying to figure out how I can count the number of letters in a string that occur 3 times. The string is from raw_input().
For example, if my input is:
abceeedtyooo
The output should be: 2
This is my current code:
print 'Enter String:'
x = str(raw_input (""))
print x.count(x[0]*3)

To count number of consecutive duplicate letters that occur exactly 3 times:
>>> from itertools import groupby
>>> sum(len(list(dups)) == 3 for _, dups in groupby("abceeedtyooo"))
2

To count the chars in the string, you can use collections.Counter:
>>> from collections import Counter
>>> counter = Counter("abceeedtyooo")
>>> print(counter)
Counter({'e': 3, 'o': 3, 'a': 1, 'd': 1, 'y': 1, 'c': 1, 'b': 1, 't': 1})
Then you can filter the result as follows:
>>> result = [char for char in counter if counter[char] == 3]
>>> print(result)
['e', 'o']
If you want to match consecutive characters only, you can use regex (cf. re):
>>> import re
>>> result = re.findall(r"(.)\1\1", "abceeedtyooo")
>>> print(result)
['e', 'o']
>>> result = re.findall(r"(.)\1\1", "abcaaa")
>>> print(result)
['a']
This will also match if the same character appears three consecutive times multiple times (e.g. on "aaabcaaa", it will match 'a' twice). Matches are non-overlapping, so on "aaaa" it will only match once, but on "aaaaaa" it will match twice. Should you not want multiple matches on consecutive strings, modify the regex to r"(.)\1\1(?!\1)". To avoid matching any chars that appear more than 3 consecutive times, use (.)(?<!(?=\1)..)\1{2}(?!\1). This works around a problem with Python's regex module that cannot handle (?<!\1).

We can count the chars in the string through 'for' loop
s="abbbaaaaaaccdaaab"
st=[]
count=0
for i in set(s):
print(i+str(s.count(i)),end='')
Output: a10c2b4d1

How to convert a string with comma-delimited items to a list in Python?

How do you convert a string into a list?
Say the string is like text = "a,b,c". After the conversion, text == ['a', 'b', 'c'] and hopefully text[0] == 'a', text[1] == 'b'?

Like this:
>>> text = 'a,b,c'
>>> text = text.split(',')
>>> text
[ 'a', 'b', 'c' ]

Just to add on to the existing answers: hopefully, you'll encounter something more like this in the future:
>>> word = 'abc'
>>> L = list(word)
>>> L
['a', 'b', 'c']
>>> ''.join(L)
'abc'
But what you're dealing with right now, go with #Cameron's answer.
>>> word = 'a,b,c'
>>> L = word.split(',')
>>> L
['a', 'b', 'c']
>>> ','.join(L)
'a,b,c'

The following Python code will turn your string into a list of strings:
import ast
teststr = "['aaa','bbb','ccc']"
testarray = ast.literal_eval(teststr)

I don't think you need to
In python you seldom need to convert a string to a list, because strings and lists are very similar
Changing the type
If you really have a string which should be a character array, do this:
In [1]: x = "foobar"
In [2]: list(x)
Out[2]: ['f', 'o', 'o', 'b', 'a', 'r']
Not changing the type
Note that Strings are very much like lists in python
Strings have accessors, like lists
In [3]: x[0]
Out[3]: 'f'
Strings are iterable, like lists
In [4]: for i in range(len(x)):
...: print x[i]
...:
f
o
o
b
a
r
TLDR
Strings are lists. Almost.

In case you want to split by spaces, you can just use .split():
a = 'mary had a little lamb'
z = a.split()
print z
Output:
['mary', 'had', 'a', 'little', 'lamb']

If you actually want arrays:
>>> from array import array
>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> myarray = array('c', text)
>>> myarray
array('c', 'abc')
>>> myarray[0]
'a'
>>> myarray[1]
'b'
If you do not need arrays, and only want to look by index at your characters, remember a string is an iterable, just like a list except the fact that it is immutable:
>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> text[0]
'a'

m = '[[1,2,3],[4,5,6],[7,8,9]]'
m= eval(m.split()[0])
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

All answers are good, there is another way of doing, which is list comprehension, see the solution below.
u = "UUUDDD"
lst = [x for x in u]
for comma separated list do the following
u = "U,U,U,D,D,D"
lst = [x for x in u.split(',')]

I usually use:
l = [ word.strip() for word in text.split(',') ]
the strip remove spaces around words.

To convert a string having the form a="[[1, 3], [2, -6]]" I wrote yet not optimized code:
matrixAr = []
mystring = "[[1, 3], [2, -4], [19, -15]]"
b=mystring.replace("[[","").replace("]]","") # to remove head [[ and tail ]]
for line in b.split('], ['):
row =list(map(int,line.split(','))) #map = to convert the number from string (some has also space ) to integer
matrixAr.append(row)
print matrixAr

split() is your friend here. I will cover a few aspects of split() that are not covered by other answers.
If no arguments are passed to split(), it would split the string based on whitespace characters (space, tab, and newline). Leading and trailing whitespace is ignored. Also, consecutive whitespaces are treated as a single delimiter.
Example:
>>> " \t\t\none two three\t\t\tfour\nfive\n\n".split()
['one', 'two', 'three', 'four', 'five']
When a single character delimiter is passed, split() behaves quite differently from its default behavior. In this case, leading/trailing delimiters are not ignored, repeating delimiters are not "coalesced" into one either.
Example:
>>> ",,one,two,three,,\n four\tfive".split(',')
['', '', 'one', 'two', 'three', '', '\n four\tfive']
So, if stripping of whitespaces is desired while splitting a string based on a non-whitespace delimiter, use this construct:
words = [item.strip() for item in string.split(',')]
When a multi-character string is passed as the delimiter, it is taken as a single delimiter and not as a character class or a set of delimiters.
Example:
>>> "one,two,three,,four".split(',,')
['one,two,three', 'four']
To coalesce multiple delimiters into one, you would need to use re.split(regex, string) approach. See the related posts below.
Related
string.split() - Python documentation
re.split() - Python documentation
Split string based on regex
Split string based on a regular expression

# to strip `,` and `.` from a string ->
>>> 'a,b,c.'.translate(None, ',.')
'abc'
You should use the built-in translate method for strings.
Type help('abc'.translate) at Python shell for more info.

Using functional Python:
text=filter(lambda x:x!=',',map(str,text))

Example 1
>>> email= "myemailid#gmail.com"
>>> email.split()
#OUTPUT
["myemailid#gmail.com"]
Example 2
>>> email= "myemailid#gmail.com, someonsemailid#gmail.com"
>>> email.split(',')
#OUTPUT
["myemailid#gmail.com", "someonsemailid#gmail.com"]

Count the number of occurrences of a character in a string

How do I count the number of occurrences of a character in a string?
e.g. 'a' appears in 'Mary had a little lamb' 4 times.

str.count(sub[, start[, end]])
Return the number of non-overlapping occurrences of substring sub in the range [start, end]. Optional arguments start and end are interpreted as in slice notation.
>>> sentence = 'Mary had a little lamb'
>>> sentence.count('a')
4

You can use .count() :
>>> 'Mary had a little lamb'.count('a')
4

To get the counts of all letters, use collections.Counter:
>>> from collections import Counter
>>> counter = Counter("Mary had a little lamb")
>>> counter['a']
4

Regular expressions maybe?
import re
my_string = "Mary had a little lamb"
len(re.findall("a", my_string))

Python-3.x:
"aabc".count("a")
str.count(sub[, start[, end]])
Return the number of non-overlapping occurrences of substring sub in the range [start, end]. Optional arguments start and end are interpreted as in slice notation.

myString.count('a');
more info here

str.count(a) is the best solution to count a single character in a string. But if you need to count more characters you would have to read the whole string as many times as characters you want to count.
A better approach for this job would be:
from collections import defaultdict
text = 'Mary had a little lamb'
chars = defaultdict(int)
for char in text:
chars[char] += 1
So you'll have a dict that returns the number of occurrences of every letter in the string and 0 if it isn't present.
>>>chars['a']
4
>>>chars['x']
0
For a case insensitive counter you could override the mutator and accessor methods by subclassing defaultdict (base class' ones are read-only):
class CICounter(defaultdict):
def __getitem__(self, k):
return super().__getitem__(k.lower())
def __setitem__(self, k, v):
super().__setitem__(k.lower(), v)
chars = CICounter(int)
for char in text:
chars[char] += 1
>>>chars['a']
4
>>>chars['M']
2
>>>chars['x']
0

This easy and straight forward function might help:
def check_freq(x):
freq = {}
for c in set(x):
freq[c] = x.count(c)
return freq
check_freq("abbabcbdbabdbdbabababcbcbab")
{'a': 7, 'b': 14, 'c': 3, 'd': 3}
If a comprehension is desired:
def check_freq(x):
return {c: x.count(c) for c in set(x)}

Regular expressions are very useful if you want case-insensitivity (and of course all the power of regex).
my_string = "Mary had a little lamb"
# simplest solution, using count, is case-sensitive
my_string.count("m") # yields 1
import re
# case-sensitive with regex
len(re.findall("m", my_string))
# three ways to get case insensitivity - all yield 2
len(re.findall("(?i)m", my_string))
len(re.findall("m|M", my_string))
len(re.findall(re.compile("m",re.IGNORECASE), my_string))
Be aware that the regex version takes on the order of ten times as long to run, which will likely be an issue only if my_string is tremendously long, or the code is inside a deep loop.

I don't know about 'simplest' but simple comprehension could do:
>>> my_string = "Mary had a little lamb"
>>> sum(char == 'a' for char in my_string)
4
Taking advantage of built-in sum, generator comprehension and fact that bool is subclass of integer: how may times character is equal to 'a'.

a = 'have a nice day'
symbol = 'abcdefghijklmnopqrstuvwxyz'
for key in symbol:
print(key, a.count(key))

An alternative way to get all the character counts without using Counter(), count and regex
counts_dict = {}
for c in list(sentence):
if c not in counts_dict:
counts_dict[c] = 0
counts_dict[c] += 1
for key, value in counts_dict.items():
print(key, value)

I am a fan of the pandas library, in particular the value_counts() method. You could use it to count the occurrence of each character in your string:
>>> import pandas as pd
>>> phrase = "I love the pandas library and its `value_counts()` method"
>>> pd.Series(list(phrase)).value_counts()
8
a 5
e 4
t 4
o 3
n 3
s 3
d 3
l 3
u 2
i 2
r 2
v 2
` 2
h 2
p 1
b 1
I 1
m 1
( 1
y 1
_ 1
) 1
c 1
dtype: int64

count is definitely the most concise and efficient way of counting the occurrence of a character in a string but I tried to come up with a solution using lambda, something like this :
sentence = 'Mary had a little lamb'
sum(map(lambda x : 1 if 'a' in x else 0, sentence))
This will result in :
4
Also, there is one more advantage to this is if the sentence is a list of sub-strings containing same characters as above, then also this gives the correct result because of the use of in. Have a look :
sentence = ['M', 'ar', 'y', 'had', 'a', 'little', 'l', 'am', 'b']
sum(map(lambda x : 1 if 'a' in x else 0, sentence))
This also results in :
4
But Of-course this will work only when checking occurrence of single character such as 'a' in this particular case.

a = "I walked today,"
c=['d','e','f']
count=0
for i in a:
if str(i) in c:
count+=1
print(count)

I know the ask is to count a particular letter. I am writing here generic code without using any method.
sentence1 =" Mary had a little lamb"
count = {}
for i in sentence1:
if i in count:
count[i.lower()] = count[i.lower()] + 1
else:
count[i.lower()] = 1
print(count)
output
{' ': 5, 'm': 2, 'a': 4, 'r': 1, 'y': 1, 'h': 1, 'd': 1, 'l': 3, 'i': 1, 't': 2, 'e': 1, 'b': 1}
Now if you want any particular letter frequency, you can print like below.
print(count['m'])
2

the easiest way is to code in one line:
'Mary had a little lamb'.count("a")
but if you want can use this too:
sentence ='Mary had a little lamb'
count=0;
for letter in sentence :
if letter=="a":
count+=1
print (count)

To find the occurrence of characters in a sentence you may use the below code
Firstly, I have taken out the unique characters from the sentence and then I counted the occurrence of each character in the sentence these includes the occurrence of blank space too.
ab = set("Mary had a little lamb")
test_str = "Mary had a little lamb"
for i in ab:
counter = test_str.count(i)
if i == ' ':
i = 'Space'
print(counter, i)
Output of the above code is below.
1 : r ,
1 : h ,
1 : e ,
1 : M ,
4 : a ,
1 : b ,
1 : d ,
2 : t ,
3 : l ,
1 : i ,
4 : Space ,
1 : y ,
1 : m ,

"Without using count to find you want character in string" method.
import re
def count(s, ch):
pass
def main():
s = raw_input ("Enter strings what you like, for example, 'welcome': ")
ch = raw_input ("Enter you want count characters, but best result to find one character: " )
print ( len (re.findall ( ch, s ) ) )
main()

Python 3
Ther are two ways to achieve this:
1) With built-in function count()
sentence = 'Mary had a little lamb'
print(sentence.count('a'))`
2) Without using a function
sentence = 'Mary had a little lamb'
count = 0
for i in sentence:
if i == "a":
count = count + 1
print(count)

Use count:
sentence = 'A man walked up to a door'
print(sentence.count('a'))
# 4

Taking up a comment of this user:
import numpy as np
sample = 'samplestring'
np.unique(list(sample), return_counts=True)
Out:
(array(['a', 'e', 'g', 'i', 'l', 'm', 'n', 'p', 'r', 's', 't'], dtype='<U1'),
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1]))
Check 's'. You can filter this tuple of two arrays as follows:
a[1][a[0]=='s']
Side-note: It works like Counter() of the collections package, just in numpy, which you often import anyway. You could as well count the unique words in a list of words instead.

This is an extension of the accepted answer, should you look for the count of all the characters in the text.
# Objective: we will only count for non-empty characters
text = "count a character occurrence"
unique_letters = set(text)
result = dict((x, text.count(x)) for x in unique_letters if x.strip())
print(result)
# {'a': 3, 'c': 6, 'e': 3, 'u': 2, 'n': 2, 't': 2, 'r': 3, 'h': 1, 'o': 2}

No more than this IMHO - you can add the upper or lower methods
def count_letter_in_str(string,letter):
return string.count(letter)

You can use loop and dictionary.
def count_letter(text):
result = {}
for letter in text:
if letter not in result:
result[letter] = 0
result[letter] += 1
return result

spam = 'have a nice day'
var = 'd'
def count(spam, var):
found = 0
for key in spam:
if key == var:
found += 1
return found
count(spam, var)
print 'count %s is: %s ' %(var, count(spam, var))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - counting non alphanumeric characters in a Pandas dataframe - python

Slightly extending what you have already done, you can use a conditional dictionary comprehension to check that each character in charset is an ASCII letter.. from string import ascii_letters >>> {char: df['col1'].str.count(char).sum() for char in charset if char in ascii_letters} {'B': 2, 'A': 5}

Related

Using regular expression, list all the letters that follows a vowel according to their occurrence frequency

How to print the strings having repeating characters?

Counting Instances of Consecutive Duplicate Letters in a Python String

How to convert a string with comma-delimited items to a list in Python?

Count the number of occurrences of a character in a string

Categories

Resources