numpy to speed up char and substring counts? - python

anyone here tried to use numpy to count string characters and substrings?
For example:
for string in string_list:
c = np.array(list(string))
counter = Counter(c)
It is ok to count 1 char, but for substrings there are any function to speed up things for large data?
For ex I have this:
k = 1
test_1mer = Counter()
for name, seq in parse_multi_fasta_file_compressed_or_not('test/test_random.faa'):
test_1mer.update(get_kmers(seq, k))
Fasta file (test_random.faa):
>header
SGFAVAGBHNAAMMAAM
I am using this code to count substrings in a file of 65GB.
It takes many hours to finish. So I think about numpy to speed up things a lil bit. With some biult in function to try to improve a little bit.
enter link description here

Initializing numpy arrays is not the fastest operation and it doesn't appear you need any numpy specific tools; I'd recommend just passing the strings themselves to Counter:
In [192]: strings = ['awerawe', 'awerawer', '23432wefeaf', 'awefpi32']
...: sum(map(Counter, strings), start=Counter())
Out[192]:
Counter({'a': 6,
'w': 6,
'e': 7,
'r': 3,
'2': 3,
'3': 3,
'4': 1,
'f': 3,
'p': 1,
'i': 1})
If instead, you'd want to count words you can just pass strings to Counter:
In [193]: strings = ['awerawe', 'awerawer', '23432wefeaf', 'awefpi32']
...: Counter(strings)
Out[193]: Counter({'awerawe': 1, 'awerawer': 1, '23432wefeaf': 1, 'awefpi32': 1})

Related

How to join all keys and values of dictionary and return in form of string?

Step 1. i/p= “wwwwaaadexxxxxx”
Step 2. converted= {'w': 4, 'a': 3, 'd': 1, 'e': 1, 'x': 6}
Step Final. o/p= 'w4a3d1e1x6'
I'm on S2 how to go to final step ?
Would appreciated direct conversions 1-> Final
Time Complexity should be less but would appreciate any Sol.
I want to return in form of String stored in any var
without importing anything
You can get ket and value pairs (using dict.items()) and parse them as a list, then use join to create a string out of it!
converted= {'w': 4, 'a': 3, 'd': 1, 'e': 1, 'x': 6}
print(''.join([f"{k}{v}" for k,v in converted.items()]))
w4a3d1e1x6
OR use Counter
Counter is from collections module that will give you a dict like structure with Count of each character
from collections import Counter
my_str = 'wwwwaaadexxxxxx'
print(''.join([f"{k}{v}" for k,v in Counter(my_str).items()]))

Character count in string

def charcount(stri):
for i in stri:
count = 0
for j in stri:
if stri[i] == stri[j]:
count += 1
I am new to python and currently learning string operations, can anyone tell me what is wrong in this program? The function tries to print a count of each character in given string.
For eg: string ="There is shadow behind you"
I want to count how many times each character have come in string
Counting characters in a string can be done with the Counter() class like:
Code:
from collections import Counter
def charcount(stri):
return Counter(stri)
print(charcount('The function try to print count of each character '
'in given string . Please help'))
Results:
Counter({' ': 14, 'e': 7, 'n': 7, 't': 7, 'c': 5, 'i': 5,
'r': 5, 'h': 4, 'o': 4, 'a': 4, 'f': 2, 'u': 2,
'p': 2, 'g': 2, 's': 2, 'l': 2, 'T': 1, 'y': 1,
'v': 1, '.': 1, 'P': 1})
Feedback on code:
In these lines:
for i in stri:
count = 0
for j in stri:
The outer loop is looping over each character in stri, and the inner loop is looping over every character in stri. This is like a Cartesian product of the elements in the list, and is not necessary here.
Secondly, in this line:
if stri[i] == stri[j]:
You are accessing stri by its indices, but i and j are not indices, they are the characters themselves. So treating them as indices does not work here, since characters are not valid indices for lists. If you wanted to access just the indices, you could access them with range(len()):
for i in range(len(stri)):
count = 0
for j in range(len(stri)):
if stri[i] == stri[j]:
Or if you want to access the elements and their indices, you can use enumerate().
Having said this, your approach is too complicated and needs to be redone. You need to group your characters and count them. Using nested loops is overkill here.
Alternative approaches:
There are lots of better ways to do this such as using collections.Counter() and dictionaries. These data structures are very good for counting.
Since it also looks like your struggling with loops, I suggest going back to the basics, and then attempt doing this problem with a dictionary.
This is what you need to do. Iterate through the input string and use a hash to keep track of the counts. In python, the basic hash is a dictionary.
def charCounter(string):
d = {} # initialize a new dictionary
for s in string:
if s not in d:
d[s] = 1
else:
d[s] += 1
return d
print charCounter("apple")
# returns {'a': 1, 'p': 2, 'e': 1, 'l': 1}
Just little modification in your solution
first you are looping wrong:-
Take a look:-
def charcount(stri):
d = {}
for i in stri:
if i in d:
d[i] = d[i] +1
else:
d[i] = 1
return d
print (charcount("hello")) #expected outpu
Counting each characters in a string
>>> from collections import Counter
>>> string ="There is shadow behind you"
>>> Counter(string)
Counter({' ': 4, 'h': 3, 'e': 3, 'i': 2, 's': 2, 'd': 2, 'o': 2, 'T': 1, 'r':
1, 'a': 1, 'w': 1, 'b': 1, 'n': 1, 'y': 1, 'u': 1})
If you don't want to use any import :
def charcount(string):
occurenceDict = dict()
for char in string:
if char not in occurenceDict:
occurenceDict[char] = 1
else :
occurenceDict[char] += 1
return(occurenceDict)
You can use the following code.
in_l = ','.join(str(input('Put a string: '))).split(',')
d1={}
for i in set(in_l):
d1[i] = in_l.count(i)
print(d1)
public class Z {
public static void main(String[] args) {
int count=0;
String str="aabaaaababa";
for(int i=0;i<str.length();i++) {
if(str.charAt(i)=='a') {
count++;
}
}
System.out.println(count);
}
}

Detecting the sequence of letters in a string

I've made a hasher. It works and it's really simple. I made one for fun and I thought that the code is way too long. It's over 1000 lines long and it's so simple. I just want to shorten it down.
Here's how I did the code:
wordorg = raw_input("Enter a word here: ")
## Checking if what you typed is correct
if len(wordorg) <= 10 and len(wordorg) > 1 and wordorg.isalpha():
## Comparison (JESUS THIS IS A LONG PIECE OF CODE)
print "Your original word was: " + wordorg
word = wordorg.lower()
if len(word) >= 1:
if word[0] == "a":
one = a
if word[0] == "b":
one = b
if word[0] == "c":
Bla bla bla, you get the idea, it goes like that. When it reaches Z
if word[0] == "z":
one = z
if len(word) >= 2:
if word[1] == "a":
And it goes on. My question is, how can I shorten my code?
EDIT:
The integers a, b, c are defined like this:
a = 2
b = 3
c = 5
and so on.
You could use a dict to split your line count by 26:
>>> import string
>>> translate = {l:i for i,l in enumerate(string.ascii_lowercase, 1)}
>>> translate
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}
Now, all you need is a dict lookup instead of 26 ifs:
>>> word='something'
>>> translate[word[0]]
19
>>> translate[word[1]]
15
You could replace the 1, 2, ... values with the ones defined as a, b, ... in your code.
If you want to do it for every letter, simply use a list comprehension:
>>> [translate[letter] for letter in word]
[19, 15, 13, 5, 20, 8, 9, 14, 7]
You now have a list of integers, ready for further processing!
I don't know where you go exactly from the code snippet you gave but I suggest to start from:
[1+ord(chr)-ord('a') for chr in wordorg]
ord is a function returning the ascii code of a character (a=97, b=98, etc.). So 1+ord(chr)-ord('a') will return 1 for 'a', 2 for 'b', etc.
It seems more interesting not to use a dictionary in your hash function, since a dictionary itself is a hash table.
I think this does what you're looking for. What I did is build a loop that runs over your word, so it can compare letter by letter. The second loop goes over the letters of the alphabet and if your word letter has a match, this is stored in the array results. This array counts the occurrences of each letter. If you want you can replace the print statement with writing to a file. There is also no need to restrict your code to run on short words anymore.
import string
alphabet = string.ascii_lowercase
results = [0] * len(alphabet) # array to count occurrences of letters
wordorg = raw_input("Input word here: ")
print alphabet
if wordorg.isalpha():
for i in range(len(wordorg)):
for j in range(len(alphabet)):
if (wordorg[i].find(alphabet[j])!=-1):
results[j] += 1
# print results
for i in range(len(alphabet)):
if (results[i]>0):
print "There are %d occurrences of the letter %s" %(results[i], alphabet[i])

Python count/dictionary count

dct = {}
with open("grades_single.txt","r") as g:
content = g.readlines()[1].strip('\n')
for item in content:
dct[item] = content.count(item)
LetterA = max(dct.values())
print(dct)
I'm very new to python so please excuse me. This is my code so far and it works but not as it's intended to. I'm trying to count the frequency off certain letters on new lines so I can do a mathematical function with each letter. The program counts all the letters and prints them but I'd like to be able to count each letter one by one I.E 7As, new fuction 4Bs etc.
At the moment the program is printing them off in one function but yeah I'd like to split them up so I can work with each letter one by one. {'A': 9, 'C': 12, 'B': 19, 'E': 4, 'D': 5, 'F': 1}
Does anyone know how to count the frequency of each letter by letter?
ADCBCBBBADEBCCBADBBBCDCCBEDCBACCFEABBCBBBCCEAABCBB
Example of what I'd like to count.
>>> from collections import Counter
>>> s = "ADCBCBBBADEBCCBADBBBCDCCBEDCBACCFEABBCBBBCCEAABCBB"
>>> Counter(s)
Counter({'B': 19, 'C': 14, 'A': 7, 'D': 5, 'E': 4, 'F': 1})
collections.Counter is clean, but if you were in a hurry, you could iterate over all of the elements and place them into a dictionary yousrelf.
s = 'ADCBCBBBADEBCCBADBBBCDCCBEDCBACCFEABBCBBBCCEAABCBB'
grades = {}
for letter in s:
grades[letter] = grades.get(letter, 0) + 1

How to return the number of characters whose frequency is above a threshold

How do I print the number of upper case characters whose frequency is above a threshold (in the tutorial)?
The homework question is:
Your task is to write a function which takes as input a single non-negative number and returns (not print) the number of characters in the tally whose count is strictly greater than the argument of the function. Your function should be called freq_threshold.
My answer is:
mobyDick = "Blah blah A B C A RE."
def freq_threshold(threshold):
tally = {}
for char in mobyDick:
if char in tally:
tally[char] += 1
else:
tally[char] = 1
for key in tally.keys():
if key.isupper():
print tally[key],tally.keys
if threshold>tally[key]:return threshold
else:return tally[key]
It doesn't work, but I don't know where it is wrong.
Your task is to return number of characters that satisfy the condition. You're trying to return count of occurrences of some character. Try this:
result = 0
for key in tally.keys():
if key.isupper() and tally[key] > threshold:
result += 1
return result
You can make this code more pythonic. I wrote it this way to make it more clear.
The part where you tally up the number of each character is fine:
>>> pprint.pprint ( tally )
{' ': 5,
'.': 1,
'A': 2,
'B': 2,
'C': 1,
'E': 1,
'R': 1,
'a': 2,
'b': 1,
'h': 2,
'l': 2,
'\x80': 2,
'\xe3': 1}
The error is in how you are summarising the tally.
Your assignment asked you to print the number of characters occurring more than n times in the string.
What you are returning is either n or the number of times one particular character occurred.
You instead need to step through your tally of characters and character counts, and count how many characters have frequencies exceeding n.
Do not reinvent the wheel, but use a counter object, e.g.:
>>> from collections import Counter
>>> mobyDick = "Blah blah A B C A RE."
>>> c = Counter(mobyDick)
>>> c
Counter({' ': 6, 'a': 2, 'B': 2, 'h': 2, 'l': 2, 'A': 2, 'C': 1, 'E': 1, '.': 1, 'b': 1, 'R': 1})
from collections import Counter
def freq_threshold(s, n):
cnt = Counter(s)
return [i for i in cnt if cnt[i]>n and i.isupper()]
To reinvent the wheel:
def freq_threshold(s, n):
d = {}
for i in s:
d[i] = d.get(i, 0)+1
return [i for i in d if d[i]>n and i.isupper()]

Categories