Detect and count numerical sequence in Python array - python

In a numerical sequence (e.g. one-dimensional array) I want to find different patterns of numbers and count each finding separately. However, the numbers can occur repeatedly but only the basic pattern is important.
# Example signal (1d array)
a = np.array([1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1])
# Search for these exact following "patterns": [1,2,1], [1,2,3], [3,2,1]
# Count the number of pattern occurrences
# [1,2,1] = 2 (occurs 2 times)
# [1,2,3] = 1
# [3,2,1] = 1
I have come up with the Knuth-Morris-Pratt string matching (http://code.activestate.com/recipes/117214/), which gives me the index of the searched pattern.
for s in KnuthMorrisPratt(list(a), [1,2,1]):
print('s')
The problem is, I don't know how to find the case, where the pattern [1,2,1] "hides" in the sequence [1,2,2,2,1]. I need to find a way to reduce this sequence of repeated numbers in order to get to [1,2,1]. Any ideas?

I don't use NumPy and I am quite new to Python, so there might be a better and more efficient solution.
I would write a function like this:
def dac(data, pattern):
count = 0
for i in range(len(data)-len(pattern)+1):
tmp = data[i:(i+len(pattern))]
if tmp == pattern:
count +=1
return count
If you want to ignore repeated numbers in the middle of your pattern:
def dac(data, pattern):
count = 0
for i in range(len(data)-len(pattern)+1):
tmp = [data[i], data [i+1]]
try:
for j in range(len(data)-i):
print(i, i+j)
if tmp[-1] != data[i+j+1]:
tmp.append(data[i+j+1])
if len(tmp) == len(pattern):
print(tmp)
break
except:
pass
if tmp == pattern:
count +=1
return count
Hope that might help.

Here's a one-liner that will do it
import numpy as np
a = np.array([1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1])
p = np.array([1,2,1])
num = sum(1 for k in
[a[j:j+len(p)] for j in range(len(a) - len(p) + 1)]
if np.array_equal(k, p))
The innermost part is a list comprehension that generates all pieces of the array that are the same length as the pattern. The outer part sums 1 for every element of this list which matches the pattern.

The only way I could think of solving your problem with the
subpatterns matching was to use regex.
The following is a demonstration for findind for example the sequence [1,2,1] in list1:
import re
list1 = [1,1,2,2,2,2,1,1,1,2,1,1,2,3,3,3,3,3,2,2,1,1,1]
str_list = ''.join(str(i) for i in list1)
print re.findall(r'1+2+1', str_list)
This will give you as a result:
>>> print re.findall(r'1+2+1', str_list)
['1122221', '1121']

Related

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

How to stop over counting of duplicate letters in a list of strings

I'm trying to count the number of times a duplicate letter shows up in the list element.
For example, given
arr = ['capps','hat','haaah']
I out put a list and I get ['1','0','1']
def myfunc(words):
counter = 0 #counters dup letters in words
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1]: #if the letter ahead is the same add one
counter+=1
return counter
def minimalOperations(arr):
return [*map(myfunc,arr)] #map fuc applies myfunc to element in words.
But my code would output [1,0,2]
I'm not sure why I am over counting.
Can anyone help me resolve this, thank you in advance.
A more efficient solution using a regular expression:
import re
def myfunc(words):
reg_str = r"(\w)\1{1,}"
return len(re.findall(reg_str, words))
This function will find the number of substrings of length 2 or more containing the same letter. Thus 'aaa' in your example will only be counted once.
For a string like
'hhhhfafaahggaa'
the output will be 4 , since there are 4 maximal substrings of the same letter occuring at least twice : 'hhh' , 'ss', 'gg', 'aa'
You aren't accounting for situations where you have greater than 2 identical characters in succession. To do this, you can look back as well as forward:
if (words[i] == words[i+1]) and (words[i] != words[i-1] if i != 0 else True)
# as before
The ternary statement helps for the first iteration of the loop, to avoid comparing the last letter of a string with the first.
Another solution is to use itertools.groupby and count the number of instances where a group has a length greater than 1:
arr = ['capps','hat','haaah']
from itertools import groupby
res = [sum(1 for _, j in groupby(el) if sum(1 for _ in j) > 1) for el in arr]
print(res)
[1, 0, 1]
The sum(1 for _ in j) part is used to count the number items in a generator. It's also possible to use len(list(j)), though this requires list construction.
Well, your code counts the number of duplications, so what you observe is quite logical:
your input is arr = ['capps','hat','haaah']
in 'capps', the letter p is duplicated 1 time => myfunc() returns 1
in 'hat', there is no duplicated letter => myfunc() returns 0
in 'haaah', the letter a is duplicated 2 times => myfunc() returns 2
So finally you get [1,0,2].
For your purpose, I suggest you to use a regex to match and count the number of groups of duplicated letters in each word. I also replaced the usage of map() with a list comprehension that I find more readable:
import re
def myfunc(words):
return len(re.findall(r'(\w)\1+', words))
def minimalOperations(arr):
return [myfunc(a) for a in arr]
arr = ['capps','hat','haaah']
print(minimalOperations(arr)) # [1,0,1]
arr = ['cappsuul','hatppprrrrtyyy','haaah']
print(minimalOperations(arr)) # [2,3,1]
You need to keep track of a little more state, specifically if you're looking at duplicates now.
def myfunc(words):
counter = 0 #counters dup letters in words
seen = None
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1] and words[i+1] != seen: #if the letter ahead is the same add one and wasn't the first
counter+=1
seen = words[i]
return counter
This gives you the following output
>>> arr = ['capps','hat','haaah']
>>> map(myfunc, arr)
[1, 0, 1]
As others have pointed out, you could use a regular expression and trade clarity for performance. They key is to find a regular expression that means "two or more repeated characters" and may depend on what you consider to be characters (e.g. how do you treat duplicate punctuation?)
Note: the "regex" used for this is technically an extension on regular expressions because it requires memory.
The form will be len(re.findall(regex, words))
I would break this kind of problem into smaller chunks. Starting by grouping duplicates.
The documentation for itertools has groupby and recipes for this kind of things.
A slightly edited version of unique_justseen would look like this:
duplicates = (len(sum(1 for _ in group) for _key, group in itertools.groupby("haaah")))
and yields values: 1, 3, 1. As soon as any of these values are greater than 1 you have a duplicate. So just count them:
sum(n > 1 for n in duplicates)
Use re.findall for matches of 2 or more letters
>>> arr = ['capps','hat','haaah']
>>> [len(re.findall(r'(.)\1+', w)) for w in arr]
[1, 0, 1]

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Matching exact elements in a set where order doesn't matter

I'm new to python and I'm trying to match the exact elements between two sets, regardless of order. So if my 2 sets are:
reflist = [1],[2,3,4],[5,6]
qlist = [1,2,3,4],[6,5]
The number of matches should be 1, which is 5,6
I tried to write the following loop to match the elements in qlist against reflist, and count the number of matches:
i = 0
count = 0
for each in qlist:
while i < len(qlist):
if each.split(",").sort == reflist[i].split(",").sort:
count = count + 1
i = i + 1
print count
However, I keep getting count = 0, even if the order of 5 and 6 in qlist is 5,6. Would really appreciate any help with this!
If there are no duplicates in your "sets", convert your "sets" to a set of frozensets, and find the set intersection -
i = set(map(frozenset, reflist))
j = map(frozenset, qlist)
len(i.intersection(j))
1
This could do:
If you have no duplicates:
matches = [x for x in map(set, reflist) if x in map(set, qlist)]
If you have duplicates:
matches = [x for x in map(sorted, reflist) if x in map(sorted, qlist)]
You could always use collections.Counter() for this:
from collections import Counter
reflist = [[1],[2,3,4],[5,6]]
qlist = [[1,2,3,4],[6,5]]
result = [list(x.keys()) for x in [Counter(y) for y in reflist] if x in [Counter(y) for y in qlist]]
print(result)
Which Outputs:
[[5,6]]
Here is my one-liner, using frozensets and and:
len(set(map(frozenset, qlist)) and set(map(frozenset, reflist)))
I understand you are new to Python, hence I will answer your question using your own method, just for the sake of recording the basic straightforward answer for future reference.
First of all, your code shouldn't run at all. It must error out, because both each and reflist[i] are lists, and you are applying a string method of split(",") on them. Therefore you are getting the initial value of count = 0. You must check in the first place whether your code is even touching all the elements of qlist and reflist. This is not Code Review, hence I will leave it to you to run this and see the answer:
i = 0
count = 0
for each in qlist:
while i < len(qlist):
print i
print each
print reflist[i]
i = i + 1
Keep in mind: You don't have to iterate on index! You can just loop over the elements of iterables directly! This is the answer you are looking for:
match = [] # Optional, to see all the matching elements
count = 0
for q in qlist:
for r in reflist:
if set(q) == set(r):
print q, r
match.append(q)
count += 1
print match
print count, len(match)

Intro to Python - Lists questions

we've started doing Lists in our class and I'm a bit confused thus coming here since previous questions/answers have helped me in the past.
The first question was to sum up all negative numbers in a list, I think I got it right but just want to double check.
import random
def sumNegative(lst):
sum = 0
for e in lst:
if e < 0:
sum = sum + e
return sum
lst = []
for i in range(100):
lst.append(random.randrange(-1000, 1000))
print(sumNegative(lst))
For the 2nd question, I'm a bit stuck on how to write it. The question was:
Count how many words occur in a list up to and including the first occurrence of the word “sap”. I'm assuming it's a random list but wasn't given much info so just going off that.
I know the ending would be similar but no idea how the initial part would be since it's string opposed to numbers.
I wrote a code for a in-class problem which was to count how many odd numbers are on a list(It was random list here, so assuming it's random for that question as well) and got:
import random
def countOdd(lst):
odd = 0
for e in lst:
if e % 2 = 0:
odd = odd + 1
return odd
lst = []
for i in range(100):
lst.append(random.randint(0, 1000))
print(countOdd(lst))
How exactly would I change this to fit the criteria for the 2nd question? I'm just confused on that part. Thanks.
The code to sum -ve numbers looks fine! I might suggest testing it on a list that you can manually check, such as:
print(sumNegative([1, -1, -2]))
The same logic would apply to your random list.
A note about your countOdd function, it appears that you are missing an = (== checks for equality, = is for assignment) and the code seems to count even numbers, not odd. The code should be:
def countOdd(lst):
odd = 0
for e in lst:
if e%2 == 1: # Odd%2 == 1
odd = odd + 1
return odd
As for your second question, you can use a very similar function:
def countWordsBeforeSap(inputList):
numWords = 0
for word in inputList:
if word.lower() != "sap":
numWords = numWords + 1
else:
return numWords
inputList = ["trees", "produce", "sap"]
print(countWordsBeforeSap(inputList))
To explain the above, the countWordsBeforeSap function:
Starts iterating through the words.
If the word is anything other than "sap" it increments the counter and continues
If the word IS "sap" then it returns early from the function
The function could be more general by passing in the word that you wanted to check for:
def countWordsBefore(inputList, wordToCheckFor):
numWords = 0
for word in inputList:
if word.lower() != wordToCheckFor:
numWords = numWords + 1
else:
return numWords
inputList = ["trees", "produce", "sap"]
print(countWordsBeforeSap(inputList, "sap"))
If the words that you are checking come from a single string then you would initially need to split the string into individual words like so:
inputString = "Trees produce sap"
inputList = inputString.split(" ")
Which splits the initial string into words that are separated by spaces.
Hope this helps!
Tom
def count_words(lst, end="sap"):
"""Note that I added an extra input parameter.
This input parameter has a default value of "sap" which is the actual question.
However you can change this input parameter to any other word if you want to by
just doing "count_words(lst, "another_word".
"""
words = []
# First we need to loop through each item in the list.
for item in lst:
# We append the item to our "words" list first thing in this loop,
# as this will make sure we will count up to and INCLUDING.
words.append(item)
# Now check if we have reached the 'end' word.
if item == end:
# Break out of the loop prematurely, as we have reached the end.
break
# Our 'words' list now has all the words up to and including the 'end' variable.
# 'len' will return how many items there are in the list.
return len(words)
lst = ["something", "another", "woo", "sap", "this_wont_be_counted"]
print(count_words(lst))
Hope this helps you understand lists better!
You can make effective use of list/generator comprehensions. Below are fast and memory efficient.
1. Sum of negatives:
print(sum( i<0 for i in lst))
2. Count of words before sap: Like you sample list, it assumes no numbers are there in list.
print(lst.index('sap'))
If it's a random list. Filter strings. Find Index for sap
l = ['a','b',1,2,'sap',3,'d']
l = filter(lambda x: type(x)==str, l)
print(l.index('sap'))
3. Count of odd numbers:
print(sum(i%2 != 0 for i in lst))

Categories