calculate histogram peaks in python - python

In Python, how do I calcuate the peaks of a histogram?
I tried this:
import numpy as np
from scipy.signal import argrelextrema
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4,
5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9,
12,
15, 16, 17, 18, 19, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24,]
h = np.histogram(data, bins=[0, 5, 10, 15, 20, 25])
hData = h[0]
peaks = argrelextrema(hData, np.greater)
But the result was:
(array([3]),)
I'd expect it to find the peaks in bin 0 and bin 3.
Note that the peaks span more than 1 bin. I don't want it to consider the peaks that span more than 1 column as additional peak.
I'm open to another way to get the peaks.
Note:
>>> h[0]
array([19, 15, 1, 10, 5])
>>>

In computational topology, the formalism of persistent homology provides a definition of "peak" that seems to address your need. In the 1-dimensional case the peaks are illustrated by the blue bars in the following figure:
A description of the algorithm is given in this
Stack Overflow answer of a peak detection question.
The nice thing is that this method not only identifies the peaks but it quantifies the "significance" in a natural way.
A simple and efficient implementation (as fast as sorting numbers) and the source material to the above answer given in this blog article:
https://www.sthu.org/blog/13-perstopology-peakdetection/index.html

Try the findpeaks library.
pip install findpeaks
# Your input data:
data = [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 12, 15, 16, 17, 18, 19, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,]
# import library
from findpeaks import findpeaks
# Find some peaks using the smoothing parameter.
fp = findpeaks(lookahead=1, interpolate=10)
# fit
results = fp.fit(data)
# Make plot
fp.plot()
# Results with respect to original input data.
results['df']
# Results based on interpolated smoothed data.
results['df_interp']

I wrote an easy function:
def find_peaks(a):
x = np.array(a)
max = np.max(x)
lenght = len(a)
ret = []
for i in range(lenght):
ispeak = True
if i-1 > 0:
ispeak &= (x[i] > 1.8 * x[i-1])
if i+1 < lenght:
ispeak &= (x[i] > 1.8 * x[i+1])
ispeak &= (x[i] > 0.05 * max)
if ispeak:
ret.append(i)
return ret
I defined a peak as a value bigger than 180% that of the neighbors and bigger than 5% of the max value. Of course you can adapt the values as you prefer in order to find the best set up for your problem.

Related

Clean way to generate random numbers from 0 to 50 of size 1000 in python, with no similar number of occurrences

What would be the cleanest way to generate random numbers from 0 to 50, of size 1000, with the condition that no number should have the same number of occurrence as any other number using python and numpy.
Example for size 10: [0, 0, 0, 1, 1, 3, 3, 3, 3, 2] --> no number occurs same number of times
Drawing from a rng.dirichlet distribution and rejecting samples guarantees to obey the requirements, but with low entropy for the number of unique elements. You have to adjust the range of unique elements yourself with np.ones(rng.integers(min,max)). If max approaches the maximum number of unique elements (here 50) rejection might take long or has no solution, causing an infinite loop. The code is for a resulting array of size of 100.
import numpy as np
times = np.array([])
rng = np.random.default_rng()
#rejection sampling
while times.sum() != 100 or len(times) != len(np.unique(times)):
times = np.around(rng.dirichlet(np.ones(rng.integers(5,10)))*100)
nr = rng.permutation(np.arange(51))[:len(times)]
np.repeat(nr, times.astype(int))
Random output
array([ 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 33, 33, 33,
33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,
21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 25, 5, 5, 5])
Here's a recursive and possibly very slow implementation that produces the output desired.
import numpy as np
def get_sequence_lengths(values, total):
if total == 0:
return [[]], True
if total < 0:
return [], False
if len(values) == 0:
return [], False
sequences = []
result = False
for i in range(len(values)):
ls, suc = get_sequence_lengths(values[:i] + values[i + 1:], total - values[i])
result |= suc
if suc:
sequences.extend([[values[i]] + s for s in ls])
return sequences, result
def gen_numbers(rand_min, rand_max, count):
values = list(range(rand_min, rand_max + 1))
sequences, success = get_sequence_lengths(list(range(1, count+1)), count)
sequences = list(filter(lambda x: len(x) <= 1 + rand_max - rand_min, sequences))
if not success or not len(sequences):
raise ValueError('Cannot generate with given parameters.')
sequence = sequences[np.random.randint(len(sequences))]
values = np.random.choice(values, len(sequence), replace=False)
result = []
for v, s in zip(values, sequence):
result.extend([v] * s)
return result
get_sequence_length will generate all permutations of unique positive integers that sum up to the given total. The sequence will then be further filtered by the number available values. Finally the generation of paired value and counts from the sequence produces the output.
As mentioned above get_sequence_length is recursive and is going to be quite slow for larger input values.
To avoid the variability of generating random combinations in a potentially long trial/error loop, you could use a function that directly produces a random partition of a number where all parts are distinct (increasing). from that you simply need to map shuffled numbers over the chunks provided by the partition function:
def randPart(N,size=0): # O(√N)
if not size:
maxSize = int((N*2+0.25)**0.5-0.5) # ∑1..maxSize <= N
size = random.randrange(1,maxSize) # select random size
if size == 1: return (N,) # one part --> all of N
s = size*(size-1)//2 # min sum of deltas for rest
a = random.randrange(1,(N-s)//size) # base value
p = randPart(N-a*size,size-1) # deltas on other parts
return (a,*(n+a for n in p)) # combine to distinct parts
usage:
size = 30
n = 10
chunks = randPart(size)
numbers = random.sample(range(n),len(chunks))
result = [n for count,n in zip(chunks,numbers) for _ in range(count)]
print(result)
[9, 9, 9, 0, 0, 0, 0, 7, 7, 7, 7, 7, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6]
# resulting frequency counts
from collections import Counter
print(sorted(Counter(result).values()))
[3, 4, 5, 6, 12]
note that, if your range of random numbers is smaller than the maximum number of distinct partitions (for example fewer than 44 numbers for an output of 1000 values), you would need to modify the randPart function to take the limit into account in its calculation of maxSize:
def randPart(N,sizeLimit=0,size=0):
if not size:
maxSize = int((N*2+0.25)**0.5-0.5) # ∑1..maxSize <= N
maxSize = min(maxSize,sizeLimit or maxSize)
...
You could also change it to force a minimum number of partitions
This solves your problem in the way #MYousefi suggested.
import random
seq = list(range(50))
random.shuffle(seq)
values = []
for n,v in enumerate(seq):
values.extend( [v]*(n+1) )
if len(values) > 1000:
break
print(values)
Note that you can't get exactly 1,000 numbers. At first, I generated the entire sequence and then took the first 1,000, but that means whichever sequence gets truncated will be the same length as one of the earlier ones. You end up with 1,035.

Fitting a large number of bars into a matplotlib barh graph

I'm trying to make a horizontal bar graph with a large number of elements/bars with matplotlib's barh function. However, I'm having a couple of problems with bars being too close together and their labels being illegible (see image below):
I first tried changing the figure size, setting figsize=(10,40) and increasing the height up from 40, to no avail.
I also tried bumping up the spacing between bars from 0.2 to 0.3 (in the positions list), but it seems that going any higher than a spacing of 0.2 makes some of the bars disappear. In other words, there seem to be clusters of ~5 bars that are too close together that get spaced properly at 0.3, but all the bars between these clusters disappear.
The code is shown below (adapted from the mpl docs/examples). I'm sure there's rather an easy fix here that I'm just too much of a novice to realize. Alternatively, I could try graphing this in matlab but I prefer python for quality and simplicity. Are there improvements I could make that would make my bar graph legible?
Code:
genus = {'Parasutterella': 1, 'Anaerobaculum': 1, 'Clostridiales': 1, 'Butyrivibrio': 1, 'Anaerococcus': 1, 'Neisseria': 1, 'Campylobacter': 1, 'Intestinibacter': 1, 'Erysipelatoclostridium': 1, 'Tannerella': 1, 'Barnesiella': 1, 'Enterobacter': 1, 'Odoribacter': 1, 'Arcobacter': 1, 'Dialister': 1, 'Alistipes': 1, 'Collinsella': 2, 'Synergistes': 2, 'Burkholderiales': 2, 'Gordonibacter': 2, 'Tyzzerella': 2, 'Providencia': 2, 'Weissella': 2, 'Enterobacteriaceae': 2, 'Flavonifractor': 2, 'Prevotella': 2, 'Klebsiella': 2, 'Citrobacter': 2, 'Actinomyces': 2, 'Proteus': 2, 'Catenibacterium': 2, 'Propionibacterium': 2, 'Mitsuokella': 2, 'butyrate-producing': 2, 'Parvimonas': 2, 'Phascolarctobacterium': 2, 'Desulfovibrio': 2, 'Cedecea': 2, 'Finegoldia': 2, 'Slackia': 3, '[Bacteroides]': 3, 'Hafnia': 3, 'Acidaminococcus': 3, 'Bifidobacterium': 3, 'Sutterella': 3, 'Anaerofustis': 3, 'Paraprevotella': 3, 'Oxalobacter': 3, 'Yokenella': 3, 'Leuconostoc': 3, 'Dermabacter': 3, 'Megamonas': 4, 'Staphylococcus': 4, 'Fusobacterium': 4, 'Anaerostipes': 4, 'Bilophila': 4, 'Butyricicoccus': 4, 'Parabacteroides': 4, 'Erysipelotrichaceae': 4, 'Anaerotruncus': 4, 'Listeria': 4, 'Corynebacterium': 5, 'Pseudoflavonifractor': 5, 'Dorea': 5, 'Streptococcus': 6, 'Roseburia': 6, 'Helicobacter': 6, 'Eggerthella': 6, 'Acinetobacter': 6, '[Clostridium': 6, 'Ruminococcaceae': 6, 'Dysgonomonas': 6, '[Eubacterium]': 6, 'Enterococcus': 6, 'Subdoligranulum': 7, 'Faecalibacterium': 7, 'Blautia': 8, 'Holdemania': 8, 'Bacteroides': 8, 'Marvinbryantia': 8, 'Coprococcus': 9, 'Eubacterium': 9, 'Lactobacillus': 9, 'Paenisporosarcina': 9, 'Turicibacter': 9, 'Ruminococcus': 10, 'Coprobacillus': 11, 'Ralstonia': 11, 'Peptoclostridium': 11, 'Pseudomonas': 13, 'Desulfitobacterium': 14, 'Bacillus': 15, 'Streptomyces': 26, '[Clostridium]': 29, 'Paenibacillus': 32, 'Lachnospiraceae': 32, 'Clostridium': 35}
barWidth = 0.125
labels = list(genus.keys())
cols = len(labels)
bars = []
positions = [(i+1)*0.2 for i in range(cols)]
for key in labels:
bars.append(genus[key])
fig,ax = plt.subplots()
rects = []
for i in range(len(bars)):
if labels[i] in pos_genus:
rects.append(ax.barh(y=positions[i], width=bars[i], height=barWidth, color='#000000',label='Gram Positive'))
else:
rects.append(ax.barh(y=positions[i], width=bars[i], height=barWidth, color='#E8384F',label='Gram Negative'))
ax.set_title('Genus')
ax.set_yticks(positions)
ax.set_yticklabels(labels)
ax.set_ylabel('Genus')
ax.set_xlabel('Number of Organisms')
#ax.set_ylim(positions[0]-barWidth,positions[-1]+barWidth)
ax.set_xlim(0,40)
blk_patch = mpatches.Patch(color='#000000', label='Gram Positive')
red_patch = mpatches.Patch(color='#E8384F', label='Gram Negative')
plt.legend(handles=[blk_patch, red_patch])
#plt.figure(figsize=(10,50))
bar_path = os.path.join(paths['Figures'], "{0}_horiz_bar.png".format(str('genus')))
plt.savefig(bar_path,dpi=300,bbox_inches='tight')
plt.show()
Illegible barh plot:

How to calculate median from 2 different lists in Python

I have two lists note = [6,8,10,13,14,17] Effective = [3,5,6,7,5,1] ,the first one represents grades, the second one the students in the class that got that grade. so 3 kids got a 6 and 1 got a 17. I want to calculate the mean and the median. for the mean I got:
note = [6,8,10,13,14,17]
Effective = [3,5,6,7,5,1]
products = [] for num1, num2 in zip(note, Effective):
products.append(num1 * num2)
print(sum(products)/(sum(Effective)))
My first question is, how do I turn both lists into a 3rd list:
(6,6,6,8,8,8,8,8,10,10,10,10,10,10,13,13,13,13,13,13,13,14,14,14,14,14,17)
in order to get the median.
Thanks,
Donka
Here's one approach iterating over Effective on an inner level to replicate each number as many times as specified in Effective, and taking the median using statistics.median:
from statistics import median
out = []
for i in range(len(note)):
for _ in range(Effective[i]):
out.append(note[i])
print(median(out))
# 10
To get your list you could do something like
total = []
for grade, freq in zip(note, Effective):
total += freq*[grade]
You can use np.repeat to get a list with the new values.
note = [6,8,10,13,14,17]
Effective = [3,5,6,7,5,1]
import numpy as np
new_list = np.repeat(note,Effective)
np.median(new_list),np.mean(new_list)
To achieve output like the third list that you expect you have to do something like that:
from statistics import median
note = [6,8,10,13,14,17]
Effective = [3,5,6,7,5,1]
newList = []
for index,value in enumerate(Effective):
for j in range(value):
newList.append(note[index])
print(newList)
print("Median is {}".format(median(newList)))
Output:
[6, 6, 6, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 17]
Median is 10
For computing the median I suggest you use statistics.median:
from statistics import median
note = [6, 8, 10, 13, 14, 17]
effective = [3, 5, 6, 7, 5, 1]
total = [n for n, e in zip(note, effective) for _ in range(e)]
result = median(total)
print(result)
Output
10
If you look at total (in the code above), you have:
[6, 6, 6, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 17]
A functional alternative, using repeat:
from statistics import median
from itertools import repeat
note = [6, 8, 10, 13, 14, 17]
effective = [3, 5, 6, 7, 5, 1]
total = [v for vs in map(repeat, note, effective) for v in vs]
result = median(total)
print(result)
note = [6,8,10,13,14,17]
effective = [3,5,6,7,5,1]
newlist=[]
for i in range(0,len(note)):
for j in range(effective[i]):
newlist.append(note[i])
print(newlist)

How to randomly select a specific sequence from a list?

I have a list of hours starting from (0 is midnight).
hour = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
I want to generate a sequence of 3 consecutive hours randomly. Example:
[3,6]
or
[15, 18]
or
[23,2]
and so on. random.sample does not achieve what I want!
import random
hourSequence = sorted(random.sample(range(1,24), 2))
Any suggestions?
Doesn't exactly sure what you want, but probably
import random
s = random.randint(0, 23)
r = [s, (s+3)%24]
r
Out[14]: [16, 19]
Note: None of the other answers take in to consideration the possible sequence [23,0,1]
Please notice the following using itertools from python lib:
from itertools import islice, cycle
from random import choice
hours = list(range(24)) # List w/ 24h
hours_cycle = cycle(hours) # Transform the list in to a cycle
select_init = islice(hours_cycle, choice(hours), None) # Select a iterator on a random position
# Get the next 3 values for the iterator
select_range = []
for i in range(3):
select_range.append(next(select_init))
print(select_range)
This will print sequences of three values on your hours list in a circular way, which will also include on your results for example the [23,0,1].
You can try this:
import random
hour = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
index = random.randint(0,len(hour)-2)
l = [hour[index],hour[index+3]]
print(l)
You can get a random number from the array you already created hour and take the element that is 3 places afterward:
import random
def random_sequence_endpoints(l, span):
i = random.choice(range(len(l)))
return [hour[i], hour[(i+span) % len(l)]]
hour = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
result = random_sequence_endpoints(hour, 3)
This will work not only for the above hours list example but for any other list contain any other elements.

Function reads np.array - produces the mean for k nn to number p in np.array

I need to defina a function which reads a numpy array and produces the mean for k nearest points to number p in the array.
Example:
array= np.array([1, 2, 3, 4, 5, 6, 7, 50, 24, 32, 9, 11, 12, 10])
p= 15 (**Note this is not a number in the array, I will need to find the
number closest to p or p number itself)
k = 3
In this case, I would need to generate the mean for ([11, 12, 10)]
as they are closest to p = 15
With the above numbers, I will need to find the mean for k number of points closest to p and p can be explicitly stated in the array or may not be.
I am new and very confused at this point and feel I have exhausted my resources. I feel this question has been asked before but the answers are much too complex for what I need.
Thanks in advance.
Given a (1d) array arr and scalar input p, here's how you could find the mean of the n nearest values:
def neighbor_mean(arr, p, n=3):
idx = np.abs(arr - p).argsort()[:n]
return arr[idx].mean()
arr = np.array([1, 2, 3, 4, 5, 6, 7, 50, 24, 32, 9, 11, 12, 10])
neighbor_mean(arr, p=15)
# 11.0
In the above, first you take the absolute differences:
np.abs(arr - 15)
# array([14, 13, 12, 11, 10, 9, 8, 35, 9, 17, 6, 4, 3, 5])
Then argsort() returns the indices that would sort an array. We're interested in the n-smallest absolute differences. This is what you're really looking for, rather than sorting the differences directly.
np.abs(arr - p).argsort()[:3]
# array([12, 11, 13])
Lastly you want to index your input array arr and take the mean of this:
arr[[12, 11, 13]]
# array([12, 11, 10]) # mean: 11.0

Categories