Transform a Pandas series to be monotonic - python

I'm looking for a way to remove the points that ruin the monotonicity of a series.
For example
s = pd.Series([0,1,2,3,10,4,5,6])
or
s = pd.Series([0,1,2,3,-1,4,5,6])
we would extract
s = pd.Series([0,1,2,3,4,5,6])
NB: we assume that the first element is always correct.

Monotonic could be both increasing or decreasing, the functions below will return exclude all values that brean monotonicity.
However, there seems to be a confusion in your question, given the series s = pd.Series([0,1,2,3,10,4,5,6]), 10 doesn't break monotonicity conditions, 4, 5, 6 do. So the correct answer there is 0, 1, 2, 3, 10
import pandas as pd
s = pd.Series([0,1,2,3,10,4,5,6])
def to_monotonic_inc(s):
return s[s >= s.cummax()]
def to_monotonic_dec(s):
return s[s <= s.cummin()]
print(to_monotonic_inc(s))
print(to_monotonic_dec(s))
Output is 0, 1, 2, 3, 10 for increasing and 0 for decreasing.
Perhaps you want to find the longest monotonic array? because that's a completely different search problem.
----- EDIT -----
Below is a simple way of finding the longest monotonic ascending array given your constraints using plain python:
def get_longeset_monotonic_asc(s):
enumerated = sorted([(v, i) for i, v in enumerate(s) if v >= s[0]])[1:]
output = [s[0]]
last_index = 0
for v, i in enumerated:
if i > last_index:
last_index = i
output.append(v)
return output
s1 = [0,1,2,3,10,4,5,6]
s2 = [0,1,2,3,-1,4,5,6]
print(get_longeset_monotonic_asc(s1))
print(get_longeset_monotonic_asc(s2))
'''
Output:
[0, 1, 2, 3, 4, 5, 6]
[0, 1, 2, 3, 4, 5, 6]
'''
Note that this solution involves sorting which is O(nlog(n)) + a second step which is O(n).

Here is a way to produce a monotonically increasing series:
import pandas as pd
# create data
s = pd.Series([1, 2, 3, 4, 5, 4, 3, 2, 3, 4, 5, 6, 7, 8])
# find max so far (i.e., running_max)
df = pd.concat([s.rename('orig'),
s.cummax().rename('running_max'),
], axis=1)
# are we at or above max so far?
df['keep?'] = (df['orig'] >= df['running_max'])
# filter out one or many points below max so far
df = df.loc[ df['keep?'], 'orig']
# verify that remaining points are monotonically increasing
assert pd.Index(df).is_monotonic_increasing
# print(df.drop_duplicates()) # eliminates ties
print(df) # keeps ties
0 1
1 2
2 3
3 4
4 5
10 5 # <-- same as previous value -- a tie
11 6
12 7
13 8
Name: orig, dtype: int64
You can see graphically with s.plot(); and df.plot();

Related

Error when trying to implement MERGE algorithm merging to sorted lists of integers in python?

I'm new to both algorithms AND programming.
As an intro to the MERGE algorithms the chapter introduces first the MERGE algorithm by itself. It merges and sorts an array consisting of 2 sorted sub-arrays.
I did the pseudocode on paper according to the book:
Source: "Introduction to Algorithms
Third Edition" Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein
Since I am implementing it in python3 I had to change some lines given that indexing in python starts at 0 unlike in the pseudocode example of the book.
Keep in mind that the input is one array that contains 2 SORTED sub-arrays which are then merged and sorted, and returned. I kept the prints in my code, so you can see my checks...
#!/anaconda3/bin/python3
import math
import argparse
# For now only MERGE slides ch 2 -- Im defining p q and r WITHIN the function
# But for MERGE_SORT p,q and r are defined as parameters!
def merge(ar):
'''
Takes as input an array. This array consists of 2 subarrays that ARE ALLREADY sorted
(small to large). When splitting the array into half, the left
part will be longer by one if not divisible by 2. These subarrays will be
called left and right. Each of the subarrays must already be sorted. Merge() then
merges these sorted arrays into one big sorted array. The sorted array is returned.
'''
print(ar)
p=0 # for now defining always as 0
if len(ar)%2==0:
q=len(ar)//2-1 # because indexing starts from ZERO in py
else:
q=len(ar)//2 # left sub array will be 1 item longer
r=len(ar)-1 # again -1 because indexing starts from ZERO in py
print('p', p, 'q', q, 'r', r)
# lets see if n1 and n2 check out
n_1 = q-p+1 # lenght of left subarray
n_2 = r-q # lenght of right subarray
print('n1 is: ', n_1)
print('n2 is: ', n_2)
left = [0]*(n_1+1) # initiating zero list of lenght n1
right=[0]*(n_2+1)
print(left, len(left))
print(right, len(right))
# filling left and right
for i in range(n_1):# because last value will always be infinity
left[i] = ar[p+i]
for j in range(n_2):
right[j] = ar[q+j+1]
#print(ar[q+j+1])
#print(right[j])
# inserting infinity at last index for each subarray
left[n_1]=math.inf
right[n_2]=math.inf
print(left)
print(right)
# merging: initiating indexes at 0
i=0
j=0
print('p', p)
print('r', r)
for k in range(p,r):
if left[i] <= right[j]:
ar[k]=left[i]
# increase i
i += 1
else:
ar[k]=right[j]
#increase j
j += 1
print(ar)
#############################################################################################################################
# Adding parser
#############################################################################################################################
parser = argparse.ArgumentParser(description='MERGE algorithm from ch 2')
parser.add_argument('-a', '--array', type=str, metavar='', required=True, help='One List of integers composed of 2 sorted halves. Sorting must start from smallest to largest for each of the halves.')
args = parser.parse_args()
args_list_st=args.array.split(',') # list of strings
args_list_int=[]
for i in args_list_st:
args_list_int.append(int(i))
if __name__ == "__main__":
merge(args_list_int)
The problem:
When I try to sort the array as shown in the book the merged array that is returned contains two 6es and the 7 is lost.
$ ./2.merge.py -a=2,4,5,7,1,2,3,6
[2, 4, 5, 7, 1, 2, 3, 6]
p 0 q 3 r 7
n1 is: 4
n2 is: 4
[0, 0, 0, 0, 0] 5
[0, 0, 0, 0, 0] 5
[2, 4, 5, 7, inf]
[1, 2, 3, 6, inf]
p 0
r 7
[1, 2, 2, 3, 4, 5, 6, 6]
This does how ever not happen with arrays of any number higher than 6.
$ ./2.merge.py -a=2,4,5,7,1,2,3,8
[2, 4, 5, 7, 1, 2, 3, 8]
p 0 q 3 r 7
n1 is: 4
n2 is: 4
[0, 0, 0, 0, 0] 5
[0, 0, 0, 0, 0] 5
[2, 4, 5, 7, inf]
[1, 2, 3, 8, inf]
p 0
r 7
[1, 2, 2, 3, 4, 5, 7, 8]
I showed it to a colleague in my class without success. And I've walked it through manually with numbers on paper snippets but withouth success. I hope someone can find my silly mistake because I'm completely stuck.
Thanks
As r is the index of the last value in arr, you need to add one to it to make a range that also includes that final index:
for k in range(p, r + 1):
# ^^^^^
Note that your code could be greatly reduced if you would use list slicing.
Brother you made a very small mistake in this line
for k in range(p,r):
Here you loop is running from p to r-1 and your last index i.e r, will not get iterated.
So you have to use
for k in range(p,r+1):
And in the second testcase a=[2,4,5,7,1,2,3,8]
You are getting the correct output even with your wrong code because you are overwriting the values in array ar and your current code was able to sort the array till index r-1 and the number present at index r will be the same which was present before the execution of your merge function i.e 8
Try using this testcase: [2, 4, 5, 8, 1, 2, 3, 7]
And your output will be [1, 2, 2, 3, 4, 5, 7, 7]
Hope this helped

constructing arithmetic progressions from loop

I am trying to work out a program that would calculate the diagonal coefficients of pascal's triangle.
For those who are not familiar with it, the general terms of sequences are written below.
1st row = 1 1 1 1 1....
2nd row = N0(natural number) // 1 = 1 2 3 4 5 ....
3rd row = N0(N0+1) // 2 = 1 3 6 10 15 ...
4th row = N0(N0+1)(N0+2) // 6 = 1 4 10 20 35 ...
the subsequent sequences for each row follows a specific pattern and it is my goal to output those sequences in a for loop with number of units as input.
def figurate_numbers(units):
row_1 = str(1) * units
row_1_list = list(row_1)
for i in range(1, units):
sequences are
row_2 = n // i
row_3 = (n(n+1)) // (i(i+1))
row_4 = (n(n+1)(n+2)) // (i(i+1)(i+2))
>>> def figurate_numbers(4): # coefficients for 4 rows and 4 columns
[1, 1, 1, 1]
[1, 2, 3, 4]
[1, 3, 6, 10]
[1, 4, 10, 20] # desired output
How can I iterate for both n and i in one loop such that each sequence of corresponding row would output coefficients?
You can use map or a list comprehension to hide a loop.
def f(x, i):
return lambda x: ...
row = [ [1] * k ]
for i in range(k):
row[i + 1] = map( f(i), row[i])
where f is function that descpribe the dependency on previous element of row.
Other possibility adapt a recursive Fibbonachi to rows. Numpy library allows for array arifmetics so even do not need map. Also python has predefined libraries for number of combinations etc, perhaps can be used.
To compute efficiently, without nested loops, use Rational Number based solution from
https://medium.com/#duhroach/fast-fun-with-pascals-triangle-6030e15dced0 .
from fractions import Fraction
def pascalIndexInRowFast(row,index):
lastVal=1
halfRow = (row>>1)
#early out, is index < half? if so, compute to that instead
if index > halfRow:
index = halfRow - (halfRow - index)
for i in range(0, index):
lastVal = lastVal * (row - i) / (i + 1)
return lastVal
def pascDiagFast(row,length):
#compute the fractions of this diag
fracs=[1]*(length)
for i in range(length-1):
num = i+1
denom = row+1+i
fracs[i] = Fraction(num,denom)
#now let's compute the values
vals=[0]*length
#first figure out the leftmost tail of this diag
lowRow = row + (length-1)
lowRowCol = row
tail = pascalIndexInRowFast(lowRow,lowRowCol)
vals[-1] = tail
#walk backwards!
for i in reversed(range(length-1)):
vals[i] = int(fracs[i]*vals[i+1])
return vals
Don't reinvent the triangle:
>>> from scipy.linalg import pascal
>>> pascal(4)
array([[ 1, 1, 1, 1],
[ 1, 2, 3, 4],
[ 1, 3, 6, 10],
[ 1, 4, 10, 20]], dtype=uint64)
>>> pascal(4).tolist()
[[1, 1, 1, 1], [1, 2, 3, 4], [1, 3, 6, 10], [1, 4, 10, 20]]

Generate random array of integers with a number of appearance of each integer

I need to create a random array of 6 integers between 1 and 5 in Python but I also have another data say a=[2 2 3 1 2] which can be considered as the capacity. It means 1 can occur no more than 2 times or 3 can occur no more than 3 times.
I need to set up a counter for each integer from 1 to 5 to make sure each integer is not generated by the random function more than a[i].
Here is the initial array I created in python but I need to find out how I can make sure about the condition I described above. For example, I don't need a solution like [2 1 5 4 5 4] where 4 is shown twice or [2 2 2 2 1 2].
solution = np.array([np.random.randint(1,6) for i in range(6)])
Even if I can add probability, that should work. Any help is appreciated on this.
You can create an pool of data that have the most counts and then pick from there:
import numpy as np
a = [2, 2, 3, 1, 2]
data = [i + 1 for i, e in enumerate(a) for _ in range(e)]
print(data)
result = np.random.choice(data, 6, replace=False)
print(result)
Output
[1, 1, 2, 2, 3, 3, 3, 4, 5, 5]
[1 3 2 2 3 1]
Note that data is array that has for each element the specified count, then we pick randomly from data this way we ensure that you won't have more elements that the specify count.
UPDATE
If you need that each number appears at least one time, you can start with a list of each of the numbers, sample from the rest and then shuffle:
import numpy as np
result = [1, 2, 3, 4, 5]
a = [1, 1, 2, 0, 1]
data = [i + 1 for i, e in enumerate(a) for _ in range(e)]
print(data)
result = result + np.random.choice(data, 1, replace=False).tolist()
np.random.shuffle(result)
print(result)
Output
[1, 2, 3, 3, 5]
[3, 4, 2, 5, 1, 2]
Notice that I subtract 1 from each of the original values of a, also the original 6 was change to 1 because you already have 5 numbers in the variable result.
You could test your count against a dictionary
import random
a = [2, 2, 3, 1, 2]
d = {idx: item for idx,item in enumerate(a, start = 1)}
l = []
while len(set(l) ^ set([*range(1, 6)])) > 0:
l = []
while len(l) != 6:
x = random.randint(1,5)
while l.count(x) == d[x]:
x = random.randint(1,5)
l.append(x)
print(l)

Largest Subset whose sum is less than equal to a given sum

A list is defined as follows: [1, 2, 3]
and the sub-lists of this are:
[1], [2], [3],
[1,2]
[1,3]
[2,3]
[1,2,3]
Given K for example 3 the task is to find the largest length of sublist with sum of elements is less than equal to k.
I am aware of itertools in python but it will result in segmentation fault for larger lists. Is there any other efficient algorithm to achieve this? Any help would be appreciated.
My code is as allows:
from itertools import combinations
def maxLength(a, k):
#print a,k
l= []
i = len(a)
while(i>=0):
lst= list(combinations(sorted(a),i))
for j in lst:
#rint list(j)
lst = list(j)
#print sum(lst)
sum1=0
sum1 = sum(lst)
if sum1<=k:
return len(lst)
i=i-1
You can use the dynamic programming solution that #Apy linked to. Here's a Python example:
def largest_subset(items, k):
res = 0
# We can form subset with value 0 from empty set,
# items[0], items[0...1], items[0...2]
arr = [[True] * (len(items) + 1)]
for i in range(1, k + 1):
# Subset with value i can't be formed from empty set
cur = [False] * (len(items) + 1)
for j, val in enumerate(items, 1):
# cur[j] is True if we can form a set with value of i from
# items[0...j-1]
# There are two possibilities
# - Set can be formed already without even considering item[j-1]
# - There is a subset with value i - val formed from items[0...j-2]
cur[j] = cur[j-1] or ((i >= val) and arr[i-val][j-1])
if cur[-1]:
# If subset with value of i can be formed store
# it as current result
res = i
arr.append(cur)
return res
ITEMS = [5, 4, 1]
for i in range(sum(ITEMS) + 1):
print('{} -> {}'.format(i, largest_subset(ITEMS, i)))
Output:
0 -> 0
1 -> 1
2 -> 1
3 -> 1
4 -> 4
5 -> 5
6 -> 6
7 -> 6
8 -> 6
9 -> 9
10 -> 10
In above arr[i][j] is True if set with value of i can be chosen from items[0...j-1]. Naturally arr[0] contains only True values since empty set can be chosen. Similarly for all the successive rows the first cell is False since there can't be empty set with non-zero value.
For rest of the cells there are two options:
If there already is a subset with value of i even without considering item[j-1] the value is True
If there is a subset with value of i - items[j - 1] then we can add item to it and have a subset with value of i.
As far as I can see (since you treat sub array as any items of the initial array) you can use greedy algorithm with O(N*log(N)) complexity (you have to sort the array):
1. Assign entire array to the sub array
2. If sum(sub array) <= k then stop and return sub array
3. Remove maximim item from the sub array
4. goto 2
Example
[1, 2, 3, 5, 10, 25]
k = 12
Solution
sub array = [1, 2, 3, 5, 10, 25], sum = 46 > 12, remove 25
sub array = [1, 2, 3, 5, 10], sum = 21 > 12, remove 10
sub array = [1, 2, 3, 5], sum = 11 <= 12, stop and return
As an alternative you can start with an empty sub array and add up items from minimum to maximum while sum is less or equal then k:
sub array = [], sum = 0 <= 12, add 1
sub array = [1], sum = 1 <= 12, add 2
sub array = [1, 2], sum = 3 <= 12, add 3
sub array = [1, 2, 3], sum = 6 <= 12, add 5
sub array = [1, 2, 3, 5], sum = 11 <= 12, add 10
sub array = [1, 2, 3, 5, 10], sum = 21 > 12, stop,
return prior one: [1, 2, 3, 5]
Look, for generating the power-set it takes O(2^n) time. It's pretty bad. You can instead use the dynamic programming approach.
Check in here for the algorithm.
http://www.geeksforgeeks.org/dynamic-programming-subset-sum-problem/
And yes, https://www.youtube.com/watch?v=s6FhG--P7z0 (Tushar explains everything well) :D
Assume everything is positive. (Handling negatives is a simple extension of this and is left to the reader as an exercise). There exists an O(n) algorithm for the described problem. Using the O(n) median select, we partition the array based on the median. We find the sum of the left side. If that is greater than k, then we cannot take all elements, we must thus recur on the left half to try to take a smaller set. Otherwise, we subtract the sum of the left half from k, then we recur on the right half to see how many more elements we can take.
Partitioning the array based on median select and recurring on only 1 of the halves yields a runtime of n+n/2 +n/4 +n/8.. which geometrically sums up to O(n).

How to replace only the first n elements in a numpy array that are larger than a certain value?

I have an array myA like this:
array([ 7, 4, 5, 8, 3, 10])
If I want to replace all values that are larger than a value val by 0, I can simply do:
myA[myA > val] = 0
which gives me the desired output (for val = 5):
array([0, 4, 5, 0, 3, 0])
However, my goal is to replace not all but only the first n elements of this array that are larger than a value val.
So, if n = 2 my desired outcome would look like this (10 is the third element and should therefore not been replaced):
array([ 0, 4, 5, 0, 3, 10])
A straightforward implementation would be:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
# track the number of replacements
repl = 0
for ind, vali in enumerate(myA):
if vali > val:
myA[ind] = 0
repl += 1
if repl == n:
break
That works but maybe someone can can up with a smart way of masking!?
The following should work:
myA[(myA > val).nonzero()[0][:2]] = 0
since nonzero will return the indexes where the boolean array myA > val is non zero e.g. True.
For example:
In [1]: myA = array([ 7, 4, 5, 8, 3, 10])
In [2]: myA[(myA > 5).nonzero()[0][:2]] = 0
In [3]: myA
Out[3]: array([ 0, 4, 5, 0, 3, 10])
Final solution is very simple:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
myA[np.where(myA > val)[0][:n]] = 0
print(myA)
Output:
[ 0 4 5 0 3 10]
Here's another possibility (untested), probably no better than nonzero:
def truncate_mask(m, stop):
m = m.astype(bool, copy=False) # if we allow non-bool m, the next line becomes nonsense
return m & (np.cumsum(m) <= stop)
myA[truncate_mask(myA > val, n)] = 0
By avoiding building and using an explicit index you might end up with slightly better performance...but you'd have to test it to find out.
Edit 1: while we're on the subject of possibilities, you could also try:
def truncate_mask(m, stop):
m = m.astype(bool, copy=True) # note we need to copy m here to safely modify it
m[np.searchsorted(np.cumsum(m), stop):] = 0
return m
Edit 2 (the next day): I've just tested this and it seems that cumsum is actually worse than nonzero, at least with the kinds of values I was using (so neither of the above approaches is worth using). Out of curiosity, I also tried it with numba:
import numba
#numba.jit
def set_first_n_gt_thresh(a, val, thresh, n):
ii = 0
while n>0 and ii < len(a):
if a[ii] > thresh:
a[ii] = val
n -= 1
ii += 1
This only iterates over the array once, or rather it only iterates over the necessary part of the array once, never even touching the latter part. This gives you vastly superior performance for small n, but even for the worst case of n>=len(a) this approach is faster.
You could use the same solution as here with converting you np.array to pd.Series:
s = pd.Series([ 7, 4, 5, 8, 3, 10])
n = 2
m = 5
s[s[s>m].iloc[:n].index] = 0
In [416]: s
Out[416]:
0 0
1 4
2 5
3 0
4 3
5 10
dtype: int64
Step by step explanation:
In [426]: s > m
Out[426]:
0 True
1 False
2 False
3 True
4 False
5 True
dtype: bool
In [428]: s[s>m].iloc[:n]
Out[428]:
0 7
3 8
dtype: int64
In [429]: s[s>m].iloc[:n].index
Out[429]: Int64Index([0, 3], dtype='int64')
In [430]: s[s[s>m].iloc[:n].index]
Out[430]:
0 7
3 8
dtype: int64
Output in In[430] looks the same as In[428] but in 428 it's a copy and in 430 original series.
If you'll need np.array you could use values method:
In [418]: s.values
Out[418]: array([ 0, 4, 5, 0, 3, 10], dtype=int64)

Categories