how to get first indices in dataframe match - python

given a list of indices that match a condition, where there will be many spans in the list that are sequentially adjacent, how can I easily select only the first of each span.
such that
magicallySelect([1,2,3,10,11,12,100,101,102]) == [1,10,100]
but -- importantly, this should also work for other indicies, like dates (which is the case in my data). The actual code I'm hoping to get working is:
original.reset_index(inplace=True)
predict = {}
for app in apps:
reg = linear_model.LinearRegression()
reg.fit(original.index.values.reshape(-1, 1), original[app].values)
slope = reg.coef_.tolist()[0]
delta = original[app].apply(lambda x: abs(slope - x))
forecast['test_delta'] = forecast[app].apply(lambda x: abs(slope - x))
tdm = forecast['test_delta'].mean()
tds = forecast['test_delta'].std(ddof=0)
# identify moments that are σ>2 abnormal
forecast['z'] = forecast['test_delta'].apply(lambda x: abs(x - tdm / tds))
sig = forecast.index[forecast[forecast['z'] > 2]].tolist()
predict[app] = FIRST_INDEX_IN_EACH_SPAN_OF(sig)

l = [1,2,3,10,11,12,100,101,102]
indices = [l[i] for i in range(len(l)) if l[i-1]!=l[i]-1]
Reordering this slightly to work for datetimes, this would give you all items in the list where the gap from the previous item is greater than 1 day (plus the first item by default):
indices = [l[0]] + [l[i] for i in range(len(l)) if (l[i]-l[i-1]).days>1]
For a difference in time measured in minutes, you can convert to seconds and substitute this in. E.g. for 15 minutes (900 seconds) you can do:
indices = [l[0]] + [l[i] for i in range(len(l)) if (l[i]-l[i-1]).seconds>900]

Related

Unique ordered ratio of integers

I have two ordered lists of consecutive integers m=0, 1, ... M and n=0, 1, 2, ... N. Each value of m has a probability pm, and each value of n has a probability pn. I am trying to find the ordered list of unique values r=n/m and their probabilities pr. I am aware that r is infinite if n=0 and can even be undefined if m=n=0.
In practice, I would like to run for M and N each be of the order of 2E4, meaning up to 4E8 values of r - which would mean 3 GB of floats (assuming 8 Bytes/float).
For this calculation, I have written the python code below.
The idea is to iterate over m and n, and for each new m/n, insert it in the right place with its probability if it isn't there yet, otherwise add its probability to the existing number. My assumption is that it is easier to sort things on the way instead of waiting until the end.
The cases related to 0 are added at the end of the loop.
I am using the Fraction class since we are dealing with fractions.
The code also tracks the multiplicity of each unique value of m/n.
I have tested up to M=N=100, and things are quite slow. Are there better approaches to the question, or more efficient ways to tackle the code?
Timing:
M=N=30: 1 s
M=N=50: 6 s
M=N=80: 30 s
M=N=100: 82 s
import numpy as np
from fractions import Fraction
import time # For timiing
start_time = time.time() # Timing
M, N = 6, 4
mList, nList = np.arange(1, M+1), np.arange(1, N+1) # From 1 to M inclusive, deal with 0 later
mProbList, nProbList = [1/(M+1)]*(M), [1/(N+1)]*(N) # Probabilities, here assumed equal (not general case)
# Deal with mn=0 later
pmZero, pnZero = 1/(M+1), 1/(N+1) # P(m=0) and P(n=0)
pNaN = pmZero * pnZero # P(0/0) = P(m=0)P(n=0)
pZero = pmZero * (1 - pnZero) # P(0) = P(m=0)P(n!=0)
pInf = pnZero * (1 - pmZero) # P(inf) = P(m!=0)P(n=0)
# Main list of r=m/n, P(r) and mult(r)
# Start with first line, m=1
rList = [Fraction(mList[0], n) for n in nList[::-1]] # Smallest first
rProbList = [mProbList[0] * nP for nP in nProbList[::-1]] # Start with first line
rMultList = [1] * len(rList) # Multiplicity of each element
# Main loop
for m, mP in zip(mList[1:], mProbList[1:]):
for n, nP in zip(nList[::-1], nProbList[::-1]): # Pick an n value
r, rP, rMult = Fraction(m, n), mP*nP, 1
for i in range(len(rList)-1): # See where it fits in existing list
if r < rList[i]:
rList.insert(i, r)
rProbList.insert(i, rP)
rMultList.insert(i, 1)
break
elif r == rList[i]:
rProbList[i] += rP
rMultList[i] += 1
break
elif r < rList[i+1]:
rList.insert(i+1, r)
rProbList.insert(i+1, rP)
rMultList.insert(i+1, 1)
break
elif r == rList[i+1]:
rProbList[i+1] += rP
rMultList[i+1] += 1
break
if r > rList[-1]:
rList.append(r)
rProbList.append(rP)
rMultList.append(1)
break
# Deal with 0
rList.insert(0, Fraction(0, 1))
rProbList.insert(0, pZero)
rMultList.insert(0, N)
# Deal with infty
rList.append(np.Inf)
rProbList.append(pInf)
rMultList.append(M)
# Deal with undefined case
rList.append(np.NAN)
rProbList.append(pNaN)
rMultList.append(1)
print(".... done in %s seconds." % round(time.time() - start_time, 2))
print("************** Final list\nr", 'Prob', 'Mult')
for r, rP, rM in zip(rList, rProbList, rMultList): print(r, rP, rM)
print("************** Checks")
print("mList", mList, 'nList', nList)
print("Sum of proba = ", np.sum(rProbList))
print("Sum of multi = ", np.sum(rMultList), "\t(M+1)*(N+1) = ", (M+1)*(N+1))
Based on the suggestion of #Prune, and on this thread about merging lists of tuples, I have modified the code as below. It's a lot easier to read, and runs about an order of magnitude faster for N=M=80 (I have omitted dealing with 0 - would be done same way as in original post). I assume there may be ways to tweak the merge and conversion back to lists further yet.
# Do calculations
data = [(Fraction(m, n), mProb(m) * nProb(n)) for n in range(1, N+1) for m in range(1, M+1)]
data.sort()
# Merge duplicates using a dictionary
d = {}
for r, p in data:
if not (r in d): d[r] = [0, 0]
d[r][0] += p
d[r][1] += 1
# Convert back to lists
rList, rProbList, rMultList = [], [], []
for k in d:
rList.append(k)
rProbList.append(d[k][0])
rMultList.append(d[k][1])
I expect that "things are quite slow" because you've chosen a known inefficient sort. A single list insertion is O(K) (later list elements have to be bumped over, and there is added storage allocation on a regular basis). Thus a full-list insertion sort is O(K^2). For your notation, that is O((M*N)^2).
If you want any sort of reasonable performance, research and use the best-know methods. The most straightforward way to do this is to make your non-exception results as a simple list comprehension, and use the built-in sort for your penultimate list. Simply append your n=0 cases, and you're done in O(K log K) time.
I the expression below, I've assumed functions for m and n probabilities.
This is a notational convenience; you know how to directly compute them, and can substitute those expressions if you wish.
data = [ (mProb(m) * nProb(n), Fraction(m, n))
for n in range(1, N+1)
for m in range(0, M+1) ]
data.sort()
data.extend([ # generate your "zero" cases here ])

Python: Selecting longest consecutive series of dates in list

I have a series of lists (np.arrays, actually), of which the elements are dates.
id
0a0fe3ed-d788-4427-8820-8b7b696a6033 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8 [2019-01-29, 2019-01-30, 2019-01-31, 2019-02-0...
Name: startDate, dtype: object
For each element in the series (i.e. for each list of dates), I want to retain the longest sublist in which all dates are consecutive. I'm struggling to approach this in a pythonic (simple/efficient) way. The only approach that I can think of is to use multiple loops: loop over the series values (the lists), and loop over each element in the list. I would then store the first date and the number of consecutive days, and use temporary values to overwrite the results if a longer sequence of consecutive days is encountered. This seems highly inefficient though. Is there a better way of doing this?
Since you mention you are using numpy arrays of dates it makes sense to stick to numpy types instead of converting to the built-in type. I'm assuming here that your arrays have dtype 'datetime64[D]'. In that case you could do something like
import numpy as np
date_list = np.array(['2005-02-01', '2005-02-02', '2005-02-03',
'2005-02-05', '2005-02-06', '2005-02-07', '2005-02-08', '2005-02-09',
'2005-02-11', '2005-02-12',
'2005-02-14', '2005-02-15', '2005-02-16', '2005-02-17',
'2005-02-19', '2005-02-20',
'2005-02-22', '2005-02-23', '2005-02-24',
'2005-02-25', '2005-02-26', '2005-02-27', '2005-02-28'],
dtype='datetime64[D]')
i0max, i1max = 0, 0
i0 = 0
for i1, date in enumerate(date_list):
if date - date_list[i0] != np.timedelta64(i1-i0, 'D'):
if i1 - i0 > i1max - i0max:
i0max, i1max = i0, i1
i0 = i1
print(date_list[i0max:i1max])
# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']
Here, i0 and i1 indicate the start and stop indeces of the current sub-array of consecutive dates, and i0max and i1max the start and stop indices of the longest sub-array found so far. The solution uses the fact that the difference between the i-th and zeroth entry in a list of consecutive dates is exactly i days.
You can convert list to ordinals which are increasing for all consecutive dates. Which means next_date = previous_date + 1 read more.
Then find the longest consecutive sub-array.
This process will take O(n)->single loop time which is the most efficient way to get this.
CODE
from datetime import datetime
def get_consecutive(date_list):
# convert to ordinals
v = [datetime.strptime(d, "%Y-%m-%d").toordinal() for d in date_list]
consecutive = []
run = []
dates = []
# get consecutive ordinal sequence
for i in range(1, len(v) + 1):
run.append(v[i-1])
dates.append(date_list[i-1])
if i == len(v) or v[i-1] + 1 != v[i]:
if len(consecutive) < len(run):
consecutive = dates
dates = []
run = []
return consecutive
OUTPUT:
date_list = ['2019-01-29', '2019-01-30', '2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088, 737089, 737090, 737095]
OUTPUT:
['2019-01-29', '2019-01-30', '2019-01-31']
Now use get_consecutive in df.column.apply(get_consecutive)it will give you all increasing date list. Or you can all function for each list if you are using some other data structure.
I'm going to reduce this problem to finding consecutive days in a single list. There are a few tricks that make it more Pythonic as you ask. The following script should run as-is. I've documented how it works inline:
from datetime import timedelta, date
# example input
days = [
date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 4),
date(2020, 1, 5), date(2020, 1, 6), date(2020, 1, 8),
]
# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index = 0
longest_interval_length = current_interval_length = 1
# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1, 2020-01-02), (2020-01-02, 2020-01-03), ...]
# use enumerate to get the index of the current day
for i, (previous_day, current_day) in enumerate(zip(days, days[1:]), start=1):
if current_day - previous_day == timedelta(days=+1):
# we've found a consecutive day! increase the interval length
current_interval_length += 1
else:
# nope, not a consecutive day! start from this day and start
# counting from 1
current_interval_index = i
current_interval_length = 1
if current_interval_length > longest_interval_length:
# we broke the record! record it as the longest interval
longest_interval_index = current_interval_index
longest_interval_length = current_interval_length
print("Longest interval index:", longest_interval_index)
print("Longest interval: ", days[longest_interval_index:longest_interval_index + longest_interval_length])
It should be easy enough to turn this into a reusable function.

How to check if 2 different values are from the same list and obtaining the list name

** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.

Speeding a numpy correlation program using the fact that lists are sorted

I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul
Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span

How to reduce a collection of ranges to a minimal set of ranges [duplicate]

This question already has answers here:
Union of multiple ranges
(5 answers)
Closed 7 years ago.
I'm trying to remove overlapping values from a collection of ranges.
The ranges are represented by a string like this:
499-505 100-115 80-119 113-140 500-550
I want the above to be reduced to two ranges: 80-140 499-550. That covers all the values without overlap.
Currently I have the following code.
cr = "100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250".split(" ")
ar = []
br = []
for i in cr:
(left,right) = i.split("-")
ar.append(left);
br.append(right);
inc = 0
for f in br:
i = int(f)
vac = []
jnc = 0
for g in ar:
j = int(g)
if(i >= j):
vac.append(j)
del br[jnc]
jnc += jnc
print vac
inc += inc
I split the array by - and store the range limits in ar and br. I iterate over these limits pairwise and if the i is at least as great as the j, I want to delete the element. But the program doesn't work. I expect it to produce this result: 80-125 500-550 200-250 180-185
For a quick and short solution,
from operator import itemgetter
from itertools import groupby
cr = "499-505 100-115 80-119 113-140 500-550".split(" ")
fullNumbers = []
for i in cr:
a = int(i.split("-")[0])
b = int(i.split("-")[1])
fullNumbers+=range(a,b+1)
# Remove duplicates and sort it
fullNumbers = sorted(list(set(fullNumbers)))
# Taken From http://stackoverflow.com/questions/2154249
def convertToRanges(data):
result = []
for k, g in groupby(enumerate(data), lambda (i,x):i-x):
group = map(itemgetter(1), g)
result.append(str(group[0])+"-"+str(group[-1]))
return result
print convertToRanges(fullNumbers)
#Output: ['80-140', '499-550']
For the given set in your program, output is ['80-125', '180-185', '200-250', '500-550']
Main Possible drawback of this solution: This may not be scalable!
Let me offer another solution that doesn't take time linearly proportional to the sum of the range sizes. Its running time is linearly proportional to the number of ranges.
def reduce(range_text):
parts = range_text.split()
if parts == []:
return ''
ranges = [ tuple(map(int, part.split('-'))) for part in parts ]
ranges.sort()
new_ranges = []
left, right = ranges[0]
for range in ranges[1:]:
next_left, next_right = range
if right + 1 < next_left: # Is the next range to the right?
new_ranges.append((left, right)) # Close the current range.
left, right = range # Start a new range.
else:
right = max(right, next_right) # Extend the current range.
new_ranges.append((left, right)) # Close the last range.
return ' '.join([ '-'.join(map(str, range)) for range in new_ranges ]
This function works by sorting the ranges, then looking at them in order and merging consecutive ranges that intersect.
Examples:
print(reduce('499-505 100-115 80-119 113-140 500-550'))
# => 80-140 499-550
print(reduce('100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250'))
# => 80-125 180-185 200-250 500-550

Categories