pandas dataframe exponential decay summation - python

I have a pandas dataframe,
[[1, 3],
[4, 4],
[2, 8]...
]
I want to create a column that has this:
1*(a)^(3) # = x
1*(a)^(3 + 4) + 4 * (a)^4 # = y
1*(a)^(3 + 4 + 8) + 4 * (a)^(4 + 8) + 2 * (a)^8 # = z
...
Where "a" is some value.
The stuff: 1, 4, 2, is from column one, the repeated 3, 4, 8 is column 2
Is this possible using some form of transform/apply?
Essentially getting:
[[1, 3, x],
[4, 4, y],
[2, 8, z]...
]
Where x, y, z is the respective sums from the new column (I want them next to each other)
There is a "groupby" that is being done on the dataframe, and this is what I want to do for a given group

If I'm understanding your question correctly, this should work:
df = pd.DataFrame([[1, 3], [4, 4], [2, 8]], columns=['a', 'b'])
a = 42
new_lst = []
for n in range(len(lst)):
z = 0
i = 0
while i <= n:
z += df['a'][i]*a**(sum(df['b'][i:n+1]))
i += 1
new_lst.append(z)
df['new'] = new_lst
Update:
Saw that you are using pandas and updated with dataframe methods. Not sure there's an easy way to do this with apply since you need a mix of values from different rows. I think this for loop is still the best route.

Related

How to generate numeric mapping for categorical columns in pandas?

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])

Length of the intersections between a list an list of list

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists
Differences :
I am focused on efficiently when the lists are large
I'm searching for the largest intersections.
x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]
Here are some assumptions:
y is a list of ~500,000 sublist of ~500 elements
each sublist in y is a range, so y is characterized by the last elements of each sublists. In the example : 3, 7, 9, 12 ...
x is not sorted
y contains once and only once each numbers between 1 and ~500000*500
y is sorted in the sense that, as in the example, the sub-lists are sorted and the first element of one sublist is the next of the last element of the previous list.
y is known long before even compile-time
My purpose is to know, among the sublists of y, which have at least 10 intersections with x.
I can obviously make a loop :
def find_best(x, y):
result = []
for index, sublist in enumerate(y):
intersection = set(x).intersection(set(sublist))
if len(intersection) > 2: # in real live: > 10
result.append(index)
return(result)
x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
res = find_best(x, y)
print(res) # [0, 2]
Here the result is [0,2] because the first and third sublist of y have 2 elements in intersection with x.
An other method should to parse only once y and count the intesections :
def find_intersec2(x, y):
n_sublists = len(y)
res = {num: 0 for num in range(0, n_sublists + 1)}
for list_no, sublist in enumerate(y):
for num in sublist:
if num in x:
x.remove(num)
res[list_no] += 1
return [n for n in range(n_sublists + 1) if res[n] >= 2]
This second method uses more the hypothesis.
Questions :
what optimizations are possibles ?
Is there a completely different approach ? Indexing, kdtree ? In my use case, the large list y is known days before the actual run. So i'm not afraid to buildind an index or whatever from y. The small list x is only known at runtime.
Since y contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny) with Nx and Ny the number of items in respectively x and y. This algorithm is nearly optimal (since x needs to be read entirely).
Actual implementation
First of all, you need to transform your current y to a Numpy array containing the beginning value of all ranges (in an increasing order) with N as the last value (assuming N is excluded for the ranges of y, or N+1 otherwise). This part can be assumed as free since y can be computed at compile time in your case. Here is an example:
import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])
Then, you need to perform the binary search and check that the values fits in the range of y:
indices = np.searchsorted(y, x, 'right')
# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1
Then you need to count the indices and filter the ones with at least :
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]
Here is an example based on your:
x = np.array([1, 2, 3, 4, 5, 6])
# [[1, 2, 3], [4], [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])
# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]
# [0, 2]
print(result.tolist())
It runs in less than 0.1 ms on my machine on a random input based on your input constraints.
Turn y into 2 dicts.
index = { # index to count map
0 : 0,
1 : 0,
2 : 0,
3 : 0,
4 : 0
}
y = { # elem to index map
1: 0,
2: 0,
3: 0,
4: 1,
5: 2,
6: 2,
7: 3,
8 : 4,
9 : 4,
10 : 4,
11 : 4
}
Since you know y in advance, I don't count the above operations into the time complexity. Then, to count the intersection:
x = [1, 2, 3, 4, 5, 6]
for e in x: index[y[e]] += 1
Since you mentioned x is small, I try to make the time complexity depends only on the size of x (in this case O(n)).
Finally, the answer is the list of keys in index dict where the value is >= 2 (or 10 in real case).
answer = [i for i in index if index[i] >= 2]
This uses y to create a linear array mapping every int to the (1 plus), the index of the range or subgroup the int is in; called x2range_counter.
x2range_counter uses a 32 bit array.array type to save memory and can be cached and used for calculations of all x on the same y.
calculating the hits in each range for a particular x is then just indirected array incrementing of a count'er in function count_ranges`.
y = [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11, 12]]
x = [5, 3, 1, 11, 8, 10]
range_counter_max = len(y)
extent = y[-1][-1] + 1 # min in y must be 1 not 0 remember.
x2range_counter = array.array('L', [0] * extent) # efficient 32 bit array storage
# Map any int in any x to appropriate ranges counter.
for range_counter_index, rng in enumerate(y, start=1):
for n in rng:
x2range_counter[n] = range_counter_index
print(x2range_counter) # array('L', [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4])
# x2range_counter can be saved for this y and any x on this y.
def count_ranges(x: List[int]) -> List[int]:
"Number of x-hits on each y subgroup in order"
# Note: count[0] initially catches errors. count[1..] counts x's in y ranges [0..]
count = array.array('L', [0] * (range_counter_max + 1))
for xx in x:
count[x2range_counter[xx]] += 1
assert count[0] == 0, "x values must all exist in a y range and y must have all int in its range."
return count[1:]
print(count_ranges(x)) # array('L', [1, 2, 1, 2])
I created a class for this, with extra functionality such as returning the ranges rather than the indices; all ranges hit >=M times; (range, hit-count) tuples sorted most hit first.
Range calculations for different x are proportional to x and are simple array lookups rather than any hashing of dicts.
What do you think?

constructing arithmetic progressions from loop

I am trying to work out a program that would calculate the diagonal coefficients of pascal's triangle.
For those who are not familiar with it, the general terms of sequences are written below.
1st row = 1 1 1 1 1....
2nd row = N0(natural number) // 1 = 1 2 3 4 5 ....
3rd row = N0(N0+1) // 2 = 1 3 6 10 15 ...
4th row = N0(N0+1)(N0+2) // 6 = 1 4 10 20 35 ...
the subsequent sequences for each row follows a specific pattern and it is my goal to output those sequences in a for loop with number of units as input.
def figurate_numbers(units):
row_1 = str(1) * units
row_1_list = list(row_1)
for i in range(1, units):
sequences are
row_2 = n // i
row_3 = (n(n+1)) // (i(i+1))
row_4 = (n(n+1)(n+2)) // (i(i+1)(i+2))
>>> def figurate_numbers(4): # coefficients for 4 rows and 4 columns
[1, 1, 1, 1]
[1, 2, 3, 4]
[1, 3, 6, 10]
[1, 4, 10, 20] # desired output
How can I iterate for both n and i in one loop such that each sequence of corresponding row would output coefficients?
You can use map or a list comprehension to hide a loop.
def f(x, i):
return lambda x: ...
row = [ [1] * k ]
for i in range(k):
row[i + 1] = map( f(i), row[i])
where f is function that descpribe the dependency on previous element of row.
Other possibility adapt a recursive Fibbonachi to rows. Numpy library allows for array arifmetics so even do not need map. Also python has predefined libraries for number of combinations etc, perhaps can be used.
To compute efficiently, without nested loops, use Rational Number based solution from
https://medium.com/#duhroach/fast-fun-with-pascals-triangle-6030e15dced0 .
from fractions import Fraction
def pascalIndexInRowFast(row,index):
lastVal=1
halfRow = (row>>1)
#early out, is index < half? if so, compute to that instead
if index > halfRow:
index = halfRow - (halfRow - index)
for i in range(0, index):
lastVal = lastVal * (row - i) / (i + 1)
return lastVal
def pascDiagFast(row,length):
#compute the fractions of this diag
fracs=[1]*(length)
for i in range(length-1):
num = i+1
denom = row+1+i
fracs[i] = Fraction(num,denom)
#now let's compute the values
vals=[0]*length
#first figure out the leftmost tail of this diag
lowRow = row + (length-1)
lowRowCol = row
tail = pascalIndexInRowFast(lowRow,lowRowCol)
vals[-1] = tail
#walk backwards!
for i in reversed(range(length-1)):
vals[i] = int(fracs[i]*vals[i+1])
return vals
Don't reinvent the triangle:
>>> from scipy.linalg import pascal
>>> pascal(4)
array([[ 1, 1, 1, 1],
[ 1, 2, 3, 4],
[ 1, 3, 6, 10],
[ 1, 4, 10, 20]], dtype=uint64)
>>> pascal(4).tolist()
[[1, 1, 1, 1], [1, 2, 3, 4], [1, 3, 6, 10], [1, 4, 10, 20]]

Efficient way to iterate python loop until previous position

I have a list a and I need to iterate from position 2 until its previous position 1.
# old index - 0 1 2 3 4
a = [1,2,3,4,5]
# new index - 2,3,4,0,1
# new value - 3,4,5,1,2
cnt = 0
while True:
for i in range(2,len(a)):
print(a[i])
for i in range(len(a)-2-1):
print(a[i])
break
I'm using 2 for loops but I believe there should be a better way to do it.
Let's assume we start with a list a = [1,2,3,4,5].
You can use collections.deque and its method deque.rotate:
from collections import deque
b = deque(a)
b.rotate(-2)
print(b)
deque([3, 4, 5, 1, 2])
Or, if you are happy to use a 3rd party library, you can use NumPy and np.roll:
import numpy as np
c = np.array(a)
c = np.roll(c, -2)
print(c)
array([3, 4, 5, 1, 2])
you can create a new list combining the elements after the particular value and before the particular value, let's say 3 in your case:
a = [1, 2, 3, 4, 5]
piv = a.index(3)
print(a[piv:] + a[:piv])
which gives you [3, 4, 5, 1, 2]
a = [1,2,3,4,5]
position = 2
for item in a[position:] + a[:position]:
print(item)
A base python based solution
a[2::] + a[:2:]
Gives
[3, 4, 5, 1, 2]
A generic version of the same would be
rotate_from = 2
a[rotate_from::] + a[:rotate_from:]
Write a function for rotate list,
In [114]: def rotate(lst, n):
...: return lst[-n:] + lst[:-n]
...:
In [115]: rotate(a,-2)
Out[115]: [3, 4, 5, 1, 2]

Get Maximum Value across rows and columns of a python Matrix

Consider the question:
The grid is:
[ [3, 0, 8, 4],
[2, 4, 5, 7],
[9, 2, 6, 3],
[0, 3, 1, 0] ]
The max viewed from top (i.e. max across columns) is: [9, 4, 8, 7]
The max viewed from left (i.e. max across rows) is: [8, 7, 9, 3]
I know how to define a grid in Python:
maximums = [[0 for x in range(len(grid[0]))] for x in range(len(grid))]
Getting maximum across rows looks easy:
max_top = [max(x) for x in grid]
But how to get maximum across columns?
Further, I need to find a way to do so in linear space O(M+N) where MxN is the size of the Matrix.
Use zip:
result = [max(i) for i in zip(*grid)]
In Python, * is not a pointer, rather, it is used for unpacking a structure passed to an object's parameter or specifying that the object can receive a variable number of items. For instance:
def f(*args):
print(args)
f(434, 424, "val", 233, "another val")
Output:
(434, 424, 'val', 233, 'another val')
Or, given an iterable, each item can be inserted at its corresponding function parameter:
def f(*args):
print(args)
f(*["val", "val3", 23, 23])
>>>('val', 'val3', 23, 23)
zip "transposes" a listing of data i.e each row becomes a column, and vice versa.
You could use numpy:
import numpy as np
x = np.array([ [3, 0, 8, 4],
[2, 4, 5, 7],
[9, 2, 6, 3],
[0, 3, 1, 0] ])
print(x.max(axis=0))
Output:
[9 4 8 7]
You said that you need to do this in O(m+n) space (not using numpy), so here's a solution that doesn't recreate the matrix:
max = x[0]
for i in x:
for j, k in enumerate(i):
if k > max[j]:
max[j] = k
print(max)
Output:
[9, 4, 8, 7]
I figured a shortcut too:
transpose the matrix and then just take maximum over rows:
grid_transposed = [[grid[j][i] for j in range(len(grid[0]))] for i in range(len(grid))]
max_left = [max(x) for x in grid]
But then again this takes O(M*N) space I have to alter the matrix.
I don't want to use numpy as external libraries are not allowed in any assignments.
Easiest way is to use numpy's array max:
array.max(0)
Something like these works both ways and is quite easy to read:
# 1.
maxLR, maxTB = [], []
maxlr, maxtb = 0, 0
# max across rows
for i, x in enumerate(grid):
maxlr = 0
for j, y in enumerate(grid[0]):
maxlr = max(maxlr, grid[i][j])
maxLR.append(maxlr)
# max across columns
for j, y in enumerate(grid[0]):
maxtb = 0
for i, x in enumerate(grid):
maxtb = max(maxtb, grid[i][j])
maxTB.append(maxtb)
# 2.
row_maxes = [max(row) for row in grid]
col_maxes = [max(col) for col in zip(*grid)]

Categories