Polars: how can I calculate lagged correlations between days? - python

I have a polars dataframe as below:
import polars as pl
df = pl.DataFrame(
{
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"day": [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4],
"id": [1, 2, 3, 2, 3, 4, 1, 2, 5, 2, 1, 3, 4],
"value": [1, 2, 2, 3, 5, 2, 1, 2, 7, 3, 5, 3, 4],
}
)
The result I want to have is:
Group by "class" (although there is just one in this example, assume there are many of them).
Calculate all pairwise correlations for all possible day pairs, e.g., between "day" - 1 and "day" - 2, "day" - 2 and "day" - 4, etc.
The two series between one particular "day" pair are taken from "value" and matched by "id" and the correlation is calculated by only considering the intersections, for example, the correlation between "day" - 1 and "day" - 4 is the correlation between [1, 2, 2] and [5, 3, 3].
I may want to structure the results as such:
class cor_day_1_2 cor_day_1_3 cor_day_1_4 cor_day_2_3 cor_day_2_4 cor_day_3_4
1 - - - - - -
.
.
.
I have tried using df.pivot to start with but get stuck by a few reasons:
Need to do transpose (which could be expansive)
Otherwise, compute row-wise correlation (don't think it is supported out of the box)
Thanks a lot for your potential help.

Here's an attempt to started: using .join() to group by class, id then filtering out the duplicates.
(df.join(df, on=["class", "id"])
.filter(pl.col("day") < pl.col("day_right"))
.groupby(["class", "day", "day_right"]).agg_list()
)
shape: (6, 6)
┌───────┬─────┬───────────┬───────────┬───────────┬─────────────┐
│ class | day | day_right | id | value | value_right │
│ --- | --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | list[i64] | list[i64] │
╞═══════╪═════╪═══════════╪═══════════╪═══════════╪═════════════╡
│ 1 | 3 | 4 | [2, 1] | [2, 1] | [3, 5] │
│ 1 | 1 | 2 | [2, 3] | [2, 2] | [3, 5] │
│ 1 | 2 | 3 | [2] | [3] | [2] │
│ 1 | 1 | 4 | [2, 1, 3] | [2, 1, 2] | [3, 5, 3] │
│ 1 | 2 | 4 | [2, 3, 4] | [3, 5, 2] | [3, 3, 4] │
│ 1 | 1 | 3 | [1, 2] | [1, 2] | [1, 2] │
└───────┴─────┴───────────┴───────────┴───────────┴─────────────┘

I'm a Polars beginner, so this might be off, but you could try the following:
df = (
df.join(df, on=["class", "id"], how="inner", suffix="_1")
.rename({"day": "day_0", "value": "value_0"})
.filter(pl.col("day_0") < pl.col("day_1"))
.groupby(["class", "day_0", "day_1"], maintain_order=True).agg(
(pl.cov("value_0", "value_1") / (pl.std("value_0") * pl.std("value_1"))).alias("corr")
)
.sort(["day_0", "day_1"])
.with_columns(
pl.concat_str([pl.lit("corr_day"), "day_0", "day_1"], "_").alias("cols"),
)
.pivot(index="class", values="corr", columns="cols")
)

Related

creating matrix, which reflex properties of a given one with numpy

when given a N x K matrix with elements from 0 to K, is there an efficient an elegant way to create the following matrix:
#0 | #2 | ... | #K
-------------------------
Row 1 x | y | ... | z
-------------------------
Row 2 a | b | ... | d
-------------------------
...
-------------------------
Row N g | h | ... | j
Where in cell 1,2 should be the amount of 2s in the first row for example.
I know, that a rather inefficient way would be to do this with two for loops, but I was wondering if that is also possible to solve with some matrix / NumPy operations.
Cheers
EDIT:
some code might look like this:
x_mod = np.zeros((N,I))
for n in range(N):
for i in range(I):
x_mod[n][int(X[n][i])] += 1
where X is the original matrix and x_mod the new one.
So for
X = 2 3 4 4 0
0 1 3 3 2
1 1 4 2 2
the desired result would look like:
1 0 1 1 2
1 1 1 2 0
0 2 2 0 1
I think you are looking for np.bincount. It's fast but unfortunately only works 1D. So one loop is required:
import numpy as np
rng = np.random.default_rng()
N,K = 5,6
in_ = rng.integers(0,K,(N,K))
in_
# array([[2, 4, 3, 5, 4, 1],
# [4, 5, 3, 5, 4, 5],
# [2, 2, 0, 3, 3, 3],
# [5, 5, 0, 4, 4, 5],
# [1, 1, 0, 2, 3, 2]])
out = np.array([np.bincount(i,None,K) for i in in_])
out
# array([[0, 1, 1, 1, 2, 1],
# [0, 0, 0, 1, 2, 3],
# [1, 0, 2, 3, 0, 0],
# [1, 0, 0, 0, 2, 3],
# [1, 2, 2, 1, 0, 0]])

Python dataframe repeat column data in each cell as a list

I am trying to repeat the whole data in a column in each each cell of the column.
My code:
df3=pd.DataFrame({
'x':[1,2,3,4,5],
'y':[10,20,30,20,10],
'z':[5,4,3,2,1]
})
df3 =
x y z
0 1 10 5
1 2 20 4
2 3 30 3
3 4 20 2
4 5 10 1
df3['z'] = df['z'].agg(lambda x: list(x))
Present output:
KeyError: 'z'
Expected output:
df=
x y z
0 1 10 [5, 4, 3, 2, 1]
1 2 20 [5, 4, 3, 2, 1]
2 3 30 [5, 4, 3, 2, 1]
3 4 20 [5, 4, 3, 2, 1]
4 5 10 [5, 4, 3, 2, 1]
Another way is to list(df.column.values)
df3.assign(z=[list(df3.z.values)]*len(df3))
x y z
0 5 10 [5, 4, 3, 2, 1]
1 4 20 [5, 4, 3, 2, 1]
2 3 30 [5, 4, 3, 2, 1]
3 2 20 [5, 4, 3, 2, 1]
4 1 10 [5, 4, 3, 2, 1]
Check with
df3['new_z']=[df3.z.tolist()]*len(df3)
More safe
df3['new_z']=[df3.z.tolist() for x in df.index]

Pandas Create Column of Numpy Arrays Given Min and Max in Other Columns

Given the following dataframe:
df = pd.DataFrame({'min':[1,2,3],'max':[4,5,6]})
df
min max
0 1 4
1 2 5
2 3 6
I need to add a third column called "arrays" that is a set of arrays generated from the "min" and "max" columns (with 1 added to the "max" value).
For example, using data from the first row, min = 1 and max = 4:
np.arange(1, 5)
array([1, 2, 3, 4])
So I would need that resulting stored in the new "arrays" column in the first row.
Here is the desired result:
min max arrays
0 1 4 [1, 2, 3, 4]
1 2 5 [2, 3, 4, 5]
2 3 6 [3, 4, 5, 6]
Use list comprehension with range
df['arrays'] = [list(range(m, mx+1)) for m, mx in zip(df['min'], df['max'])]
Out[1015]:
min max arrays
0 1 4 [1, 2, 3, 4]
1 2 5 [2, 3, 4, 5]
2 3 6 [3, 4, 5, 6]
Another solution:
df = pd.DataFrame({'min':[1,2,3],'max':[4,5,6]})
df['arrays'] = df.apply(lambda x: np.arange(x['min'], x['max']+1), axis=1)
print(df)
Prints:
min max arrays
0 1 4 [1, 2, 3, 4]
1 2 5 [2, 3, 4, 5]
2 3 6 [3, 4, 5, 6]

pandas equivalent to R series of multiple repeated numbers

I want to create a simple vector of many repeated values. This is easy in R:
> numbers <- c(rep(1,5), rep(2,4), rep(3,3))
> numbers
[1] 1 1 1 1 1 2 2 2 2 3 3 3
However, if I try to do this in Python using pandas and numpy, I don't quite get the same thing:
numbers = pd.Series([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
numbers
0 [1, 1, 1, 1, 1]
1 [2, 2, 2, 2]
2 [3, 3, 3]
dtype: object
What's the R equivalent in Python?
Just adjust how you use np.repeat
np.repeat([1, 2, 3], [5, 4, 3])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Or with pd.Series
pd.Series(np.repeat([1, 2, 3], [5, 4, 3]))
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
dtype: int64
That said, the purest form to replicate what you've done in R is to use np.concatenate in conjunction with np.repeat. It just isn't what I'd recommend doing.
np.concatenate([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Now you can use the same syntax in python:
>>> from datar.base import c, rep
>>>
>>> numbers = c(rep(1,5), rep(2,4), rep(3,3))
>>> print(numbers)
[1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
I am the author of the datar package. Feel free to submit issues if you have any questions.

Translate reshape from Matlab to Python

I'm using numpy and I don't know how translate this MATLAB code to python:
C = reshape(A(B.',:).', 6, []).';
I think that the only right thing that I did is:
temp=A[B.transpose(),:]
but I don't know how translate all of the rows.
example of matrix:
A =
1 2
1 3
1 4
1 5
1 6
2 3
2 4
2 5
2 6
B =
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
C =
1 2 1 3 1 4
1 2 1 3 1 5
1 2 1 3 1 6
1 2 1 3 2 3
1 2 1 3 2 4
1 2 1 3 2 5
1 2 1 3 2 6
This looks like an indexing plus reshaping operation; one thing to keep in mind is that numpy is zero-indexed, while matlab is one-indexed. That means you need to index A with B - 1, and then reshape your result as desired. For example:
import numpy as np
A = np.array([[1, 2],
[1, 3],
[1, 4],
[1, 5],
[1, 6],
[2, 3],
[2, 4],
[2, 5],
[2, 6]])
B = np.array([[1, 2, 3],
[1, 2, 4],
[1, 2, 5],
[1, 2, 6],
[1, 2, 7],
[1, 2, 8],
[1, 2, 9]])
C = A[B - 1].reshape(B.shape[0], -1)
The result is:
>>> C
array([[1, 2, 1, 3, 1, 4],
[1, 2, 1, 3, 1, 5],
[1, 2, 1, 3, 1, 6],
[1, 2, 1, 3, 2, 3],
[1, 2, 1, 3, 2, 4],
[1, 2, 1, 3, 2, 5],
[1, 2, 1, 3, 2, 6]])
One potentially confusing piece: the -1 in the reshape method is a marker that indicates numpy should calculate the appropriate dimension to preserve the size of the array.

Categories