Using numpy to compute the squared sum of distances between values - python

I am a newbie in python and stackoverflow. I am trying to change my way of thinking about loops.
I have a series of values which type is <class 'pandas.core.series.Series'>.
Goal: Giving a depth n, I would like to compute for each value (except the first 2*n-2) :
result(i) = sum[j=0 to n-1](distance(i-j)*value[i-j])/sum[j=0 to n-1](distance[j])
with distance(i) = sum[k=1 to n-1]((value[i]-value[i-k])^2)
I want to avoid loops, so is there a better way to achieve my goal using numpy?
EDIT :
Ok, it seems that I am not that clear so here is an example with n= 4 :
Index
Value
0
2
1
4
2
5
3
3
4
1
5
8
6
9
7
4
8
2
9
1
10
7
Then I compute the squared difference (value[i]-value[j])^2 with j=i-1 to i-3 :
diff²
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
1
1
4
4
9
16
4
5
9
25
49
6
36
64
1
7
9
16
25
8
36
49
4
9
64
9
1
10
9
25
36
I think that getting this matrix, full or not is the core of my problem.
I can now compute distance(i) which is the sum of a row and distance(i)*value(i) :
Index
distance
distance x Value
0
1
2
3
6
18
4
29
29
5
83
664
6
101
909
7
50
200
8
89
178
9
74
74
10
70
490
And finally I can get the result :
Index
Value
Result
0
2
1
4
2
5
3
3
4
1
5
8
6
9
7.397260274
7
4
6.851711027
8
2
6.040247678
9
1
4.334394904
10
7
3.328621908
For example :
result(10) = (distance(10)*value(10)+distance(9)*value(9)+distance(8)*value(8)+distance(7)*value(7))/(distance(10)+distance(9)+distance(8)+distance(7))
I have a Java version of the algorithm if needed.
Thank you.
UPDATE :
I finally found how to get the full sqared differences matrix :
import numpy as np
import pandas as pd
n=4
myseries=pd.Series([2, 4, 5, 3, 1, 8, 9, 4, 2, 1, 7])
l=len(myseries)
vector = np.repeat(myseries, l)
mat = vector.to_numpy().reshape((l, l))
diff = mat-np.transpose(mat)
squared_diff = np.multiply(diff, diff)
print(squared_diff)
I still have to get the sum of selected elements

You could do like this:
myseries = pd.Series(np.random.rand(100), dtype='float32')
sum_of_squared_distances= np.sum(np.square(np.diff(myseries.values[:n][1::2])))
where "n" is the nth index(depth) and [1::2] part gets only odd-index values since you only need values corresponding to odd-index(except 2*n-2)

Related

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Pandas - Randomly Replace 10% of rows with other rows

I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.

Pandas Random Weighted Choice

I would like to randomly select a value in consideration of weightings using Pandas.
df:
0 1 2 3 4 5
0 40 5 20 10 35 25
1 24 3 12 6 21 15
2 72 9 36 18 63 45
3 8 1 4 2 7 5
4 16 2 8 4 14 10
5 48 6 24 12 42 30
I am aware of using np.random.choice, e.g:
x = np.random.choice(
['0-0','0-1',etc.],
1,
p=[0.4,0.24 etc.]
)
And so, I would like to get an output, in a similar style/alternative method to np.random.choice from df, but using Pandas. I would like to do so in a more efficient way in comparison to manually inserting the values as I have done above.
Using np.random.choice I am aware that all values must add up to 1. I'm not sure as to how to go about solving this, nor randomly selecting a value based on weightings using Pandas.
When referring to an output, if the randomly selected weight was for example, 40, then the output would be 0-0 since it is located in that column 0, row 0 and so on.
Stack the DataFrame:
stacked = df.stack()
Normalize the weights (so that they add up to 1):
weights = stacked / stacked.sum()
# As GeoMatt22 pointed out, this part is not necessary. See the other comment.
And then use sample:
stacked.sample(1, weights=weights)
Out:
1 2 12
dtype: int64
# Or without normalization, stacked.sample(1, weights=stacked)
DataFrame.sample method allows you to either sample from rows or from columns. Consider this:
df.sample(1, weights=[0.4, 0.3, 0.1, 0.1, 0.05, 0.05])
Out:
0 1 2 3 4 5
1 24 3 12 6 21 15
It selects one row (the first row with 40% chance, the second with 30% chance etc.)
This is also possible:
df.sample(1, weights=[0.4, 0.3, 0.1, 0.1, 0.05, 0.05], axis=1)
Out:
1
0 5
1 3
2 9
3 1
4 2
5 6
Same process but 40% chance is associated with the first column and we are selecting from columns. However, your question seems to imply that you don't want to select rows or columns - you want to select the cells inside. Therefore, I changed the dimension from 2D to 1D.
df.stack()
Out:
0 0 40
1 5
2 20
3 10
4 35
5 25
1 0 24
1 3
2 12
3 6
4 21
5 15
2 0 72
1 9
2 36
3 18
4 63
5 45
3 0 8
1 1
2 4
3 2
4 7
5 5
4 0 16
1 2
2 8
3 4
4 14
5 10
5 0 48
1 6
2 24
3 12
4 42
5 30
dtype: int64
So if I now sample from this, I will both sample a row and a column. For example:
df.stack().sample()
Out:
1 0 24
dtype: int64
selects row 1 and column 0.

Pandas - Using a list of values to create a smaller frame

I have a list of values that are found in a large pandas dataframe:
value_list = [1, 4, 5, 6, 54]
Example DataFrame df is below:
column x
0 1 3
1 4 6
2 5 8
3 6 19
4 8 21
5 12 97
6 54 102
I would like to create a subset of the data frame using only these values:
df_new = df[df['column'] is in value_list] # pseudo code
Is this possible?
You might be looking for isin operation.
In [60]: df[df['column'].isin(value_list)]
Out[60]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
Also, you can use query like
In [63]: df.query('column in #value_list')
Out[63]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
You missed a for loop :
df_new = [df[elem]['column'] for elem in df if df[elem]['column'] in value_list]

How to get odd column headers in Python?

I have data in file as shown below:
odd_column.dat
X1 X2 X3 X4 X5 X6 X7
1 1 1 1 2 2 2
2 2 4 2 5 5 3
3 3 9 3 10 10 4
4 4 16 4 17 17 5
5 5 25 5 26 26 6
6 6 36 6 37 37 7
7 7 49 7 50 50 8
8 8 64 8 65 65 9
9 9 81 9 82 82 10
And I am trying to get the odd column headers with this code (which does not work):
Code
import numpy as np
with open('odd_column.dat', "r") as data:
while True:
line = data.readline()
if not line.startswith('#'):
break
data_header = [i for i in line.strip().split('\t') if i]
odd_column_header = data_header[n for n in (1, 3, 5, 7)]
I have given only 7 total columns as an example. I would like to generalize it for thousands of columns, so that I get the headers of only the odd columns. How can this be done in Python?
Just use Python slicing:
odd_column_header = data_header[0::2]

Categories