How to get median values across diagonal lines in a matrix? - python

I have the following matrix in pandas:
import numpy as np
import pandas as pd
df_matrix = pd.DataFrame(np.random.random((10, 10)))
I need to get a vector that contains 10 median values, 1 value across each blue line as shown in the picture below:
The last number in the output vector is basically 1 number rather than a median.

X = np.random.random((10, 10))
fX = np.fliplr(X) # to get the "other" diagonal
np.array([np.median(np.diag(fX, k=-k)) for k in range(X.shape[0])])

The diagonals are such that row_num + col_num = constant. So you can use stack and sum the rows/cols and groupby:
(df_matrix.stack().reset_index(name='val')
.assign(diag=lambda x: x.level_0+x.level_1) # enumerate the diagonals
.groupby('diag')['val'].median() # median by diagonal
.loc[len(df_matrix):] # lower triangle diagonals
)
Output (for np.random.seed(42)):
diag
9 0.473090
10 0.330898
11 0.531382
12 0.440152
13 0.548075
14 0.325330
15 0.580145
16 0.427541
17 0.248817
18 0.107891
Name: val, dtype: float64

Related

Faster way to sum all combinations of rows in dataframe

I have a dataframe of 10,000 rows that I am trying to sum all possible combinations of those rows. According to my math, that's about 50 million combinations. I'll give a small example to simplify what my data looks like:
df = Ratio Count Score
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
And here's the desired result:
results = Min Ratio Max Ratio Total Count Total Score
1 2 13 23
1 3 21 36
1 4 30 50
1 5 40 65
2 3 15 25
2 4 24 39
2 5 34 54
3 4 17 27
3 5 27 42
4 5 19 29
This is the code that I came up with to complete the calculation:
for i in range(len(df)):
j = i + 1
while j <= len(df):
range_to_calc = df.iloc[i:j]
total_count = range_to_calc['Count'].sum()
total_score = range_to_calc['Score'].sum()
new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
'Total Count': total_count,
'Total Score': total_score}
results = results.append(new_row, ignore_index=True)
j = j + 1
This code works, but according to my estimates after running it for a few minutes, it would take 200 hours to complete. I understand that using numpy would be a lot faster, but I can't wrap my head around how to build multiple arrays to add together. (I think it would be easy if I was doing just 1+2, 2+3, 3+4, etc., but it's a lot harder because I need 1+2, 1+2+3, 1+2+3+4, etc.) Is there a more efficient way to complete this calculation so it can run in a reasonable amount of time? Thank you!
P.S.: If you're wondering what I want to do with a 50 million-row dataframe, I don't actually need that in my final results. I'm ultimately looking to divide the Total Score of each row in the results by its Total Count to get a Total Score Per Total Count value, and then display the 1,000 highest Total Scores Per Total Count, along with each associated Min Ratio, Max Ratio, Total Count, and Total Score.
After these improvements it takes ~2 minutes to run for 10k rows.
For the sum computation, you can pre-compute cumulative sum(cumsum) and save it. sum(i to j) is equal to sum(0 to j) - sum(0 to i-1).
Now sum(0 to j) is cumsum[j] and sum(0 to i - 1) is cumsum[i-1].
So sum(i to j) = cumsum[j] - cumsum[i - 1].
This gives significant improvement over computing sum each time for different combination.
Operation over numpy array is faster than the operation on pandas series, hence convert every colum to numpy array and then do the computation over it.
(From other answers): Instead of appending in list, initialise an empty numpy array of size ((n*(n+1)//2) -n , 4) and use it to save the results.
Use:
count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
for j in range(i + 1, len(df)):
arr[k, :] = ([
count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j],
score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
ratios[i],
ratios[j]])
k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score',
'Min_Ratio', 'Max_Ratio'])
Input:
df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5],
'Count': [6, 7, 8, 9, 10],
'Score': [11, 12, 13, 14, 15]})
Output:
>>>out
Min_Ratio Max_Ratio Total_Count Total_Score
0 1 2 13 23
1 1 3 21 36
2 1 4 30 50
3 1 5 40 65
4 2 3 15 25
5 2 4 24 39
6 2 5 34 54
7 3 4 17 27
8 3 5 27 42
9 4 5 19 29
First of all, you can improve the algorithm. Then, you can speed up the computation using Numpy vectorization/broadcasting.
Here are the interesting point to improve the performance of the algorithm:
append of Pandas is slow because it recreate a new dataframe. You should never use it in a costly loop. Instead, you can append the lines to a Python list or even directly write the items in a pre-allocated Numpy vector.
computing partial sums take an O(n) time while you can pre-compute the cumulative sums and then just find the partial sum in constant time.
CPython loops are very slow, but the inner loop can be vectorized using Numpy thanks to broadcasting.
Here is the resulting code:
import numpy as np
import pandas as pd
def fastImpl(df):
n = len(df)
resRowCount = (n * (n+1)) // 2
k = 0
cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
ratios = df['Ratio'].astype(int)
minRatio = np.empty(resRowCount, dtype=int)
maxRatio = np.empty(resRowCount, dtype=int)
count = np.empty(resRowCount, dtype=int)
score = np.empty(resRowCount, dtype=int)
for i in range(n):
kStart, kEnd = k, k+(n-i)
jStart, jEnd = i+1, n+1
minRatio[kStart:kEnd] = ratios[i]
maxRatio[kStart:kEnd] = ratios[i:n]
count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
k = kEnd
assert k == resRowCount
return pd.DataFrame({
'Min Ratio': minRatio,
'Max Ratio': maxRatio,
'Total Count': count,
'Total Score': score
})
Note that this code give the same results than the code in your question, but the original code does not give the expected results stated in the question. Note also that since inputs are integers, I forced Numpy to use integers for sake of performance (despite the algorithm should work with floats too).
This code is hundreds of thousand times faster than the original code on big dataframes and it succeeds to compute a dataframe of 10,000 rows in 0.7 second.
Others have explained why your algorithm was so slow so I will dive into that.
Let's take a different approach to your problem. In particular, look at how the Total Count and Total Score columns are calculated:
Calculate the cumulative sum for every row from 1 to n
Calculate the cumulative sum for every row from 2 to n
...
Calculate the cumulative sum for every row from n to n
Since cumulative sums are accumulative, we only need to calculate it once for row 1 to row n:
The cumsum of (2 to n) is the cumsum of (1 to n) - (row 1)
The cumsum of (3 to n) is the cumsum of (2 to n) - (row 2)
And so on...
In other words, the current cumsum is the previous cumsum minus its first row, then dropping the first row.
As you have theorized, pandas is a lot slower than numpy so we will convert everthing into numpy for speed:
arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array
tmp = np.cumsum(arr[:, 1:3], axis=0) # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column
results2 = [tmp]
for i in range(1, len(arr)):
tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
diff = results2[-1][0] # the previous cumsum's first row
tmp -= diff # adjust the current cumsum
tmp[:, 0] = arr[i, 0] # new Min Ratio
tmp[:, 1] = arr[i:, 0] # new Max Ratio
results2.append(tmp)
# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])
During my test, this produces the results for a 10k row data frame in about 2 seconds.
Sorry to write late for this topic, but I'm just looking for a solution for a similar topic. The solution for this issue is simple because the combination is only in pairs. This is solved by uploading the dataframe to any DB and executing the following query whose duration is less than 10 seconds:
SEL f1.*,f2.*,f1.score+f2.score
FROM table_with_data_source f1, table_with_data_source f2
where f1.ratio<>f2.ratio;
The database will do it very fast even if there are 100,000 records or more.
However, none of the algorithms that I saw in the answers, actually perform a combinatorial of values. He only does it in pairs. The problem really gets complicated when it's a true combinatorial, for example:
Given: a, b, c, d and e as records:
a
b
c
d
e
The real combination would be:
a+b
a+c
a+d
a+e
a+b+c
a+b+d
a+b+e
a+c+d
a+c+e
a+d+e
a+b+c+d
a+b+c+e
a+c+d+e
a+b+c+d+e
b+c
b+d
b+e
b+c+d
b+c+e
b+d+e
c+d
c+e
c+d+e
d+e
This is a real combinatorial, which covers all possible combinations. For this case I have not been able to find a suitable solution since it really affects the performance of any HW. Anyone have any idea how to perform a real combinatorial using python? At the database level it affects the general performance of the database.

Iterate rows and find sum of rows not exceeding a number

Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})

Pandas: Calculating a Z-score to avoid "look ahead" bias

I have time series data in dataframe named "df", and, my code for calculating the z-score is given below:
mean = df.mean()
standard_dev = df.std()
z_score = (df - mean) / standard_dev
I would like to calculate the z-score for each observation using the respective observation and data that was known at the point of recording the observation. i.e. I do not want to use a standard deviation and mean that incorporates data that occurs after a specific point in time. I just want to use data from time t, t-1, t-2....
How do I do this?
Use .expanding() - col being the column you want to compute your statistics for (drop [col] in case, if you wish to compute it for the whole dataframe):
You might need to sort values by time column first - denoted as time_col (in case if it's not sorted already):
df=df.sort_values("time_col", axis=0)
Then:
df[col].sub(df[col].expanding().mean()).div(df[col].expanding().std())
Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html
For the sample data:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqrstuv"), "b": [6,5,7,1,-9,0,3,5,2,8]})
df["c"]=df["b"].sub(df["b"].expanding().mean()).div(df["b"].expanding().std())
Outputs:
a b c
0 x 6 NaN
1 y 5 -0.707107
2 z 7 1.000000
3 p 1 -1.425880
4 q -9 -1.677484
5 r 0 -0.281450
6 s 3 0.210502
7 t 5 0.534207
8 u 2 -0.046142
9 v 8 1.062430
You could assign two new columns, containing the mean and std of previous items. I here assume, that your time series data is in the column 'time_series_data':
len_ = len(df)
df['mean_past'] = [np.mean(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['std_past'] = [np.std(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['z_score'] = (df['time_series_data'] - df['mean_past']) / df['std_past']
Edit: if you want to z-score all columns, you could define a function, that computes the z-score and apply it on all columns of your dataframe:
def z_score_column(column):
len_ = len(column)
mean = [np.mean(column[0:lv+1]) for lv in range(0,len_)]
std = [np.std(column[0:lv+1]) for lv in range(0,len_)]
return [(c-m)/s for c,m,s in zip(column, mean, std)]
df = pd.DataFrame(np.random.rand(10,5))
df.apply(z_score_column)

Comparing values within the same dataframe column

Is there anyway to compare values within the same column of a pandas DataFrame?
The task at hand is something like this:
import pandas as pd
data = pd.DataFrame({"A": [0,-5,2,3,-3,-4,-4,-2,-1,5,6,7,3,-1]});
I need to find the maximum time (in indices) consecutive +/- values appear (Equivalently checking consecutive values because the sign can be encoded by True/False). The above data should yield 5 because there are 5 consecutive negative integers [-3,-4,-4,-2,-1]
If possible, I was hoping to avoid using a loop because the number of data points in the column may very well exceed millions in order.
I've tried using data.A.rolling() and it's variants, but can't seem to figure out any possible way to do this in a vectorized way.
Any suggestions?
Here's a NumPy approach that computes the max interval lengths for the positive and negative values -
def max_interval_lens(arr):
# Store mask of positive values
pos_mask = arr>=0
# Get indices of shifts
idx = np.r_[0,np.flatnonzero(pos_mask[1:] != pos_mask[:-1])+1, arr.size]
# Return max of intervals
lens = np.diff(idx)
s = int(pos_mask[0])
maxs = [0,0]
if len(lens)==1:
maxs[1-s] = lens[0]
else:
maxs = lens[1-s::2].max(), lens[s::2].max()
return maxs # Positive, negative max lens
Sample run -
In [227]: data
Out[227]:
A
0 0
1 -5
2 2
3 3
4 -3
5 -4
6 -4
7 -2
8 -1
9 5
10 6
11 7
12 3
13 -1
In [228]: max_interval_lens(data['A'].values)
Out[228]: (4, 5)

apply pandas qcut function to subgroups

Let us assume we created a dataframe df using the code below. I have created a bin frequency count based on the 'value' column in df. Now how do I get the frequency count of these label=1 samples frequency count based on previous created bin? Obviously, I should not use qcut for those label = 1 samples to get the count, since the bin positions are not same as before.
import numpy as np
import pandas as pd
mu, sigma = 0, 0.1
theta = 0.3
s = np.random.normal(mu, sigma, 100)
group = np.random.binomial(1, theta, 100)
df = pd.DataFrame(np.vstack([s,group]).transpose())
df.columns = ['value','label']
factor = pd.qcut(df['value'], 5)
factor_bin_count = pd.value_counts(factor)
Update: I took the solution from jeff
df.groupby(['label',factor]).value.count()
If I understand your question. You want to take a grouping factor (e.g. you created using qcut to bin the continuous values), and another grouper (e.g. 'label'), then perform an operation. count in this case.
In [36]: df.groupby(['label',factor]).value.count()
Out[36]:
label value
0 [-0.248, -0.0864] 14
(-0.0864, -0.0227] 15
(-0.0227, 0.0208] 15
(0.0208, 0.0718] 17
(0.0718, 0.24] 13
1 [-0.248, -0.0864] 6
(-0.0864, -0.0227] 5
(-0.0227, 0.0208] 5
(0.0208, 0.0718] 3
(0.0718, 0.24] 7
Name: value, dtype: int64

Categories