Get the index of median value in array containing Nans - python

How can I get the index of the median value for an array which contains NaNs?
For example, I have the array of values [Nan, 2, 5, NaN, 4, NaN, 3, 1] with correspondent array of errors on those values [np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3]. Then the median is 3, while the error is 0.4.
Is there a simple way to do this?
EDIT: I edited the error array to imply a more realistic situation. And Yes, I am using numpy.

It's not really clear how you intend to meaningfully extract the error from the median, but if you do happen to have an array such that the median is one of its entries, and the corresponding error array is defined at the corresponding index, and there aren't other entries with the same value as the median, and probably several other disclaimers, then you can do the following:
a = np.array([np.nan,2,5,np.nan, 4,np.nan,3,1])
aerr = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# median, ignoring NaNs
amedian = np.median(a[np.isfinite(a)])
# find the index of the closest value to the median in a
idx = np.nanargmin(np.abs(a-amedian))
# this is the corresponding "error"
aerr[idx]
EDIT: as #DSM points out, if you have NumPy 1.9 or above, you can simplify the calculation of amedian as amedian = np.nanmedian(a).

numpy has everything you need:
values = np.array([np.nan, 2, 5, np.nan, 4, np.nan, 3, 1])
errors = np.array([np.nan, 0.1, 0.2, np.nan, 0.1, np.nan, 0.4, 0.3])
# filter
filtered = values[~np.isnan(values)]
# find median
median = np.median(filtered)
# find indexes
indexes = np.where(values == median)[0]
# find errors
errors[indexes] # array([ 0.4])

let say you have your list named as "a", then you can use this codeto find a masked array without "Nan" and then do median with a np.ma.median():
a=[Nan, 2, 5, NaN, 4, NaN, 3, 1]
am = numpy.ma.masked_array(a, [numpy.isnan(x) for x in a])
numpy.ma.median(am)
you can do the same for errors as well.

Related

Add list of numbers to an empty list using for loop in Python

Faced a wall recently with somewhat simple thing but no matter what I am unable to solve it.
I created a small function that calculates some values and returns a list as an output value
def calc(file):
#some calculation based on file
return degradation #as a list
for example, for file "data1.txt"
degradation = [1,0.9,0.8,0.5]
and for file "data2.txt"
degradation = [1,0.8,0.6,0.2]
Since I have several files on which I want to apply the calc() I wanted them to connect them, sideways, so that I connect them into an array which has len(degradation) number of rows, and columns as much as I have files. Was planning to do it with for loop.
For this specific case something like:
output = 1 , 1
0.9,0.8
0.8,0.6
0.5,0.2
Tried with pandas as well but without a success.
import numpy as np
arr2d = np.array([[1, 2, 3, 4]])
arr2d = np.append(arr2d, [[9, 8, 7, 6]], axis=0).T
expect an output something like this:
array([[1, 9],
[2, 8],
[3, 7],
[4, 6]])
You can use numpy.hstack() to achieve this.
Imagine you have data from the first two files from the first two iterations of the for loop.
data1.txt gives you
degradation1 = [1,0.9,0.8,0.5]
and data2.txt gives you
degradation2 = [1,0.8,0.6,0.2]
First, you have to convert both lists into lists of lists.
degradation1 = [[i] for i in degradation1]
degradation2 = [[i] for i in degradation2]
This gives the outputs,
print(degradation1)
print(degradation2)
[[1], [0.9], [0.8], [0.5]]
[[1], [0.8], [0.6], [0.2]]
Now you can stack the data using the numpy.hstack().
stacked = numpy.hstack(degradation1,degradation2)
This gives the output
array([[1. , 1. ],
[0.9, 0.8],
[0.8, 0.6],
[0.5, 0.2]])
Imagine you have the file data3.text during the 3rd iteration of the for loop and it gives
degradation3 = [1,0.3,0.6,0.4]
You can follow the same steps as above and stack it with stacked. Follow the steps; convert to a list of the lists, stack with stacked.
degradation3 = [[i] for i in degradation3]
stacked = numpy.hstack(stacked,degradation3)
This gives you the output
array([[1. , 1. , 1. ],
[0.9, 0.8, 0.3],
[0.8, 0.6, 0.6],
[0.5, 0.2, 0.4]])
You can continue this for the whole loop.
Assume my_lists is a list of your lists.
my_lists = [
[1, 2, 3, 4],
[10, 20, 30, 40],
[100, 200, 300, 400]]
result = []
for _ in my_lists[0]:
result.append([])
for l in my_lists:
for i in range(len(result)):
result[i].append(l[i])
for line in result:
print(line)
The output would be
[1, 10, 100]
[2, 20, 200]
[3, 30, 300]
[4, 40, 400]
As you seem to want to work with lists
## degradations as list
degradation1 = [1,0.8,0.6,0.2]
degradation2 = [1,0.9,0.8,0.5]
degradation3 = [0.7,0.9,0.8,0.5]
degradations = [degradation1, degradation2, degradation3]
## CORE OF THE ANSWER ##
degradationstransposed = [list(i) for i in zip(*degradations)]
print(degradationstransposed)
[[1, 1, 0.7], [0.8, 0.9, 0.9], [0.6, 0.8, 0.8], [0.2, 0.5, 0.5]]

How to find first occurrence of a significant difference in values of a pandas dataframe?

In a Pandas DataFrame, how would I find the first occurrence of a large difference between two values at two adjacent indices?
As an example, if I have a DataFrame column A with data [1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1], I would want index holding 1.5, which would be 5. In my code below, it would give me the index holding 7.2, because 15 - 7.2 > 7 - 1.5.
idx = df['A'].diff().idxmax() - 1
How should I fix this problem, so I get the index of the first 'large difference' occurrence?
The main issue is of course how you define a "large difference". Your solution is pretty good to get the largest difference, improved only by using .diff(-1) and using absolute values as shown by Jezrael:
differences = df['A'].diff(-1).abs()
Using absolute values matters if your values are not sorted, in which case you can get negative differences.
Then, you should probably do some clustering on these values and get the smallest index of the cluster with largest values. Jezrael already showed a heuristic by using the largest quartile, however by only slightly modifying your example this doesn’t work:
df = pd.DataFrame({'A': [1, 1.05, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1]})
differences = df['A'].diff(-1).abs()
idx = differences.index[differences >= differences.quantile(.75)][0]
print(idx, differences[idx])
This returns 1 0.1499999999999999
Here's 3 other heuristics that might work better for you:
If you have a value above which you consider a difference to be “large” (e.g. 1.5):
idx = differences.index[differences >= 1.5][0]
If you know how many large values there are, you can select those and get the smallest index (e.g. 2):
idx = differences.nlargest(2).index.min()
If you know all small values are grouped together (as are all the 0.1 in your example), you can filter what's larger than the mean (or the mean + 1 standard deviation if your “large” values are very close to the smaller ones).
idx = differences.index[differences >= differences.mean()][0]
This is because contrarily to the median, your few large differences will pull the mean up significantly.
If you really want to go for proper clustering, you can use the KMeans algorithm from scikit learn:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2).fit(differences.values[:-1].reshape(-1, 1))
clusters = pd.Series(kmeans.labels_, index=differences.index[:-1])
idx = clusters.index[clusters.eq(np.squeeze(kmeans.cluster_centers_).argmax())][0]
This classifies the data into 2 classes, and then gets the classification into a pandas Series. We then filter this series’ index by selecting only the cluster that has the highest values, and finally get the first element of this filtered index.
One idea is filter by Series.quantile with Series of differences with change order of differencies by -1 and aboslute values, last get first index:
df = pd.DataFrame({'A':[1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1]})
x = df['A'].diff(-1) .abs()
print (x)
0 0.1
1 0.1
2 0.1
3 0.1
4 0.1
5 5.5
6 0.1
7 0.1
8 7.8
9 0.1
10 NaN
Name: A, dtype: float64
idx = x.index[x >= x.quantile(.75)]
print (idx)
Int64Index([5, 7, 8], dtype='int64')
print (idx[0])
5
If you have a Numpy Array, which you can use any dataframe row as, you can use numpy.argmax.
import numpy as np
import numpy as np
a = np.array([1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1])
diff = np.diff(a)
threshold = 2 # set your threshold
max_index = np.argwhere(diff> threshold) [[5],[8]]
References:
https://numpy.org/doc/stable/reference/generated/numpy.diff.html
https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html
More Info:
pandas.diff will calculate the diff diff[i] = a[i] - a[i-1]
numpy.diff will calculate the diff diff[i] = a[i+1] - a[i],
with the exception of i=max len:
a[i] = a[i]-a[i-1]
def shift(a):
a_r = np.roll(a, 1) # right shift
a_l = np.roll(a, -1) # left shift
return np.stack([a_l, a_r], axis=1)
a = np.array([1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1])
diff = abs(shift(a) - a.reshape(-1, 1))
diff = diff[1:-1]
indices = diff.argmax(axis=0) - 2
a[indices]
array([7. , 1.5])

Searching for a NaN strike

I have a Dataframe in which some columns contain wrong information. This wrong information is always before a longer sequence of NaN values. Let's imagine I have the following dataset:
import pandas as pd
from numpy import nan
d = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [0.1, 0.1, nan, nan, nan, nan, 100, 101]}
df = pd.DataFrame(data=d)
"Obs1" is without wrong information, while "Obs2" has wrong values before the 4-NaN sequence. Does anyone know how to find such a longer sequence in a timeseries (e.g. an occurence of 4 NaN values), to then fill all previous entries with NaN? To give an example, my desired Output would be:
Output = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [nan, nan, nan, nan, nan, nan, 100, 101]}
Thanks in advance
For each column, check the i'th element and (i+1)'th element are NaN and find max index (i) satisfying the i'th element and (i+1)'th element are NaN.
See the following code.
for col in df.columns:
cond = df[col].iloc[1:].isnull() + df[col].iloc[:-1].isnull() == 2
if sum(cond) >= 2:
df[col].iloc[:cond.index[-1] - 1] = nan

Best way converting data in PANDAS DataFrame to matrix in Python

I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?
What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch
Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()

Multiplying by pattern matching

I have a matrix of the following format:
matrix = np.array([1, 2, 3, np.nan],
[1, np.nan, 3, 4],
[np.nan, 2, 3, np.nan])
and coefficients I want to selectively multiply element-wise with my matrix:
coefficients = np.array([0.5, np.nan, 0.2, 0.3],
[0.3, 0.3, 0.2, np.nan],
[np.nan, 0.2, 0.1, np.nan])
In this case, I would want the first row in matrix to be multiplied with the second row in coefficients, while the second row in matrix would be multiplied with the first row in coefficients. In short, I want to select the row in coefficients that matches row in matrix in terms of where np.nan values are located.
The location of np.nan values will be different for each row in coefficients, as they describe the coefficients for different cases of data availability.
Is there a quick way to do this, that doesn't require writing if-statements for all possible cases?
Approach #1
A quick way would be with NumPy broadcasting -
# Mask of NaNs
mask1 = np.isnan(matrix)
mask2 = np.isnan(coefficients)
# Perform comparison between each row of mask1 against every row of mask2
# leading to a 3D array. Look for all-matching ones along the last axis.
# These are the ones that shows the row matches between the two input arrays -
# matrix and coefficients. Then, we use find the corresponding matching
# indices that gives us the pair of matches betweel those two arrays
r,c = np.nonzero((mask1[:,None] == mask2).all(-1))
# Index into arrays with those indices and perform elementwise multiplication
out = matrix[r] * coefficients[c]
Output for given sample data -
In [40]: out
Out[40]:
array([[ 0.3, 0.6, 0.6, nan],
[ 0.5, nan, 0.6, 1.2],
[ nan, 0.4, 0.3, nan]])
Approach #2
For performance, reduce each row of NaNs mask to its decimal equivalent and then create a storing array in which we can store elements off matrix and then multiply into the elements off coefficients indexed by those decimal equivalents -
R = 2**np.arange(matrix.shape[1])
idx1 = mask1.dot(R)
idx2 = mask2.dot(R)
A = np.empty((idx1.max()+1, matrix.shape[1]))
A[idx1] = matrix
A[idx2] *= coefficients
out = A[idx1]

Categories