I check using two series X and Y if ones is bigger than another. By using loc, I can get the index of my series where X>Y is TRUE. For example:
X.loc[X>Y]
Using this indexing, I want to shift the indexes n periods. For instance, if X.loc[X>Y] gives us {1,5,8,9}, I am interested in shifting these to {1+2,5+2,8+2,9+2}. I will appreciate any kind of advice on this matter!
You could use numpy.nonzero to get the indices and then shift them:
# two random arrays as an example
X = numpy.random.random(100)
Y = numpy.random.random(100)
ids = numpy.nonzero(X > Y)[0]
print ids
print ids + 2
Related
Having issues with plotting values above a set threshold using a pandas dataframe.
I have a dataframe that has 21453 rows and 20 columns, and one of the columns is just 1 and 0 values. I'm trying to plot this column using the following code:
lst1 = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst1.append(df_smooth['Time'][x])
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst1)
But get the following errors:
x and y must have same first dimension, but have shapes (21453,) and (9,)
Any suggestions on how to fix this?
The error is probably the result of this line plt.plot(df_smooth['Time'], lst1). While lst1 is a subset of df_smooth[Time], df_smooth['Time'] is the full series.
The solution I would do is to also build a filtered x version for example -
lst_X = []
lst_Y = []
for x in range(0, len(df)):
if(df_smooth['Active'][x] == 1):
lst_X.append(df_smooth['Time'][x])
lst_Y.append(df_smooth['Time'][x])
Another option is to build a sub-dataframe -
sub_df = df_smooth[df_smooth['Active']==1]
plt.plot(sub_df['Time'], sub_df['Time'])
(assuming the correct column as Y column is Time, otherwise just replace it with the correct column)
It seems like you are trying to plot two different data series using the plt.plot() function, this is causing the error because plt.plot() expects both series to have the same length.
You will need to ensure that both data series have the same length before trying to plot them. One way to do this is to create a new list that contains the same number of elements as the df_smooth['Time'] data series, and then fill it with the corresponding values from the lst1 data series.
# Create a new list with the same length as the 'Time' data series
lst2 = [0] * len(df_smooth['Time'])
# Loop through the 'lst1' data series and copy the values to the corresponding
# indices in the 'lst2' data series
for x in range(0, len(lst1)):
lst2[x] = lst1[x]
# Plot the 'Time' and 'lst2' data series using the plt.plot() function
plt.plot(df_smooth['Time'], df_smooth['CH1'])
plt.plot(df_smooth['Time'], lst2)
I think this should work.
I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array
i have been trying to build a neural network,to do so i have to divide the data into x and y,(my dataset was converted to numpy).
The data in the "x" is the 1st column which i have extracted successfully but when i try to extract the 2nd column i get the both x and y values for "y".
Here the code i used to divide the data:
data=np.genfromtxt("/home/crpsm/Pycharm/DataSet/headbrain.csv",delimiter=',')
x=data[:,:1]
y=data[:, :2]
Heres the output of x and y:
x:-
[[3738.]
[4261.]
[3777.]
[4177.]
[3585.]
[3785.]
[3559.]
[3613.]
[3982.]
[3443.]
y:-
[[3738. 1297.]
[4261. 1335.]
[3777. 1282.]
[4177. 1590.]
[3585. 1300.]
[3785. 1400.]
[3559. 1255.]
[3613. 1355.]
[3982. 1375.]
[3443. 1340.]
please tell me how to fix this error.Thanks in Advance..!!!
You may want to review the numpy indexing documentation.
To get the second column in the same shape as x, use y=data[:, 1:2].
Note: you are creating 2d arrays with this indexing (shape of (len(data), 1)). If you want 1d arrays, just use integers, not slices, for the second term:
x = data[:, 0]
y = data[:, 1]
What #w-m said in their answer is correct, you are currently assigning all rows (the first :) and all columns, starting from zero up to column one, excluding the upper bound, to x (with :1) and all rows (again the first :) and all columns, starting from zero up to column two, excluding the upper bound, to y (with :2).
x = data[:, 0]
y = data[:, 1]
Is one way to do this properly, but a nicer and more succinct way would be to use tuple unpacking:
x, y = data.T
This transposes (`T) the data, i.e. the two dimensions are exchanged, after which the first dimension has length two. If your actual data has more columns than that, you can use :
x, y, *rest = data.T
In this case rest will be a list of the remaining columns. This syntax was introduced in Python 3.0.
I have a numpy array a, a.shape=(17,90,144). I want to find the maximum magnitude of each column of cumsum(a, axis=0), but retaining the original sign. In other words, if for a given column a[:,j,i] the largest magnitude of cumsum corresponds to a negative value, I want to retain the minus sign.
The code np.amax(np.abs(a.cumsum(axis=0))) gets me the magnitude, but doesn't retain the sign. Using np.argmax instead will get me the indices I need, which I can then plug into the original cumsum array. But I can't find a good way to do so.
The following code works, but is dirty and really slow:
max_mag_signed = np.zeros((90,144))
indices = np.argmax(np.abs(a.cumsum(axis=0)), axis=0)
for j in range(90):
for i in range(144):
max_mag_signed[j,i] = a.cumsum(axis=0)[indices[j,i],j,i]
There must be a cleaner, faster way to do this. Any ideas?
I can't find any alternatives to argmax but at least you can fasten that with a more vectorized approach:
# store the cumsum, since it's used multiple times
cum_a = a.cumsum(axis=0)
# find the indices as before
indices = np.argmax(abs(cum_a), axis=0)
# construct the indices for the second and third dimensions
y, z = np.indices(indices.shape)
# get the values with np indexing
max_mag_signed = cum_a[indices, y, z]
I have a 4 dimensional array, i.e., data.shape = (20,30,33,288). I am finding the index of the closest array to n using
index = abs(data - n).argmin(axis = 1), so
index.shape = (20,33,288) with the indices varying.
I would like to use data[index] = "values" with values.shape = (20,33,288), but data[index] returns the error "index (8) out of range (0<=index<1) in dimension 0" or this operation takes a relatively long time to compute and returns a matrix with a shape that doesn't seem to make sense.
How do I return a array of correct values? i.e.,
data[index] = "values" with values.shape = (20,33,288)
This seems like a simple problem, is there a simple answer?
I would eventually like to find index2 = abs(data - n2).argmin(axis = 1), so I can perform an operation, say sum data at index to data at index2 without looping through the variables. Is this possible?
I am using python2.7 and numpy version 1.5.1.
You should be able to access the maximum values indexed by index using numpy.indices():
x, z, t = numpy.indices(index.shape)
data[x, index, z, t]
If I understood you correctly, this should work:
numpy.put(data, index, values)
I learned something new today, thanks.