I have a pandas data frame F with a sorted index I. I am interested in knowing about the last change in one of the columns, let's say A. In particular, I want to construct a series with the same index as F, namely I, whose value at i is j where j is the greatest index value less than i such that F[A][j] != F[A][i]. For example, consider the following frame:
A
1 5
2 5
3 6
4 2
5 2
The desired series would be:
1 NaN
2 NaN
3 2
4 3
5 3
Is there a pandas/numpy idiomatic way to construct this series?
Try this:
df['B'] = np.nan
last = np.nan
for index, row in df.iterrows():
if index == 0:
continue
if df['A'].iloc[index] != df['A'].iloc[index - 1]:
last = index
df['B'].iloc[index] = last
This will create a new column with the results. I believe that changing the rows as you pass through them is not a good idea, after that you can simply replace a column and delete the other if you wish.
np.argmax or pd.Series.argmax on Boolean data can help you find the first (or in this case, last) True value. You still have to loop over the series in this solution, though.
# Initiate source data
F = pd.DataFrame({'A':[5,5,6,2,2]}, index=list('fobni'))
# Initiate resulting Series to NaN
result = pd.Series(np.nan, index=F.index)
for i in range(1, len(F)):
value_at_i = F['A'].iloc[i]
values_before_i = F['A'].iloc[:i]
# Get differences as a Boolean Series
# (keeping the original index)
diffs = (values_before_i != value_at_i)
if diffs.sum() == 0:
continue
# Reverse the Series of differences,
# then find the index of the first True value
j = diffs[::-1].argmax()
result.iloc[i] = j
Related
I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx
I have 2 columns in a dataframe, one named "day_test" and one named "Temp Column". Some of my values in Temp Column are negative, and I want them to be 1 or 2. I've made a for loop with 2 if statements:
for (i,j) in zip(df['day_test'].astype(int), df['Temp Column'].astype(int)):
if i == 2 and j < 0:
j = 2
if i == 1 and j < 0:
j = 1
I tried printing j so I know the loops are working properly, but the values that I want to change in the dataframe are staying negative.
Thanks
Your code doesn't change the values inside the dataframe, it only changes the j value temporarily.
One way to do it is this:
df['day_test'] = df['day_test'].astype(int)
df['Temp Column'] = df['Temp Column'].astype(int)
df.loc[(df['day_test']==1) & (df['Temp Column']<0),'Temp Column'] = 1
df.loc[(df['day_test']==2) & (df['Temp Column']<0),'Temp Column'] = 2
Assuming a pandas dataframe like the one in the picture, I would like to fill the na values based with the value of the other variable similar to it. To be more clear, my variables are
mean_1, mean_2 .... , std_1, std_2, ... min_1, min_2 ...
So I would like to fill the na values with the values of the other columns, but not all the columns, only those whose represent the same metric, in the picture i highligted 2 na values. The first one I would like to fill it with the mean obtain from the variables 'MEAN' at row 2, while the second na I would like to fill it with the mean obtain from variable 'MIN' at row 9. Is there a way to do it?
you can find the unique prefixes, iterate through each and do fillna for subsets seperately
uniq_prefixes = set([x.split('_')[0] for x in df.columns])
for prfx in uniq_prefixes:
mask = [col for col in df if col.startswith(prfx)]
# Transpose is needed because row wise fillna is not implemented yet
df.loc[:,mask] = df[mask].T.fillna(df[mask].mean(axis=1)).T
Yes, it is possible doing it using the loop. Below is the naive approach, but even for fancier ones, it is not much optimisation (at least I don't see them).
for i, row in df.iterrows():
sum_means = 0
n_means = 0
sum_stds = 0
n_stds = 0
fill_mean_idxs = []
fill_std_idxs = []
for idx, item in item.iteritems():
if idx.startswith('mean') and item is None:
fill_mean_idxs.append(idx)
elif idx.startswith('mean'):
sum_means += float(item)
n_means += 1
elif idx.startswith('std') and item is None:
fill_std_idxs.append(idx)
elif idx.startswith('std'):
sum_stds += float(item)
n_stds += 1
ave_mean = sum_means / n_means
std_mean = sum_stds / n_stds
for idx in fill_mean_idx:
df.loc[i, idx] = ave_mean
for idx in fill_std_idx:
df.loc[i, idx] = std_mean
The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0
In a panda series it should go through the series and stop if one value has increased 5 times. With a simple example it works so far:
list2 = pd.Series([2,3,3,4,5,1,4,6,7,8,9,10,2,3,2,3,2,3,4])
def cut(x):
y = iter(x)
for i in y:
if x[i] < x[i+1] < x[i+2] < x[i+3] < x[i+4] < x[i+5]:
return x[i]
break
out = cut(list2)
index = list2[list2 == out].index[0]
So I get the correct Output of 1 and Index of 5.
But if I use a second list with series type and instead of (19,) which has (23999,) values then I get the Error:
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 3489660928
You can do something like this:
# compare list2 with the previous values
s = list2.gt(list2.shift())
# looking at last 5 values
s = s.rolling(5).sum()
# select those equal 5
list2[s.eq(5)]
Output:
10 9
11 10
dtype: int64
The first index where it happens is
s.eq(5).idxmax()
# output 10
Also, you can chain them together:
(list2.gt(list2.shift())
.rolling(5).sum()
.eq(5).idxmax()
)