Finding the next value corresponding to a criteria - python

First, here's part of my data frame
Dates Order Value
0 2010-12-07T10:00:00.000000000Z In 70
1 2010-12-07T14:00:00.000000000Z Out 70
2 2010-12-08T06:00:00.000000000Z In 31
3 2010-12-09T02:00:00.000000000Z In 48
4 2010-12-09T10:00:00.000000000Z In 29
5 2010-12-09T10:00:00.000000000Z In 59
6 2010-12-09T10:00:00.000000000Z Out 31
7 2010-12-09T14:00:00.000000000Z Out 29
8 2010-12-09T14:00:00.000000000Z In 32
9 2010-12-10T06:00:00.000000000Z In 1
10 2010-12-10T10:00:00.000000000Z Out 48
In this code, I'm trying to find a few things:
The first occurrence of a 'In' in the dataframe. For that, I'm using
index_1 = df[df.Order=='In'].first_valid_index() This will result in 0, that's correct.
Then, I'll find the corresponding Value for that index with
order_1 = df.at[index_1,'Value'] This will result in 70, also correct.
Find the NEXT time the value 70 appears in this dataframe. This is the part I'm struggling with. The values in Value only repeat once, and the second time it appears will always be on a Out.
Can anyone help me finish this part of the code?

Using idxmax with boolean indexing:
val = df.loc[df['Order'].eq('In').idxmax(), 'Value']
df[df['Value'].eq(val) & df['Order'].eq('Out')]
Dates Order Value
1 2010-12-07T14:00:00.000000000Z Out 70

IIUC, we can use index filtering with isin
val = df[
(df["Value"].isin(df[df["Order"].eq("In")]["Value"].head(1)))
& (df["Order"].eq("Out"))
]
print(val)
Dates Order Value
1 2010-12-07T14:00:00.000000000Z Out 70

You can do the following given the fact that you succesded in extracting the first index and its value:
import pandas as pd
df = pd.DataFrame({'value': [70,70,10,10,50,60,70]})
index_1 = 0
order_1 = 70
indices = df.index[df['value']==order_1].tolist()
next_index = indices.index(index_1) + 1

Related

Returning the location of variables in a range of a pandas dataframe column where a condition is met

I currently have a column in a dataframe, df[Stress]. I want to return the location of the rows in the column where the value stored is less than a variable, load_drop, but only within a certain range of the column, stated by first and last. I figured I could use np.where to find the locations, but so far I'm returning an empty array when I run the code. Here is what I have so far:
df = {'Stress': [1,2,3,6,7,8,10,12,14,20,19,17,15,13,12,10,8,7,6,4,1,0]
first = 10
last = 18
drop = 11
life_array = np.where(df['Stress'].iloc[first:last] < drop)
print (life_array)
[]
Ideally, my desired output would be this:
print(life_array)
0 15
1 16
2 17
3 18
Which is the the location of the rows where the condition is met. Can I use np.where and iloc in such a fashion?
IIUC need 2 steps - first filter by position and then by mask:
s = df['Stress'].iloc[first:last]
life_array = s[s < drop]
print (life_array)
15 10
16 8
17 7
Name: Stress, dtype: int64
If need indices:
first = 10
last = 18
drop = 11
s = df['Stress'].iloc[first:last + 1]
life_array = s.index[s < drop].tolist()
print (life_array)
[15, 16, 17, 18]

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

Pandas Dataframe Filter Multiple Conditions

I am looking to filter a dataframe to only include values that are equal to a certain value, or greater than another value.
Example dataframe:
0 1 2
0 0 1 23
1 0 2 43
2 1 3 54
3 2 3 77
From here, I want to pull all values from column 0, where column 2 is either equal to 23, or greater than 50 (so it should return 0, 1 and 2). Here is the code I have so far:
df = df[(df[2]=23) & (df[2]>50)]
This returns nothing. However, when I split these apart and run them individually (df = df[df[2]=23] and df = df[df[2]>50]), then I do get results back. Does anyone have any insights onto how to get this to work?
As you said , it's or : | not and: &
df = df[(df[2]=23) | (df[2]>50)]

Accessing Pandas groupby() function

I have the below data frame with me after doing the following:
train_X = icon[['property', 'room', 'date', 'month', 'amount']]
train_frame = train_X.groupby(['property', 'month', 'date', 'room']).median()
print(train_frame)
amount
property month date room
1 6 6 2 3195.000
12 3 2977.000
18 2 3195.000
24 3 3581.000
36 2 3146.000
3 3321.500
42 2 3096.000
3 3580.000
54 2 3195.000
3 3580.000
60 2 3000.000
66 3 3810.000
78 2 3000.000
84 2 3461.320
3 2872.800
90 2 3461.320
3 3580.000
96 2 3534.000
3 2872.800
102 3 3581.000
108 3 3580.000
114 2 3195.000
My objective is to track the median amount based on the (property, month, date, room)
I did this:
big_list = [[property, month, date, room], ...]
test_list = [property, month, date, room]
if test_list == big_list:
#I want to get the median amount wrt to that row which matches the test_list
How do I do this?
What I did is, tried the below...
count = 0
test_list = [2, 6, 36, 2]
for j in big_list:
if test_list == j:
break
count += 1
Now, after getting the count how do I access the median amount by count in dataframe? Is their a way to access dataframe by index?
Please note:
big_list is the list of lists where each list is [property, month, date, room] from the above dataframe
test_list is an incoming list to be matched with the big_list in case it does.
Answering the last question:
Is their a way to access dataframe by index?
Of course there is - you should use df.iloc or loc
depends if you want to get purerly by integer (I guess this is the situation) - you should use 'iloc' or by for example string type index - then you can use loc.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
Edit:
Coming back to the question.
I assume that 'amount' is your searched median, then.
You can use reset_index() method on grouped dataframe, like
train_frame_reset = train_frame.reset_index()
and then you can again access your column names, so you should be albe to do the following (assuming j is index of found row):
train_frame_reset.iloc[j]['amount'] <- will give you median
If I understand your problem correctly you don't need to count at all, you can access the values via loc directly.
Look at:
A=pd.DataFrame([[5,6,9],[5,7,10],[6,3,11],[6,5,12]],columns=(['lev0','lev1','val']))
Then you did:
test=A.groupby(['lev0','lev1']).median()
Accessing, say, the median to the group lev0=6 and lev1 =1 can be done via:
test.loc[6,5]

Remove row whose timestamp is in previous sliding window via Pandas in Python

Here's a problem for cleansing my data. The dataframe looks as below:
timestamp
0 10
1 12
2 23
3 25
4 27
5 34
6 45
What I intend to do is iterating through timestamps from top to down, grab one if no previous timestamp is taken (for initialization, It'll take '10'), then omit every row whose timestamp is between [10, 10+10], including '12'. Likewise, we should take '23' and omit '25', '27' since they are between [23, 23+10]. Finally, '34' and '45' should be taken as well.
Eventually, the result would be
timestamp
0 10
2 23
5 34
6 45
Could anyone give me some idea to realize this in Pandas? Great thanks!
I don't believe there is a way to solve this custom problem using a groupby like construct, but here is a coding solution that gives you the index location and timestamp values.
stamps = [df.timestamp.iat[0]]
index = [df.index[0]]
for idx, ts in df.timestamp.iteritems():
if ts >= stamps[-1] + 10:
index.append(idx)
stamps.append(ts)
>>> index
[0, 2, 5, 6]
>>> stamps
[10, 23, 34, 45]
>>> df.iloc[index]
timestamp
0 10
2 23
5 34
6 45
I am not sure if I understood correct about the initialization, but see if this helps you:
df = pd.read_csv("data.csv")
gap = 10
actual = 0
for timestamp in df.values:
if timestamp >= (actual+gap):
print(timestamp)
actual = timestamp
if you want to create a new DF:
df = pd.read_csv("data.csv")
gap = 10
actual = 0
index = []
for i, timestamp in enumerate(df.values):
if timestamp >= (actual+gap):
actual = timestamp
else:
index.append(i)
new_df = df.drop(df.index[index])

Categories