Getting the row count until the highest date from pandas - python

I have df like below I want to create dayshigh column. This column will show the row counts until the highest date.
date high
05-06-20 1.85
08-06-20 1.88
09-06-20 2
10-06-20 2.11
11-06-20 2.21
12-06-20 2.17
15-06-20 1.99
16-06-20 2.15
17-06-20 16
18-06-20 9
19-06-20 14.67
should be like:
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16 8
18-06-20 9 0
19-06-20 14.67 1
using the below code but showing error somehow:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
for j in range(df["DaysHigh"][i].index, len(df)):
if df["high"][i] > df["high"][i-1]:
df["DaysHigh"][i] = df["DaysHigh"][i-1] + 1
else:
df["DaysHigh"][i] = 0
At which point am I doing wrong? Thank you

Is the dayshigh number for 17-06-20 supposed to be 2 instead of 8? If so, you can basically use the code you had already written here. There are three changes I'm making below:
starting i from 1 instead of 0 to avoid trying to access the -1th element
removing the loop over j (doesn't seem to be necessary)
using loc to set the values instead of df["high"][i] -- you'll see this should resolve the warnings about copies and slices.
Keeping first line same as before,
for i in range(1, len(df)):
if df["high"][i] > df["high"][i-1]:
df.loc[i,"DaysHigh"] = df["DaysHigh"][i-1] + 1
else:
df.loc[i,"DaysHigh"] = 0

procedure
Use pandas.shift() to create a column for the next row of comparison results.
calculate the cumulative sum of its created columns
delete the columns if they are not needed
df['tmp'] = np.where(df['high'] >= df['high'].shift(), 1, np.NaN)
df['dayshigh'] = df['tmp'].groupby(df['tmp'].isna().cumsum()).cumsum()
df.drop('tmp', axis=1, inplace=True)
df
date high dayshigh
0 05-06-20 1.85 NaN
1 08-06-20 1.88 1.0
2 09-06-20 2.00 2.0
3 10-06-20 2.11 3.0
4 11-06-20 2.21 4.0
5 12-06-20 2.17 NaN
6 15-06-20 1.99 NaN
7 16-06-20 2.15 1.0
8 17-06-20 16.00 2.0
9 18-06-20 9.00 NaN
10 19-06-20 14.67 1.0

Well, I think I did, here is my solution:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
#for i in range(len(df)-1000, len(df)):
for j in reversed(range(i)):
if df["high"][i] > df["high"][j]:
df["DaysHigh"][i] = df["DaysHigh"][i] + 1
else:
break
print(df)
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2.00 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16.00 8
18-06-20 9.00 0
19-06-20 14.67 1

Related

Take average of range entities and replace it in pandas column

I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64

Pandas pivots index as columns using groupby-apply

I have come across some strange behavior in Pandas groupby-apply that I am trying to figure out.
Take the following example dataframe:
import pandas as pd
import numpy as np
index = range(1, 11)
groups = ["A", "B"]
idx = pd.MultiIndex.from_product([index, groups], names = ["index", "group"])
np.random.seed(12)
df = pd.DataFrame({"val": np.random.normal(size=len(idx))}, index=idx).reset_index()
print(df.tail().round(2))
index group val
15 8 B -0.12
16 9 A 1.01
17 9 B -0.91
18 10 A -1.03
19 10 B 1.21
And using this framework (which allows me to execute any arbitrary function within a groupby-apply):
def add_two(x):
return x + 2
def pd_groupby_apply(df, out_name, in_name, group_name, index_name, function):
def apply_func(df):
if index_name is not None:
df = df.set_index(index_name).sort_index()
df[out_name] = function(df[in_name].values)
return df[out_name]
return df.groupby(group_name).apply(apply_func)
Whenever I call pd_groupby_apply with the following inputs, I get a pivoted DataFrame:
df_out1 = pd_groupby_apply(df=df,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out1.head().round(2))
index 1 2 3 4 5 6 7 8 9 10
group
A 2.47 2.24 2.75 2.01 1.19 1.40 3.10 3.34 3.01 0.97
B 1.32 0.30 0.47 1.88 4.87 2.47 0.78 1.88 1.09 3.21
However, as soon as my dataframe does not contain full group-index pairs, and I call my pd_groupby_apply function again, I do recieve my dataframe back in the way that I want (i.e. not pivoted):
df_notfull = df.iloc[:-1]
df_out2 = pd_groupby_apply(df=df_notfull,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out2.head().round(2))
group index
A 1 2.47
2 2.24
3 2.75
4 2.01
5 1.19
Why is this? And more importantly, how can I prevent Pandas from pivoting my dataframe when I have full index-group pairs in my dataframe?

Pandas combine two different length of time series dataframe

I am trying to combine two different timeframes of pandas dataframe. The first dataframe has 1 hour timeseries. and the second dataframe has 1 minute timeseries.
1 hour dataframe
get_time value
0 1599739200 123.10
1 1599742800 136.24
2 1599750000 224.14
1 minute dataframe
get_time value
0 1599739200 2.11
1 1599739260 3.11
2 1599739320 3.12
3 1599742800 4.23
4 1599742860 2.22
5 1599742920 1.11
6 1599746400 7.24
7 1599746460 22.10
8 1599746520 2.13
9 1599750000 5.14
10 1599750060 12.10
11 1599750120 21.30
I want to combine those two dataframes, so the value of 1 hour dataframe will be mapped in 1 minute dataframe. if there is no 1 hour value then the mapped value will be nan.
Desired Result:
get_time value 1h mapped value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
Basically i want to combine those dataframe with these logic:
if (1m_get_time >= 1h_get_time) and (1m_get_time < 1h_get_time+60minutes)
1h mapped value = 1h value
else:
1h mapped value = nan
Currently i use recursive method. But it takes long time for big size of data. here is the example of dataframe:
dfhigh_ = pd.DataFrame({
'get_time' : [1599739200, 1599742800, 1599750000],
'value' : [123.1, 136.24, 224.14],
})
dflow_ = pd.DataFrame({
'get_time' : [1599739200, 1599739260, 1599739320, 1599742800, 1599742860, 1599742920, 1599746400, 1599746460, 1599746520, 1599750000, 1599750060, 1599750120],
'value' : [2.11, 3.11, 3.12, 4.23, 2.22, 1.11, 7.24, 22.1, 2.13, 5.14, 12.1, 21.3],
})
Floor the get_time from dflow_ to nearest hour representation then use Series.map to map the values from dfhigh_ to dflow_ based on this rounded timestamp:
hr = dflow_['get_time'] // 3600 * 3600
dflow_['mapped_value'] = hr.map(dfhigh_.set_index('get_time')['value'])
get_time value mapped_value
0 1599739200 2.11 123.10
1 1599739260 3.11 123.10
2 1599739320 3.12 123.10
3 1599742800 4.23 136.24
4 1599742860 2.22 136.24
5 1599742920 1.11 136.24
6 1599746400 7.24 NaN
7 1599746460 22.10 NaN
8 1599746520 2.13 NaN
9 1599750000 5.14 224.14
10 1599750060 12.10 224.14
11 1599750120 21.30 224.14
This should work (for edge cases as well):
import pandas as pd
from datetime import datetime
import numpy as np
dfhigh_ = dfhigh_.rename(columns={'value': '1h mapped value'})
df_new = pd.merge(dflow_, dfhigh_, how='outer', on=['get_time'])
df_new.get_time = [datetime.fromtimestamp(x) for x in df_new['get_time']]
for idx,row in df_new.iterrows():
if not np.isnan(row['1h mapped value']):
current_hour, current_1h_mapped_value = row['get_time'].hour, row['1h mapped value']
for sub_idx,sub_row in df_new.loc[(df_new.get_time.dt.hour == current_hour) & np.isnan(df_new['1h mapped value'])].iterrows():
df_new.loc[sub_idx, '1h mapped value'] = current_1h_mapped_value

Loop through Pandas rolling sum (To get the sum of last 100)

I have the following data:
Date Qty
01/01/2019 4.15
02/01/2019 12.39
03/01/2019 14.15
04/01/2019 12.15
05/01/2019 3.26
06/01/2019 6.23
07/01/2019 15.89
08/01/2019 5.55
09/01/2019 12.49
10/01/2019 9.4
11/01/2019 9.11
12/01/2019 9.18
13/01/2019 13.45
14/01/2019 4.52
15/01/2019 0
16/01/2019 0
17/01/2019 8.41
18/01/2019 9.55
19/01/2019 15.43
20/01/2019 16.45
21/01/2019 9.28
22/01/2019 9.55
23/01/2019 7.87
24/01/2019 12.58
25/01/2019 6.12
26/01/2019 6.15
27/01/2019 6.07
28/01/2019 15.53
The output that I'm trying to achieve is this:
Date Window_Sum
01/01/2019
02/01/2019
03/01/2019
04/01/2019
05/01/2019
06/01/2019
07/01/2019
08/01/2019
09/01/2019
10/01/2019
11/01/2019 100.62
12/01/2019 109.8
13/01/2019 110.86
14/01/2019 101.23
15/01/2019 101.23
16/01/2019 101.23
17/01/2019 109.64
18/01/2019 103.78
19/01/2019 112.98
20/01/2019 107.99
21/01/2019 104.78
22/01/2019 104.93
23/01/2019 103.69
24/01/2019 107.09
25/01/2019 113.21
26/01/2019 101.39
27/01/2019 107.46
28/01/2019 105.03
Let me just briefly explain the logic to get the output:
So on 01/01/2019, the Qty is 4.15, and looking back there are no other values, so the cumulative sum is not greater than 100. Hence, the output value is a NULL.
Fast forward to 10/01/2019, the Qty is 9.4, and looking back the cumulative sum is 95.66. Since the cumulative sum is not greater than 100, the output will be a NULL value.
Next, we'll look at 11/01/2019. The Qty here is 9.11, and looking back the cumulative sum is 100.62. Reason why it is 100.62 and not 104.77 is because the sum of Qty from 11/01/2019 to 02/01/2019 (looking backwards), hits 100/slightly above 100 first.
Similarly, at 12/01/2019, the Qty here is 9.18, and looking back the cumulative sum is 100.8 because the sum of Qty from 12/01/2019 to 02/01/2019(looking backwards), hits 100/slightly above 100 first.
Is there a solution which allows a loop into the pandas rolling sum function to achieve this result?
What I'm trying to achieve here is to ensure that once the cumulative sum reaches 100 or slightly over 100, then I will take the value and append it into the "Window_Sum".
Update: Managed to get the code running with help. Here's the solution:
#get last row index
start=len(data)-1
#initialise cumulative sum
cumsum = 0
for i in range(start,-1,-1):
j=i
while cumsum < 100:
cumsum += data.loc[j,'Qty']
if j!=0:
j-=1
else:
cumsum=None
break
data.loc[i,'Window_Sum']=cumsum
cumsum=0
Just use the cumsum() function:
In [7]: df['Window_Sum'] = df['Qty'].cumsum()
In [8]: df.head()
Out[8]:
Date Qty Window_Sum
0 01-Jan-19 4.0 4.0
1 02-Jan-19 1.0 5.0
2 03-Jan-19 6.0 11.0
3 04-Jan-19 3.0 14.0
4 05-Jan-19 3.0 17.0
Hope this is what you were looking for!

Extract values based on timestamps diff conditions in pandas

I have a dataset with timestamps and values for each ID. The number of rows, for each ID, is different and I need a double for loop like this:
for ids in IDs:
for index in Date:
Now, I would like to find the difference between timestamps, for each ID, in these ways:
values between 2 days
values between 7 days
In particular, for each ID
if, from the first value, in the next 2 days there is an increment of at least 0.3 fromt the first value
OR
if, from the first value, in the next 7 days there is a value equals to 1.5*first value
I store that ID in a dataframe, otherwise I store that ID in another dataframe.
Now, my code is the following:
yesDf = pd.DataFrame()
noDf = pd.DataFrame()
for ids in IDs:
for index in Date:
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 2):
if (df.iloc[index]['Val'] - df.iloc[index - 1]['Val'] >= 0.3):
yesDf += IDs['ID']
noDf += IDs['ID']
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 7):
if(df.iloc[Date - 1]['Val'] >= df.iloc[index]['Val'] * 1.5):
yesDf += IDs['ID']
noDf += IDs['ID']
print(yesDf)
print(noDf)
I get these errors:
TypeError: incompatible type for a datetime/timedelta operation [sub]
and
pandas.errors.NullFrequencyError: Cannot shift with no freq
How can I solve this problem?
Thank you
Edit: my dataframe
Val ID Date
2199 0.90 0000.0 2017-12-26 11:00:01
2201 1.35 0001.0 2017-12-26 11:00:01
63540 0.72 0001.0 2018-08-10 11:53:01
68425 0.86 0001.0 2018-10-14 08:33:01
42444 0.99 0002.0 2018-02-01 09:25:53
41474 1.05 0002.0 2018-04-01 08:00:04
42148 1.19 0002.0 2018-07-01 08:50:00
24291 1.01 0004.0 2017-01-01 08:12:02
for exampled: for ID 0001.0 the first value is 1.35, and in the next 2 days I don't have an increment of at least 0.3 from the start value, and in the next 7 days I don't have an increment 1.5times the firsrt value, so it goes in the noDf Dataframe.
Also the dtypes:
Val float64
ID object
Date datetime64[ns]
Surname object
Name object
dtype: object
edit:
after the modified code the results are:
Val ID Date Date_diff_cumsum Val_diff
24719 2.08 0118.0 2017-01-15 08:16:05 1.0 0.36
24847 2.17 0118.0 2017-01-16 07:23:04 1.0 0.45
25233 2.45 0118.0 2017-01-17 08:21:03 2.0 0.73
24749 2.95 0118.0 2017-01-18 09:49:09 3.0 1.23
17042 1.78 0129.0 2018-02-05 22:48:17 0.0 0.35
And it is correct. Now I only need to add the single ID into a dataframe
This answer should work assuming you start from the first value of an ID, so the first timestamp.
First, I added the 'Date_diff_cumsum' column, which stores the difference in days between the first date for the ID and the row's date:
df['Date_diff_cumsum'] = df.groupby('ID').Date.diff().dt.days
df['Date_diff_cumsum'] = df.groupby('ID').Date_diff_cumsum.cumsum().fillna(0)
Then, I add the 'Value_diff' column, which is the difference between the first value for an ID and the row's value:
df['Val_diff'] = df.groupby('ID')['Val'].transform(lambda x:x-x.iloc[0])
Here is what I get after adding the columns for your sample DataFrame:
Val ID Date Date_diff_cumsum Val_diff
0 0.90 0.0 2017-12-26 11:00:01 0.0 0.00
1 1.35 1.0 2017-12-26 11:00:01 0.0 0.00
2 0.72 1.0 2018-08-10 11:53:01 227.0 -0.63
3 0.86 1.0 2018-10-14 08:33:01 291.0 -0.49
4 0.99 2.0 2018-02-01 09:25:53 0.0 0.00
5 1.05 2.0 2018-04-01 08:00:04 58.0 0.06
6 1.19 2.0 2018-07-01 08:50:00 149.0 0.20
7 1.01 4.0 2017-01-01 08:12:02 0.0 0.00
And finally, return the rows which satisfy the conditions in your question:
df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))]
In this case, it will return no rows.
yesDf = df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
noDf = df[~((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
yesDf contains the IDs that satisfy the condition, and noDf the ones that don't
I hope this answers your question !

Categories