Replace row value by comparing dates - python

I have a date in a list:
[datetime.date(2017, 8, 9)]
I want replace the value of a dataframe matching that date with zero.
Dataframe:
Date Amplitude Magnitude Peaks Crests
0 2017-06-21 6.953356 1046.656154 4 3
1 2017-06-27 7.015520 1185.221306 5 4
2 2017-06-28 6.947471 908.115055 2 2
3 2017-06-29 6.921587 938.175153 3 3
4 2017-07-02 6.906078 938.273547 3 2
5 2017-07-03 6.898809 955.718452 6 5
6 2017-07-04 6.876283 846.514852 5 5
7 2017-07-26 6.862897 870.610086 6 5
8 2017-07-27 6.846426 824.403786 7 7
9 2017-07-28 6.831949 813.753420 7 7
10 2017-07-29 6.823125 841.245427 4 3
11 2017-07-30 6.816301 846.603427 5 4
12 2017-07-31 6.810133 842.287006 5 4
13 2017-08-01 6.800645 794.167590 3 3
14 2017-08-02 6.793034 801.505774 4 3
15 2017-08-03 6.790814 860.497395 7 6
16 2017-08-04 6.785664 815.055002 4 4
17 2017-08-05 6.782069 829.607640 5 4
18 2017-08-06 6.778176 819.014799 4 3
19 2017-08-07 6.774587 817.624203 5 5
20 2017-08-08 6.771193 815.101641 4 3
21 2017-08-09 6.765695 772.970000 1 1
22 2017-08-10 6.769422 945.207554 1 1
23 2017-08-11 6.773154 952.422598 4 3
24 2017-08-12 6.770926 826.700122 4 4
25 2017-08-13 6.772816 916.046905 5 5
26 2017-08-14 6.771130 834.881662 5 5
27 2017-08-15 6.769183 826.009391 5 5
28 2017-08-16 6.767313 824.650882 5 4
29 2017-08-17 6.765894 832.752100 5 5
30 2017-08-18 6.766861 894.165751 5 5
31 2017-08-19 6.768392 912.200274 4 3
i have tried this:
for x in range(len(all_details)):
for y in selected_day:
m = all_details['Date'] > y
all_details.loc[m, 'Peaks'] = 0
But getting an error:
ValueError: Arrays were different lengths: 32 vs 1
Can anybody suggest me the correct way to do it>
Any help would be appreciated.

First your solution working nice with your sample data.
Another faster solution is creating each mask in loop and then reduce by logical or, and - what need. Better it is explained here.
L = [datetime.date(2017, 8, 9)]
m = np.logical_or.reduce([all_details['Date'] > x for x in L])
all_details.loc[m, 'Peaks'] = 0
In your solution is better compare only by minimal date from list:
all_details.loc[all_details['Date'] > min(L), 'Peaks'] = 0

Related

Conditionally keep only one of the duplicates in pandas groupby groups

I have a dataset in this format: (can be download in CSV format from here)
ID DateAcquired DateSent
1 20210518 20220110
1 20210719 20220210
1 20210719 20220310
1 20200420 20220410
1 20210328 20220510
1 20210518 20220610
2 20210108 20220110
2 20210110 20220210
2 20210119 20220310
2 20210108 20220410
2 20200109 20220510
2 20210919 20220610
2 20211214 20220612
2 20210812 20220620
2 20210909 20220630
2 20200102 20220811
2 20200608 20220909
2 20210506 20221005
2 20210130 20221101
3 20210518 20220110
3 20210519 20220210
3 20210520 20220310
3 20210518 20220410
3 20210611 20220510
3 20210521 20220610
3 20210723 20220612
3 20211211 20220620
4 20210518 20220110
4 20210519 20220210
4 20210520 20220310
4 20210618 20220410
4 20210718 20220510
4 20210818 20220610
5 20210518 20220110
5 20210818 20220210
5 20210918 20220310
5 20211018 20220410
5 20211113 20220510
5 20211218 20220610
5 20210631 20221212
6T 20200102 20201101
6T 20200102 20201101
6T 20200102 20201101
6T 20210405 20220610
6T 20210606 20220611
I am doing groupby:
data.groupby(['ID','DateAcquired'])
For each unique combination of ID and DateAcquired, I am only interested in keeping one DateSent, and that is the newest one. Therefore, in other words, if a unique combination of ID and DateAcquired has two DateSent available, only take the one where DateSent is the largest/newest. This operation should apply only if ID is NOT 6T.
I am out of ideas on how to do this. Is there an easy way of doing it with pandas?
You can filter rows for not equal 6T and get maximum rows by DateSent by DataFrameGroupBy.idxmax and then append 6T rows to output:
m = df['ID'].ne('6T')
df = (df.loc[df[m].groupby(['ID','DateAcquired'])['DateSent'].idxmax()]
.append(df[~m], ignore_index=True))
Solution with sorting and removing duplicates:
m = df['ID'].ne('6T')
df = (df[m].sort_values(['ID','DateAcquired','DateSent'], ascending=[True, True, False])
.drop_duplicates(subset=['ID','DateAcquired'])
.append(df[~m], ignore_index=True))
Use pd.to_datetime with Groupby.max:
In [835]: df.DateSent = pd.to_datetime(df.DateSent, format='%Y%m%d')
In [841]: df[df.ID.ne('6T')].groupby(['ID','DateAcquired'])['DateSent'].max().reset_index().append(df[df.ID.eq('6T')])
Out[841]:
ID DateAcquired DateSent
0 1 20200420 2022-04-10
1 1 20210328 2022-05-10
2 1 20210518 2022-06-10
3 1 20210719 2022-03-10
4 2 20200102 2022-08-11
5 2 20200109 2022-05-10
6 2 20200608 2022-09-09
7 2 20210108 2022-04-10
8 2 20210110 2022-02-10
9 2 20210119 2022-03-10
10 2 20210130 2022-11-01
11 2 20210506 2022-10-05
12 2 20210812 2022-06-20
13 2 20210909 2022-06-30
14 2 20210919 2022-06-10
15 2 20211214 2022-06-12
16 3 20210518 2022-04-10
17 3 20210519 2022-02-10
18 3 20210520 2022-03-10
19 3 20210521 2022-06-10
20 3 20210611 2022-05-10
21 3 20210723 2022-06-12
22 3 20211211 2022-06-20
23 4 20210518 2022-01-10
24 4 20210519 2022-02-10
25 4 20210520 2022-03-10
26 4 20210618 2022-04-10
27 4 20210718 2022-05-10
28 4 20210818 2022-06-10
29 5 20210518 2022-01-10
30 5 20210631 2022-12-12
31 5 20210818 2022-02-10
32 5 20210918 2022-03-10
33 5 20211018 2022-04-10
34 5 20211113 2022-05-10
35 5 20211218 2022-06-10
40 6T 20200102 2020-11-01
41 6T 20200102 2020-11-01
42 6T 20200102 2020-11-01
43 6T 20210405 2022-06-10
44 6T 20210606 2022-06-11

Python pandas: set consecutive index (starting at 0) for every group in groupby

I have a dataframe sorted and grouped by serial number, then date:
df = df.sort_values(by=["serial_num" , "date"])
df = df.groupby(df['serial_num'])['date'].some_function()
Dataframe
serial_num date
0 1 2001-01-01
1 1 2001-02-01
2 1 2001-03-01
3 1 2001-04-01
4 3 2003-05-01
5 3 2003-06-01
6 3 2003-07-01
7 7 2005-07-01
8 7 2005-08-01
9 7 2005-09-01
10 7 2005-10-01
11 7 2005-11-01
12 7 2005-12-01
Each unique serial_num group will be a line on a line graph.
The way it is graphed now, each line is a time series that starts at a different point -- because the first date is different for every serial_num group.
I need the x-axis of the graph to be time instead of date. All lines on the graph will start at the same point - the origin of the x-axis of my graph.
I think the easiest way to do this would be to add a consecutive index that starts at 0 for each group, like this:
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5
Then, I think I will be able to graph (in Plotly) with all lines starting at the same point (the 0 index will be the first data point for each serial_num.
NOTE: each serial_num group has a different number of data points.
I'm unsure how to index with groupby this way. Please help! Or if you know another method that will accomplish the same goal, please share. Thanks!
Use cumcount:
df["new_index"] = df.groupby("serial_num").cumcount()
print(df)
==>
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5

pandas pct_change() in reverse

Suppose we have a dataframe and we calculate as percent change between rows
y_axis = [1,2,3,4,5,6,7,8,9]
x_axis = [100,105,115,95,90,88,110,100,0]
DF = pd.DataFrame({'Y':y_axis, 'X':x_axis})
DF = DF[['Y','X']]
DF['PCT'] = DF['X'].pct_change()
Y X PCT
0 1 100 NaN
1 2 105 0.050000
2 3 115 0.095238
3 4 95 -0.173913
4 5 90 -0.052632
5 6 88 -0.022222
6 7 110 0.250000
7 8 100 -0.090909
8 9 0 -1.000000
That way it starts from the first row.
I want calculate pct_change() starting from the last row.
One way to do it
DF['Reverse'] = list(reversed(x_axis))
DF['PCT_rev'] = DF['Reverse'].pct_change()
pct_rev = DF.PCT_rev.tolist()
DF['_PCT_'] = list(reversed(pct_rev))
DF2 = DF[['Y','X','PCT','_PCT_']]
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
But that is a very ugly and inefficient solution.
I was wondering if there are more elegant solutions?
DF.assign(_PCT_=DF.X.pct_change(-1))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
Series.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
periods : int, default 1 Periods to shift for forming percent change
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html
I deleted my other answer because #su79eu7k 's is way better.
You can cut your time in half by using the underlying arrays. But you also have to suppress a warning.
a = DF.X.values
DF.assign(_PCT_=np.append((a[:-1] - a[1:]) / a[1:], np.nan))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN

Fill the missing date values in a Pandas Dataframe column

I'm using Pandas to store stock prices data using Data Frames. There are 2940 rows in the dataset. The Dataset snapshot is displayed below:
The time series data does not contain the values for Saturday and Sunday. Hence missing values have to be filled.
Here is the code I've written but it is not solving the problem:
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/Admin/Analytics/stock-prices')
data = pd.read_csv('stock-data.csv')
# PriceDate Column - Does not contain Saturday and Sunday stock entries
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_index(by=['PriceDate'], ascending=[True])
# Starting date is Aug 25 2004
idx = pd.date_range('08-25-2004',periods=2940,freq='D')
data = data.set_index(idx)
data['newdate']=data.index
newdate=data['newdate'].values # Create a time series column
data = pd.merge(newdate, data, on='PriceDate', how='outer')
How to fill the missing values for Saturday and Sunday?
I think you can use resample with ffill or bfill, but before set_index from column PriceDate:
print (data)
ID PriceDate OpenPrice HighPrice
0 1 6/24/2016 1 2
1 2 6/23/2016 3 4
2 2 6/22/2016 5 6
3 2 6/21/2016 7 8
4 2 6/20/2016 9 10
5 2 6/17/2016 11 12
6 2 6/16/2016 13 14
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_values(by=['PriceDate'], ascending=[True])
data.set_index('PriceDate', inplace=True)
print (data)
ID OpenPrice HighPrice
PriceDate
2016-06-16 2 13 14
2016-06-17 2 11 12
2016-06-20 2 9 10
2016-06-21 2 7 8
2016-06-22 2 5 6
2016-06-23 2 3 4
2016-06-24 1 1 2
data = data.resample('D').ffill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 11 12
3 2016-06-19 2 11 12
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2
data = data.resample('D').bfill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 9 10
3 2016-06-19 2 9 10
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2

How transform value column to quantile at pandas python?

I use pandas to analyze my data, and execute:
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
print df
userid recency frequency monetary
0 47918 9 53 788778
1 48302 85 10 232323
2 8873 3 79 2323
3 63158 23 23 2323232
4 364 14 43 232323
5 45191 1 75 224455
6 21061 9 64 23367
7 41356 22 55 2346777
8 42455 14 30 23478
9 65460 3 16 2345
I need to transform value recency frequency and monetary into value in range 1-5. so output is
userid recency frequency monetary
0 47918 1 2 3
1 48302 2 1 2
2 8873 3 4 5
3 63158 2 2 2
4 364 5 4 2
5 45191 1 5 4
6 21061 4 4 3
7 41356 3 5 4
8 42455 5 3 5
9 65460 3 1 2
how can do that in python ?
thx
IIUC you need qcut with codes, last need add 1, because minimal value is 1 and maximal 5:
df['recency1'] = pd.qcut(df['recency'].values, 5)
df['frequency1'] = pd.qcut(df['frequency'].values, 5)
df['monetary1'] = pd.qcut(df['monetary'].values, 5)
print df
userid recency frequency monetary recency1 frequency1 \
0 47918 9 53 788778 (3, 9] (37.8, 53.8]
1 48302 85 10 232323 (22.2, 85] [10, 21.6]
2 8873 3 79 2323 [1, 3] (66.2, 79]
3 63158 23 23 2323232 (22.2, 85] (21.6, 37.8]
4 364 14 43 232323 (9, 14] (37.8, 53.8]
5 45191 1 75 224455 [1, 3] (66.2, 79]
6 21061 9 64 23367 (3, 9] (53.8, 66.2]
7 41356 22 55 2346777 (14, 22.2] (53.8, 66.2]
8 42455 14 30 23478 (9, 14] (21.6, 37.8]
9 65460 3 16 2345 [1, 3] [10, 21.6]
monetary1
0 (232323, 1095668.8]
1 (144064.2, 232323]
2 [2323, 19162.6]
3 (1095668.8, 2346777]
4 (144064.2, 232323]
5 (144064.2, 232323]
6 (19162.6, 144064.2]
7 (1095668.8, 2346777]
8 (19162.6, 144064.2]
9 [2323, 19162.6]
df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1
print df
userid recency frequency monetary
0 47918 2 3 4
1 48302 5 1 3
2 8873 1 5 1
3 63158 5 2 5
4 364 3 3 3
5 45191 1 5 3
6 21061 2 4 2
7 41356 4 4 5
8 42455 3 2 2
9 65460 1 1 1

Categories