sort pandas dataframe indeces - python

I have a dataframe df which indeces are
df.index
Out[4]:
Index([u'2015-03-28_p001_2', u'2015-03-29_p001_2',
u'2015-03-30_p001_2', u'2015-03-31_p001_2',
u'2015-03-31_p002_3', u'2015-04-01_p001_2',
u'2015-04-01_p002_3', u'2015-04-02_p001_2',
u'2015-04-02_p002_3', u'2015-04-03_p001_2',
...
u'2016-03-31_p127_1', u'2016-04-01_p127_1',
u'2016-04-01_p128_3', u'2016-04-02_p127_1',
u'2016-04-02_p128_3', u'2016-04-03_p127_1',
u'2016-04-03_p128_3', u'2016-04-04_p127_1',
u'2016-04-05_p127_1', u'2016-04-06_p127_1'],
dtype='object', length=781)
The dataframe df is the results of a merge of 2 dataframes.
As you can see from the indeces are not sorted. E.g. '2015-03-31_p002_3'(5th position) comes before '2015-04-01_p001_2' (6th position)
I would like to group together all the _p001_2 and sort them according to the date, then all the _p002_3, etc etc.
But I didnt manage to do it...

If sort_index is not possible use, then it is a bit complicated - need create helper DataFrame by split, then sort_values and last reindex:
idx = pd.Index([u'2015-03-28_p001_2', u'2015-03-29_p001_2',
u'2015-03-30_p001_2', u'2015-03-31_p001_2',
u'2015-03-31_p002_3', u'2015-04-01_p001_2',
u'2015-04-01_p002_3', u'2015-04-02_p001_2',
u'2015-04-02_p002_3', u'2015-04-03_p001_2',
u'2016-03-31_p127_1', u'2016-04-01_p127_1',
u'2016-04-01_p128_3', u'2016-04-02_p127_1',
u'2016-04-02_p128_3', u'2016-04-03_p127_1',
u'2016-04-03_p128_3', u'2016-04-04_p127_1',
u'2016-04-05_p127_1', u'2016-04-06_p127_1'])
df = pd.DataFrame({'a':range(len(idx))}, index=idx)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-03-31_p002_3 4
2015-04-01_p001_2 5
2015-04-01_p002_3 6
2015-04-02_p001_2 7
2015-04-02_p002_3 8
2015-04-03_p001_2 9
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-01_p128_3 12
2016-04-02_p127_1 13
2016-04-02_p128_3 14
2016-04-03_p127_1 15
2016-04-03_p128_3 16
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
df = df.sort_index()
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-03-31_p002_3 4
2015-04-01_p001_2 5
2015-04-01_p002_3 6
2015-04-02_p001_2 7
2015-04-02_p002_3 8
2015-04-03_p001_2 9
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-01_p128_3 12
2016-04-02_p127_1 13
2016-04-02_p128_3 14
2016-04-03_p127_1 15
2016-04-03_p128_3 16
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
df1 = df.index.to_series().str.split('_', expand=True)
df1[0] = pd.to_datetime(df1[0])
#if necessary change order columns for sorting
df1 = df1.sort_values(by=[1,2,0])
print (df1)
0 1 2
2015-03-28_p001_2 2015-03-28 p001 2
2015-03-29_p001_2 2015-03-29 p001 2
2015-03-30_p001_2 2015-03-30 p001 2
2015-03-31_p001_2 2015-03-31 p001 2
2015-04-01_p001_2 2015-04-01 p001 2
2015-04-02_p001_2 2015-04-02 p001 2
2015-04-03_p001_2 2015-04-03 p001 2
2015-03-31_p002_3 2015-03-31 p002 3
2015-04-01_p002_3 2015-04-01 p002 3
2015-04-02_p002_3 2015-04-02 p002 3
2016-03-31_p127_1 2016-03-31 p127 1
2016-04-01_p127_1 2016-04-01 p127 1
2016-04-02_p127_1 2016-04-02 p127 1
2016-04-03_p127_1 2016-04-03 p127 1
2016-04-04_p127_1 2016-04-04 p127 1
2016-04-05_p127_1 2016-04-05 p127 1
2016-04-06_p127_1 2016-04-06 p127 1
2016-04-01_p128_3 2016-04-01 p128 3
2016-04-02_p128_3 2016-04-02 p128 3
2016-04-03_p128_3 2016-04-03 p128 3
df = df.reindex(df1.index)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-04-01_p001_2 5
2015-04-02_p001_2 7
2015-04-03_p001_2 9
2015-03-31_p002_3 4
2015-04-01_p002_3 6
2015-04-02_p002_3 8
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-02_p127_1 13
2016-04-03_p127_1 15
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
2016-04-01_p128_3 12
2016-04-02_p128_3 14
2016-04-03_p128_3 16
EDIT:
If duplicates, then is necessary create new columns, sort and last drop them:
df[[0,1,2]] = df.index.to_series().str.split('_', expand=True)
df[0] = pd.to_datetime(df[0])
df = df.sort_values(by=[1,2,0])
df = df.drop([0,1,2], axis=1)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-04-01_p001_2 5
2015-04-02_p001_2 7
2015-04-03_p001_2 9
2015-03-31_p002_3 4
2015-04-01_p002_3 6
2015-04-02_p002_3 8
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-02_p127_1 13
2016-04-03_p127_1 15
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
2016-04-01_p128_3 12
2016-04-02_p128_3 14
2016-04-03_p128_3 16

Related

Conditionally keep only one of the duplicates in pandas groupby groups

I have a dataset in this format: (can be download in CSV format from here)
ID DateAcquired DateSent
1 20210518 20220110
1 20210719 20220210
1 20210719 20220310
1 20200420 20220410
1 20210328 20220510
1 20210518 20220610
2 20210108 20220110
2 20210110 20220210
2 20210119 20220310
2 20210108 20220410
2 20200109 20220510
2 20210919 20220610
2 20211214 20220612
2 20210812 20220620
2 20210909 20220630
2 20200102 20220811
2 20200608 20220909
2 20210506 20221005
2 20210130 20221101
3 20210518 20220110
3 20210519 20220210
3 20210520 20220310
3 20210518 20220410
3 20210611 20220510
3 20210521 20220610
3 20210723 20220612
3 20211211 20220620
4 20210518 20220110
4 20210519 20220210
4 20210520 20220310
4 20210618 20220410
4 20210718 20220510
4 20210818 20220610
5 20210518 20220110
5 20210818 20220210
5 20210918 20220310
5 20211018 20220410
5 20211113 20220510
5 20211218 20220610
5 20210631 20221212
6T 20200102 20201101
6T 20200102 20201101
6T 20200102 20201101
6T 20210405 20220610
6T 20210606 20220611
I am doing groupby:
data.groupby(['ID','DateAcquired'])
For each unique combination of ID and DateAcquired, I am only interested in keeping one DateSent, and that is the newest one. Therefore, in other words, if a unique combination of ID and DateAcquired has two DateSent available, only take the one where DateSent is the largest/newest. This operation should apply only if ID is NOT 6T.
I am out of ideas on how to do this. Is there an easy way of doing it with pandas?
You can filter rows for not equal 6T and get maximum rows by DateSent by DataFrameGroupBy.idxmax and then append 6T rows to output:
m = df['ID'].ne('6T')
df = (df.loc[df[m].groupby(['ID','DateAcquired'])['DateSent'].idxmax()]
.append(df[~m], ignore_index=True))
Solution with sorting and removing duplicates:
m = df['ID'].ne('6T')
df = (df[m].sort_values(['ID','DateAcquired','DateSent'], ascending=[True, True, False])
.drop_duplicates(subset=['ID','DateAcquired'])
.append(df[~m], ignore_index=True))
Use pd.to_datetime with Groupby.max:
In [835]: df.DateSent = pd.to_datetime(df.DateSent, format='%Y%m%d')
In [841]: df[df.ID.ne('6T')].groupby(['ID','DateAcquired'])['DateSent'].max().reset_index().append(df[df.ID.eq('6T')])
Out[841]:
ID DateAcquired DateSent
0 1 20200420 2022-04-10
1 1 20210328 2022-05-10
2 1 20210518 2022-06-10
3 1 20210719 2022-03-10
4 2 20200102 2022-08-11
5 2 20200109 2022-05-10
6 2 20200608 2022-09-09
7 2 20210108 2022-04-10
8 2 20210110 2022-02-10
9 2 20210119 2022-03-10
10 2 20210130 2022-11-01
11 2 20210506 2022-10-05
12 2 20210812 2022-06-20
13 2 20210909 2022-06-30
14 2 20210919 2022-06-10
15 2 20211214 2022-06-12
16 3 20210518 2022-04-10
17 3 20210519 2022-02-10
18 3 20210520 2022-03-10
19 3 20210521 2022-06-10
20 3 20210611 2022-05-10
21 3 20210723 2022-06-12
22 3 20211211 2022-06-20
23 4 20210518 2022-01-10
24 4 20210519 2022-02-10
25 4 20210520 2022-03-10
26 4 20210618 2022-04-10
27 4 20210718 2022-05-10
28 4 20210818 2022-06-10
29 5 20210518 2022-01-10
30 5 20210631 2022-12-12
31 5 20210818 2022-02-10
32 5 20210918 2022-03-10
33 5 20211018 2022-04-10
34 5 20211113 2022-05-10
35 5 20211218 2022-06-10
40 6T 20200102 2020-11-01
41 6T 20200102 2020-11-01
42 6T 20200102 2020-11-01
43 6T 20210405 2022-06-10
44 6T 20210606 2022-06-11

Replace row value by comparing dates

I have a date in a list:
[datetime.date(2017, 8, 9)]
I want replace the value of a dataframe matching that date with zero.
Dataframe:
Date Amplitude Magnitude Peaks Crests
0 2017-06-21 6.953356 1046.656154 4 3
1 2017-06-27 7.015520 1185.221306 5 4
2 2017-06-28 6.947471 908.115055 2 2
3 2017-06-29 6.921587 938.175153 3 3
4 2017-07-02 6.906078 938.273547 3 2
5 2017-07-03 6.898809 955.718452 6 5
6 2017-07-04 6.876283 846.514852 5 5
7 2017-07-26 6.862897 870.610086 6 5
8 2017-07-27 6.846426 824.403786 7 7
9 2017-07-28 6.831949 813.753420 7 7
10 2017-07-29 6.823125 841.245427 4 3
11 2017-07-30 6.816301 846.603427 5 4
12 2017-07-31 6.810133 842.287006 5 4
13 2017-08-01 6.800645 794.167590 3 3
14 2017-08-02 6.793034 801.505774 4 3
15 2017-08-03 6.790814 860.497395 7 6
16 2017-08-04 6.785664 815.055002 4 4
17 2017-08-05 6.782069 829.607640 5 4
18 2017-08-06 6.778176 819.014799 4 3
19 2017-08-07 6.774587 817.624203 5 5
20 2017-08-08 6.771193 815.101641 4 3
21 2017-08-09 6.765695 772.970000 1 1
22 2017-08-10 6.769422 945.207554 1 1
23 2017-08-11 6.773154 952.422598 4 3
24 2017-08-12 6.770926 826.700122 4 4
25 2017-08-13 6.772816 916.046905 5 5
26 2017-08-14 6.771130 834.881662 5 5
27 2017-08-15 6.769183 826.009391 5 5
28 2017-08-16 6.767313 824.650882 5 4
29 2017-08-17 6.765894 832.752100 5 5
30 2017-08-18 6.766861 894.165751 5 5
31 2017-08-19 6.768392 912.200274 4 3
i have tried this:
for x in range(len(all_details)):
for y in selected_day:
m = all_details['Date'] > y
all_details.loc[m, 'Peaks'] = 0
But getting an error:
ValueError: Arrays were different lengths: 32 vs 1
Can anybody suggest me the correct way to do it>
Any help would be appreciated.
First your solution working nice with your sample data.
Another faster solution is creating each mask in loop and then reduce by logical or, and - what need. Better it is explained here.
L = [datetime.date(2017, 8, 9)]
m = np.logical_or.reduce([all_details['Date'] > x for x in L])
all_details.loc[m, 'Peaks'] = 0
In your solution is better compare only by minimal date from list:
all_details.loc[all_details['Date'] > min(L), 'Peaks'] = 0

how to use custom calculation based on two dataframes in python

I have 2 dataframes as below,
df1
index X1 X2 X3 X4
0 6 10 6 7
1 8 9 11 13
2 12 13 15 11
3 8 11 7 6
4 11 7 6 6
5 13 14 11 10
df2
index Y
0 20
1 14
2 17
3 14
4 15
5 20
I want to get 3rd dataframe such that new X(i) = Y - (Y/X(i))
index X1 X2 X3 X4
0 16.67 18.00 16.67 17.14
1 12.25 12.44 12.73 12.92
2 15.58 15.69 15.87 15.45
3 12.25 12.73 12.00 11.67
4 13.64 12.86 12.50 12.50
5 18.46 18.57 18.18 18.00
Please note, it shall match the index number of df1 and df2 while calculating.
Thanks in advance!
You can use numpy for vectorized solution i.e
df[df.columns] = (df2.values - df2.values/df.values)
X1 X2 X3 X4
index
0 16.666667 18.000000 16.666667 17.142857
1 12.250000 12.444444 12.727273 12.923077
2 15.583333 15.692308 15.866667 15.454545
3 12.250000 12.727273 12.000000 11.666667
4 13.636364 12.857143 12.500000 12.500000
5 18.461538 18.571429 18.181818 18.000000
Base on pandas
-(1/df1.div(df2.Y,0)).sub(df2.Y,0)
Out[634]:
X1 X2 X3 X4
index
0 16.666667 18.000000 16.666667 17.142857
1 12.250000 12.444444 12.727273 12.923077
2 15.583333 15.692308 15.866667 15.454545
3 12.250000 12.727273 12.000000 11.666667
4 13.636364 12.857143 12.500000 12.500000
5 18.461538 18.571429 18.181818 18.000000

How to Reverse Rolling Sum?

I have a rolling sum calculated on a grouped data frame but its adding up the wrong way, it is a sum of the future, when I need a sum of the past.
What am I doing wrong here?
I import the data and sort by Dimension and Date (I have tried removing the date sort already)
df = pd.read_csv('Input.csv', parse_dates=True)
df.sort_values(['Dimension','Date'])
print(df)
I then create a new column which is a multi index grouped by rolling window
new_column = df.groupby('Dimension').Value1.apply(lambda x:
x.rolling(window=3).sum())
I then reset the index to be the same as the original
df['Sum_Value1'] = new_column.reset_index(level=0, drop=True)
print(df)
I have also tried reversing the index before the calculation, but that also failed.
Input
Dimension,Date,Value1,Value2
1,4/30/2002,10,20
1,1/31/2002,10,20
1,10/31/2001,10,20
1,7/31/2001,10,20
1,4/30/2001,10,20
1,1/31/2001,10,20
1,10/31/2000,10,20
2,4/30/2002,10,20
2,1/31/2002,10,20
2,10/31/2001,10,20
2,7/31/2001,10,20
2,4/30/2001,10,20
2,1/31/2001,10,20
2,10/31/2000,10,20
3,4/30/2002,10,20
3,1/31/2002,10,20
3,10/31/2001,10,20
3,7/31/2001,10,20
3,1/31/2001,10,20
3,10/31/2000,10,20
Output:
Dimension Date Value1 Value2 Sum_Value1
0 1 4/30/2002 10 20 NaN
1 1 1/31/2002 10 20 NaN
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 30.0
6 1 10/31/2000 10 20 30.0
7 2 4/30/2002 10 20 NaN
8 2 1/31/2002 10 20 NaN
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 30.0
13 2 10/31/2000 10 20 30.0
Goal Output:
Dimension Date Value1 Value2 Sum_Value1
0 1 4/30/2002 10 20 30.0
1 1 1/31/2002 10 20 30.0
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 NaN
6 1 10/31/2000 10 20 NaN
7 2 4/30/2002 10 20 30.0
8 2 1/31/2002 10 20 30.0
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 Nan
13 2 10/31/2000 10 20 NaN
You need a backward sum, therefore reverse your series before sum rolling it:
lambda x: x[::-1].rolling(window=3).sum()
You can shift the result by window-1 to get the left aligned results:
df["sum_value1"] = (df.groupby('Dimension').Value1
.apply(lambda x: x.rolling(window=3).sum().shift(-2)))
Rolling backwards is the same as rolling forward and then shifting the result:
x.rolling(window=3).sum().shift(-2)
def reverse_rolling(series, window, func):
index = series.index
series = pd.DataFrame(series.iloc[::-1])
series = series.rolling(window, 1).apply(func)
series = series.iloc[::-1]
series['index'] = index
series = series.set_index('index')
return series[0]
You can use
import pandas as pd
from pandas.api.indexers import FixedForwardWindowIndexer
df = pd.read_csv(r'C:\Users\xxxx\python\data.txt')
indexer = FixedForwardWindowIndexer(window_size=3)
df1 = df.join(df.groupby('Dimension')['Value1'].rolling(indexer, min_periods=3).sum().to_frame().reset_index(), rsuffix='_sum')
del df1['Dimension_sum']
del df1['level_1']
df1
Input:
Dimension Date Value1 Value2
0 1 4/30/2002 10 20
1 1 1/31/2002 10 20
2 1 10/31/2001 10 20
3 1 7/31/2001 10 20
4 1 4/30/2001 10 20
5 1 1/31/2001 10 20
6 1 10/31/2000 10 20
7 2 4/30/2002 10 20
8 2 1/31/2002 10 20
9 2 10/31/2001 10 20
10 2 7/31/2001 10 20
11 2 4/30/2001 10 20
12 2 1/31/2001 10 20
13 2 10/31/2000 10 20
14 3 4/30/2002 10 20
15 3 1/31/2002 10 20
16 3 10/31/2001 10 20
17 3 7/31/2001 10 20
18 3 1/31/2001 10 20
19 3 10/31/2000 10 20
OUTPUT:
Dimension Date Value1 Value2 Value1_sum
0 1 4/30/2002 10 20 30.0
1 1 1/31/2002 10 20 30.0
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 NaN
6 1 10/31/2000 10 20 NaN
7 2 4/30/2002 10 20 30.0
8 2 1/31/2002 10 20 30.0
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 NaN
13 2 10/31/2000 10 20 NaN
14 3 4/30/2002 10 20 30.0
15 3 1/31/2002 10 20 30.0
16 3 10/31/2001 10 20 30.0
17 3 7/31/2001 10 20 30.0
18 3 1/31/2001 10 20 NaN
19 3 10/31/2000 10 20 NaN
Just had to do the same thing myself and came up with a simple one-liner:
df['Sum_Value1'] = df['Value1'].iloc[::-1].rolling(window = 3).sum()

Fill the missing date values in a Pandas Dataframe column

I'm using Pandas to store stock prices data using Data Frames. There are 2940 rows in the dataset. The Dataset snapshot is displayed below:
The time series data does not contain the values for Saturday and Sunday. Hence missing values have to be filled.
Here is the code I've written but it is not solving the problem:
import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/Admin/Analytics/stock-prices')
data = pd.read_csv('stock-data.csv')
# PriceDate Column - Does not contain Saturday and Sunday stock entries
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_index(by=['PriceDate'], ascending=[True])
# Starting date is Aug 25 2004
idx = pd.date_range('08-25-2004',periods=2940,freq='D')
data = data.set_index(idx)
data['newdate']=data.index
newdate=data['newdate'].values # Create a time series column
data = pd.merge(newdate, data, on='PriceDate', how='outer')
How to fill the missing values for Saturday and Sunday?
I think you can use resample with ffill or bfill, but before set_index from column PriceDate:
print (data)
ID PriceDate OpenPrice HighPrice
0 1 6/24/2016 1 2
1 2 6/23/2016 3 4
2 2 6/22/2016 5 6
3 2 6/21/2016 7 8
4 2 6/20/2016 9 10
5 2 6/17/2016 11 12
6 2 6/16/2016 13 14
data['PriceDate'] = pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_values(by=['PriceDate'], ascending=[True])
data.set_index('PriceDate', inplace=True)
print (data)
ID OpenPrice HighPrice
PriceDate
2016-06-16 2 13 14
2016-06-17 2 11 12
2016-06-20 2 9 10
2016-06-21 2 7 8
2016-06-22 2 5 6
2016-06-23 2 3 4
2016-06-24 1 1 2
data = data.resample('D').ffill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 11 12
3 2016-06-19 2 11 12
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2
data = data.resample('D').bfill().reset_index()
print (data)
PriceDate ID OpenPrice HighPrice
0 2016-06-16 2 13 14
1 2016-06-17 2 11 12
2 2016-06-18 2 9 10
3 2016-06-19 2 9 10
4 2016-06-20 2 9 10
5 2016-06-21 2 7 8
6 2016-06-22 2 5 6
7 2016-06-23 2 3 4
8 2016-06-24 1 1 2

Categories