Create well structured pandas dataframe using dataframe - python

I have a Panda DataFreme data from 2018 to 2020. I want to structure these data as follows.
Month | 2018 | 2019
Jan 115 73
Feb 112 63
....
up to December.
How can I solve this issue using panda data frame syntax?
Date
2018-01-01 115.0
2018-02-01 112.0
2018-03-01 104.5
2018-04-01 91.1
2018-05-01 85.5
2018-06-01 76.5
2018-07-01 86.5
2018-08-01 77.9
2018-09-01 65.0
2018-10-01 71.0
2018-11-01 76.0
2018-12-01 72.5
2019-01-01 73.0
2019-02-01 63.0
2019-03-01 63.0
2019-04-01 61.0
2019-05-01 58.3
2019-06-01 59.0
2019-07-01 67.0
2019-08-01 64.0
2019-09-01 59.9
2019-10-01 70.4
2019-11-01 78.9
2019-12-01 75.0
2020-01-01 73.9
Name: Close, dtype: float64

This is more like pivot but with crosstab
s = pd.crosstab(df.index.strftime('%b'),df.index.year,df.values,aggfunc='sum')
Out[87]:
col_0 2018 2019 2020
row_0
Apr 91.1 61.0 NaN
Aug 77.9 64.0 NaN
Dec 72.5 75.0 NaN
Feb 112.0 63.0 NaN
Jan 115.0 73.0 73.9
Jul 86.5 67.0 NaN
Jun 76.5 59.0 NaN
Mar 104.5 63.0 NaN
May 85.5 58.3 NaN
Nov 76.0 78.9 NaN
Oct 71.0 70.4 NaN
Sep 65.0 59.9 NaN

You can use groupby and unstack:
(s.groupby([s.index.month, s.index.year]).first().unstack()
.rename_axis(columns='Year',index='Month')
)
Output:
Year 2018 2019 2020
Month
1 115.0 73.0 73.9
2 112.0 63.0 NaN
3 104.5 63.0 NaN
4 91.1 61.0 NaN
5 85.5 58.3 NaN
6 76.5 59.0 NaN
7 86.5 67.0 NaN
8 77.9 64.0 NaN
9 65.0 59.9 NaN
10 71.0 70.4 NaN
11 76.0 78.9 NaN
12 72.5 75.0 NaN

Related

Overwrite one dataframe with values from another dataframe, based on repeated datetime index

I want to update and overwrite the values of one dataframe with the values in another, based on the datetime index, for a repeated datetime index. This code illustrates my problem, I have given df1 crazy values for illustrative purposes:
#import packages
import pandas as pd
import numpy as np
#create dataframes and indices
df = pd.DataFrame(np.random.randint(0,30,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))
df1 = pd.DataFrame(np.random.randint(900,1000,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))
df['Location'] =[2,2,3,3,4,4,5,5,6,6]
df1['Location'] =[2,2,3,3,4,4,5,5,6,6]
df.index = ["2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00"]
df1.index = ["2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00"]
df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)
Take a look at both dataframes, which shows dates 18th and 19th for df, and 19th and 20th for df1.
print(df)
MeanT MaxT MinT Location
2020-05-18 12:00:00 28 0 9 2
2020-05-19 12:00:00 22 7 11 2
2020-05-18 12:00:00 2 7 7 3
2020-05-19 12:00:00 10 24 18 3
2020-05-18 12:00:00 10 12 25 4
2020-05-19 12:00:00 25 7 20 4
2020-05-18 12:00:00 1 8 11 5
2020-05-19 12:00:00 27 19 12 5
2020-05-18 12:00:00 25 10 26 6
2020-05-19 12:00:00 29 11 27 6
print(df1)
MeanT MaxT MinT Location
2020-05-19 12:00:00 912 991 915 2
2020-05-20 12:00:00 936 917 965 2
2020-05-19 12:00:00 918 977 901 3
2020-05-20 12:00:00 974 971 927 3
2020-05-19 12:00:00 979 929 953 4
2020-05-20 12:00:00 988 955 939 4
2020-05-19 12:00:00 969 983 940 5
2020-05-20 12:00:00 902 904 916 5
2020-05-19 12:00:00 983 942 965 6
2020-05-20 12:00:00 928 994 933 6
I want to create a new dataframe which updates df with the values from df1, so the new df has values for the 18th from df, and the 19th and 20th from df1.
I have tried using combine_first like so:
df = df.set_index(df.groupby(level=0).cumcount(), append=True)
df1 = df1.set_index(df1.groupby(level=0).cumcount(), append=True)
df3 = df.combine_first(df1).sort_index(level=[1,0]).reset_index(level=1, drop=True)
which updates the dataframe, but doesn't overwrite the data for the 19th with values in df1. It produces this output:
print(df3)
MeanT MaxT MinT Location
2020-05-18 12:00:00 28.0 0.0 9.0 2.0
2020-05-19 12:00:00 22.0 7.0 11.0 2.0
2020-05-20 12:00:00 936.0 917.0 965.0 2.0
2020-05-18 12:00:00 2.0 7.0 7.0 3.0
2020-05-19 12:00:00 10.0 24.0 18.0 3.0
2020-05-20 12:00:00 974.0 971.0 927.0 3.0
2020-05-18 12:00:00 10.0 12.0 25.0 4.0
2020-05-19 12:00:00 25.0 7.0 20.0 4.0
2020-05-20 12:00:00 988.0 955.0 939.0 4.0
2020-05-18 12:00:00 1.0 8.0 11.0 5.0
2020-05-19 12:00:00 27.0 19.0 12.0 5.0
2020-05-20 12:00:00 902.0 904.0 916.0 5.0
2020-05-18 12:00:00 25.0 10.0 26.0 6.0
2020-05-19 12:00:00 29.0 11.0 27.0 6.0
2020-05-20 12:00:00 928.0 994.0 933.0 6.0
So the values for the 18th and the 20th are correct, but the values for the 19th are still from df. I want the values from df to be overwritten with those in df1. Please help!
you just need to use combine_first backwards.
We can also use 'Location' as index instead groupby.cumcount
df3 = (df1.set_index('Location', append=True)
.combine_first(df.set_index('Location', append=True))
.reset_index(level='Location')
.reindex(columns=df.columns)
.sort_values('Location'))
print(df3)
Location MeanT MaxT MinT
2020-05-18-12:00:00 2 28.0 0.0 9.0
2020-05-19-12:00:00 2 912.0 991.0 915.0
2020-05-20-12:00:00 2 936.0 917.0 965.0
2020-05-18-12:00:00 3 2.0 7.0 7.0
2020-05-19-12:00:00 3 918.0 977.0 901.0
2020-05-20-12:00:00 3 974.0 971.0 927.0
2020-05-18-12:00:00 4 10.0 12.0 25.0
2020-05-19-12:00:00 4 979.0 929.0 953.0
2020-05-20-12:00:00 4 988.0 955.0 939.0
2020-05-18-12:00:00 5 1.0 8.0 11.0
2020-05-19-12:00:00 5 969.0 983.0 940.0
2020-05-20-12:00:00 5 902.0 904.0 916.0
2020-05-18-12:00:00 6 25.0 10.0 26.0
2020-05-19-12:00:00 6 983.0 942.0 965.0
2020-05-20-12:00:00 6 928.0 994.0 933.0

Add to values of a DataFrame using cooridnates

I have a dataframe a:
Out[68]:
p0_4 p5_7 p8_9 p10_14 p15 p16_17 p18_19 p20_24 p25_29 \
0 1360.0 921.0 676.0 1839.0 336.0 668.0 622.0 1190.0 1399.0
1 308.0 197.0 187.0 411.0 67.0 153.0 172.0 336.0 385.0
2 76.0 59.0 40.0 72.0 16.0 36.0 20.0 56.0 82.0
3 765.0 608.0 409.0 1077.0 220.0 359.0 342.0 873.0 911.0
4 1304.0 906.0 660.0 1921.0 375.0 725.0 645.0 1362.0 1474.0
5 195.0 135.0 78.0 262.0 44.0 97.0 100.0 265.0 229.0
6 1036.0 965.0 701.0 1802.0 335.0 701.0 662.0 1321.0 1102.0
7 5072.0 3798.0 2865.0 7334.0 1399.0 2732.0 2603.0 4976.0 4575.0
8 1360.0 962.0 722.0 1758.0 357.0 710.0 713.0 1761.0 1660.0
9 743.0 508.0 369.0 1118.0 286.0 615.0 429.0 738.0 885.0
10 1459.0 1015.0 679.0 1732.0 337.0 746.0 677.0 1493.0 1546.0
11 828.0 519.0 415.0 1057.0 190.0 439.0 379.0 788.0 1024.0
12 1042.0 690.0 503.0 1204.0 219.0 451.0 465.0 1193.0 1406.0
p30_44 p45_59 p60_64 p65_74 p75_84 p85_89 p90plus
0 4776.0 8315.0 2736.0 5463.0 2819.0 738.0 451.0
1 1004.0 2456.0 988.0 2007.0 1139.0 313.0 153.0
2 291.0 529.0 187.0 332.0 108.0 31.0 10.0
3 2807.0 5505.0 2060.0 4104.0 2129.0 516.0 252.0
4 4524.0 9406.0 3034.0 6003.0 3366.0 840.0 471.0
5 806.0 1490.0 606.0 1288.0 664.0 185.0 108.0
6 4127.0 8311.0 2911.0 6111.0 3525.0 1029.0 707.0
7 16917.0 27547.0 8145.0 15950.0 9510.0 2696.0 1714.0
8 5692.0 9380.0 3288.0 6458.0 3830.0 1050.0 577.0
9 2749.0 5696.0 2014.0 4165.0 2352.0 603.0 288.0
10 4676.0 7654.0 2502.0 5077.0 3004.0 754.0 461.0
11 2799.0 4880.0 1875.0 3951.0 2294.0 551.0 361.0
12 3288.0 5661.0 1974.0 4007.0 2343.0 623.0 303.0
and a series d:
Out[70]:
2 p45_59
10 p45_59
11 p45_59
Is there a simple way to add 1 to number in a with the same index and column labels in d?
I have tried:
a[d] +=1
However this adds 1 to every value in the column, not just the values with indices 2, 10 and 11.
Thanking you in advance.
You might want to try this.
a.loc[list(d.index), list(d.values)] += 1

Second Line in Matplotlib plot is inaccurate/runs all over the grid

I'm trying to plot fantasy points from two players in every game since the start of the NBA season.
I've created a dataframe that has the lines of every player, every night, and I want to plot every date that each have played.
The two dataframes look as such.
kemba[['Date','FP']]
Date FP
Rk
260 10/23/2019 2.0
532 10/25/2019 28.0
754 10/26/2019 49.0
1390 10/30/2019 35.0
1628 11/1/2019 39.5
2178 11/5/2019 32.5
2463 11/7/2019 17.5
2800 11/9/2019 40.0
3103 11/11/2019 37.5
3410 11/13/2019 37.0
3699 11/15/2019 25.0
4001 11/17/2019 22.5
4186 11/18/2019 22.0
4494 11/20/2019 9.5
4750 11/22/2019 4.0
5637 11/27/2019 50.5
5904 11/29/2019 19.0
6193 12/1/2019 22.5
6677 12/4/2019 43.5
6975 12/6/2019 26.0
7454 12/9/2019 33.5
7769 12/11/2019 57.0
7861 12/12/2019 31.5
8614 12/18/2019 35.5
9071 12/20/2019 5.0
9289 12/22/2019 26.0
100 12/25/2019 23.0
ingram[['Date','FP']]
Date FP
Rk
22 10/22/2019 31.5
441 10/25/2019 37.5
646 10/26/2019 57.0
984 10/28/2019 41.5
1439 10/31/2019 30.0
1718 11/2/2019 10.5
1994 11/4/2019 59.0
2586 11/8/2019 30.0
2757 11/9/2019 31.5
4245 11/19/2019 30.5
4532 11/21/2019 38.5
4864 11/23/2019 40.5
5022 11/24/2019 32.5
5496 11/27/2019 22.0
5784 11/29/2019 43.0
6111 12/1/2019 31.0
6404 12/3/2019 40.0
6737 12/5/2019 27.0
7038 12/7/2019 18.0
7372 12/9/2019 38.5
7668 12/11/2019 29.0
7958 12/13/2019 38.0
8283 12/15/2019 32.5
8551 12/17/2019 24.0
8612 12/18/2019 48.0
8891 12/20/2019 30.5
102 12/23/2019 31.0
55 12/25/2019 46.5
The data that I've plotted is such:
# creating x & y for Ingram
ingram_fp=ingram['FP']
ingram_date=ingram['Date']
# creating x and y for Kemmba
kemba_fp=kemba['FP']
kemba_date=kemba['Date']
fig=plt.figure()
plt.plot(kemba_date,kemba_fp,color='#FF5733',linewidth=1,marker='.',label='Walker')
plt.plot(ingram_date,ingram_fp,color='#33A7FF',marker='.',label='Ingram')
fig.autofmt_xdate()
plt.show()
When I do this, the link for Ingram is all over the place. Any idea on what went wrong?
This is the plot I get
It looks like Date might not be formatted as a date.
Modify your code as follows:
import pandas as pd
# creating x & y for Ingram
ingram_fp=ingram['FP']
ingram_date=pd.to_datetime(ingram['Date'])
# creating x and y for Kemmba
kemba_fp=kemba['FP']
kemba_date=pd.to_datetime(kemba['Date'])

Using shift and rolling in pandas with groupBy

df = pd.DataFrame(dict(
list(
zip(["A", "B", "C"],
[np.array(["id %02d" % i for i in range(1, 11)]).repeat(10),
pd.date_range("2018-01-01", periods=100).strftime("%Y-%m-%d"),
[i for i in range(10, 110)]])
)
))
df = df.groupby(["A", "B"]).sum()
df["D"] = df["C"].shift(1).rolling(2).mean()
df
This code generates the following:
I want the rolling logic to start over for every new ID. Right now, ID 02 is using the last two values from ID 01 to calculate the mean.
How can this be achieved?
I believe you need groupby:
df['D'] = df["C"].shift(1).groupby(df['A'], group_keys=False).rolling(2).mean()
print (df.head(20))
C D
A B
id 01 2018-01-01 10 NaN
2018-01-02 11 NaN
2018-01-03 12 10.5
2018-01-04 13 11.5
2018-01-05 14 12.5
2018-01-06 15 13.5
2018-01-07 16 14.5
2018-01-08 17 15.5
2018-01-09 18 16.5
2018-01-10 19 17.5
id 02 2018-01-11 20 NaN
2018-01-12 21 19.5
2018-01-13 22 20.5
2018-01-14 23 21.5
2018-01-15 24 22.5
2018-01-16 25 23.5
2018-01-17 26 24.5
2018-01-18 27 25.5
2018-01-19 28 26.5
2018-01-20 29 27.5
Or:
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
print (df.head(20))
C D
A B
id 01 2018-01-01 10 NaN
2018-01-02 11 NaN
2018-01-03 12 10.5
2018-01-04 13 11.5
2018-01-05 14 12.5
2018-01-06 15 13.5
2018-01-07 16 14.5
2018-01-08 17 15.5
2018-01-09 18 16.5
2018-01-10 19 17.5
id 02 2018-01-11 20 NaN
2018-01-12 21 NaN
2018-01-13 22 20.5
2018-01-14 23 21.5
2018-01-15 24 22.5
2018-01-16 25 23.5
2018-01-17 26 24.5
2018-01-18 27 25.5
2018-01-19 28 26.5
2018-01-20 29 27.5
While the accepted answer by #jezrael works correctly for positive shifts, it gives incorrect result (partially) for negative shifts. Please check the following
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
df['E'] = df["C"].groupby(df['A']).rolling(2).mean().shift(1).values
df['F'] = df["C"].groupby(df['A']).shift(-1).rolling(2).mean()
df['G'] = df["C"].groupby(df['A']).rolling(2).mean().shift(-1).values
df.set_index(['A', 'B'], inplace=True)
print(df.head(20))
C D E F G
A B
id 01 2018-01-01 10 NaN NaN NaN 10.5
2018-01-02 11 NaN NaN 11.5 11.5
2018-01-03 12 10.5 10.5 12.5 12.5
2018-01-04 13 11.5 11.5 13.5 13.5
2018-01-05 14 12.5 12.5 14.5 14.5
2018-01-06 15 13.5 13.5 15.5 15.5
2018-01-07 16 14.5 14.5 16.5 16.5
2018-01-08 17 15.5 15.5 17.5 17.5
2018-01-09 18 16.5 16.5 18.5 18.5
2018-01-10 19 17.5 17.5 NaN NaN
id 02 2018-01-11 20 NaN 18.5 NaN 20.5
2018-01-12 21 NaN NaN 21.5 21.5
2018-01-13 22 20.5 20.5 22.5 22.5
2018-01-14 23 21.5 21.5 23.5 23.5
2018-01-15 24 22.5 22.5 24.5 24.5
2018-01-16 25 23.5 23.5 25.5 25.5
2018-01-17 26 24.5 24.5 26.5 26.5
2018-01-18 27 25.5 25.5 27.5 27.5
2018-01-19 28 26.5 26.5 28.5 28.5
2018-01-20 29 27.5 27.5 NaN NaN
Note that columns D and E are computed for .shift(1) and columns F and G are computed for .shift(-1). Column E is incorrect, since the first value of id 02 uses last two values of id 01. Column F is incorrect since first values are NaNs for both id 01 and id 02. Columns D and G give correct results. So, the full answer should be like this. If shift period is non-negative, use the following
df['D'] = df["C"].groupby(df['A']).shift(1).rolling(2).mean()
If shift period is negative, use the following
df['G'] = df["C"].groupby(df['A']).rolling(2).mean().shift(-1).values
Hope it helps!

df.plot.density() returns an empty plot despite values actually being there

I currently have this data set
WSPD GST WD WVHT DPD APD MWD BAR ATMP WTMP
Date
2005-06-06 03:00:00 8.2 9.8 86 NaN NaN NaN 77 1011.1 28.8 29
2005-06-06 04:00:00 9.4 11.2 96 NaN NaN NaN 79 1011.6 29 29
2005-06-06 05:00:00 9.4 10.9 103 NaN NaN NaN 78 1011.6 29 28.9
2005-06-06 06:00:00 7.5 9 114 NaN NaN NaN 84 1011.4 27.7 28.9
2005-06-06 07:00:00 7 10.4 118 NaN NaN NaN 85 1011.1 25.4 28.9
I am attempting to do a probability density plot for the column with WVHT. However, when I type:
df['WVHT'].plot.density()
I receive an empty plot despite the column actually having values. Please note this is just a sample of the data.

Categories