I have a pandas dataFrame like this:
content
date
2013-12-18 12:30:00 1
2013-12-19 10:50:00 1
2013-12-24 11:00:00 0
2014-01-02 11:30:00 1
2014-01-03 11:50:00 0
2013-12-17 16:40:00 10
2013-12-18 10:00:00 0
2013-12-11 10:00:00 0
2013-12-18 11:45:00 0
2013-12-11 14:40:00 4
2010-05-25 13:05:00 0
2013-11-18 14:10:00 0
2013-11-27 11:50:00 3
2013-11-13 10:40:00 0
2013-11-20 10:40:00 1
2008-11-04 14:49:00 1
2013-11-18 10:05:00 0
2013-08-27 11:00:00 0
2013-09-18 16:00:00 0
2013-09-27 11:40:00 0
date being the index.
I reduce the values to months using:
dataFrame = dataFrame.groupby([lambda x: x.year, lambda x: x.month]).agg([sum])
which outputs:
content
sum
2006 3 66
4 65
5 48
6 87
7 37
8 54
9 73
10 74
11 53
12 45
2007 1 28
2 40
3 95
4 63
5 56
6 66
7 50
8 49
9 18
10 28
Now when I plot this dataFrame, I want the x-axis show every month/year as a tick. I have tries setting xticks but it doesn't seem to work. How could this be achieved? This is my current plot using dataFrame.plot():
You can use set_xtick() and set_xticklabels():
idx = pd.date_range("2013-01-01", periods=1000)
val = np.random.rand(1000)
s = pd.Series(val, idx)
g = s.groupby([s.index.year, s.index.month]).mean()
ax = g.plot()
ax.set_xticks(range(len(g)));
ax.set_xticklabels(["%s-%02d" % item for item in g.index.tolist()], rotation=90);
output:
Related
i have the following dataframe structure:
exec_start_date exec_finish_date hour_start hour_finish session_qtd
2020-03-01 2020-03-02 22 0 1
2020-03-05 2020-03-05 22 23 3
2020-03-03 2020-03-04 18 7 4
2020-03-07 2020-03-07 18 18 2
As you can see above, i have three situations of sessions execution:
1) Start in one day and finish in another day with different hours
2) Start in one day and finish in the same day with different hours
3) Start in one day and finish in the same day with same hours
I need to create a column filling the interval between hour_start and hour_finish and create another column with execution date. Then:
If a session starts in one day and finish in another day, the hours executed in the start date need to be filled with exec_start_date and the hours executed in the following day need to be filled with the exec_finish_date.
So, a intermediary dataset would be like this:
exec_start_date exec_finish_date hour_start hour_finish session_qtd hour_interval
2020-03-01 2020-03-02 22 0 1 [22,23,0]
2020-03-05 2020-03-05 22 23 3 [22,23]
2020-03-03 2020-03-04 20 3 4 [20,21,22,23,0,1,2,3]
2020-03-07 2020-03-07 18 18 2 [18]
And the final dataset would be like this:
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-07 2 18
I have tried to create the interval with np.arange but didn't work properly to all cases, specially with the cases that start in one day and finish in another day.
Can you help me?
So the way I would do this is to get the full dates to get the time interval, then just pull the hours from that range. np.arange will not work because hours loop.
#Create two new temp columns that calculate have the full start and end date
df['start'] = df['exec_start_date'].astype(str) + " " + df['hour_start'].astype(str) + ":00.000"
df['end'] = df['exec_finish_date'].astype(str) + " " + df['hour_finish'].astype(str) + ":00.000"
#Convert them both to datetimes
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
#Apply a function to get your range
df['date'] = df.apply(lambda x: pd.date_range(x['start'], x['end'], freq='H').tolist(), axis = 1)
#explode new date column to get 1 column for all the created interval dates
df = df.explode(column = 'date')
#Make two new columns based on your final requested table based on new column
df['exec_date'] = df['date'].dt.date
df['hour_interval'] = df['date'].dt.hour
#Make a copy of the columns you wanted in a new df
df2 = df[['exec_date','session_qtd','hour_interval']].copy()
Make df dtype string
df=df.astype(str)
Concat date with hour and coerce to datetime
df['exec_start_date']=pd.to_datetime(df['exec_start_date'].str.cat(df['hour_start'], sep=' ')+ ':00:00')
df['exec_finish_date']=pd.to_datetime(df['exec_finish_date'].str.cat(df['hour_finish'], sep=' ')+ ':00:00')
Derive hourly periods between the start and end datetime
df['date'] = df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)
Explode date to achieve outcome
df.explode('date')
Can skip last step by exploding after deriving hourly periods as follows
df.assign(date=df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)).explode('date')
Outcome
exec_start_date exec_finish_date hour_start hour_finish session_qtd \
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
3 2020-03-07 18:00:00 2020-03-07 18:00:00 18 18 2
date
0 22
0 23
0 0
1 22
1 23
2 18
2 19
2 20
2 21
2 22
2 23
2 0
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 18
Another approach without converting the date fields:
import pandas as pd
import numpy as np
ss = '''
exec_start_date,exec_finish_date,hour_start,hour_finish,session_qtd
2020-03-01,2020-03-02,22,0,1
2020-03-05,2020-03-05,22,23,3
2020-03-03,2020-03-04,18,7,4
2020-03-07,2020-03-07,18,18,2
'''.strip()
with open('data.csv','w') as f: f.write(ss)
########## main script #############
df = pd.read_csv('data.csv')
df['hour_start'] = df['hour_start'].astype('int')
df['hour_finish'] = df['hour_finish'].astype('int')
# get hour list, group by day
df['hour_interval'] = df.apply(lambda x: list(range(x[2],x[3]+1)) if x[0]==x[1] else [(101,) + tuple(range(x[2],24))]+[(102,)+tuple(range(0,x[3]+1))], axis=1)
lst_col = 'hour_interval'
# split day group to separate rows
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# choose start\end day
hrdf['exec_date'] = hrdf.apply(lambda x: x[0] if type(x[5]) is int or x[5][0] == 101 else x[1], axis=1)
# remove day indicator
hrdf['hour_interval'] = hrdf['hour_interval'].apply(lambda x: [x] if type(x) is int else list(x[1:]))
# split hours to separate rows
df = hrdf
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# columns for final output
df_final = hrdf[['exec_date','session_qtd','hour_interval']]
print(df_final.to_string(index=False))
Output
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 18
2020-03-03 4 19
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-04 4 4
2020-03-04 4 5
2020-03-04 4 6
2020-03-04 4 7
2020-03-07 2 18
I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np
# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})
df.ID = pd.to_numeric(df.ID)
df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')
#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan
#(try1)
df.yyyy = df.Date.astype(float)
#(try2)
df.yyyy = pd.to_numeric(df.Date)
#(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']
I have a dataframe with dates and values from column A to H. Also, I have some fixed variables X1=5, X2=6, Y1=7,Y2=8, Z1=9
Date A B C D E F G H
0 2018-01-02 00:00:00 7161 7205 -44 54920 73 7 5 47073
1 2018-01-03 00:00:00 7101 7147 -46 54710 73 6 5 46570
2 2018-01-04 00:00:00 7146 7189 -43 54730 70 7 5 46933
3 2018-01-05 00:00:00 7079 7121 -43 54720 70 6 5 46404
4 2018-01-08 00:00:00 7080 7125 -45 54280 70 6 5 46355
5 2018-01-09 00:00:00 7060 7102 -43 54440 70 6 5 46319
6 2018-01-10 00:00:00 7113 7153 -40 54510 70 7 5 46837
7 2018-01-11 00:00:00 7103 7141 -38 54690 70 7 5 46728
8 2018-01-12 00:00:00 7074 7110 -36 54310 65 6 5 46357
9 2018-01-15 00:00:00 7181 7210 -29 54320 65 6 5 46792
10 2018-01-16 00:00:00 7036 7078 -42 54420 65 6 5 45709
11 2018-01-17 00:00:00 6994 7034 -40 53690 65 6 5 45416
12 2018-01-18 00:00:00 7032 7076 -44 53590 65 6 5 45705
13 2018-01-19 00:00:00 6999 7041 -42 53560 65 6 5 45331
14 2018-01-22 00:00:00 7025 7068 -43 53500 65 6 5 45455
15 2018-01-23 00:00:00 6883 6923 -41 53490 65 6 5 44470
16 2018-01-24 00:00:00 7111 7150 -39 52630 65 6 5 45866
17 2018-01-25 00:00:00 7101 7138 -37 53470 65 6 5 45663
18 2018-01-26 00:00:00 7043 7085 -43 53380 65 6 5 45087
19 2018-01-29 00:00:00 7041 7085 -44 53370 65 6 5 44958
20 2018-01-30 00:00:00 7010 7050 -41 53040 65 6 5 44790
21 2018-01-31 00:00:00 7079 7118 -39 52880 65 6 5 45248
What I wanted to do is adding some column-wise simple calculations to this dataframe using values in column A to H as well as those fixed variables.
The tricky part is that I need to apply different variables to different date ranges.
For example, during 2018-01-01 to 2018-01-10, I wanted to calculate a new column I where the value equals to: (A+B+C)*X1*Y1+Z1;
While during 2018-01-11 to 2018-01-25, the calculation needs to take (A+B+C)*X2*Y1+Z1. Similar to Y1 and Y2 applied to each of their date ranges.
I know this can calculate/create a new column I.
df[I]=(df[A]+df[B]+df[C])*X1*Y1+Z1
but not sure how to be able to have that flexibility to use different variables to different date ranges.
You can use np.select to define a value based on a condition:
cond = [df.Date.between('2018-01-01','2018-01-10'), df.Date.between('2018-01-11','2018-01-25')]
values = [(df['A']+df['B']+df['C'])*X1*Y1+Z1, (df['A']+df['B']+df['C'])*X2*Y2+Z1]
# select values depending on the condition
df['I'] = np.select(cond, values)
I plotted a data frame like this:
Date Quote-Spread
0 2013-11-17 2.0
1 2013-12-10 8.0
2 2013-12-11 8.0
3 2014-06-01 5.0
4 2014-06-23 15.0
5 2014-06-24 45.0
6 2014-06-25 10.0
7 2014-06-28 20.0
8 2014-09-13 50000.0
9 2015-03-30 250000.0
10 2016-04-02 103780.0
11 2016-04-03 119991.0
12 2016-04-04 29994.0
13 2016-04-05 69993.0
14 2016-04-06 39997.0
15 2016-04-09 490321.0
16 2016-04-10 65485.0
17 2016-04-11 141470.0
18 2016-04-12 109939.0
19 2016-04-13 29983.0
20 2016-04-16 39964.0
21 2016-04-17 39964.0
22 2016-04-18 79920.0
23 2016-04-19 29997.0
24 2016-04-20 108414.0
25 2016-04-23 126849.0
26 2016-04-24 206853.0
27 2016-04-25 37559.0
28 2016-04-26 22817.0
29 2016-04-27 37506.0
30 2016-04-30 37597.0
31 2016-05-01 18799.0
32 2016-05-02 18799.0
33 2016-05-03 9400.0
34 2016-05-07 29890.0
35 2016-05-08 29193.0
36 2016-05-09 7792.0
37 2016-05-10 3199.0
38 2016-05-11 8538.0
39 2016-05-14 49937.0
I use this command to plot them in ipython:
df2.plot(x= 'Date', y='Quote-Spread')
plt.show()
But my figure is plotted like this:
As you can see in day 2014-04-23, the Quote-Spread has a value about 126,000. But in plot it is just zero.
my whole plot is like this:
Here is my code of original data:
Sachad = df.loc[df['SID']== 40065016131938148]
#Drop rows with any zero
df1 = df1[~(df1 == 0).any(axis = 1)]
df1['Quote-Spread'] = (df1['SellPrice'].mask(df1['SellPrice'].eq(0))-
df1['BuyPrice'].mask(df1['BuyPrice'].eq(0))).abs()
df2 = df1.groupby('Date' , as_index = False )['Quote-Spread'].mean()
df2.plot(x= 'Date', y='Quote-Spread')
plt.show()
Another question is how i can plot for specific dates like between 2014-04-01 up to 2016-06-01. and draw vertical red lines for dates 2014-06-06 and 2016-01-06?
Please provide the code that produced the working plot. Any warning messages?
As for your last questions: to select the rows you want, you can simply use > and < operators to compare two datetimes in conditional statements.
For vertical lines, you can use plt.axvline(x=date, color = 'r')