My dataframe has an index column of dates and one column
var
date
2020-03-10 77
2020-03-11 88
2020-03-12 99
I have an array and I want to append it to the dataframe one by one. I have tried a few methods but anything isn't working.
my code is something like this
for i in range(20):
x=i*i
df.append(x)
After each iteration dataframe needs to be appended with x value.
Final output:
var
date
2020-03-10 77
2020-03-11 88
2020-03-12 99
2020-03-13 1
2020-03-14 4
2020-03-15 9
.
.
. 20 times
Will be grateful for any suggestions.
Try this
tmpdf = pd.DataFrame({"var":[77,88,99]},index=pd.date_range("2020-03-10",periods=3,freq='D'))
for i in range(1,21):
idx = tmpdf.tail(1).index[0] + pd.Timedelta(days=1)
tmpdf.loc[idx] = i*i
output
2020-03-10 77
2020-03-11 88
2020-03-12 99
2020-03-13 1
2020-03-14 4
2020-03-15 9
2020-03-16 16
2020-03-17 25
2020-03-18 36
2020-03-19 49
2020-03-20 64
2020-03-21 81
2020-03-22 100
2020-03-23 121
2020-03-24 144
2020-03-25 169
2020-03-26 196
2020-03-27 225
2020-03-28 256
2020-03-29 289
2020-03-30 324
2020-03-31 361
2020-04-01 400
Related
I have a dataframe of daily stock data, which is indexed by a datetimeindex.
There are multiple stock entries, thus there are duplicate datetimeindex values.
I am looking for a way to:
Group the dataframe by the stock symbol
Resample the prices for each symbol group into monthly price frequency data
Perform a pct_change calculation on each symbol group monthly price
Store it as a new column 'monthly_return' in the original dataframe.
I have been able to manage the first three operations. Storing the result in the original dataframe is where I'm having some trouble.
To illustrate this, I created a toy dataset which includes a 'dummy' index (idx) column which I use to assist creation of the desired output later on in the third code block.
import random
import pandas as pd
import numpy as np
datelist = pd.date_range(pd.datetime(2018,1,1), periods=PER).to_pydatetime().tolist() * 2
ids = [random.choice(['A', 'B']) for i in range(len(datelist))]
prices = random.sample(range(200), len(datelist))
idx = range(len(datelist))
df1 = pd.DataFrame(data=zip(idx, ids, prices), index=datelist, columns='idx label prices'.split())
print(df1.head(10))
df1
idx label prices
2018-01-01 0 B 40
2018-01-02 1 A 190
2018-01-03 2 A 159
2018-01-04 3 A 25
2018-01-05 4 A 89
2018-01-06 5 B 164
...
2018-01-31 30 A 102
2018-02-01 31 A 117
2018-02-02 32 A 120
2018-02-03 33 B 75
2018-02-04 34 B 170
...
Desired Output
idx label prices monthly_return
2018-01-01 0 B 40 0.000000
2018-01-02 1 A 190 0.000000
2018-01-03 2 A 159 0.000000
2018-01-04 3 A 25 0.000000
2018-01-05 4 A 89 0.000000
2018-01-06 5 B 164 0.000000
...
2018-01-31 30 A 102 -0.098039
2018-02-01 31 A 117 0.000000
2018-02-02 32 A 120 0.000000
...
2018-02-26 56 B 152 0.000000
2018-02-27 57 B 2 0.000000
2018-02-28 58 B 49 -0.040816
2018-03-01 59 B 188 0.000000
...
2018-01-28 89 A 88 0.000000
2018-01-29 90 A 26 0.000000
2018-01-30 91 B 128 0.000000
2018-01-31 92 A 144 -0.098039
...
2018-02-26 118 A 92 0.000000
2018-02-27 119 B 111 0.000000
2018-02-28 120 B 34 -0.040816
...
What I have tried so far is:
dfX = df1.copy(deep=True)
dfX = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
print(dfX)
Which outputs:
label
A 2018-01-31 -0.067961
2018-02-28 -0.364583
2018-03-31 0.081967
B 2018-01-31 1.636364
2018-02-28 -0.557471
2018-03-31 NaN
This is quite close to what I would like to do, however I am only getting pct_change data on end of month dates back which is annoying to store back in the original dataframe (df1) as a new column.
Something like this doesn't work:
dfX = df1.copy(deep=True)
dfX['monthly_return'] = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
As it yields the error:
TypeError: incompatible index of inserted column with frame index
I have considered 'upsampling' the monthly_return data back into a daily series, however this could likely end up causing the same error mentioned above since the original dataset could be missing dates (such as weekends). Additionally, resetting the index to clear this error would still create problems as the grouped dfX does not have the same number of rows/frequency as the original df1 which is of daily frequency.
I have a hunch that this can be done by using multi-indexing and dataframe merging however I am unsure how to go about doing so.
This generates my desired output, but it isn't as clean of a solution as I was hoping for
df1 is generated the same as before (code given in question):
idx label prices
2018-01-01 0 A 145
2018-01-02 1 B 86
2018-01-03 2 B 141
...
2018-01-25 86 B 12
2018-01-26 87 B 71
2018-01-27 88 B 186
2018-01-28 89 B 151
2018-01-29 90 A 161
2018-01-30 91 B 143
2018-01-31 92 B 88
...
Then:
def fun(x):
dates = x.date
x = x.set_index('date', drop=True)
x['monthly_return'] = x.resample('M').last()['prices'].pct_change(1).shift(-1)
x = x.reindex(dates)
return x
dfX = df1.copy(deep=True)
dfX.reset_index(inplace=True)
dfX.columns = 'date idx label prices'.split()
dfX = dfX.groupby('label').apply(fun).droplevel(level='label')
print(dfX)
Which outputs the desired result (unsorted):
idx label prices monthly_return
date
2018-01-01 0 A 145 NaN
2018-01-06 5 A 77 NaN
2018-01-08 7 A 48 NaN
2018-01-09 8 A 31 NaN
2018-01-11 10 A 20 NaN
2018-01-12 11 A 27 NaN
2018-01-14 13 A 109 NaN
2018-01-15 14 A 166 NaN
2018-01-17 16 A 130 NaN
2018-01-18 17 A 139 NaN
2018-01-19 18 A 191 NaN
2018-01-21 20 A 164 NaN
2018-01-22 21 A 112 NaN
2018-01-23 22 A 167 NaN
2018-01-25 24 A 140 NaN
2018-01-26 25 A 42 NaN
2018-01-30 29 A 107 NaN
2018-02-04 34 A 9 NaN
2018-02-07 37 A 84 NaN
2018-02-08 38 A 23 NaN
2018-02-10 40 A 30 NaN
2018-02-12 42 A 89 NaN
2018-02-15 45 A 79 NaN
2018-02-16 46 A 115 NaN
2018-02-19 49 A 197 NaN
2018-02-21 51 A 11 NaN
2018-02-26 56 A 111 NaN
2018-02-27 57 A 126 NaN
2018-03-01 59 A 135 NaN
2018-03-03 61 A 28 NaN
2018-01-01 62 A 120 NaN
2018-01-03 64 A 170 NaN
2018-01-05 66 A 45 NaN
2018-01-07 68 A 173 NaN
2018-01-08 69 A 158 NaN
2018-01-09 70 A 63 NaN
2018-01-11 72 A 62 NaN
2018-01-12 73 A 168 NaN
2018-01-14 75 A 169 NaN
2018-01-15 76 A 142 NaN
2018-01-17 78 A 83 NaN
2018-01-18 79 A 96 NaN
2018-01-21 82 A 25 NaN
2018-01-22 83 A 90 NaN
2018-01-23 84 A 59 NaN
2018-01-29 90 A 161 NaN
2018-02-01 93 A 150 NaN
2018-02-04 96 A 85 NaN
2018-02-06 98 A 124 NaN
2018-02-14 106 A 195 NaN
2018-02-16 108 A 136 NaN
2018-02-17 109 A 134 NaN
2018-02-18 110 A 183 NaN
2018-02-19 111 A 32 NaN
2018-02-24 116 A 102 NaN
2018-02-25 117 A 72 NaN
2018-02-27 119 A 38 NaN
2018-03-02 122 A 137 NaN
2018-03-03 123 A 171 NaN
2018-01-02 1 B 86 NaN
2018-01-03 2 B 141 NaN
2018-01-04 3 B 189 NaN
2018-01-05 4 B 60 NaN
2018-01-07 6 B 1 NaN
2018-01-10 9 B 87 NaN
2018-01-13 12 B 44 NaN
2018-01-16 15 B 147 NaN
2018-01-20 19 B 92 NaN
2018-01-24 23 B 81 NaN
2018-01-27 26 B 190 NaN
2018-01-28 27 B 24 NaN
2018-01-29 28 B 116 NaN
2018-01-31 30 B 98 1.181818
2018-02-01 31 B 121 NaN
2018-02-02 32 B 110 NaN
2018-02-03 33 B 66 NaN
2018-02-05 35 B 4 NaN
2018-02-06 36 B 13 NaN
2018-02-09 39 B 114 NaN
2018-02-11 41 B 16 NaN
2018-02-13 43 B 174 NaN
2018-02-14 44 B 78 NaN
2018-02-17 47 B 144 NaN
2018-02-18 48 B 14 NaN
2018-02-20 50 B 133 NaN
2018-02-22 52 B 156 NaN
2018-02-23 53 B 159 NaN
2018-02-24 54 B 177 NaN
2018-02-25 55 B 43 NaN
2018-02-28 58 B 19 -0.338542
2018-03-02 60 B 127 NaN
2018-01-02 63 B 2 NaN
2018-01-04 65 B 97 NaN
2018-01-06 67 B 8 NaN
2018-01-10 71 B 54 NaN
2018-01-13 74 B 106 NaN
2018-01-16 77 B 74 NaN
2018-01-19 80 B 188 NaN
2018-01-20 81 B 172 NaN
2018-01-24 85 B 51 NaN
2018-01-25 86 B 12 NaN
2018-01-26 87 B 71 NaN
2018-01-27 88 B 186 NaN
2018-01-28 89 B 151 NaN
2018-01-30 91 B 143 NaN
2018-01-31 92 B 88 1.181818
2018-02-02 94 B 75 NaN
2018-02-03 95 B 103 NaN
2018-02-05 97 B 82 NaN
2018-02-07 99 B 128 NaN
2018-02-08 100 B 123 NaN
2018-02-09 101 B 52 NaN
2018-02-10 102 B 18 NaN
2018-02-11 103 B 21 NaN
2018-02-12 104 B 50 NaN
2018-02-13 105 B 64 NaN
2018-02-15 107 B 185 NaN
2018-02-20 112 B 125 NaN
2018-02-21 113 B 108 NaN
2018-02-22 114 B 132 NaN
2018-02-23 115 B 180 NaN
2018-02-26 118 B 67 NaN
2018-02-28 120 B 192 -0.338542
2018-03-01 121 B 58 NaN
Perhaps there is a more concise and pythonic way of doing this.
I want to format dates in pandas, to have year-month-day. My dates are from april to september. I do not have values from january, feb etc, but sometimes my pandas reads day as month and month as day. Look at index 16 or 84.
6 2019-08-26 15:10:00
7 2019-08-25 13:22:00
8 2019-08-24 16:06:00
9 2019-08-23 15:13:00
10 2019-08-22 14:24:00
11 2019-08-21 14:02:00
12 2019-08-16 12:31:00
13 2019-08-15 15:31:00
14 2019-08-14 14:46:00
15 2019-08-13 17:13:00
16 2019-11-08 15:54:00
17 2019-10-08 10:07:00
68 2019-06-06 11:22:00
69 2019-05-06 15:16:00
70 2019-01-06 17:02:00
75 2019-05-21 09:01:00
76 2019-05-19 16:52:00
77 2019-05-15 15:40:00
78 2019-10-05 13:34:00
81 2019-06-05 11:55:00
82 2019-03-05 17:28:00
83 2019-02-05 18:01:00
84 2019-01-05 17:05:00
85 2019-01-05 09:57:00
86 2019-04-30 10:16:00
87 2019-04-29 17:51:00
88 2019-04-27 17:42:00
How to fix this?
I want to have date type values *(year-month-day), without time, so that I can group by day, or by month.
I have tried this, but It does not work:
df['Created'] = pd.to_datetime(df['Created'], format = 'something')
And for grouping by month, I have tried this:
df['Created'] = df['Created'].dt.to_period('M')
Solution for sample data - you can create both possible datetimes with both formats with errors='coerce' for missing values in not match and then replace missing values from second Series (YYYY-DD-MM) by first Series (YYYY-MM-DD) by Series.combine_first or Series.combine_first:
a = pd.to_datetime(df['Created'], format = '%Y-%m-%d %H:%M:%S', errors='coerce')
b = pd.to_datetime(df['Created'], format = '%Y-%d-%m %H:%M:%S', errors='coerce')
df['Created'] = b.combine_first(a).dt.to_period('M')
#alternative
#df['Created'] = b.fillna(a).dt.to_period('M')
print (df)
Created
6 2019-08
7 2019-08
8 2019-08
9 2019-08
10 2019-08
11 2019-08
12 2019-08
13 2019-08
14 2019-08
15 2019-08
16 2019-08
17 2019-08
68 2019-06
69 2019-06
70 2019-06
75 2019-05
76 2019-05
77 2019-05
78 2019-05
81 2019-05
82 2019-05
83 2019-05
84 2019-05
85 2019-05
86 2019-04
87 2019-04
88 2019-04
I created a dummy dataframe to parse this. Try strftime
from datetime import datetime
import time
import pandas as pd
time1 = datetime.now()
time.sleep(6)
time2 = datetime.now()
df = pd.DataFrame({'Created': [time1, time2]})
df['Created2'] = df['Created'].apply(lambda x: x.strftime('%Y-%m-%d'))
print(df.head())
df1
slot Time Location User
56 2017-10-26 22:15:00 89 1
2 2017-10-27 00:30:00 54 1
20 2017-10-28 05:00:00 64 1
24 2017-10-29 06:00:00 2 1
91 2017-11-01 22:45:00 78 1
62 2017-11-02 15:30:00 99 1
91 2017-11-02 22:45:00 34 1
47 2017-10-26 20:15:00 465 2
1 2017-10-27 00:10:00 67 2
20 2017-10-28 05:00:00 5746 2
28 2017-10-29 07:00:00 36 2
91 2017-11-01 22:45:00 786 2
58 2017-11-02 14:30:00 477 2
95 2017-11-02 23:45:00 7322 2
df2
slot
2
91
62
58
I need the output df3 as
slot Time Location User
2 2017-10-27 00:30:00 54 1
91 2017-11-01 22:45:00 78 1
91 2017-11-02 22:45:00 34 1
91 2017-11-01 22:45:00 786 2
62 2017-11-02 15:30:00 99 1
58 2017-11-02 14:30:00 477 2
if those are csv file then we can join them
join File1 file2 > file3
But how can we do the same for the outputs in Jupyter notebook
Try isin:
df1[df1.slot.isin(df2.slot)]
Output:
slot Time Location User
1 2 2017-10-27 00:30:00 54 1
4 91 2017-11-01 22:45:00 78 1
5 62 2017-11-02 15:30:00 99 1
6 91 2017-11-02 22:45:00 34 1
11 91 2017-11-01 22:45:00 786 2
12 58 2017-11-02 14:30:00 477 2
I have a dataframe with dates and values from column A to H. Also, I have some fixed variables X1=5, X2=6, Y1=7,Y2=8, Z1=9
Date A B C D E F G H
0 2018-01-02 00:00:00 7161 7205 -44 54920 73 7 5 47073
1 2018-01-03 00:00:00 7101 7147 -46 54710 73 6 5 46570
2 2018-01-04 00:00:00 7146 7189 -43 54730 70 7 5 46933
3 2018-01-05 00:00:00 7079 7121 -43 54720 70 6 5 46404
4 2018-01-08 00:00:00 7080 7125 -45 54280 70 6 5 46355
5 2018-01-09 00:00:00 7060 7102 -43 54440 70 6 5 46319
6 2018-01-10 00:00:00 7113 7153 -40 54510 70 7 5 46837
7 2018-01-11 00:00:00 7103 7141 -38 54690 70 7 5 46728
8 2018-01-12 00:00:00 7074 7110 -36 54310 65 6 5 46357
9 2018-01-15 00:00:00 7181 7210 -29 54320 65 6 5 46792
10 2018-01-16 00:00:00 7036 7078 -42 54420 65 6 5 45709
11 2018-01-17 00:00:00 6994 7034 -40 53690 65 6 5 45416
12 2018-01-18 00:00:00 7032 7076 -44 53590 65 6 5 45705
13 2018-01-19 00:00:00 6999 7041 -42 53560 65 6 5 45331
14 2018-01-22 00:00:00 7025 7068 -43 53500 65 6 5 45455
15 2018-01-23 00:00:00 6883 6923 -41 53490 65 6 5 44470
16 2018-01-24 00:00:00 7111 7150 -39 52630 65 6 5 45866
17 2018-01-25 00:00:00 7101 7138 -37 53470 65 6 5 45663
18 2018-01-26 00:00:00 7043 7085 -43 53380 65 6 5 45087
19 2018-01-29 00:00:00 7041 7085 -44 53370 65 6 5 44958
20 2018-01-30 00:00:00 7010 7050 -41 53040 65 6 5 44790
21 2018-01-31 00:00:00 7079 7118 -39 52880 65 6 5 45248
What I wanted to do is adding some column-wise simple calculations to this dataframe using values in column A to H as well as those fixed variables.
The tricky part is that I need to apply different variables to different date ranges.
For example, during 2018-01-01 to 2018-01-10, I wanted to calculate a new column I where the value equals to: (A+B+C)*X1*Y1+Z1;
While during 2018-01-11 to 2018-01-25, the calculation needs to take (A+B+C)*X2*Y1+Z1. Similar to Y1 and Y2 applied to each of their date ranges.
I know this can calculate/create a new column I.
df[I]=(df[A]+df[B]+df[C])*X1*Y1+Z1
but not sure how to be able to have that flexibility to use different variables to different date ranges.
You can use np.select to define a value based on a condition:
cond = [df.Date.between('2018-01-01','2018-01-10'), df.Date.between('2018-01-11','2018-01-25')]
values = [(df['A']+df['B']+df['C'])*X1*Y1+Z1, (df['A']+df['B']+df['C'])*X2*Y2+Z1]
# select values depending on the condition
df['I'] = np.select(cond, values)
I've been searching SO and haven't figured this out yet. Hoping someone can aide this python newb to solving my problem.
I'm trying to figure out how to write an if/then statement in python and perform an aggregation off that if/then statement. My end goal is to say if the date = 1/7/2017 then use the value in the "fake" column. If date = all else then average the two columns together.
Here is what I have so far:
import pandas as pd
import numpy as np
import datetime
np.random.seed(42)
dte=pd.date_range(start=datetime.date(2017,1,1), end= datetime.date(2017,1,15))
fake=np.random.randint(15,100, size=15)
fake2=np.random.randint(300,1000,size=15)
so_df=pd.DataFrame({'date':dte,
'fake':fake,
'fake2':fake2})
so_df['avg']= so_df[['fake','fake2']].mean(axis=1)
so_df.head()
Assuming you have already computed the average column:
so_df['fake'].where(so_df['date']=='20170107', so_df['avg'])
Out:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 449.0
9 395.5
10 197.0
11 438.5
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
If not, you can replace the column reference with the same calculation:
so_df['fake'].where(so_df['date']=='20170107', so_df[['fake','fake2']].mean(axis=1))
To check for multiple dates, you need to use the element-wise version of the or operator (which is pipe: |). Otherwise it will raise an error.
so_df['fake'].where((so_df['date']=='20170107') | (so_df['date']=='20170109'), so_df['avg'])
The above checks for two dates. In the case of 3 or more, you may want to use isin with a list:
so_df['fake'].where(so_df['date'].isin(['20170107', '20170109', '20170112']), so_df['avg'])
Out[42]:
0 375.5
1 260.0
2 331.0
3 267.5
4 397.0
5 355.0
6 89.0
7 320.5
8 38.0
9 395.5
10 197.0
11 67.0
12 498.5
13 409.5
14 525.5
Name: fake, dtype: float64
Let's use np.where:
so_df['avg'] = np.where(so_df['date'] == pd.to_datetime('2017-01-07'),
so_df['fake'], so_df[['fake',
'fake2']].mean(1))
Output:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
One way to do if-else in pandas is by using np.where
There are three values inside, condition, if and else
so_df['avg']= np.where(so_df['date'] == '2017-01-07',so_df['fake'],so_df[['fake','fake2']].mean(axis=1))
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 449.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5
we can also use Series.where() method:
In [141]: so_df['avg'] = so_df['fake'] \
...: .where(so_df['date'].isin(['2017-01-07','2017-01-09']))
...: .fillna(so_df[['fake','fake2']].mean(1))
...:
In [142]: so_df
Out[142]:
date fake fake2 avg
0 2017-01-01 66 685 375.5
1 2017-01-02 29 491 260.0
2 2017-01-03 86 576 331.0
3 2017-01-04 75 460 267.5
4 2017-01-05 35 759 397.0
5 2017-01-06 97 613 355.0
6 2017-01-07 89 321 89.0
7 2017-01-08 89 552 320.5
8 2017-01-09 38 860 38.0
9 2017-01-10 17 774 395.5
10 2017-01-11 36 358 197.0
11 2017-01-12 67 810 438.5
12 2017-01-13 16 981 498.5
13 2017-01-14 44 775 409.5
14 2017-01-15 52 999 525.5