Taking away all previous values in a column in dataframe - python

I am using some data where I need to find the time difference between all previous rows i.e. in row 3 I need to know the time between row 2 and row 1 and row 2 and row 0, in row 5 i need to know the time between row 5 and row 4, row 5 and row 3.... row 5 and row 0. I then want to have a big dataframe with all these differences in (as well as the other columns).
I have made a test dataframe for this
data = {random': [1, 3, 9, 3, 4, 7, 8, 10],
'timestamp': [2, 138, 157, 232, 245, 302, 323, 379]}
df = pd.DataFrame(data)
I then tried to do
for i in range(0,len(df-1)):
difference = df.timestamp.diff(periods=i+1)
print(difference)
To iterate through each row and takeaway the previous row the first iteration, the second row the second iteration etc.
I am stuck on how to combine this into one large dataframe after all the iterations AND how to make sure the loop uses the original dataframe at the start of each iteration (not the dataframe from the previous iteration).
This is what is being outputted
0 NaN
1 136.0
2 19.0
3 75.0
4 13.0
5 57.0
6 21.0
7 56.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 155.0
3 94.0
4 88.0
5 70.0
6 78.0
7 77.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 230.0
4 107.0
5 145.0
6 91.0
7 134.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 243.0
5 164.0
6 166.0
7 147.0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 300.0
6 185.0
7 222.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 321.0
7 241.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 377.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
Name: timestamp, dtype: float64
If anyone knows how to solve this that would be great :)

Here is one way of solving the problem with Series.expanding:
df['diff'] = [list(s.iat[-1] - s[-2::-1]) for s in df['timestamp'].expanding(1)]
random timestamp diff
0 1 2 []
1 3 138 [136]
2 9 157 [19, 155] #--> 157-138, 157-2
3 3 232 [75, 94, 230] #--> 232-157, 232-138, 232-2
4 4 245 [13, 88, 107, 243]
5 7 302 [57, 70, 145, 164, 300]
6 8 323 [21, 78, 91, 166, 185, 321]
7 10 379 [56, 77, 134, 147, 222, 241, 377]

I may be misunderstanding what you mean but if you're asking how to collect these differences together:
differences = [df.timestamp.diff(periods=i+1) for i in range(0,len(df-1))]
differences = pd.concat(differences)

I also may be misunderstanding, but this is the best representation I could think of from what you described:
>>> df2 = df.copy()
>>> for i in df2.timestamp:
df2[i]=df2['timestamp']-i
>>> df2
random timestamp 2 138 157 232 245 302 323 379
0 1 2 0 -136 -155 -230 -243 -300 -321 -377
1 3 138 136 0 -19 -94 -107 -164 -185 -241
2 9 157 155 19 0 -75 -88 -145 -166 -222
3 3 232 230 94 75 0 -13 -70 -91 -147
4 4 245 243 107 88 13 0 -57 -78 -134
5 7 302 300 164 145 70 57 0 -21 -77
6 8 323 321 185 166 91 78 21 0 -56
7 10 379 377 241 222 147 134 77 56 0

Related

Combine data from lines in a pandas df where client_id is duplicated

I have combined 2 dataframes and I want to make sure that the user id is not repeated and all values are written in 1 line, and not duplicated.
This is a ready-made dataframe, but I need to edit it and remove repetitions.
I attach the input data:
rzd = pd.DataFrame({
'client_id': [111, 112, 113, 114, 115],
'rzd_revenue': [1093, 2810, 10283, 5774, 981]})
rzd
auto = pd.DataFrame({
'client_id': [113, 114, 115, 116, 117],
'auto_revenue': [57483, 83, 912, 4834, 98]})
auto
air = pd.DataFrame({
'client_id': [115, 116, 117, 118],
'air_revenue': [81, 4, 13, 173]})
air
client_base = pd.DataFrame({
'client_id': [111, 112, 113, 114, 115, 116, 117, 118],
'address': ['Комсомольская 4', 'Энтузиастов 8а', 'Левобережная 1а', 'Мира 14', 'ЗЖБИиДК 1',
'Строителей 18', 'Панфиловская 33', 'Мастеркова 4']})
client_base
frames = [rzd, auto, air]
result = pd.concat(frames)
result
full_result = result.merge(client_base)
full_result
client_id rzd_revenue auto_revenue air_revenue address
0 111 1093.0 NaN NaN Комсомольская 4
1 112 2810.0 NaN NaN Энтузиастов 8а
2 113 10283.0 NaN NaN Левобережная 1а
3 113 NaN 57483.0 NaN Левобережная 1а
4 114 5774.0 NaN NaN Мира 14
5 114 NaN 83.0 NaN Мира 14
6 115 981.0 NaN NaN ЗЖБИиДК 1
7 115 NaN 912.0 NaN ЗЖБИиДК 1
8 115 NaN NaN 81.0 ЗЖБИиДК 1
9 116 NaN 4834.0 NaN Строителей 18
10 116 NaN NaN 4.0 Строителей 18
11 117 NaN 98.0 NaN Панфиловская 33
12 117 NaN NaN 13.0 Панфиловская 33
13 118 NaN NaN 173.0 Мастеркова 4
The numbers are different documentation, by type merge, concat, and so on, but I did not find clear articles on editing.
As a result, it should look like this:
client_id rzd_revenue auto_revenue air_revenue address
0 111 1093.0 0.0 0.0 Комсомольская 4
1 112 2810.0 0.0 0.0 Энтузиастов 8а
2 113 10283.0 57483.0 0.0 Левобережная 1а
3 114 5774.0 83.0 0.0 Мира 14
4 115 981.0 912.0 81.0 ЗЖБИиДК 1
If possible, you should prefer joining so that there isn't a duplicate row issue at all. However, if you're just given full_result and not the initial tables, this can be accomplished using a groupby expression:
full_result.groupby("client_id").first().reset_index()
For each client_id, it takes the first non-null value of each column, then resets the index so client_id is just another column again.
You should be joining. You can use concat() to do that, just need to set the index first:
result = pd.concat([df.set_index('client_id') for df in frames], axis=1)
result
rzd_revenue auto_revenue air_revenue
client_id
111 1093.0 NaN NaN
112 2810.0 NaN NaN
113 10283.0 57483.0 NaN
114 5774.0 83.0 NaN
115 981.0 912.0 81.0
116 NaN 4834.0 4.0
117 NaN 98.0 13.0
118 NaN NaN 173.0
Then to connect client_base, you can do something similar. I'll also replace the NaNs now.
full_result = result.join(client_base.set_index('client_id')).fillna(0)
full_result
rzd_revenue auto_revenue air_revenue address
client_id
111 1093.0 0.0 0.0 Комсомольская 4
112 2810.0 0.0 0.0 Энтузиастов 8а
113 10283.0 57483.0 0.0 Левобережная 1а
114 5774.0 83.0 0.0 Мира 14
115 981.0 912.0 81.0 ЗЖБИиДК 1
116 0.0 4834.0 4.0 Строителей 18
117 0.0 98.0 13.0 Панфиловская 33
118 0.0 0.0 173.0 Мастеркова 4
At this point you could .reset_index() if you wanted.
More info:
User guide: Merge, join, concatenate and compare
Pandas Merging 101

replace do work in str but does not work in object dtype

ab = '1 234'
ab = ab.replace(" ", "")
ab
'1234'
its easy to use replace() to get rid of the white space, but when I have a column of pandas dataframe;
gbpusd['Profit'] = gbpusd['Profit'].replace(" ", "")
gbpusd['Profit'].head()
3 7 000.00
4 6 552.00
11 4 680.00
14 3 250.00
24 1 700.00
Name: Profit, dtype: object
But it didnt work, googled many times but no solutions...
gbpusd['Profit'].sum() TypeError: can only concatenate str (not "int")
to str
Then, as the whitespace is still here, which cannot do further analysis, like sum()
The thing is harder than I think: the raw data is
gbpusd.head()
Ticket Open Time Type Volume Item Price S / L T / P Close Time Price.1 Commission Taxes Swap Profit
84 50204109.0 2019.10.24 09:56:32 buy 0.5 gbpusd 1.29148 0.0 0.0 2019.10.24 09:57:48 1.29179 0 0.0 0.0 15.5
85 50205025.0 2019.10.24 10:10:13 buy 0.5 gbpusd 1.29328 0.0 0.0 2019.10.24 15:57:02 1.29181 0 0.0 0.0 -73.5
86 50207371.0 2019.10.24 10:34:10 buy 0.5 gbpusd 1.29236 0.0 0.0 2019.10.24 15:57:18 1.29197 0 0.0 0.0 -19.5
87 50207747.0 2019.10.24 10:40:32 buy 0.5 gbpusd 1.29151 0.0 0.0 2019.10.24 15:57:24 1.29223 0 0.0 0.0 36
88 50212252.0 2019.10.24 11:47:14 buy 1.5 gbpusd 1.28894 0.0 0.0 2019.10.24 15:57:12 1.29181 0 0.0 0.0 430.5
when I did
gbpusd['Profit'] = gbpusd['Profit'].str.replace(" ", "")
gbpusd['Profit']
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
90 NaN
91 NaN
92 NaN
93 NaN
94 NaN
95 NaN
96 NaN
97 NaN
98 NaN
99 NaN
100 NaN
101 NaN
102 NaN
103 NaN
104 NaN
105 NaN
106 NaN
107 NaN
108 NaN
109 NaN
110 NaN
111 NaN
112 NaN
113 NaN
...
117 4680.00
118 NaN
119 NaN
120 NaN
121 NaN
122 NaN
123 NaN
124 NaN
125 NaN
126 NaN
127 NaN
128 NaN
129 NaN
130 -2279.00
131 -2217.00
132 -2037.00
133 -5379.00
134 -1620.00
135 -7154.00
136 -4160.00
137 1144.00
138 NaN
139 NaN
140 NaN
141 -1920.00
142 7000.00
143 3250.00
144 NaN
145 1700.00
146 NaN
Name: Profit, Length: 63, dtype: object
The white space is replaced, but some data which has no space is NaN now...someone may have the same problem...
also need to use str
gbpusdprofit = gbpusd['Profit'].str.replace(" ", "")
Output:
0 7000.00
1 6552.00
2 4680.00
3 3250.00
4 1700.00
Name: Profit, dtype: object
and for sum:
gbpusd['Profit'].str.replace(" ", "").astype('float').sum()
Result:
23182.0
You can convert to string and sum in a oneliner:
gbpusd['Profit'].str.replace(' ', "").astype(float).sum()

Pandas groupby, resample, calculate pct_change and the store result back into original freq. dataframe

I have a dataframe of daily stock data, which is indexed by a datetimeindex.
There are multiple stock entries, thus there are duplicate datetimeindex values.
I am looking for a way to:
Group the dataframe by the stock symbol
Resample the prices for each symbol group into monthly price frequency data
Perform a pct_change calculation on each symbol group monthly price
Store it as a new column 'monthly_return' in the original dataframe.
I have been able to manage the first three operations. Storing the result in the original dataframe is where I'm having some trouble.
To illustrate this, I created a toy dataset which includes a 'dummy' index (idx) column which I use to assist creation of the desired output later on in the third code block.
import random
import pandas as pd
import numpy as np
datelist = pd.date_range(pd.datetime(2018,1,1), periods=PER).to_pydatetime().tolist() * 2
ids = [random.choice(['A', 'B']) for i in range(len(datelist))]
prices = random.sample(range(200), len(datelist))
idx = range(len(datelist))
df1 = pd.DataFrame(data=zip(idx, ids, prices), index=datelist, columns='idx label prices'.split())
print(df1.head(10))
df1
idx label prices
2018-01-01 0 B 40
2018-01-02 1 A 190
2018-01-03 2 A 159
2018-01-04 3 A 25
2018-01-05 4 A 89
2018-01-06 5 B 164
...
2018-01-31 30 A 102
2018-02-01 31 A 117
2018-02-02 32 A 120
2018-02-03 33 B 75
2018-02-04 34 B 170
...
Desired Output
idx label prices monthly_return
2018-01-01 0 B 40 0.000000
2018-01-02 1 A 190 0.000000
2018-01-03 2 A 159 0.000000
2018-01-04 3 A 25 0.000000
2018-01-05 4 A 89 0.000000
2018-01-06 5 B 164 0.000000
...
2018-01-31 30 A 102 -0.098039
2018-02-01 31 A 117 0.000000
2018-02-02 32 A 120 0.000000
...
2018-02-26 56 B 152 0.000000
2018-02-27 57 B 2 0.000000
2018-02-28 58 B 49 -0.040816
2018-03-01 59 B 188 0.000000
...
2018-01-28 89 A 88 0.000000
2018-01-29 90 A 26 0.000000
2018-01-30 91 B 128 0.000000
2018-01-31 92 A 144 -0.098039
...
2018-02-26 118 A 92 0.000000
2018-02-27 119 B 111 0.000000
2018-02-28 120 B 34 -0.040816
...
What I have tried so far is:
dfX = df1.copy(deep=True)
dfX = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
print(dfX)
Which outputs:
label
A 2018-01-31 -0.067961
2018-02-28 -0.364583
2018-03-31 0.081967
B 2018-01-31 1.636364
2018-02-28 -0.557471
2018-03-31 NaN
This is quite close to what I would like to do, however I am only getting pct_change data on end of month dates back which is annoying to store back in the original dataframe (df1) as a new column.
Something like this doesn't work:
dfX = df1.copy(deep=True)
dfX['monthly_return'] = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
As it yields the error:
TypeError: incompatible index of inserted column with frame index
I have considered 'upsampling' the monthly_return data back into a daily series, however this could likely end up causing the same error mentioned above since the original dataset could be missing dates (such as weekends). Additionally, resetting the index to clear this error would still create problems as the grouped dfX does not have the same number of rows/frequency as the original df1 which is of daily frequency.
I have a hunch that this can be done by using multi-indexing and dataframe merging however I am unsure how to go about doing so.
This generates my desired output, but it isn't as clean of a solution as I was hoping for
df1 is generated the same as before (code given in question):
idx label prices
2018-01-01 0 A 145
2018-01-02 1 B 86
2018-01-03 2 B 141
...
2018-01-25 86 B 12
2018-01-26 87 B 71
2018-01-27 88 B 186
2018-01-28 89 B 151
2018-01-29 90 A 161
2018-01-30 91 B 143
2018-01-31 92 B 88
...
Then:
def fun(x):
dates = x.date
x = x.set_index('date', drop=True)
x['monthly_return'] = x.resample('M').last()['prices'].pct_change(1).shift(-1)
x = x.reindex(dates)
return x
dfX = df1.copy(deep=True)
dfX.reset_index(inplace=True)
dfX.columns = 'date idx label prices'.split()
dfX = dfX.groupby('label').apply(fun).droplevel(level='label')
print(dfX)
Which outputs the desired result (unsorted):
idx label prices monthly_return
date
2018-01-01 0 A 145 NaN
2018-01-06 5 A 77 NaN
2018-01-08 7 A 48 NaN
2018-01-09 8 A 31 NaN
2018-01-11 10 A 20 NaN
2018-01-12 11 A 27 NaN
2018-01-14 13 A 109 NaN
2018-01-15 14 A 166 NaN
2018-01-17 16 A 130 NaN
2018-01-18 17 A 139 NaN
2018-01-19 18 A 191 NaN
2018-01-21 20 A 164 NaN
2018-01-22 21 A 112 NaN
2018-01-23 22 A 167 NaN
2018-01-25 24 A 140 NaN
2018-01-26 25 A 42 NaN
2018-01-30 29 A 107 NaN
2018-02-04 34 A 9 NaN
2018-02-07 37 A 84 NaN
2018-02-08 38 A 23 NaN
2018-02-10 40 A 30 NaN
2018-02-12 42 A 89 NaN
2018-02-15 45 A 79 NaN
2018-02-16 46 A 115 NaN
2018-02-19 49 A 197 NaN
2018-02-21 51 A 11 NaN
2018-02-26 56 A 111 NaN
2018-02-27 57 A 126 NaN
2018-03-01 59 A 135 NaN
2018-03-03 61 A 28 NaN
2018-01-01 62 A 120 NaN
2018-01-03 64 A 170 NaN
2018-01-05 66 A 45 NaN
2018-01-07 68 A 173 NaN
2018-01-08 69 A 158 NaN
2018-01-09 70 A 63 NaN
2018-01-11 72 A 62 NaN
2018-01-12 73 A 168 NaN
2018-01-14 75 A 169 NaN
2018-01-15 76 A 142 NaN
2018-01-17 78 A 83 NaN
2018-01-18 79 A 96 NaN
2018-01-21 82 A 25 NaN
2018-01-22 83 A 90 NaN
2018-01-23 84 A 59 NaN
2018-01-29 90 A 161 NaN
2018-02-01 93 A 150 NaN
2018-02-04 96 A 85 NaN
2018-02-06 98 A 124 NaN
2018-02-14 106 A 195 NaN
2018-02-16 108 A 136 NaN
2018-02-17 109 A 134 NaN
2018-02-18 110 A 183 NaN
2018-02-19 111 A 32 NaN
2018-02-24 116 A 102 NaN
2018-02-25 117 A 72 NaN
2018-02-27 119 A 38 NaN
2018-03-02 122 A 137 NaN
2018-03-03 123 A 171 NaN
2018-01-02 1 B 86 NaN
2018-01-03 2 B 141 NaN
2018-01-04 3 B 189 NaN
2018-01-05 4 B 60 NaN
2018-01-07 6 B 1 NaN
2018-01-10 9 B 87 NaN
2018-01-13 12 B 44 NaN
2018-01-16 15 B 147 NaN
2018-01-20 19 B 92 NaN
2018-01-24 23 B 81 NaN
2018-01-27 26 B 190 NaN
2018-01-28 27 B 24 NaN
2018-01-29 28 B 116 NaN
2018-01-31 30 B 98 1.181818
2018-02-01 31 B 121 NaN
2018-02-02 32 B 110 NaN
2018-02-03 33 B 66 NaN
2018-02-05 35 B 4 NaN
2018-02-06 36 B 13 NaN
2018-02-09 39 B 114 NaN
2018-02-11 41 B 16 NaN
2018-02-13 43 B 174 NaN
2018-02-14 44 B 78 NaN
2018-02-17 47 B 144 NaN
2018-02-18 48 B 14 NaN
2018-02-20 50 B 133 NaN
2018-02-22 52 B 156 NaN
2018-02-23 53 B 159 NaN
2018-02-24 54 B 177 NaN
2018-02-25 55 B 43 NaN
2018-02-28 58 B 19 -0.338542
2018-03-02 60 B 127 NaN
2018-01-02 63 B 2 NaN
2018-01-04 65 B 97 NaN
2018-01-06 67 B 8 NaN
2018-01-10 71 B 54 NaN
2018-01-13 74 B 106 NaN
2018-01-16 77 B 74 NaN
2018-01-19 80 B 188 NaN
2018-01-20 81 B 172 NaN
2018-01-24 85 B 51 NaN
2018-01-25 86 B 12 NaN
2018-01-26 87 B 71 NaN
2018-01-27 88 B 186 NaN
2018-01-28 89 B 151 NaN
2018-01-30 91 B 143 NaN
2018-01-31 92 B 88 1.181818
2018-02-02 94 B 75 NaN
2018-02-03 95 B 103 NaN
2018-02-05 97 B 82 NaN
2018-02-07 99 B 128 NaN
2018-02-08 100 B 123 NaN
2018-02-09 101 B 52 NaN
2018-02-10 102 B 18 NaN
2018-02-11 103 B 21 NaN
2018-02-12 104 B 50 NaN
2018-02-13 105 B 64 NaN
2018-02-15 107 B 185 NaN
2018-02-20 112 B 125 NaN
2018-02-21 113 B 108 NaN
2018-02-22 114 B 132 NaN
2018-02-23 115 B 180 NaN
2018-02-26 118 B 67 NaN
2018-02-28 120 B 192 -0.338542
2018-03-01 121 B 58 NaN
Perhaps there is a more concise and pythonic way of doing this.

Python: Expand a dataframe row-wise based on datetime

I have a dataframe like this:
ID Date Value
783 C 2018-02-23 0.704
580 B 2018-08-04 -1.189
221 A 2018-08-10 -0.788
228 A 2018-08-17 0.038
578 B 2018-08-02 1.188
What I want is expanding the dataframe based on Date column to 1-month earlier, and fill ID with the same person, and fill Value with nan until the last observation.
The expected result is similar to this:
ID Date Value
0 C 2018/01/24 nan
1 C 2018/01/25 nan
2 C 2018/01/26 nan
3 C 2018/01/27 nan
4 C 2018/01/28 nan
5 C 2018/01/29 nan
6 C 2018/01/30 nan
7 C 2018/01/31 nan
8 C 2018/02/01 nan
9 C 2018/02/02 nan
10 C 2018/02/03 nan
11 C 2018/02/04 nan
12 C 2018/02/05 nan
13 C 2018/02/06 nan
14 C 2018/02/07 nan
15 C 2018/02/08 nan
16 C 2018/02/09 nan
17 C 2018/02/10 nan
18 C 2018/02/11 nan
19 C 2018/02/12 nan
20 C 2018/02/13 nan
21 C 2018/02/14 nan
22 C 2018/02/15 nan
23 C 2018/02/16 nan
24 C 2018/02/17 nan
25 C 2018/02/18 nan
26 C 2018/02/19 nan
27 C 2018/02/20 nan
28 C 2018/02/21 nan
29 C 2018/02/22 nan
30 C 2018/02/23 1.093
31 B 2018/07/05 nan
32 B 2018/07/06 nan
33 B 2018/07/07 nan
34 B 2018/07/08 nan
35 B 2018/07/09 nan
36 B 2018/07/10 nan
37 B 2018/07/11 nan
38 B 2018/07/12 nan
39 B 2018/07/13 nan
40 B 2018/07/14 nan
41 B 2018/07/15 nan
42 B 2018/07/16 nan
43 B 2018/07/17 nan
44 B 2018/07/18 nan
45 B 2018/07/19 nan
46 B 2018/07/20 nan
47 B 2018/07/21 nan
48 B 2018/07/22 nan
49 B 2018/07/23 nan
50 B 2018/07/24 nan
51 B 2018/07/25 nan
52 B 2018/07/26 nan
53 B 2018/07/27 nan
54 B 2018/07/28 nan
55 B 2018/07/29 nan
56 B 2018/07/30 nan
57 B 2018/07/31 nan
58 B 2018/08/01 nan
59 B 2018/08/02 nan
60 B 2018/08/03 nan
61 B 2018/08/04 0.764
62 A 2018/07/11 nan
63 A 2018/07/12 nan
64 A 2018/07/13 nan
65 A 2018/07/14 nan
66 A 2018/07/15 nan
67 A 2018/07/16 nan
68 A 2018/07/17 nan
69 A 2018/07/18 nan
70 A 2018/07/19 nan
71 A 2018/07/20 nan
72 A 2018/07/21 nan
73 A 2018/07/22 nan
74 A 2018/07/23 nan
75 A 2018/07/24 nan
76 A 2018/07/25 nan
77 A 2018/07/26 nan
78 A 2018/07/27 nan
79 A 2018/07/28 nan
80 A 2018/07/29 nan
81 A 2018/07/30 nan
82 A 2018/07/31 nan
83 A 2018/08/01 nan
84 A 2018/08/02 nan
85 A 2018/08/03 nan
86 A 2018/08/04 nan
87 A 2018/08/05 nan
88 A 2018/08/06 nan
89 A 2018/08/07 nan
90 A 2018/08/08 nan
91 A 2018/08/09 nan
92 A 2018/08/10 2.144
93 A 2018/07/18 nan
94 A 2018/07/19 nan
95 A 2018/07/20 nan
96 A 2018/07/21 nan
97 A 2018/07/22 nan
98 A 2018/07/23 nan
99 A 2018/07/24 nan
100 A 2018/07/25 nan
101 A 2018/07/26 nan
102 A 2018/07/27 nan
103 A 2018/07/28 nan
104 A 2018/07/29 nan
105 A 2018/07/30 nan
106 A 2018/07/31 nan
107 A 2018/08/01 nan
108 A 2018/08/02 nan
109 A 2018/08/03 nan
110 A 2018/08/04 nan
111 A 2018/08/05 nan
112 A 2018/08/06 nan
113 A 2018/08/07 nan
114 A 2018/08/08 nan
115 A 2018/08/09 nan
116 A 2018/08/10 nan
117 A 2018/08/11 nan
118 A 2018/08/12 nan
119 A 2018/08/13 nan
120 A 2018/08/14 nan
121 A 2018/08/15 nan
122 A 2018/08/16 nan
123 A 2018/08/17 0.644
124 B 2018/07/03 nan
125 B 2018/07/04 nan
126 B 2018/07/05 nan
127 B 2018/07/06 nan
128 B 2018/07/07 nan
129 B 2018/07/08 nan
130 B 2018/07/09 nan
131 B 2018/07/10 nan
132 B 2018/07/11 nan
133 B 2018/07/12 nan
134 B 2018/07/13 nan
135 B 2018/07/14 nan
136 B 2018/07/15 nan
137 B 2018/07/16 nan
138 B 2018/07/17 nan
139 B 2018/07/18 nan
140 B 2018/07/19 nan
141 B 2018/07/20 nan
142 B 2018/07/21 nan
143 B 2018/07/22 nan
144 B 2018/07/23 nan
145 B 2018/07/24 nan
146 B 2018/07/25 nan
147 B 2018/07/26 nan
148 B 2018/07/27 nan
149 B 2018/07/28 nan
150 B 2018/07/29 nan
151 B 2018/07/30 nan
152 B 2018/07/31 nan
153 B 2018/08/01 nan
154 B 2018/08/02 -0.767
The source data can be created as below:
import pandas as pd
from itertools import chain
import numpy as np
df_1 = pd.DataFrame({
'ID' : list(chain.from_iterable([['A'] * 365, ['B'] * 365, ['C'] * 365])),
'Date' : pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist() + pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist() + pd.date_range(start = '2018-01-01', end = '2018-12-31').tolist(),
'Value' : np.random.randn(365 * 3)
})
df_1 = df_1.sample(5, random_state = 123)
Thanks for the advice!
You can create another DataFrame with previous months, then join together by concat, create DatetimeIndex, so possible use groupby with resample by d for days for add all values between:
df_2 = df_1.assign(Date = df_1['Date'] - pd.DateOffset(months=1) + pd.DateOffset(days=1),
Value = np.nan)
df = (pd.concat([df_2, df_1], sort=False)
.reset_index()
.set_index('Date')
.groupby('index', sort=False)
.resample('D')
.ffill()
.reset_index(level=1)
.drop('index', 1)
.rename_axis(None))
print (df)
Date ID Value
783 2018-01-24 C NaN
783 2018-01-25 C NaN
783 2018-01-26 C NaN
783 2018-01-27 C NaN
783 2018-01-28 C NaN
.. ... .. ...
578 2018-07-29 B NaN
578 2018-07-30 B NaN
578 2018-07-31 B NaN
578 2018-08-01 B NaN
578 2018-08-02 B 0.562684
[155 rows x 3 columns]
Another solution with list comprehension and concat, but last is necessary back filling of columns for index and ID, solution working if no missing value in original ID column:
offset = pd.DateOffset(months=1) + pd.DateOffset(days=1)
df=pd.concat([df_1.iloc[[i]].reset_index().set_index('Date').reindex(pd.date_range(d-offset,d))
for i, d in enumerate(df_1['Date'])], sort=False)
df = (df.assign(index = df['index'].bfill().astype(int), ID = df['ID'].bfill())
.rename_axis('Date')
.reset_index()
.set_index('index')
.rename_axis(None)
)
print (df)
Date ID Value
783 2018-01-24 C NaN
783 2018-01-25 C NaN
783 2018-01-26 C NaN
783 2018-01-27 C NaN
783 2018-01-28 C NaN
.. ... .. ...
578 2018-07-29 B NaN
578 2018-07-30 B NaN
578 2018-07-31 B NaN
578 2018-08-01 B NaN
578 2018-08-02 B 1.224345
[155 rows x 3 columns]
We can create a date range in the "Date" column, then explode it.
Then group the "Value" column by the index and set values to nan but the last.
Finally reset the index.
def drange(t):
return pd.date_range( t-pd.DateOffset(months=1)+pd.DateOffset(days=1),t,freq="D",normalize=True)
df["Date"]= df["Date"].transform(drange)
ID Date Value
index
783 C DatetimeIndex(['2018-01-24', '2018-01-25', '20... 0.704
580 B DatetimeIndex(['2018-07-05', '2018-07-06', '20... -1.189
221 A DatetimeIndex(['2018-07-11', '2018-07-12', '20... -0.788
228 A DatetimeIndex(['2018-07-18', '2018-07-19', '20... 0.038
578 B DatetimeIndex(['2018-07-03', '2018-07-04', '20... 1.188
df= df.reset_index(drop=True).explode(column="Date")
ID Date Value
0 C 2018-01-24 0.704
0 C 2018-01-25 0.704
0 C 2018-01-26 0.704
0 C 2018-01-27 0.704
0 C 2018-01-28 0.704
.. .. ... ...
4 B 2018-07-29 1.188
4 B 2018-07-30 1.188
4 B 2018-07-31 1.188
4 B 2018-08-01 1.188
4 B 2018-08-02 1.188
df["Value"]= df.groupby(level=0)["Value"].transform(lambda v: [np.nan]*(len(v)-1)+[v.iloc[0]])
df= df.reset_index(drop=True)
ID Date Value
0 C 2018-01-24 NaN
1 C 2018-01-25 NaN
2 C 2018-01-26 NaN
3 C 2018-01-27 NaN
4 C 2018-01-28 NaN
.. .. ... ...
150 B 2018-07-29 NaN
151 B 2018-07-30 NaN
152 B 2018-07-31 NaN
153 B 2018-08-01 NaN
154 B 2018-08-02 1.188

Dataframe Merge in Pandas

For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN
The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

Categories