Here is my problem
You will find below a sample of my DataFrame:
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
And I started to work on a loop with an If statement like this:
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
module=1/selection.loc[date,symbol]
if selection.loc[date,symbol] > rolling_mean.loc[date,symbol]:
df_end.loc[date,symbol] = module
else:
df_end.loc[date,symbol]=0
return df_end
Then :
test(df,'John_Score')
However, my problem is that I don't know how to deal with many columns at the same time, my goal is to try this function on the whole dataframe (for all columns). This sample has only 2 columns but in reality I have 30 columns and I don't know how to do it.
EDIT :
This is what I have with test(df,'John_Score') :
Paul_Score John_Score
Date
2000-01-03 0 0.125000
2000-01-04 0 0.023810
2000-01-05 0 0.000000
2000-01-06 0 0.017544
2000-01-07 0 0.000000
2000-01-08 0 0.014286
And this is what I have with test(df,'Paul_Score') :
Paul_Score John_Score
Date
2000-01-03 0.333333 0
2000-01-04 0.100000 0
2000-01-05 0.045455 0
2000-01-06 0.031250 0
2000-01-07 0.000000 0
2000-01-08 0.025000 0
And I would like something like that :
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286
My goal is to check df every day each column and if the value is greater than the value of its rolling mean 2 days then we compute 1/value of df if it is true and 0 if not.
It may have a simpler way but I'm trying to enhance my coding skills on for/if statement and I found that I have difficulties in doing computation on Dataframes with many columns
If you have any idea, you are welcome
Maybe this code does the job:
import pandas as pd
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
for cols in symbol:
module=1/selection.loc[date,cols]
if selection.loc[date,cols] > rolling_mean.loc[date,cols]:
df_end.loc[date,cols] = module
else:
df_end.loc[date,cols]=0
return df_end
test(df,['Paul_Score', 'John_Score'])
Output:
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286
I am little new to Python and have a problem like this. I have a dataframe of multiple sensor data. There are NA missing values in the dataset and need to be filled with below rules.
if the next sensor has data at the same time stamp, fill it using the next sensor data.
If near sensor has no data either, fill it with average value of all available sensors at the same timestamp.
If all sensor missing data at the same timestamp, use linear interpolation of it's own to fill the missing values
There's a sample data I built.
import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
sensor2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
sensor3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
sensordata = sensor1.append([sensor2,sensor3]).reset_index(drop = True)
Any help would be appreciated.
With the answer from Christian, the solution will be as follows.
# create data
df1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
df2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
df3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
df = df1.append([df2,df3]).reset_index(drop = True)
# pivot dataframe
df = df.pivot(index = 'date', columns ='sensor',values ='value')
# step 1, using specified sensor to fill missing values first, here use sensor 3
for c in df.columns:
selectedsensor = 3
df[c] = df[c].fillna(df[selectedsensor])
# step 2, use average of all available sensors to fill
df = df.transpose().fillna(df.transpose().mean()).transpose()
# step 3, use interpolate to fill remaining missing values
df = df.interpolate()
# unstack back to the original data format
df = df.reset_index()
df = df.melt(id_vars=['date'],var_name = 'sensor')
#df = df.unstack('sensor').reset_index()
#df = df.rename(columns ={0:'value'})
The final output is as follows:
date sensor value
0 2000-01-01 1 2.0
1 2000-01-02 1 2.0
2 2000-01-03 1 2.0
3 2000-01-04 1 2.0
4 2000-01-05 1 2.0
5 2000-01-06 1 7.0
6 2000-01-07 1 6.0
7 2000-01-08 1 5.0
8 2000-01-09 1 4.0
9 2000-01-10 1 6.0
10 2000-01-01 2 3.0
11 2000-01-02 2 4.0
12 2000-01-03 2 5.0
13 2000-01-04 2 6.0
14 2000-01-05 2 7.0
15 2000-01-06 2 7.0
16 2000-01-07 2 7.0
17 2000-01-08 2 7.0
18 2000-01-09 2 7.0
19 2000-01-10 2 8.0
20 2000-01-01 3 2.0
21 2000-01-02 3 3.0
22 2000-01-03 3 4.0
23 2000-01-04 3 5.0
24 2000-01-05 3 6.0
25 2000-01-06 3 7.0
26 2000-01-07 3 7.0
27 2000-01-08 3 7.0
28 2000-01-09 3 7.0
29 2000-01-10 3 8.0
You can do the following:
Your dataset, pivoted:
df = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor1":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6], "sensor2":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8], "sensor3":[2,3,4,5,6,7,np.nan,np.nan,7,8]}).set_index('date')
1) This is fillna with options backward, and limit = 1 along axis 1
df.fillna(method='bfill',limit=1,axis=1)
2) This is fillna with mean along the axis 1. This isn't really implemented apparently, but we can trick it with transposing:
df.transpose().fillna(df.transpose().mean()).transpose()
3) This is just interpolate
df.interpolate()
Bonus:
This got a bit uglier, since i had to apply column by column, but here is one selecting sensor 3 to fill:
for c in df.columns:
df[c] = df[c].fillna(df["sensor3"])
df
Here's a piece of code, I don't get why on the last column rm-5, I get NaN for the first 4 items.
I understand that for the rm columns the 1st 4 items aren't filled because there is no data available, but if I shift the column calculation should be made, shouldn't it ?
Similarly I don't get why there are 5 and not 4 items in the rm-5 column that are NaN
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=100, freq='D')
df = pd.DataFrame(data=np.random.randn(100), index=index, columns=['A'])
df['rm']=pd.rolling_mean(df['A'],5)
df['rm-5']=pd.rolling_mean(df['A'].shift(-5),5)
print df.head(n=8)
print df.tail(n=8)
A rm rm-5
2000-01-01 0.109161 NaN NaN
2000-01-02 -0.360286 NaN NaN
2000-01-03 -0.092439 NaN NaN
2000-01-04 0.169439 NaN NaN
2000-01-05 0.185829 0.002341 0.091736
2000-01-06 0.432599 0.067028 0.295949
2000-01-07 -0.374317 0.064222 0.055903
2000-01-08 1.258054 0.334321 -0.132972
A rm rm-5
2000-04-02 0.499860 -0.422931 -0.140111
2000-04-03 -0.868718 -0.458962 -0.182373
2000-04-04 0.081059 -0.443494 -0.040646
2000-04-05 0.500275 -0.093048 NaN
2000-04-06 -0.253915 -0.008288 NaN
2000-04-07 -0.159256 -0.140111 NaN
2000-04-08 -1.080027 -0.182373 NaN
2000-04-09 0.789690 -0.040646 NaN
You can change the order of operations. Now you are first shifting and afterwards taking the mean. Due to your first shift you create your NaN's at the end.
index = pd.date_range('2000-1-1', periods=100, freq='D')
df = pd.DataFrame(data=np.random.randn(100), index=index, columns=['A'])
df['rm']=pd.rolling_mean(df['A'],5)
df['shift'] = df['A'].shift(-5)
df['rm-5-shift_first']=pd.rolling_mean(df['A'].shift(-5),5)
df['rm-5-mean_first']=pd.rolling_mean(df['A'],5).shift(-5)
print( df.head(n=8))
print( df.tail(n=8))
A rm shift rm-5-shift_first rm-5-mean_first
2000-01-01 -0.120808 NaN 0.830231 NaN 0.184197
2000-01-02 0.029547 NaN 0.047451 NaN 0.187778
2000-01-03 0.002652 NaN 1.040963 NaN 0.395440
2000-01-04 -1.078656 NaN -1.118723 NaN 0.387426
2000-01-05 1.137210 -0.006011 0.469557 0.253896 0.253896
2000-01-06 0.830231 0.184197 -0.390506 0.009748 0.009748
2000-01-07 0.047451 0.187778 -1.624492 -0.324640 -0.324640
2000-01-08 1.040963 0.395440 -1.259306 -0.784694 -0.784694
A rm shift rm-5-shift_first rm-5-mean_first
2000-04-02 -1.283123 -0.270381 0.226257 0.760370 0.760370
2000-04-03 1.369342 0.288072 2.367048 0.959912 0.959912
2000-04-04 0.003363 0.299997 1.143513 1.187941 1.187941
2000-04-05 0.694026 0.400442 NaN NaN NaN
2000-04-06 1.508863 0.458494 NaN NaN NaN
2000-04-07 0.226257 0.760370 NaN NaN NaN
2000-04-08 2.367048 0.959912 NaN NaN NaN
2000-04-09 1.143513 1.187941 NaN NaN NaN
For more see:
http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.shift.html
I have two different pandas DataFrames and I want to extract data from one DataFrame whenever the other DataFrame has a specific value at the same time.To be concrete, I have one object called "GDP" which looks as follows:
GDP
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
I additionally have a DataFrame called "recession" which contains data like the following:
USRECQ
DATE
1949-07-01 1
1949-10-01 1
1950-01-01 0
I want to create two new time series. One should contain GDP data whenever USRECQ has a value of 0 at the same DATE. The other one should contain GDP data whenever USRECQ has a value of 1 at the same DATE. How can I do that?
Let's modify the example you posted so the dates overlap:
import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
index=pd.date_range('2000-1-1', periods=10, freq='D'))
# GDP
# 2000-01-01 0
# 2000-01-02 10
# 2000-01-03 20
# 2000-01-04 30
# 2000-01-05 40
# 2000-01-06 50
# 2000-01-07 60
# 2000-01-08 70
# 2000-01-09 80
# 2000-01-10 90
recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
index=pd.date_range('2000-1-2', periods=10, freq='D'))
# USRECQ
# 2000-01-02 0
# 2000-01-03 0
# 2000-01-04 0
# 2000-01-05 0
# 2000-01-06 0
# 2000-01-07 1
# 2000-01-08 1
# 2000-01-09 1
# 2000-01-10 1
# 2000-01-11 1
Then you could join the two dataframes:
combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
# GDP USRECQ
# 2000-01-01 0 NaN
# 2000-01-02 10 0
# 2000-01-03 20 0
# 2000-01-04 30 0
# 2000-01-05 40 0
# 2000-01-06 50 0
# 2000-01-07 60 1
# 2000-01-08 70 1
# 2000-01-09 80 1
# 2000-01-10 90 1
# 2000-01-11 NaN 1
and select rows based on a condition like this:
In [112]: combined.loc[combined['USRECQ']==0]
Out[112]:
GDP USRECQ
2000-01-02 10 0
2000-01-03 20 0
2000-01-04 30 0
2000-01-05 40 0
2000-01-06 50 0
In [113]: combined.loc[combined['USRECQ']==1]
Out[113]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
To get just the GDP column supply the column name as the second term to combined.loc:
In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]:
2000-01-07 60
2000-01-08 70
2000-01-09 80
2000-01-10 90
2000-01-11 NaN
Freq: D, Name: GDP, dtype: float64
As PaulH points out, you could also use query, which has a nicer syntax:
In [118]: combined.query('USRECQ==1')
Out[118]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
For each observation in my data, I'm trying to come up with the number of observations created in the previous 7 days.
obs date
A 1/1/2000
B 1/4/2000
C 1/5/2000
D 1/10/2000
E 1/20/2000
F 1/1/2000
Would become:
obs date births last week
A 1/1/2000 2
B 1/4/2000 3
C 1/5/2000 4
D 1/10/2000 3
E 1/20/2000 1
F 1/1/2000 2
Right now I'm using the following method, but it's very slow:
def past_week(x,df):
back = x['date'] - dt.timedelta(days=7)
return df[(df['date'] >= back) & (df['date'] < x['date'])].count()
df['births_last_week'] = df.apply(lambda x: past_week(x,df),axis=1)
Edit: Having difficulty with duplicate dates. Maybe I'm doing something wrong. I've edited the example above to include a repeated date:
df['births last week'] = df.groupby('date').cumcount() + 1
pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
gives:
date births last week
2000-01-01 1
2000-01-04 2
2000-01-05 3
2000-01-10 3
2000-01-20 1
2000-01-01 1
I've tried rolling_sum instead, but then all I get is NA values for births last week. I imagine there's something extremely obvious that I'm getting wrong, just not sure what.
Here's one approach:
df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
by_day = df.groupby("date").count().resample("D").fillna(0)
csum = by_day.cumsum()
last_week = csum - csum.shift(7).fillna(0)
final = last_week.loc[df.date]
producing
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
Step by step, first we get the DataFrame (you probably have this already):
>>> df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
>>> df
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
5 F 2000-01-01
Then we groupby on date, and count the number of observations:
>>> df.groupby("date").count()
obs
date
2000-01-01 2
2000-01-04 1
2000-01-05 1
2000-01-10 1
2000-01-20 1
We can resample this to days; it'll be a much longer timeseries, of course, but memory is cheap and I'm lazy:
>>> df.groupby("date").count().resample("D")
obs
date
2000-01-01 2
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 1
2000-01-05 1
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 NaN
2000-01-10 1
2000-01-11 NaN
2000-01-12 NaN
2000-01-13 NaN
2000-01-14 NaN
2000-01-15 NaN
2000-01-16 NaN
2000-01-17 NaN
2000-01-18 NaN
2000-01-19 NaN
2000-01-20 1
Get rid of the nans:
>>> by_day = df.groupby("date").count().resample("D").fillna(0)
>>> by_day
obs
date
2000-01-01 2
2000-01-02 0
2000-01-03 0
2000-01-04 1
2000-01-05 1
2000-01-06 0
2000-01-07 0
2000-01-08 0
2000-01-09 0
2000-01-10 1
2000-01-11 0
2000-01-12 0
2000-01-13 0
2000-01-14 0
2000-01-15 0
2000-01-16 0
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And take the cumulative sum, as part of a manual rolling-sum process. The default rolling sum has the wrong alignment, so I'll just subtract with a difference of one week:
>>> csum = by_day.cumsum()
>>> last_week = csum - csum.shift(7).fillna(0)
>>> last_week
obs
date
2000-01-01 2
2000-01-02 2
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 4
2000-01-07 4
2000-01-08 2
2000-01-09 2
2000-01-10 3
2000-01-11 2
2000-01-12 1
2000-01-13 1
2000-01-14 1
2000-01-15 1
2000-01-16 1
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And then select the dates we care about:
>>> final = last_week.loc[df.date]
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
In [57]: df
Out[57]:
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
In [58]: df['births last week'] = 1
In [59]: pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
Out[59]:
births last week
2000-01-01 0
2000-01-04 1
2000-01-05 2
2000-01-10 2
2000-01-20 0
I subtract 1 because rolling_count includes the current entry, and you don't.
Edit: To handle duplicate dates, as discussed in comments on your question, group by date and sum the 'births last week' column between inputs 58 and 59 above.