Hi I've read a lot of question here on stackoverflow about this problem, but I have a little different task.
I have this DF:
# DateTime Close
1 2000-01-04 1460
2 2000-01-05 1470
3 2000-01-06 1480
4 2000-01-07 1450
I want to get the difference between each row for Close column, but storing a [1-0] value if the difference is positive or negative. I want this result:
# DateTime Close label
1 2000-01-04 1460 1
2 2000-01-05 1470 1
3 2000-01-06 1480 1
4 2000-01-07 1450 0
I've done this:
df = pd.read_csv(DATASET_path)
df['Label'] = 0
df['Label'] = (df['Close'] - df['Close'].shift(1) > 1)
The problem is that the result is shifted by one row, so I get the difference starting by the second rows instead the first. (Also I got a boolean values [True, False] instead of 1 or 0).
This is what I get:
# DateTime Close label
1 2000-01-04 1460
2 2000-01-05 1470 True
3 2000-01-06 1480 True
4 2000-01-07 1450 True
Any solution?
Thanks
You can use DataFrame.diff and check which first differences are greater than 0. Finally cast the result to int with .astype(int):
df['label'] = df.Close.diff().fillna(1).gt(0).astype(int)
Output
# DateTime Close label
0 1 2000-01-04 1460 1
1 2 2000-01-05 1470 1
2 3 2000-01-06 1480 1
3 4 2000-01-07 1450 0
I think you need diff with bfill for repalce first missing values by previous one, last convert mask to integers for True/False to 1/0 mapping:
df['Label'] = (df['Close'].diff().bfill() > 0).astype(int)
Verify solution:
print (df)
DateTime Close
1 2000-01-04 1460
2 2000-01-05 1440 <-changed value
3 2000-01-06 1480
4 2000-01-07 1450
df['Label'] = (df['Close'].diff().bfill() > 0).astype(int)
print (df)
DateTime Close Label
1 2000-01-04 1460 0
2 2000-01-05 1440 0
3 2000-01-06 1480 1
4 2000-01-07 1450 0
Related
Here is my problem
You will find below a sample of my DataFrame:
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
And I started to work on a loop with an If statement like this:
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
module=1/selection.loc[date,symbol]
if selection.loc[date,symbol] > rolling_mean.loc[date,symbol]:
df_end.loc[date,symbol] = module
else:
df_end.loc[date,symbol]=0
return df_end
Then :
test(df,'John_Score')
However, my problem is that I don't know how to deal with many columns at the same time, my goal is to try this function on the whole dataframe (for all columns). This sample has only 2 columns but in reality I have 30 columns and I don't know how to do it.
EDIT :
This is what I have with test(df,'John_Score') :
Paul_Score John_Score
Date
2000-01-03 0 0.125000
2000-01-04 0 0.023810
2000-01-05 0 0.000000
2000-01-06 0 0.017544
2000-01-07 0 0.000000
2000-01-08 0 0.014286
And this is what I have with test(df,'Paul_Score') :
Paul_Score John_Score
Date
2000-01-03 0.333333 0
2000-01-04 0.100000 0
2000-01-05 0.045455 0
2000-01-06 0.031250 0
2000-01-07 0.000000 0
2000-01-08 0.025000 0
And I would like something like that :
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286
My goal is to check df every day each column and if the value is greater than the value of its rolling mean 2 days then we compute 1/value of df if it is true and 0 if not.
It may have a simpler way but I'm trying to enhance my coding skills on for/if statement and I found that I have difficulties in doing computation on Dataframes with many columns
If you have any idea, you are welcome
Maybe this code does the job:
import pandas as pd
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
for cols in symbol:
module=1/selection.loc[date,cols]
if selection.loc[date,cols] > rolling_mean.loc[date,cols]:
df_end.loc[date,cols] = module
else:
df_end.loc[date,cols]=0
return df_end
test(df,['Paul_Score', 'John_Score'])
Output:
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286
I would like to get a count for the # of the previous 5 values in df['A'] which are < current value in df['A'] & are also >= df2['A']. I am trying to avoid looping over every row and columns because I'd like to apply this to a larger data set.
Given this...
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
I would like to return this (solved in Excel with COUNTIFS)...
The line below achieves the first part (thanks Alexander), and Divakar and DSM have also weighed in previously (here and here).
df3 = pd.DataFrame(df.rolling(center=False,window=6).apply(lambda rollwin: sum((rollwin[:-1] < rollwin[-1]))))
But I am unable to to add the comparison to df2. Please help.
FOLLOW UP on 10/27/16:
How would I write the lambda above as a standard function?
10/28/16:
See below, taking col 'A' from both df and df2, I am trying to count how many of the previous 5 values from df['A'] fall between the current df2['A'] and df['A']. Said differently, how many from each orange box fall between the yellow low-high range?
UPDATE: different list1 data produces incorrect df3...
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[26,108],[25,102],[26,106],[25,111],[22,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))
df
Out[9]:
A B
2000-01-01 21 101
2000-01-02 22 110
2000-01-03 25 113
2000-01-04 24 112
2000-01-05 21 109
2000-01-06 26 108
2000-01-07 25 102
2000-01-08 26 106
2000-01-09 25 111
2000-01-10 22 110
df3
Out[8]:
A B
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 1.0 0.0
2000-01-07 2.0 0.0
2000-01-08 3.0 1.0
2000-01-09 2.0 3.0
2000-01-10 1.0 3.0
EXCEL EXAMPLES (11/14): see below, trying to count how many numbers in the blue box fall between the range highlighted in orange.
list1 = [[21,50,101],[22,52,110],[25,49,113],[24,49,112],[21,55,109],[28,54,108],[30,57,102],[26,56,106],[25,58,111],[24,60,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('ABC'))
print df
I believe this matches your new screen shot "Given Data".
A B C
2000-01-01 21 50 101
2000-01-02 22 52 110
2000-01-03 25 49 113
2000-01-04 24 49 112
2000-01-05 21 55 109
2000-01-06 28 54 108
2000-01-07 30 57 102
2000-01-08 26 56 106
2000-01-09 25 58 111
2000-01-10 24 60 110
and the same function:
print pd.DataFrame(
df.rolling(center=False,window=6).
apply(lambda rollwin: pd.Series(rollwin[:-1]).
between(rollwin[-1]*0.95,rollwin[-1]).sum()))
gives your desired output "Desired outcome":
A B C
2000-01-01 nan nan nan
2000-01-02 nan nan nan
2000-01-03 nan nan nan
2000-01-04 nan nan nan
2000-01-05 nan nan nan
2000-01-06 0 1 0
2000-01-07 0 1 0
2000-01-08 1 2 1
2000-01-09 1 2 3
2000-01-10 0 2 3
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
window = 6
results = []
for i in range (len(df)-window+1):
slice_df1 = df.iloc[i:i + window]
slice_df2 = df2.iloc[i:i + window]
compare1 = slice_df1['A'].iloc[-1]
compare2 = slice_df2['A'].iloc[-1]
a= slice_df1.iloc[:-1]['A'].between(compare2,compare1) # series have a between metho
results.append(a.sum())
df_res = pd.DataFrame(data = results , index = df.index[window-1:] , columns = ['countifs'])
df_res = df_res.reindex(df.index,fill_value=0.0)
print df_res
which yields:
countifs
2000-01-01 0.0000
2000-01-02 0.0000
2000-01-03 0.0000
2000-01-04 0.0000
2000-01-05 0.0000
2000-01-06 0.0000
2000-01-07 0.0000
2000-01-08 1.0000
2000-01-09 1.0000
2000-01-10 0.0000
BUT
Seeing there is a logical relationship between your upper and lower bound, value and value - 5%. Then this will perhaps be what you want.
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: sum(np.logical_and(
rollwin[-1]*0.95 <= rollwin[:-1]
,rollwin[:-1] < rollwin[-1])
)))
and if you prefer the pd.Series.between() approach:
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))
NI
YEAR MONTH datetime
2000 1 2000-01-01 NaN
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 NaN
In the dataframe above, I have a multilevel index consisting of the columns:
names=[u'YEAR', u'MONTH', u'datetime']
How do I revert to a dataframe with 'datetime' as index and 'YEAR' and 'MONTH' as normal columns?
pass level=[0,1] to just reset those levels:
dist_df = dist_df.reset_index(level=[0,1])
In [28]:
df.reset_index(level=[0,1])
Out[28]:
YEAR MONTH NI
datetime
2000-01-01 2000 1 NaN
2000-01-02 2000 1 NaN
2000-01-03 2000 1 NaN
2000-01-04 2000 1 NaN
2000-01-05 2000 1 NaN
you can pass the label names alternatively:
df.reset_index(level=['YEAR','MONTH'])
Another simple way would be to set columns for dataframe
consolidated_data.columns=country_master
ref: https://riptutorial.com/pandas/example/18695/how-to-change-multiindex-columns-to-standard-columns
I have two different pandas DataFrames and I want to extract data from one DataFrame whenever the other DataFrame has a specific value at the same time.To be concrete, I have one object called "GDP" which looks as follows:
GDP
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
I additionally have a DataFrame called "recession" which contains data like the following:
USRECQ
DATE
1949-07-01 1
1949-10-01 1
1950-01-01 0
I want to create two new time series. One should contain GDP data whenever USRECQ has a value of 0 at the same DATE. The other one should contain GDP data whenever USRECQ has a value of 1 at the same DATE. How can I do that?
Let's modify the example you posted so the dates overlap:
import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
index=pd.date_range('2000-1-1', periods=10, freq='D'))
# GDP
# 2000-01-01 0
# 2000-01-02 10
# 2000-01-03 20
# 2000-01-04 30
# 2000-01-05 40
# 2000-01-06 50
# 2000-01-07 60
# 2000-01-08 70
# 2000-01-09 80
# 2000-01-10 90
recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
index=pd.date_range('2000-1-2', periods=10, freq='D'))
# USRECQ
# 2000-01-02 0
# 2000-01-03 0
# 2000-01-04 0
# 2000-01-05 0
# 2000-01-06 0
# 2000-01-07 1
# 2000-01-08 1
# 2000-01-09 1
# 2000-01-10 1
# 2000-01-11 1
Then you could join the two dataframes:
combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
# GDP USRECQ
# 2000-01-01 0 NaN
# 2000-01-02 10 0
# 2000-01-03 20 0
# 2000-01-04 30 0
# 2000-01-05 40 0
# 2000-01-06 50 0
# 2000-01-07 60 1
# 2000-01-08 70 1
# 2000-01-09 80 1
# 2000-01-10 90 1
# 2000-01-11 NaN 1
and select rows based on a condition like this:
In [112]: combined.loc[combined['USRECQ']==0]
Out[112]:
GDP USRECQ
2000-01-02 10 0
2000-01-03 20 0
2000-01-04 30 0
2000-01-05 40 0
2000-01-06 50 0
In [113]: combined.loc[combined['USRECQ']==1]
Out[113]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
To get just the GDP column supply the column name as the second term to combined.loc:
In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]:
2000-01-07 60
2000-01-08 70
2000-01-09 80
2000-01-10 90
2000-01-11 NaN
Freq: D, Name: GDP, dtype: float64
As PaulH points out, you could also use query, which has a nicer syntax:
In [118]: combined.query('USRECQ==1')
Out[118]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
For each observation in my data, I'm trying to come up with the number of observations created in the previous 7 days.
obs date
A 1/1/2000
B 1/4/2000
C 1/5/2000
D 1/10/2000
E 1/20/2000
F 1/1/2000
Would become:
obs date births last week
A 1/1/2000 2
B 1/4/2000 3
C 1/5/2000 4
D 1/10/2000 3
E 1/20/2000 1
F 1/1/2000 2
Right now I'm using the following method, but it's very slow:
def past_week(x,df):
back = x['date'] - dt.timedelta(days=7)
return df[(df['date'] >= back) & (df['date'] < x['date'])].count()
df['births_last_week'] = df.apply(lambda x: past_week(x,df),axis=1)
Edit: Having difficulty with duplicate dates. Maybe I'm doing something wrong. I've edited the example above to include a repeated date:
df['births last week'] = df.groupby('date').cumcount() + 1
pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
gives:
date births last week
2000-01-01 1
2000-01-04 2
2000-01-05 3
2000-01-10 3
2000-01-20 1
2000-01-01 1
I've tried rolling_sum instead, but then all I get is NA values for births last week. I imagine there's something extremely obvious that I'm getting wrong, just not sure what.
Here's one approach:
df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
by_day = df.groupby("date").count().resample("D").fillna(0)
csum = by_day.cumsum()
last_week = csum - csum.shift(7).fillna(0)
final = last_week.loc[df.date]
producing
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
Step by step, first we get the DataFrame (you probably have this already):
>>> df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
>>> df
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
5 F 2000-01-01
Then we groupby on date, and count the number of observations:
>>> df.groupby("date").count()
obs
date
2000-01-01 2
2000-01-04 1
2000-01-05 1
2000-01-10 1
2000-01-20 1
We can resample this to days; it'll be a much longer timeseries, of course, but memory is cheap and I'm lazy:
>>> df.groupby("date").count().resample("D")
obs
date
2000-01-01 2
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 1
2000-01-05 1
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 NaN
2000-01-10 1
2000-01-11 NaN
2000-01-12 NaN
2000-01-13 NaN
2000-01-14 NaN
2000-01-15 NaN
2000-01-16 NaN
2000-01-17 NaN
2000-01-18 NaN
2000-01-19 NaN
2000-01-20 1
Get rid of the nans:
>>> by_day = df.groupby("date").count().resample("D").fillna(0)
>>> by_day
obs
date
2000-01-01 2
2000-01-02 0
2000-01-03 0
2000-01-04 1
2000-01-05 1
2000-01-06 0
2000-01-07 0
2000-01-08 0
2000-01-09 0
2000-01-10 1
2000-01-11 0
2000-01-12 0
2000-01-13 0
2000-01-14 0
2000-01-15 0
2000-01-16 0
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And take the cumulative sum, as part of a manual rolling-sum process. The default rolling sum has the wrong alignment, so I'll just subtract with a difference of one week:
>>> csum = by_day.cumsum()
>>> last_week = csum - csum.shift(7).fillna(0)
>>> last_week
obs
date
2000-01-01 2
2000-01-02 2
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 4
2000-01-07 4
2000-01-08 2
2000-01-09 2
2000-01-10 3
2000-01-11 2
2000-01-12 1
2000-01-13 1
2000-01-14 1
2000-01-15 1
2000-01-16 1
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And then select the dates we care about:
>>> final = last_week.loc[df.date]
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
In [57]: df
Out[57]:
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
In [58]: df['births last week'] = 1
In [59]: pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
Out[59]:
births last week
2000-01-01 0
2000-01-04 1
2000-01-05 2
2000-01-10 2
2000-01-20 0
I subtract 1 because rolling_count includes the current entry, and you don't.
Edit: To handle duplicate dates, as discussed in comments on your question, group by date and sum the 'births last week' column between inputs 58 and 59 above.