How to deal with a Dataframe with many columns and if statement

How to deal with a Dataframe with many columns and if statement - python

Here is my problem
You will find below a sample of my DataFrame:
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
And I started to work on a loop with an If statement like this:
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
module=1/selection.loc[date,symbol]
if selection.loc[date,symbol] > rolling_mean.loc[date,symbol]:
df_end.loc[date,symbol] = module
else:
df_end.loc[date,symbol]=0
return df_end
Then :
test(df,'John_Score')
However, my problem is that I don't know how to deal with many columns at the same time, my goal is to try this function on the whole dataframe (for all columns). This sample has only 2 columns but in reality I have 30 columns and I don't know how to do it.
EDIT :
This is what I have with test(df,'John_Score') :
Paul_Score John_Score
Date
2000-01-03 0 0.125000
2000-01-04 0 0.023810
2000-01-05 0 0.000000
2000-01-06 0 0.017544
2000-01-07 0 0.000000
2000-01-08 0 0.014286
And this is what I have with test(df,'Paul_Score') :
Paul_Score John_Score
Date
2000-01-03 0.333333 0
2000-01-04 0.100000 0
2000-01-05 0.045455 0
2000-01-06 0.031250 0
2000-01-07 0.000000 0
2000-01-08 0.025000 0
And I would like something like that :
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286
My goal is to check df every day each column and if the value is greater than the value of its rolling mean 2 days then we compute 1/value of df if it is true and 0 if not.
It may have a simpler way but I'm trying to enhance my coding skills on for/if statement and I found that I have difficulties in doing computation on Dataframes with many columns
If you have any idea, you are welcome

Maybe this code does the job:
import pandas as pd
df = pd.DataFrame({'Date':['01/03/2000','01/04/2000','01/05/2000','01/06/2000','01/07/2000','01/08/2000'],
'Paul_Score':[3,10,22,32,20,40],
'John_Score':[8,42,10,57,3,70]
})
df['Date']= pd.to_datetime(df['Date'])
df = df.set_index('Date')
def test(selection,symbol):
df_end = (selection*0)
rolling_mean = selection.rolling(2).mean().fillna(0)
calendar = pd.Series(df_end.index)
for date in calendar:
for cols in symbol:
module=1/selection.loc[date,cols]
if selection.loc[date,cols] > rolling_mean.loc[date,cols]:
df_end.loc[date,cols] = module
else:
df_end.loc[date,cols]=0
return df_end
test(df,['Paul_Score', 'John_Score'])
Output:
Paul_Score John_Score
Date
2000-01-03 0.333333 0.125000
2000-01-04 0.100000 0.023810
2000-01-05 0.045455 0.000000
2000-01-06 0.031250 0.017544
2000-01-07 0.000000 0.000000
2000-01-08 0.025000 0.014286

Related

fill missing data with Python

I am little new to Python and have a problem like this. I have a dataframe of multiple sensor data. There are NA missing values in the dataset and need to be filled with below rules.
if the next sensor has data at the same time stamp, fill it using the next sensor data.
If near sensor has no data either, fill it with average value of all available sensors at the same timestamp.
If all sensor missing data at the same timestamp, use linear interpolation of it's own to fill the missing values
There's a sample data I built.
import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
sensor2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
sensor3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
sensordata = sensor1.append([sensor2,sensor3]).reset_index(drop = True)
Any help would be appreciated.
With the answer from Christian, the solution will be as follows.
# create data
df1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[1,1,1,1,1,1,1,1,1,1],"value":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6]})
df2 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8]})
df3 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[2,3,4,5,6,7,np.nan,np.nan,7,8]})
df = df1.append([df2,df3]).reset_index(drop = True)
# pivot dataframe
df = df.pivot(index = 'date', columns ='sensor',values ='value')
# step 1, using specified sensor to fill missing values first, here use sensor 3
for c in df.columns:
selectedsensor = 3
df[c] = df[c].fillna(df[selectedsensor])
# step 2, use average of all available sensors to fill
df = df.transpose().fillna(df.transpose().mean()).transpose()
# step 3, use interpolate to fill remaining missing values
df = df.interpolate()
# unstack back to the original data format
df = df.reset_index()
df = df.melt(id_vars=['date'],var_name = 'sensor')
#df = df.unstack('sensor').reset_index()
#df = df.rename(columns ={0:'value'})
The final output is as follows:
date sensor value
0 2000-01-01 1 2.0
1 2000-01-02 1 2.0
2 2000-01-03 1 2.0
3 2000-01-04 1 2.0
4 2000-01-05 1 2.0
5 2000-01-06 1 7.0
6 2000-01-07 1 6.0
7 2000-01-08 1 5.0
8 2000-01-09 1 4.0
9 2000-01-10 1 6.0
10 2000-01-01 2 3.0
11 2000-01-02 2 4.0
12 2000-01-03 2 5.0
13 2000-01-04 2 6.0
14 2000-01-05 2 7.0
15 2000-01-06 2 7.0
16 2000-01-07 2 7.0
17 2000-01-08 2 7.0
18 2000-01-09 2 7.0
19 2000-01-10 2 8.0
20 2000-01-01 3 2.0
21 2000-01-02 3 3.0
22 2000-01-03 3 4.0
23 2000-01-04 3 5.0
24 2000-01-05 3 6.0
25 2000-01-06 3 7.0
26 2000-01-07 3 7.0
27 2000-01-08 3 7.0
28 2000-01-09 3 7.0
29 2000-01-10 3 8.0

You can do the following:
Your dataset, pivoted:
df = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=10),"sensor1":[np.nan,2,2,2,2,np.nan,np.nan,np.nan,4,6], "sensor2":[3,4,5,6,7,np.nan,np.nan,np.nan,7,8], "sensor3":[2,3,4,5,6,7,np.nan,np.nan,7,8]}).set_index('date')
1) This is fillna with options backward, and limit = 1 along axis 1
df.fillna(method='bfill',limit=1,axis=1)
2) This is fillna with mean along the axis 1. This isn't really implemented apparently, but we can trick it with transposing:
df.transpose().fillna(df.transpose().mean()).transpose()
3) This is just interpolate
df.interpolate()
Bonus:
This got a bit uglier, since i had to apply column by column, but here is one selecting sensor 3 to fill:
for c in df.columns:
df[c] = df[c].fillna(df["sensor3"])
df

How can I replicate excel COUNTIFS in python/pandas?

I would like to get a count for the # of the previous 5 values in df['A'] which are < current value in df['A'] & are also >= df2['A']. I am trying to avoid looping over every row and columns because I'd like to apply this to a larger data set.
Given this...
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
I would like to return this (solved in Excel with COUNTIFS)...
The line below achieves the first part (thanks Alexander), and Divakar and DSM have also weighed in previously (here and here).
df3 = pd.DataFrame(df.rolling(center=False,window=6).apply(lambda rollwin: sum((rollwin[:-1] < rollwin[-1]))))
But I am unable to to add the comparison to df2. Please help.
FOLLOW UP on 10/27/16:
How would I write the lambda above as a standard function?
10/28/16:
See below, taking col 'A' from both df and df2, I am trying to count how many of the previous 5 values from df['A'] fall between the current df2['A'] and df['A']. Said differently, how many from each orange box fall between the yellow low-high range?
UPDATE: different list1 data produces incorrect df3...
list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[26,108],[25,102],[26,106],[25,111],[22,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))
df
Out[9]:
A B
2000-01-01 21 101
2000-01-02 22 110
2000-01-03 25 113
2000-01-04 24 112
2000-01-05 21 109
2000-01-06 26 108
2000-01-07 25 102
2000-01-08 26 106
2000-01-09 25 111
2000-01-10 22 110
df3
Out[8]:
A B
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 NaN NaN
2000-01-06 1.0 0.0
2000-01-07 2.0 0.0
2000-01-08 3.0 1.0
2000-01-09 2.0 3.0
2000-01-10 1.0 3.0
EXCEL EXAMPLES (11/14): see below, trying to count how many numbers in the blue box fall between the range highlighted in orange.

list1 = [[21,50,101],[22,52,110],[25,49,113],[24,49,112],[21,55,109],[28,54,108],[30,57,102],[26,56,106],[25,58,111],[24,60,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('ABC'))
print df
I believe this matches your new screen shot "Given Data".
A B C
2000-01-01 21 50 101
2000-01-02 22 52 110
2000-01-03 25 49 113
2000-01-04 24 49 112
2000-01-05 21 55 109
2000-01-06 28 54 108
2000-01-07 30 57 102
2000-01-08 26 56 106
2000-01-09 25 58 111
2000-01-10 24 60 110
and the same function:
print pd.DataFrame(
df.rolling(center=False,window=6).
apply(lambda rollwin: pd.Series(rollwin[:-1]).
between(rollwin[-1]*0.95,rollwin[-1]).sum()))
gives your desired output "Desired outcome":
A B C
2000-01-01 nan nan nan
2000-01-02 nan nan nan
2000-01-03 nan nan nan
2000-01-04 nan nan nan
2000-01-05 nan nan nan
2000-01-06 0 1 0
2000-01-07 0 1 0
2000-01-08 1 2 1
2000-01-09 1 2 3
2000-01-10 0 2 3

list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df2 = pd.DataFrame(df * (1-.05))
window = 6
results = []
for i in range (len(df)-window+1):
slice_df1 = df.iloc[i:i + window]
slice_df2 = df2.iloc[i:i + window]
compare1 = slice_df1['A'].iloc[-1]
compare2 = slice_df2['A'].iloc[-1]
a= slice_df1.iloc[:-1]['A'].between(compare2,compare1) # series have a between metho
results.append(a.sum())
df_res = pd.DataFrame(data = results , index = df.index[window-1:] , columns = ['countifs'])
df_res = df_res.reindex(df.index,fill_value=0.0)
print df_res
which yields:
countifs
2000-01-01 0.0000
2000-01-02 0.0000
2000-01-03 0.0000
2000-01-04 0.0000
2000-01-05 0.0000
2000-01-06 0.0000
2000-01-07 0.0000
2000-01-08 1.0000
2000-01-09 1.0000
2000-01-10 0.0000
BUT
Seeing there is a logical relationship between your upper and lower bound, value and value - 5%. Then this will perhaps be what you want.
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: sum(np.logical_and(
rollwin[-1]*0.95 <= rollwin[:-1]
,rollwin[:-1] < rollwin[-1])
)))
and if you prefer the pd.Series.between() approach:
df3 = pd.DataFrame(
df.rolling(center=False,window=6).apply(
lambda rollwin: pd.Series(rollwin[:-1]).between(rollwin[-1]*0.95,rollwin[-1]).sum()))

Reverting from multiindex to single index dataframe in pandas

NI
YEAR MONTH datetime
2000 1 2000-01-01 NaN
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 NaN
2000-01-05 NaN
In the dataframe above, I have a multilevel index consisting of the columns:
names=[u'YEAR', u'MONTH', u'datetime']
How do I revert to a dataframe with 'datetime' as index and 'YEAR' and 'MONTH' as normal columns?

pass level=[0,1] to just reset those levels:
dist_df = dist_df.reset_index(level=[0,1])
In [28]:
df.reset_index(level=[0,1])
Out[28]:
YEAR MONTH NI
datetime
2000-01-01 2000 1 NaN
2000-01-02 2000 1 NaN
2000-01-03 2000 1 NaN
2000-01-04 2000 1 NaN
2000-01-05 2000 1 NaN
you can pass the label names alternatively:
df.reset_index(level=['YEAR','MONTH'])

Another simple way would be to set columns for dataframe
consolidated_data.columns=country_master
ref: https://riptutorial.com/pandas/example/18695/how-to-change-multiindex-columns-to-standard-columns

Selecting Subset of Pandas DataFrame

I have two different pandas DataFrames and I want to extract data from one DataFrame whenever the other DataFrame has a specific value at the same time.To be concrete, I have one object called "GDP" which looks as follows:
GDP
DATE
1947-01-01 243.1
1947-04-01 246.3
1947-07-01 250.1
I additionally have a DataFrame called "recession" which contains data like the following:
USRECQ
DATE
1949-07-01 1
1949-10-01 1
1950-01-01 0
I want to create two new time series. One should contain GDP data whenever USRECQ has a value of 0 at the same DATE. The other one should contain GDP data whenever USRECQ has a value of 1 at the same DATE. How can I do that?

Let's modify the example you posted so the dates overlap:
import pandas as pd
import numpy as np
GDP = pd.DataFrame({'GDP':np.arange(10)*10},
index=pd.date_range('2000-1-1', periods=10, freq='D'))
# GDP
# 2000-01-01 0
# 2000-01-02 10
# 2000-01-03 20
# 2000-01-04 30
# 2000-01-05 40
# 2000-01-06 50
# 2000-01-07 60
# 2000-01-08 70
# 2000-01-09 80
# 2000-01-10 90
recession = pd.DataFrame({'USRECQ': [0]*5+[1]*5},
index=pd.date_range('2000-1-2', periods=10, freq='D'))
# USRECQ
# 2000-01-02 0
# 2000-01-03 0
# 2000-01-04 0
# 2000-01-05 0
# 2000-01-06 0
# 2000-01-07 1
# 2000-01-08 1
# 2000-01-09 1
# 2000-01-10 1
# 2000-01-11 1
Then you could join the two dataframes:
combined = GDP.join(recession, how='outer') # change to how='inner' to remove NaNs
# GDP USRECQ
# 2000-01-01 0 NaN
# 2000-01-02 10 0
# 2000-01-03 20 0
# 2000-01-04 30 0
# 2000-01-05 40 0
# 2000-01-06 50 0
# 2000-01-07 60 1
# 2000-01-08 70 1
# 2000-01-09 80 1
# 2000-01-10 90 1
# 2000-01-11 NaN 1
and select rows based on a condition like this:
In [112]: combined.loc[combined['USRECQ']==0]
Out[112]:
GDP USRECQ
2000-01-02 10 0
2000-01-03 20 0
2000-01-04 30 0
2000-01-05 40 0
2000-01-06 50 0
In [113]: combined.loc[combined['USRECQ']==1]
Out[113]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1
To get just the GDP column supply the column name as the second term to combined.loc:
In [116]: combined.loc[combined['USRECQ']==1, 'GDP']
Out[116]:
2000-01-07 60
2000-01-08 70
2000-01-09 80
2000-01-10 90
2000-01-11 NaN
Freq: D, Name: GDP, dtype: float64
As PaulH points out, you could also use query, which has a nicer syntax:
In [118]: combined.query('USRECQ==1')
Out[118]:
GDP USRECQ
2000-01-07 60 1
2000-01-08 70 1
2000-01-09 80 1
2000-01-10 90 1
2000-01-11 NaN 1

Efficiently calculate creation rate during window in pandas

For each observation in my data, I'm trying to come up with the number of observations created in the previous 7 days.
obs date
A 1/1/2000
B 1/4/2000
C 1/5/2000
D 1/10/2000
E 1/20/2000
F 1/1/2000
Would become:
obs date births last week
A 1/1/2000 2
B 1/4/2000 3
C 1/5/2000 4
D 1/10/2000 3
E 1/20/2000 1
F 1/1/2000 2
Right now I'm using the following method, but it's very slow:
def past_week(x,df):
back = x['date'] - dt.timedelta(days=7)
return df[(df['date'] >= back) & (df['date'] < x['date'])].count()
df['births_last_week'] = df.apply(lambda x: past_week(x,df),axis=1)
Edit: Having difficulty with duplicate dates. Maybe I'm doing something wrong. I've edited the example above to include a repeated date:
df['births last week'] = df.groupby('date').cumcount() + 1
pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
gives:
date births last week
2000-01-01 1
2000-01-04 2
2000-01-05 3
2000-01-10 3
2000-01-20 1
2000-01-01 1
I've tried rolling_sum instead, but then all I get is NA values for births last week. I imagine there's something extremely obvious that I'm getting wrong, just not sure what.

Here's one approach:
df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
by_day = df.groupby("date").count().resample("D").fillna(0)
csum = by_day.cumsum()
last_week = csum - csum.shift(7).fillna(0)
final = last_week.loc[df.date]
producing
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
Step by step, first we get the DataFrame (you probably have this already):
>>> df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
>>> df
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
5 F 2000-01-01
Then we groupby on date, and count the number of observations:
>>> df.groupby("date").count()
obs
date
2000-01-01 2
2000-01-04 1
2000-01-05 1
2000-01-10 1
2000-01-20 1
We can resample this to days; it'll be a much longer timeseries, of course, but memory is cheap and I'm lazy:
>>> df.groupby("date").count().resample("D")
obs
date
2000-01-01 2
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 1
2000-01-05 1
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 NaN
2000-01-10 1
2000-01-11 NaN
2000-01-12 NaN
2000-01-13 NaN
2000-01-14 NaN
2000-01-15 NaN
2000-01-16 NaN
2000-01-17 NaN
2000-01-18 NaN
2000-01-19 NaN
2000-01-20 1
Get rid of the nans:
>>> by_day = df.groupby("date").count().resample("D").fillna(0)
>>> by_day
obs
date
2000-01-01 2
2000-01-02 0
2000-01-03 0
2000-01-04 1
2000-01-05 1
2000-01-06 0
2000-01-07 0
2000-01-08 0
2000-01-09 0
2000-01-10 1
2000-01-11 0
2000-01-12 0
2000-01-13 0
2000-01-14 0
2000-01-15 0
2000-01-16 0
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And take the cumulative sum, as part of a manual rolling-sum process. The default rolling sum has the wrong alignment, so I'll just subtract with a difference of one week:
>>> csum = by_day.cumsum()
>>> last_week = csum - csum.shift(7).fillna(0)
>>> last_week
obs
date
2000-01-01 2
2000-01-02 2
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 4
2000-01-07 4
2000-01-08 2
2000-01-09 2
2000-01-10 3
2000-01-11 2
2000-01-12 1
2000-01-13 1
2000-01-14 1
2000-01-15 1
2000-01-16 1
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And then select the dates we care about:
>>> final = last_week.loc[df.date]
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2

In [57]: df
Out[57]:
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
In [58]: df['births last week'] = 1
In [59]: pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
Out[59]:
births last week
2000-01-01 0
2000-01-04 1
2000-01-05 2
2000-01-10 2
2000-01-20 0
I subtract 1 because rolling_count includes the current entry, and you don't.
Edit: To handle duplicate dates, as discussed in comments on your question, group by date and sum the 'births last week' column between inputs 58 and 59 above.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with a Dataframe with many columns and if statement - python

Related

fill missing data with Python

How can I replicate excel COUNTIFS in python/pandas?

Reverting from multiindex to single index dataframe in pandas

Selecting Subset of Pandas DataFrame

Efficiently calculate creation rate during window in pandas

Categories

Resources