Pandas: appending column with condition

Pandas: appending column with condition - python

I have a sample time series data (stock) as below:
Date PX_OPEN PX_LAST
Date
2011-01-03 2011-01-03 31.18 31.26
2011-01-04 2011-01-04 31.42 31.02
2011-01-05 2011-01-05 31.10 30.54
2011-01-06 2011-01-06 30.66 30.54
2011-01-07 2011-01-07 31.50 30.66
2011-01-10 2011-01-10 30.82 30.94
I would like to add a new column GAP based on the following conditions:
If current day open is higher than previous day last, then GAP = up.
If current day open is lower than previous day last, then GAP = down.
Otherwise, GAP = unch. (Alternatively, up can be changed to +1, down to -1, and unch to 0.)
I can do this with if and for loop, but that would defeat the efficiency of verctorized operation in Pandas. Can anyone help?

Use nested np.where calls:
import numpy as np
df['GAP'] = np.where(df['PX_OPEN'] > df['PX_LAST'].shift(), 'up',
np.where(df['PX_OPEN'] < df['PX_LAST'].shift(), 'down', 'unch'))

Related

Pandas: Iterating over rows, Adding and Subtracting Dates, Appending those dates to a new column depending on row value

To give a brief overview of whats going on, I am observing temperature fluctutations and have filtered data from indoor and outdoor temp in an office only where temperature fluctuates. these fluctuations only occur in the mornings and at night as during the day, the temp is controlled. I will be using an ANN to learn from these fluctuations and model how long it would take for temp to change depending on other variables like OutdoorTemp, SolarDiffuseRate, etc.
Question 1:
How do I iterate by row, firstly, looking at times and adding a binary column where 0 would be mornings, and 1 would be the night-time.
Question 2:
for each day, there will be a different length of series of rows for mornings and evenings depending on how long it takes the temperature to change between 22 degrees and 17 degrees. How do I add a column for each day, and each morning and evening, which states the time it took for the temp to get from X to Y.
Basically adding or subtracting time to get the difference, then appending per morning or night.
OfficeTemp OutdoorTemp SolarDiffuseRate
DateTime
2006-01-01 07:15:00 19.915275 0.8125 0.0
2006-01-01 07:30:00 20.463506 0.8125 0.0
2006-01-01 07:45:00 20.885112 0.8125 0.0
2006-01-01 20:15:00 19.985398 8.3000 0.0
2006-01-01 20:30:00 19.157857 8.3000 0.0
... ... ... ...
2006-06-30 22:00:00 18.056205 22.0125 0.0
2006-06-30 22:15:00 17.993072 19.9875 0.0
2006-06-30 22:30:00 17.929643 19.9875 0.0
2006-06-30 22:45:00 17.867148 19.9875 0.0
2006-06-30 23:00:00 17.804429 19.9875 0.0

df = pd.DataFrame(index=pd.date_range('2006-01-01', '2006-06-30', freq='15min'))
df['OfficeTemp'] = np.random.normal(loc=20, scale=5, size=df.shape[0])
df['OutdoorTemp'] = np.random.normal(loc=12, scale=5, size=df.shape[0])
df['SolarDiffuseRate'] = 0.0
Question 1:
df['PartofDay'] = df.index.hour.map(lambda x: 0 if x < 12 else 1)
For question 2, a tolerance would need to be defined (the temperature is never going to be exactly 17 or 22 degrees).
import numpy as np
def temp_change_duration(group):
tol=0.3
first_time = group.index[np.isclose(group['OfficeTemp'], 17, atol=tol)][0]
second_time = group.index[np.isclose(group['OfficeTemp'], 22, atol=tol)][0]
return(abs(second_time-first_time))
Then apply this function to our df:
df.groupby([df.index.day, 'PartofDay']).apply(temp_change_duration)
This will get you most of the way there, but will give funny answers using the normally distributed synthetic data I've generated. See if you can adapt temp_change_duration to work with your data

How to insert missing date time value while resampling frequencies more efficient?

I am just learning programming. What I am now doing is to resample data with date time frequencies as shown below.
loctimestamp volume Date
0 2011-01-04 08:01:03.820 2994.0 2011-01-04
1 2011-01-04 08:01:03.940 2.0 2011-01-04
2 2011-01-04 08:01:04.240 4.0 2011-01-04
3 2011-01-04 08:01:04.560 12.0 2011-01-04
4 2011-01-04 08:01:04.580 1.0 2011-01-04
5 2011-01-04 08:01:04.690 1.0 2011-01-04
6 2011-01-04 08:01:04.700 5.0 2011-01-04
7 2011-01-04 08:01:04.880 5.0 2011-01-04
8 2011-01-04 08:01:05.090 61.0 2011-01-04
9 2011-01-04 08:01:07.140 4.0 2011-01-04
Here I listed only the first 2 columns and 10 rows of my whole data, which has more than 1m lines. What I expect is a resampling with 1s frequency and sum the volume, as shown below:
loctimestamp volume
2011-01-04 08:01:04 2996.0
2011-01-04 08:01:05 28.0
2011-01-04 08:01:06 61.0
2011-01-04 08:01:07 0.0
2011-01-04 08:01:08 4.0
This result is achieved by using pandas.DataFrame.resample() function. This function will insert the missing frequencies between first time and last time for each day, like the 2011-01-04 08:01:07 line above.
My work now is to transform the whole calculation from Pandas to PySpark. PySpark does resample the data more efficient but I can't find a way to efficiently insert the missing frequencies like what pandas did.
what I have tried in spark for this is:
#first find the first and last frequency for each day, the data has been
#pre resampled here to seconds, col("pre_resampled") is unix time in second
#form and fre is set to 1s
fre=1
max_time_for_each_date=(df.orderBy('Date')
.groupBy('Date')
.agg({'pre_resampled': "max"})).collect()
min_time_for_each_date=(df.orderBy('Date')
.groupBy('Date')
.agg({'pre_resampled': "min"})).collect()
#create a dataframe with all the frequencies inserted, I tried the
#spark.DataFrame.union() function to create this dataframe which seems to be
#less efficient(I also put the code here).
#ref = spark.range(min_time_for_each_date[0][1],
# max_time_for_each_date[0][1]+ 1,
# fre
# ).toDF('pre_resampled')
#i=1
#while i < len(max_time_for_each_date):
# ref = ref.union(spark.range(min_time_for_each_date[i][1],
# max_time_for_each_date[i][1]+ 1,
# fre
# ).toDF('pre_resampled'))
# i += 1
list1=[]
i=0
while i < len(max_time_for_each_date):
list1.extend(list(range(min_time_for_each_date[i][1],
max_time_for_each_date[i][1]+1,
fre)))
i += 1
ref = (spark.createDataFrame(list1, IntegerType())
.withColumnRenamed('value','pre_resampled'))
#combine the created DataFrame ref and the main DataFrame
df= (ref
.join(df, 'pre_resampled', 'left')
.orderBy('pre_resampled','loctimestamp')
.withColumn('resampled', col('pre_resampled').cast('timestamp')))
this code works, output is in fact what I need. the problem is if the fre is set to 1 second, it will sometimes end up with a java.lang.OutOfMemoryError: Java heap space or a Time Limit Exceed.
also tried:
schema = df.schema
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def udf_resample(df,fre='S'):
result = df.resample(fre, on = 'loctimestamp').sum()
return df.reset_index()
df.groupBy('Date').apply(udf_resample).show()
this shows me AttributeError: 'tuple' object has no attribute 'resample', I think the reason is not all the pandas functions are available in pandas_udf, am I correct?
What I am trying to do now is to add the spark.executer.memory or the spark.drive.memory.
Any other idea?

How to resample yearly starting from 1st of June to 31st may?

How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you

Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64

Date sampling / averaging for plotting in Pandas

Is there any way to specify the sampling rate of the X axis in Pandas? In particular, when this axis contains datetime objects?, e.g.
df['created_dt'][0]
datetime.date(2014, 3, 24)
Ideally I would like to specify how many days (from beginning to end) to include in the plot, either by having Pandas sub-sample from my dataframe or by averaging every N days.

I think you can simply using groupby and cut to group the data into time intervals. In this example, the original dataframe has 10 days, and I group the days in to 3 intervals (that is 80 hours each). Then you can do whatever you want, take the average, for example:
In [21]:
df=pd.DataFrame(np.random.random((10,3)))
df.index=pd.date_range('1/1/2011', periods=10, freq='D')
print df
0 1 2
2011-01-01 0.125353 0.661480 0.849405
2011-01-02 0.551803 0.558052 0.905813
2011-01-03 0.221589 0.070754 0.312004
2011-01-04 0.452728 0.513566 0.535502
2011-01-05 0.730282 0.163804 0.035454
2011-01-06 0.205623 0.194948 0.180352
2011-01-07 0.586136 0.578334 0.454175
2011-01-08 0.103438 0.765212 0.570750
2011-01-09 0.203350 0.778980 0.546947
2011-01-10 0.642401 0.525348 0.500244
[10 rows x 3 columns]
In [22]:
dfgb=df.groupby(pd.cut(df.index.values.astype(float), 3),as_index=False)
df_resample=dfgb.mean()
df_resample.index=dfgb.head(1).index
df_resample.__delitem__(None)
print df_resample
0 1 2
2011-01-01 0.337868 0.450963 0.650681
2011-01-05 0.507347 0.312362 0.223327
2011-01-08 0.316396 0.689847 0.539314
[3 rows x 3 columns]
In [23]:
f=plt.figure()
ax0=f.add_subplot(121)
ax1=f.add_subplot(122)
_=df.T.boxplot(ax=ax0)
_=df_resample.T.boxplot(ax=ax1)
_=[item.set_rotation(90) for item in ax0.get_xticklabels()+ax1.get_xticklabels()]
plt.tight_layout()

How do I stack two DataFrames next to each other in Pandas?

I have two sets of stock data in DataFrames:
> GOOG.head()
Open High Low
Date
2011-01-03 21.01 21.05 20.78
2011-01-04 21.12 21.20 21.05
2011-01-05 21.19 21.21 20.90
2011-01-06 20.67 20.82 20.55
2011-01-07 20.71 20.77 20.27
AAPL.head()
Open High Low
Date
2011-01-03 596.48 605.59 596.48
2011-01-04 605.62 606.18 600.12
2011-01-05 600.07 610.33 600.05
2011-01-06 610.68 618.43 610.05
2011-01-07 615.91 618.25 610.13
and I would like to stack them next two each other in a single DataFrame so I can access and compare columns (e.g. High) across stocks (GOOG vs. AAPL)? What is the best way to do this in Pandas and access the subsequent columns (e.g. GOOG's High column and AAPL's High column). Thanks!

pd.concat is also an option
In [17]: pd.concat([GOOG, AAPL], keys=['GOOG', 'AAPL'], axis=1)
Out[17]:
GOOG AAPL
Open High Low Open High Low
Date
2011-01-03 21.01 21.05 20.78 596.48 605.59 596.48
2011-01-04 21.12 21.20 21.05 605.62 606.18 600.12
2011-01-05 21.19 21.21 20.90 600.07 610.33 600.05
2011-01-06 20.67 20.82 20.55 610.68 618.43 610.05
2011-01-07 20.71 20.77 20.27 615.91 618.25 610.13

Have a look at the join method of dataframes, use the lsuffix and rsuffix attributes to create new names for the joined columns. It works like this:
>>> x
A B C
0 0.838119 -1.116730 0.167998
1 -1.143761 0.051970 0.216113
2 -0.614441 0.208978 -0.630988
3 0.114902 -0.248791 -0.503172
4 0.836523 -0.802074 1.478333
>>> y
A B C
0 -0.455859 -0.488645 -1.618088
1 -2.295255 0.524681 1.021320
2 -0.484612 1.101463 -0.081476
3 -0.475076 0.915797 -0.998777
4 -0.847538 0.057044 1.053533
>>> x.join(y, lsuffix="_x", rsuffix="_y")
A_x B_x C_x A_y B_y C_y
0 0.838119 -1.116730 0.167998 -0.455859 -0.488645 -1.618088
1 -1.143761 0.051970 0.216113 -2.295255 0.524681 1.021320
2 -0.614441 0.208978 -0.630988 -0.484612 1.101463 -0.081476
3 0.114902 -0.248791 -0.503172 -0.475076 0.915797 -0.998777
4 0.836523 -0.802074 1.478333 -0.847538 0.057044 1.053533

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.