I have two sets of stock data in DataFrames:
> GOOG.head()
Open High Low
Date
2011-01-03 21.01 21.05 20.78
2011-01-04 21.12 21.20 21.05
2011-01-05 21.19 21.21 20.90
2011-01-06 20.67 20.82 20.55
2011-01-07 20.71 20.77 20.27
AAPL.head()
Open High Low
Date
2011-01-03 596.48 605.59 596.48
2011-01-04 605.62 606.18 600.12
2011-01-05 600.07 610.33 600.05
2011-01-06 610.68 618.43 610.05
2011-01-07 615.91 618.25 610.13
and I would like to stack them next two each other in a single DataFrame so I can access and compare columns (e.g. High) across stocks (GOOG vs. AAPL)? What is the best way to do this in Pandas and access the subsequent columns (e.g. GOOG's High column and AAPL's High column). Thanks!
pd.concat is also an option
In [17]: pd.concat([GOOG, AAPL], keys=['GOOG', 'AAPL'], axis=1)
Out[17]:
GOOG AAPL
Open High Low Open High Low
Date
2011-01-03 21.01 21.05 20.78 596.48 605.59 596.48
2011-01-04 21.12 21.20 21.05 605.62 606.18 600.12
2011-01-05 21.19 21.21 20.90 600.07 610.33 600.05
2011-01-06 20.67 20.82 20.55 610.68 618.43 610.05
2011-01-07 20.71 20.77 20.27 615.91 618.25 610.13
Have a look at the join method of dataframes, use the lsuffix and rsuffix attributes to create new names for the joined columns. It works like this:
>>> x
A B C
0 0.838119 -1.116730 0.167998
1 -1.143761 0.051970 0.216113
2 -0.614441 0.208978 -0.630988
3 0.114902 -0.248791 -0.503172
4 0.836523 -0.802074 1.478333
>>> y
A B C
0 -0.455859 -0.488645 -1.618088
1 -2.295255 0.524681 1.021320
2 -0.484612 1.101463 -0.081476
3 -0.475076 0.915797 -0.998777
4 -0.847538 0.057044 1.053533
>>> x.join(y, lsuffix="_x", rsuffix="_y")
A_x B_x C_x A_y B_y C_y
0 0.838119 -1.116730 0.167998 -0.455859 -0.488645 -1.618088
1 -1.143761 0.051970 0.216113 -2.295255 0.524681 1.021320
2 -0.614441 0.208978 -0.630988 -0.484612 1.101463 -0.081476
3 0.114902 -0.248791 -0.503172 -0.475076 0.915797 -0.998777
4 0.836523 -0.802074 1.478333 -0.847538 0.057044 1.053533
Related
I found this data file on covid vaccinations, and I'd like to see the vaccination coverage in (parts of) the population. It'll probably become more clear with the actual example, so bear with me.
If I read the csv using df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE']) I get this result:
DATE REGION AGEGROUP SEX BRAND DOSE COUNT
0 2020-12-28 Brussels 18-34 F Pfizer-BioNTech A 1
1 2020-12-28 Brussels 45-54 F Pfizer-BioNTech A 2
2 2020-12-28 Brussels 55-64 F Pfizer-BioNTech A 3
3 2020-12-28 Brussels 55-64 M Pfizer-BioNTech A 1
4 2020-12-28 Brussels 65-74 F Pfizer-BioNTech A 2
I'm particularly interested in the numbers by region & date.
So I regrouped using df.groupby(['REGION','DATE']).sum()
COUNT
REGION DATE
Brussels 2020-12-28 56
2020-12-30 5
2021-01-05 725
2021-01-06 989
2021-01-07 994
... ...
Wallonia 2021-06-18 49567
2021-06-19 43577
2021-06-20 2730
2021-06-21 37193
2021-06-22 16938
In order to compare vaccination 'speeds' in different regions I have to transform the data from absolute to relative numbers, using the population from each region.
I have found some posts explaining how to calculate percentages in a multi-index dataframe like this, but the problem is that I want to divide each COUNT by a population number that is not in the original dataframe.
The population numbers are here below
REGION POP
Flanders 6629143
Wallonia 3645243
Brussels 1218255
I think the solution must be in looping through the original df and checking both REGIONs or index levels, but I have absolutely no idea how. It's a technique I'd like to master, because it might come in handy when I want some other subsets with different populations (AGEGROUP or SEX maybe).
Thank you so much for reading this far!
Disclaimer: I've only just started out using Python, and this is my very first question on Stack Overflow, so please be gentle with me... The reason why I'm posting this is because I can't find an answer anywhere else. This is probably because I haven't got the terminology down and I don't exactly know what to look for ^_^
One option would be to reformat the population_df with set_index + rename:
population_df = pd.DataFrame({
'REGION': {0: 'Flanders', 1: 'Wallonia', 2: 'Brussels'},
'POP': {0: 6629143, 1: 3645243, 2: 1218255}
})
denom = population_df.set_index('REGION').rename(columns={'POP': 'COUNT'})
denom:
COUNT
REGION
Flanders 6629143
Wallonia 3645243
Brussels 1218255
Then div the results of groupby sum relative to level=0:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'}).div(denom, level=0)
new_df:
COUNT
REGION DATE
Brussels 2020-12-28 0.000046
2020-12-30 0.000004
2021-01-05 0.000595
2021-01-06 0.000812
2021-01-07 0.000816
... ...
Wallonia 2021-06-18 0.013598
2021-06-19 0.011954
2021-06-20 0.000749
2021-06-21 0.010203
2021-06-22 0.004647
Or as a new column:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'})
new_df['NEW'] = new_df.div(denom, level=0)
new_df:
COUNT NEW
REGION DATE
Brussels 2020-12-28 56 0.000046
2020-12-30 5 0.000004
2021-01-05 725 0.000595
2021-01-06 989 0.000812
2021-01-07 994 0.000816
... ... ...
Wallonia 2021-06-18 49567 0.013598
2021-06-19 43577 0.011954
2021-06-20 2730 0.000749
2021-06-21 37193 0.010203
2021-06-22 16938 0.004647
You could run reset_index() on the groupby and then run df.apply on a custom function that does the calculations:
import pandas as pd
df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE'])
df = df.groupby(['REGION','DATE']).sum().reset_index()
def calculate(row):
if row['REGION'] == 'Flanders':
return row['COUNT'] / 6629143
elif row['REGION'] == 'Wallonia':
return row['COUNT'] / 3645243
elif row['REGION'] == 'Brussels':
return row['COUNT'] / 1218255
df['REL_COUNT'] = df.apply(calculate, axis=1) #axis=1 takes the rows as input, axis=0 would run on columns
Output df.head():
REGION
DATE
COUNT
REL_COUNT
0
Brussels
2020-12-28 00:00:00
56
0.000046
1
Brussels
2020-12-30 00:00:00
5
0.000004
2
Brussels
2021-01-05 00:00:00
725
0.000595
3
Brussels
2021-01-06 00:00:00
989
0.000812
4
Brussels
2021-01-07 00:00:00
994
0.000816
is there a way to calculate a rolling mean on a descending time series without sorting it into a ascending one?
Original time series with same timestamp order as in csv file.
pd.read_csv(data_dir+items+extension, parse_dates=True, index_col='timestamp').sort_index(ascending=False)
timestamp open
2021-05-06 90.000
2021-05-05 93.600
2021-05-04 90.840
2021-05-03 91.700
2021-04-30 91.355
Rolling mean
stock_dict[items]["SMA100"]=pd.Series(stock_dict[items]["close"]).rolling(window=100).mean()
ascending = False
open high low close volume SMA100
timestamp
2021-05-06 90.000 93.5200 89.64 93.03 8024053 NaN
2021-05-05 93.600 94.7700 90.00 90.08 13079308 NaN
2021-05-04 90.840 90.9700 87.44 88.69 15147509 NaN
2021-05-03 91.700 92.0200 90.79 91.15 6641764 NaN
2021-04-30 91.355 91.9868 90.89 91.19 6614347 NaN
... ... ... ... ... ... ...
1999-11-05 14.560 15.5000 14.50 15.38 1308267 14.9245
1999-11-04 14.690 14.7500 14.25 14.62 207033 14.9395
1999-11-03 14.310 14.5000 14.12 14.50 61600 14.9526
1999-11-02 14.250 15.0000 14.16 14.25 128817 14.9639
1999-11-01 14.190 14.3800 13.94 14.06 173233 14.9682
ascending = True
open high low close volume SMA100
timestamp
1999-11-01 14.190 14.3800 13.94 14.06 173233 NaN
1999-11-02 14.250 15.0000 14.16 14.25 128817 NaN
1999-11-03 14.310 14.5000 14.12 14.50 61600 NaN
1999-11-04 14.690 14.7500 14.25 14.62 207033 NaN
1999-11-05 14.560 15.5000 14.50 15.38 1308267 NaN
... ... ... ... ... ... ...
2021-04-30 91.355 91.9868 90.89 91.19 6614347 93.1148
2021-05-03 91.700 92.0200 90.79 91.15 6641764 93.2036
2021-05-04 90.840 90.9700 87.44 88.69 15147509 93.2542
2021-05-05 93.600 94.7700 90.00 90.08 13079308 93.3292
2021-05-06 90.000 93.5200 89.64 93.03 8024053 93.4284
As time series goes from 1999 to 2012 rolling mean is correct in case of ascending = True.
So either I have to change sorting of data, which I would like to avoid, or I have somehow to tell rolling mean function to start with last entry and calculate backwards.
I am just learning programming. What I am now doing is to resample data with date time frequencies as shown below.
loctimestamp volume Date
0 2011-01-04 08:01:03.820 2994.0 2011-01-04
1 2011-01-04 08:01:03.940 2.0 2011-01-04
2 2011-01-04 08:01:04.240 4.0 2011-01-04
3 2011-01-04 08:01:04.560 12.0 2011-01-04
4 2011-01-04 08:01:04.580 1.0 2011-01-04
5 2011-01-04 08:01:04.690 1.0 2011-01-04
6 2011-01-04 08:01:04.700 5.0 2011-01-04
7 2011-01-04 08:01:04.880 5.0 2011-01-04
8 2011-01-04 08:01:05.090 61.0 2011-01-04
9 2011-01-04 08:01:07.140 4.0 2011-01-04
Here I listed only the first 2 columns and 10 rows of my whole data, which has more than 1m lines. What I expect is a resampling with 1s frequency and sum the volume, as shown below:
loctimestamp volume
2011-01-04 08:01:04 2996.0
2011-01-04 08:01:05 28.0
2011-01-04 08:01:06 61.0
2011-01-04 08:01:07 0.0
2011-01-04 08:01:08 4.0
This result is achieved by using pandas.DataFrame.resample() function. This function will insert the missing frequencies between first time and last time for each day, like the 2011-01-04 08:01:07 line above.
My work now is to transform the whole calculation from Pandas to PySpark. PySpark does resample the data more efficient but I can't find a way to efficiently insert the missing frequencies like what pandas did.
what I have tried in spark for this is:
#first find the first and last frequency for each day, the data has been
#pre resampled here to seconds, col("pre_resampled") is unix time in second
#form and fre is set to 1s
fre=1
max_time_for_each_date=(df.orderBy('Date')
.groupBy('Date')
.agg({'pre_resampled': "max"})).collect()
min_time_for_each_date=(df.orderBy('Date')
.groupBy('Date')
.agg({'pre_resampled': "min"})).collect()
#create a dataframe with all the frequencies inserted, I tried the
#spark.DataFrame.union() function to create this dataframe which seems to be
#less efficient(I also put the code here).
#ref = spark.range(min_time_for_each_date[0][1],
# max_time_for_each_date[0][1]+ 1,
# fre
# ).toDF('pre_resampled')
#i=1
#while i < len(max_time_for_each_date):
# ref = ref.union(spark.range(min_time_for_each_date[i][1],
# max_time_for_each_date[i][1]+ 1,
# fre
# ).toDF('pre_resampled'))
# i += 1
list1=[]
i=0
while i < len(max_time_for_each_date):
list1.extend(list(range(min_time_for_each_date[i][1],
max_time_for_each_date[i][1]+1,
fre)))
i += 1
ref = (spark.createDataFrame(list1, IntegerType())
.withColumnRenamed('value','pre_resampled'))
#combine the created DataFrame ref and the main DataFrame
df= (ref
.join(df, 'pre_resampled', 'left')
.orderBy('pre_resampled','loctimestamp')
.withColumn('resampled', col('pre_resampled').cast('timestamp')))
this code works, output is in fact what I need. the problem is if the fre is set to 1 second, it will sometimes end up with a java.lang.OutOfMemoryError: Java heap space or a Time Limit Exceed.
also tried:
schema = df.schema
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def udf_resample(df,fre='S'):
result = df.resample(fre, on = 'loctimestamp').sum()
return df.reset_index()
df.groupBy('Date').apply(udf_resample).show()
this shows me AttributeError: 'tuple' object has no attribute 'resample', I think the reason is not all the pandas functions are available in pandas_udf, am I correct?
What I am trying to do now is to add the spark.executer.memory or the spark.drive.memory.
Any other idea?
I've a time series data stored in pandas dataframe, which looks like this:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
4 2016-01-25 23.22 23.42 23.01 23.26 645551
5 2016-01-26 23.28 23.85 23.22 23.74 592658
6 2016-01-27 23.68 23.78 18.76 20.09 5351850
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 1203375
I want to create an applicable function, which gets a slice of the original dataset what it can aggregate by any custom defined aggregation operator.
Lets say, the function is applied like this:
aggregated_df = data.apply(calculateMySpecificAggregation, axis=1)
where the calculateMySpecificAggregation gets a 3-sized slice of the original dataframe for each row of the original dataframe.
For each row the parameter dataframe of function contains the previous and the next rows of the original dataframe.
#pseudocode example
def calculateMySpecificAggregation(df_slice):
# I want to know which row was this function applied on (an index I would like to have here)
ri= ??? # index of the row where was this function applied
# where df_slice contains 3 rows and all columns
return float(df_slice["Close"][ri-1] + \
((df_slice["High"][ri] + df_slice["Low"][ri]) / 2) + \
df_slice["Open"][ri+1])
# this line will fail on the borders, but don't worry, I will handle it later...
I want to have the sliding window size parametrized, access to other columns of the row and know the row index of the original line where the function was applied on.
That means, in case of slidingWindow = 3, I want to have parameter dataframes:
#parameter dataframe when the function is applied on row[0]:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
#parameter dataframe when the function is applied on row[1]:
Date Open High Low Close Volume
0 2016-01-19 22.86 22.92 22.36 22.60 838024
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
#parameter dataframe when the function is applied on row[2]:
Date Open High Low Close Volume
1 2016-01-20 22.19 22.98 21.87 22.77 796745
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
#parameter dataframe when the function is applied on row[3]:
Date Open High Low Close Volume
2 2016-01-21 22.75 23.10 22.62 22.76 573068
3 2016-01-22 23.13 23.35 22.96 23.33 586967
4 2016-01-25 23.22 23.42 23.01 23.26 645551
...
#parameter dataframe when the function is applied on row[7]:
Date Open High Low Close Volume
6 2016-01-27 23.68 23.78 18.76 20.09 5351850
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
#parameter dataframe when the function is applied on row[8]:
Date Open High Low Close Volume
7 2016-01-28 20.05 20.69 19.11 19.37 2255635
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 120375
#parameter dataframe when the function is applied on row[9]:
Date Open High Low Close Volume
8 2016-01-29 19.51 20.02 19.40 19.90 1203969
9 2016-02-01 19.77 19.80 19.13 19.14 1203375
I don't want to use a cycle combined with iloc indexing if possible.
I've experimented with pandas.DataFrame.rolling and pandas.rolling_apply with no success.
Does anyone know how to solve this problem?
Ok, after a long suffering I've solved the problem.
I couldn't avoid iloc (which is not a big problem in this case), but at least cycle is not being used here.
contextSizeLeft = 2
contextSizeRight = 3
def aggregateWithContext(df, row, func, contextSizeLeft, contextSizeRight):
leftBorder = max(0, row.name - contextSizeLeft)
rightBorder = min(len(df), row.name + contextSizeRight) + 1
'''
print("pos: ", row.name, \
"\t", (row.name-contextSizeLeft, row.name+contextSizeRight), \
"\t", (leftBorder, rightBorder), \
"\t", len(df.loc[:][leftBorder : rightBorder]))
'''
return func(df.iloc[:][leftBorder : rightBorder], row.name)
def aggregate(df, center):
print()
print("center", center)
print(df["Date"])
return len(df)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, contextSizeLeft, contextSizeRight), axis=1)
and the same for dates if anyone would need it:
def aggregateWithContext(df, row, func, timedeltaLeft, timedeltaRight):
dateInRecord = row["Date"]
leftBorder = pd.to_datetime(dateInRecord - timedeltaLeft)
rightBorder = pd.to_datetime(dateInRecord + timedeltaRight)
dfs = df[(df['Date'] >= leftBorder) & (df['Date'] <= rightBorder)]
#print(dateInRecord, ":\t", leftBorder, "\t", rightBorder, "\t", len(dfs))
return func(dfs, row.name)
def aggregate(df, center):
#print()
#print("center", center)
#print(df["Date"])
return len(df)
timedeltaLeft = timedelta(days=2)
timedeltaRight = timedelta(days=2)
df.apply(lambda x: aggregateWithContext(df, x, aggregate, timedeltaLeft, timedeltaRight), axis=1)
I have a sample time series data (stock) as below:
Date PX_OPEN PX_LAST
Date
2011-01-03 2011-01-03 31.18 31.26
2011-01-04 2011-01-04 31.42 31.02
2011-01-05 2011-01-05 31.10 30.54
2011-01-06 2011-01-06 30.66 30.54
2011-01-07 2011-01-07 31.50 30.66
2011-01-10 2011-01-10 30.82 30.94
I would like to add a new column GAP based on the following conditions:
If current day open is higher than previous day last, then GAP = up.
If current day open is lower than previous day last, then GAP = down.
Otherwise, GAP = unch. (Alternatively, up can be changed to +1, down to -1, and unch to 0.)
I can do this with if and for loop, but that would defeat the efficiency of verctorized operation in Pandas. Can anyone help?
Use nested np.where calls:
import numpy as np
df['GAP'] = np.where(df['PX_OPEN'] > df['PX_LAST'].shift(), 'up',
np.where(df['PX_OPEN'] < df['PX_LAST'].shift(), 'down', 'unch'))