There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105
Related
I have the following data set:
CustomerID Date Amount Department \
0 395134 2019-01-01 199 Home
1 395134 2019-01-01 279 Home
2 1356012 2019-01-07 279 Home
3 1921374 2019-01-08 269 Home
4 395134 2019-01-01 279 Home
... ... ... ... ...
18926474 1667426 2021-06-30 349 Womenswear
18926475 1667426 2021-06-30 299 Womenswear
18926476 583105 2021-06-30 349 Womenswear
18926477 538137 2021-06-30 279 Womenswear
18926478 825382 2021-06-30 2499 Home
DaysSincePurchase
0 986 days
1 986 days
2 980 days
3 979 days
4 986 days
... ...
18926474 75 days
18926475 75 days
18926476 75 days
18926477 75 days
18926478 75 days
I want to do some feature engineering and add a few columns after having aggregated (using group_by) by customerID. The Date column is unimportant and can easily be dropped. I want a data set where every row is one unique customerID which are just integers 1,2... (first col) where the other columns are:
Total amount of purchasing
Days since the last purchase
Number of total departments
This is what I've done, and it works. However when I time it, it takes about 1.5 hours. Is there another more efficient of doing this?
customer_group = joinedData.groupby(['CustomerID'])
n = originalData['CustomerID'].nunique()
# First arrange the data in a matrix.
matrix = np.zeros((n,5)) # Pre-allocate matrix
for i in range(0,n):
matrix[i,0] = i+1
matrix[i,1] = sum(customer_group.get_group(i+1)['Amount'])
matrix[i,2] = min(customer_group.get_group(i+1)['DaysSincePurchase']).days
matrix[i,3] = customer_group.get_group(i+1)['Department'].nunique()
# The above loop takes 6300 sec approx
# convert matrix to dataframe and name columns
newData = pd.DataFrame(matrix)
newData = newData.rename(columns = {0:"CustomerID"})
newData = newData.rename(columns = {1:"TotalDemand"})
newData = newData.rename(columns = {2:"DaysSinceLastPurchase"})
newData = newData.rename(columns = {3:"nrDepartments"})
Use agg:
>>> df.groupby('CustomerID').agg(TotalDemand=('Amount', sum),
DaysSinceLastPurchase=('DaysSincePurchase', min),
nrDepartments=('Department', 'nunique'))
I ran this function over a dataframe of 20,000,000 records. It took few seconds to be executed:
>>> %timeit df.groupby('CustomerID').agg(...)
14.7 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Generated data:
N = 20000000
df = pd.DataFrame(
{'CustomerID': np.random.randint(1000, 10000, N),
'Date': np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), N),
'Amount': np.random.randint(100, 1000, N),
'Department': np.random.choice(['Home', 'Sport', 'Food', 'Womenswear',
'Menswear', 'Furniture'], N)})
df['DaysSincePurchase'] = pd.Timestamp.today().normalize() - df['Date']
I have a dataframe like as shown below (run the full code below)
df1 = pd.DataFrame({'person_id': [11,21,31,41,51],
'date_birth': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961']})
df1 = df1.melt('person_id', value_name='date_birth')
df1['birth_dates'] = pd.to_datetime(df1['date_birth'])
df_ranges = df1.assign(until_prev_year_days=(df1['birth_dates'].dt.dayofyear - 1),
until_next_year_days=((df1['birth_dates'] + pd.offsets.YearEnd(0)) - df1['birth_dates']).dt.days)
f = {'until_prev_year_days': 'min', 'until_next_year_days': 'min'}
min_days = df_ranges.groupby('person_id',as_index=False).agg(f)
min_days.columns = ['person_id','no_days_to_prev_year','no_days_to_next_year']
df_offset = pd.merge(df_ranges[['person_id','birth_dates']], min_days, on='person_id',how='inner')
See below on what I tried to get the range
df_offset['range_to_shift'] = "[" + (-1 * df_offset['no_days_to_prev_year']).map(str) + "," + df_offset['no_days_to_next_year'].map(str) + "]"
Though my approach works, I would like to is there any better and elegant way to do the same
Please note that for values from no_days_to_prev_year, we have to prefix minus sign
I expect my output to be like as shown below
Use DataFrame.mul along with DataFrame.to_numpy:
cols = ['no_days_to_prev_year', 'no_days_to_next_year']
df_offset['range_to_shift'] = df_offset[cols].mul([-1, 1]).to_numpy().tolist()
Result:
# print(df_offset)
person_id birth_dates no_days_to_prev_year no_days_to_next_year range_to_shift
0 11 1967-05-29 148 216 [-148, 216]
1 21 1957-01-21 20 344 [-20, 344]
2 31 1959-07-27 207 157 [-207, 157]
3 41 1961-01-01 0 364 [0, 364]
4 51 1961-12-31 364 0 [-364, 0]
timeit performance results:
df_offset.shape
(50000, 5)
%%timeit -n100
cols = ['no_days_to_prev_year', 'no_days_to_next_year']
df_offset['range_to_shift'] = df_offset[cols].mul([-1, 1]).to_numpy().tolist()
15.5 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
IIUC, you can use zip to create your list of range:
df = pd.DataFrame({'person_id': [11,21,31,41,51],
'date_birth': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961']})
df['date_birth'] = pd.to_datetime(df['date_birth'],format="%m/%d/%Y")
df["day_to_prev"] = df['date_birth'].dt.dayofyear - 1
df["day_to_next"] = (pd.offsets.YearEnd(0) + df['date_birth'] - df["date_birth"]).dt.days
df["range_to_shift"] = [[-x, y] for x,y in zip(df["day_to_prev"],df["day_to_next"])]
print (df)
person_id date_birth day_to_prev day_to_next range_to_shift
0 11 1967-05-29 148 216 [-148, 216]
1 21 1957-01-21 20 344 [-20, 344]
2 31 1959-07-27 207 157 [-207, 157]
3 41 1961-01-01 0 364 [0, 364]
4 51 1961-12-31 364 0 [-364, 0]
I have the following (sample) dataframe:
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
....
There are no duplicate values in the index.
I am in the unenviable position of having to append to this dataframe using elements from a number of other dataframes. So I'm appending as follows:
names_df = names_df.append({'Age': someage,
'height': someheight,
'weight':someweight,
'haircolor': somehaircolor'},
ignore_index=True)
My question is using this method how do I set the new index value in names_df equal to the person's name?
The only thing I can think of is to reset the df index before I append and then re-set it afterward. Ugly. Has to be a better way.
Thanks in advance.
I am not sure in what format are you getting the data that you are appending to the original df but one way is as follows:
df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
new_name someage someheight someweight somehaircolor
Time Testing:
%timeit df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
1000 loops, best of 3: 408 µs per loop
%timeit df.append(pd.DataFrame({'Age': 'someage', 'height': 'someheight','weight':'someweight','haircolor': 'somehaircolor'}, index=['some_person']))
100 loops, best of 3: 2.59 ms per loop
Here's another way using append. Instead of passing a dictionary, pass a dataframe (created with dictionary) while specifying index:
names_df = names_df.append(pd.DataFrame({'Age': 'someage',
'height': 'someheight',
'weight':'someweight',
'haircolor': 'somehaircolor'}, index=['some_person']))
I have data like this
location sales store
0 68 583 17
1 28 857 2
2 55 190 59
3 98 517 64
4 94 892 79
...
For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.
location sales store pcnt_sales
0 68 583 17 0.254363
1 28 857 2 0.346543
2 55 190 59 1.000000
3 98 517 64 0.272105
4 94 892 79 1.000000
...
This works, but is slow
import pandas as pd
import numpy as np
df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})
import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time) # 1.46 seconds
By comparison, R's data.table does this super fast
library(data.table)
dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))
ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm # 0.007 seconds
How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?
For performance you want to avoid apply. You could use transform to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:
>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True
I am trying to perform a task which is conceptually simple, but my code seems to be way too expensive. I am looking for a faster way, potentially utilizing pandas' built-in functions for GroupBy objects.
The starting point is a DataFrame called prices, with columns=[['item', 'store', 'day', 'price']], in which each observatoin is the most recent price update specific to a item-store combination. The problem is that some price updates are the same as the last price update for the same item-store combination. For example, let us look at a particular piece:
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
66386 56 85376 211 6.00
69477 69 85376 211 5.95
In this example I would like the observation where day equals 56 to be dropped (because price is the same as the last observation in this group). My code is:
def removeSameLast(df):
shp = df.shape[0]
lead = df['price'][1:shp]
lag = df['price'][:shp-1]
diff = np.array(lead != lag)
boo = np.array(1)
boo = np.append(boo,diff)
boo = boo.astype(bool)
df = df.loc[boo]
return df
gCell = prices.groupby(['item_id', 'store_id'])
prices = gCell.apply(removeSameLast)
This does the job, but is ugly and slow. Sorry for being a noob, but I assume that this can be done much faster. Could someone please propose a solution? Many thanks in advance.
I would suggest going for a simple solution using the shift function from Pandas. This would remove the use of the groupby and your function call.
The idea is to see where the Series [5.95, 6, 5.95, 6, 6, 5.95] is equal to the shifted one, [nan, 5.95, 6, 5.95, 6, 6] and delete(or just don't select) the rows where this condition happens.
>>> mask = ~np.isclose(prices['price'], prices['price'].shift())
>>> prices[mask]
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
69477 69 85376 211 5.95
Simple benchmark:
%timeit prices = gCell.apply(removeSameLast)
100 loops, best of 3: 4.46 ms per loop
%timeit mask = df.price != df.price.shift()
1000 loops, best of 3: 183 µs per loop