I am trying to perform a task which is conceptually simple, but my code seems to be way too expensive. I am looking for a faster way, potentially utilizing pandas' built-in functions for GroupBy objects.
The starting point is a DataFrame called prices, with columns=[['item', 'store', 'day', 'price']], in which each observatoin is the most recent price update specific to a item-store combination. The problem is that some price updates are the same as the last price update for the same item-store combination. For example, let us look at a particular piece:
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
66386 56 85376 211 6.00
69477 69 85376 211 5.95
In this example I would like the observation where day equals 56 to be dropped (because price is the same as the last observation in this group). My code is:
def removeSameLast(df):
shp = df.shape[0]
lead = df['price'][1:shp]
lag = df['price'][:shp-1]
diff = np.array(lead != lag)
boo = np.array(1)
boo = np.append(boo,diff)
boo = boo.astype(bool)
df = df.loc[boo]
return df
gCell = prices.groupby(['item_id', 'store_id'])
prices = gCell.apply(removeSameLast)
This does the job, but is ugly and slow. Sorry for being a noob, but I assume that this can be done much faster. Could someone please propose a solution? Many thanks in advance.
I would suggest going for a simple solution using the shift function from Pandas. This would remove the use of the groupby and your function call.
The idea is to see where the Series [5.95, 6, 5.95, 6, 6, 5.95] is equal to the shifted one, [nan, 5.95, 6, 5.95, 6, 6] and delete(or just don't select) the rows where this condition happens.
>>> mask = ~np.isclose(prices['price'], prices['price'].shift())
>>> prices[mask]
day item_id store_id price
35083 34 85376 211 5.95
56157 41 85376 211 6.00
63628 50 85376 211 5.95
64955 51 85376 211 6.00
69477 69 85376 211 5.95
Simple benchmark:
%timeit prices = gCell.apply(removeSameLast)
100 loops, best of 3: 4.46 ms per loop
%timeit mask = df.price != df.price.shift()
1000 loops, best of 3: 183 µs per loop
Related
newbie python/coder who is trying to make data logger downloads and calculations a smoother process as a side project.
Anyways I have two data frames.
The first is "data" which contains the following (number of rows shortened for simplicity):
Logger Name Date and Time Battery Temp(C) Sensor Reading(dg) Sensor Temp(C) Array #
0 TDX 10/1/2021 13:35 2.93 15.59 8772.737 14.5 833
1 TDX 10/1/2021 13:36 2.93 15.59 8773.426 14.5 834
2 TDX 10/1/2021 13:36 2.93 15.59 8773.570 14.5 835
3 TDX 10/1/2021 13:37 2.93 15.59 8773.793 14.5 836
The second is "param" which has parameters which contains values that I use to make calculations:
Transducer_ID elevation_tom elevation_toc elevation_ground elevation_tos calculation gage_factor xd_zero_reading thermal_factor xd_temp_at_zero_reading piezo_elev piezo_downhole_depth
0 TDX NaN NaN 1000 NaN linear -0.04135 9138 0.003119 24.8 1600 400
1 Test NaN NaN 1000 NaN linear -0.18320 8997 -0.170100 22.6 800 200
Now what I hope the code will be able to do is make a new column in "data" called "Linear P" which populates based on this calculation that uses variables from both dataframes:
[digits_zero_digits - Sensor Reading(dg)] * abs(gage_factor). Now this is not a problem if "param" only had one Transducer ID and the same number of rows as "data", but in reality it has lots of rows with different IDs.
So my question is this. What is the best way to accomplish my goal? Is it to loop over the column or is there something more efficient using the pandas library?
Thanks in advance!
edit: the output I am looking for is this
Logger Name Date and Time Battery Voltage(v) Internal Temp(C) Sensor Reading(dg) Sensor Temp(C) Array # Linear P
0 TDX 10/1/2021 13:35 2.93 15.59 8772.737 14.5 833 15.103625
1 TDX 10/1/2021 13:36 2.93 15.59 8773.426 14.5 834 15.075135
2 TDX 10/1/2021 13:36 2.93 15.59 8773.570 14.5 835 15.069181
3 TDX 10/1/2021 13:37 2.93 15.59 8773.793 14.5 836 15.059959
Just figured out a way to do it that seems pretty efficient.
I simply remove the data in "param" that I do not need:
z = data.iloc[0,0]
param = param[param.Transducer_ID == z]
With the data filtered I pull out only the needed values from param:
x = piezo_param.iloc[0, 7]
y = piezo_param.iloc[0, 6]
And perform the calculation:
data['Linear P'] = (x - data['Sensor Reading(dg)']) * abs(y)
Let me know if this seems like the best way to get the job done!
The more efficient way would be based on my experience :
join the two data frame using (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html).
make calculation on the result dataframe ( df["Linear P"] = df["Sensor Reading(dg)"] * ... ) .
here is an example of my process :
import pandas as pd
df1 = pd.DataFrame({'Names': ['a', 'a'],
'var1': [35, 15,],
'var2': [15, 40]})
df2 = pd.DataFrame({'Names1': ['a', 'E'],
'var3': [35, 15,],
'var4': [15, 40]})
final_df = df1.merge(df2, left_on='Names', right_on='Names1', how='left' )
final_df["Linear P"] = final_df["var3"] * final_df["var2"] - abs(final_df["var2"])
print(final_df)
I have the following data set:
CustomerID Date Amount Department \
0 395134 2019-01-01 199 Home
1 395134 2019-01-01 279 Home
2 1356012 2019-01-07 279 Home
3 1921374 2019-01-08 269 Home
4 395134 2019-01-01 279 Home
... ... ... ... ...
18926474 1667426 2021-06-30 349 Womenswear
18926475 1667426 2021-06-30 299 Womenswear
18926476 583105 2021-06-30 349 Womenswear
18926477 538137 2021-06-30 279 Womenswear
18926478 825382 2021-06-30 2499 Home
DaysSincePurchase
0 986 days
1 986 days
2 980 days
3 979 days
4 986 days
... ...
18926474 75 days
18926475 75 days
18926476 75 days
18926477 75 days
18926478 75 days
I want to do some feature engineering and add a few columns after having aggregated (using group_by) by customerID. The Date column is unimportant and can easily be dropped. I want a data set where every row is one unique customerID which are just integers 1,2... (first col) where the other columns are:
Total amount of purchasing
Days since the last purchase
Number of total departments
This is what I've done, and it works. However when I time it, it takes about 1.5 hours. Is there another more efficient of doing this?
customer_group = joinedData.groupby(['CustomerID'])
n = originalData['CustomerID'].nunique()
# First arrange the data in a matrix.
matrix = np.zeros((n,5)) # Pre-allocate matrix
for i in range(0,n):
matrix[i,0] = i+1
matrix[i,1] = sum(customer_group.get_group(i+1)['Amount'])
matrix[i,2] = min(customer_group.get_group(i+1)['DaysSincePurchase']).days
matrix[i,3] = customer_group.get_group(i+1)['Department'].nunique()
# The above loop takes 6300 sec approx
# convert matrix to dataframe and name columns
newData = pd.DataFrame(matrix)
newData = newData.rename(columns = {0:"CustomerID"})
newData = newData.rename(columns = {1:"TotalDemand"})
newData = newData.rename(columns = {2:"DaysSinceLastPurchase"})
newData = newData.rename(columns = {3:"nrDepartments"})
Use agg:
>>> df.groupby('CustomerID').agg(TotalDemand=('Amount', sum),
DaysSinceLastPurchase=('DaysSincePurchase', min),
nrDepartments=('Department', 'nunique'))
I ran this function over a dataframe of 20,000,000 records. It took few seconds to be executed:
>>> %timeit df.groupby('CustomerID').agg(...)
14.7 s ± 225 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Generated data:
N = 20000000
df = pd.DataFrame(
{'CustomerID': np.random.randint(1000, 10000, N),
'Date': np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), N),
'Amount': np.random.randint(100, 1000, N),
'Department': np.random.choice(['Home', 'Sport', 'Food', 'Womenswear',
'Menswear', 'Furniture'], N)})
df['DaysSincePurchase'] = pd.Timestamp.today().normalize() - df['Date']
I have the following (sample) dataframe:
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
....
There are no duplicate values in the index.
I am in the unenviable position of having to append to this dataframe using elements from a number of other dataframes. So I'm appending as follows:
names_df = names_df.append({'Age': someage,
'height': someheight,
'weight':someweight,
'haircolor': somehaircolor'},
ignore_index=True)
My question is using this method how do I set the new index value in names_df equal to the person's name?
The only thing I can think of is to reset the df index before I append and then re-set it afterward. Ugly. Has to be a better way.
Thanks in advance.
I am not sure in what format are you getting the data that you are appending to the original df but one way is as follows:
df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
Age height weight haircolor
joe 35 5.5 145 brown
mary 26 5.25 110 blonde
pete 44 6.02 185 red
new_name someage someheight someweight somehaircolor
Time Testing:
%timeit df.loc['new_name', :] = ['someage', 'someheight', 'someweight', 'somehaircolor']
1000 loops, best of 3: 408 µs per loop
%timeit df.append(pd.DataFrame({'Age': 'someage', 'height': 'someheight','weight':'someweight','haircolor': 'somehaircolor'}, index=['some_person']))
100 loops, best of 3: 2.59 ms per loop
Here's another way using append. Instead of passing a dictionary, pass a dataframe (created with dictionary) while specifying index:
names_df = names_df.append(pd.DataFrame({'Age': 'someage',
'height': 'someheight',
'weight':'someweight',
'haircolor': 'somehaircolor'}, index=['some_person']))
I have data like this
location sales store
0 68 583 17
1 28 857 2
2 55 190 59
3 98 517 64
4 94 892 79
...
For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.
location sales store pcnt_sales
0 68 583 17 0.254363
1 28 857 2 0.346543
2 55 190 59 1.000000
3 98 517 64 0.272105
4 94 892 79 1.000000
...
This works, but is slow
import pandas as pd
import numpy as np
df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})
import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time) # 1.46 seconds
By comparison, R's data.table does this super fast
library(data.table)
dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))
ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm # 0.007 seconds
How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?
For performance you want to avoid apply. You could use transform to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:
>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True
There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105