I have a pandas dataframe named 'stock_data' with a MultiIndex index of ('Date', 'StockID') and a column 'Price'. The rows are ordered by date, so for the same stock a later date will have a higher row index. I want to add a new column that for each stock (i.e. group by stock) contains a number with the maximum positive difference between the prices of the stock through time, as in max_price - min_price.
To explain this further, one could calculate this in O(stocks*rows^2) by:
for each stock:
max = 0.0
for i in range(len(rows)-1):
for j in range(i+1, len(rows):
if price[j] - price[i] > max:
max = price[j] - price[i]
How do I do this in pandas without actually calculating every value and assigning it to the right spot of a new column of the dataframe one-at-a-time like the above algorithm (which could probably be improved by sorting but this is besides the point)?
So far, I have only figured out that I can group by 'StockID' with:
stock_data.groupby(level='Stock') and pick the column stock_data.groupby(level='Stock')['Price']. But something like:
stock_data.groupby(level='Stock')['Price'].max() - stock_data.groupby(level='Stock')['Price'].min()
is not what I described above because there is no resitriction that the max() must come after the min().
Edit: The accepted solution works. Now I am also wondering if there is a way to penalize that distance by how far the max is from the min, so shorter gains are higher (therefore preferred) over longterm ones with somewhat bigger difference.
For example, maybe we could do cumsum() up to a certain length after min and not till the end? Somehow?
Let's try [::-1] to reverse the order to be able to get the maximum "in the future", then cummin and cummax after the groupby.
# sample data
np.random.seed(1)
stock_data = pd.DataFrame({'Price':np.random.randint(0,100, size=14)},
index=pd.MultiIndex.from_product(
[pd.date_range('2020-12-01', '2020-12-07', freq='D'),
list('ab')],
names=['date','stock'])
)
and assuming the dates are ordered in time, you can do:
stock_data['diff'] = (df.loc[::-1, 'Price'].groupby(level='stock').cummax()
- df.groupby(level='stock')['Price'].cummin())
print(stock_data)
Price diff
date stock
2020-12-01 a 37 42
b 12 59
2020-12-02 a 72 42
b 9 62
2020-12-03 a 75 42
b 5 66
2020-12-04 a 79 42
b 64 66
2020-12-05 a 16 60
b 1 70
2020-12-06 a 76 60
b 71 70
2020-12-07 a 6 0
b 25 24
Related
I'm sure that this question is not really helpful and could mean a lot of thinks so I'll try to explain the question with an example.
So my goal is to delete rows in a DataFrame like the following one if the row can't be part in a line of consecutive days which are as big as a given time period t. If t for example is 3, then the last row needs to be deleted, because there is a gap between the last and the row before. If t would be 4 then also the first three rows must be deleted, hence the 07.04.2012 or 03.04.2012 is missing. Hopefully you can understand what I try to explain here.
Date
Value
04.04.2012
24
05.04.2012
21
06.04.2012
20
08.04.2012
21
09.04.2012
23
10.04.2012
21
11.04.2012
26
13.04.2012
24
My attempt was to iterate over the values in the column 'Date' and check for every element x in the column if the value of the element x subtracted by the value of element x + t = -t. If this is not the case the whole row of the element should be deleted. But while I was searching how you can iterate over a DataFrame I read several times that it is not recommended to do that, because this needs a lot of computing time for big DataFrames. Unfortunately I couldn't find any other method or function which could do this. Therefore, I would be really glad if someone could help me out here. Thank you! :)
With the dates as index you can expand the index of the dataframe to include the missing days. The new dates will create nan values. Create groups for every nan value with .isna().cumsum() and count the size of each groups. Finally select the rows with a count larger or equal to the desired time period.
period = 3
df.set_index('Date', inplace=True)
df[df.groupby(df.reindex(pd.date_range(df.index.min(), df.index.max()))
.Value.isna().cumsum())
.transform('count').ge(period).Value].reset_index()
Output
Date Value
0 2012-04-04 24
1 2012-04-05 21
2 2012-04-06 20
3 2012-04-08 21
4 2012-04-09 23
5 2012-04-10 21
6 2012-04-11 26
To create the dataframe used in this solution
t = '''
Date Value
04.04.2012 24
05.04.2012 21
06.04.2012 20
08.04.2012 21
09.04.2012 23
10.04.2012 21
11.04.2012 26
13.04.2012 24
'''
import pandas as pd
from datetime import datetime
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y'))
I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0
I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851
I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()
I am new to Pandas. I have a set of Excel data read into a dataframe as follows:
TimeReceived A B
08:00:01.010 70 40
08:00:01.050 80 50
08:01:01.100 50 20
08:01:01.150 40 30
I want to compute the average for columns A & B based on time intervals of 100 ms.
The output in this case would be :
TimeReceived A B
08:00:01.000 75 45
08:00:01.100 45 25
I have set the 'TimeReceived' as a Date-Time index:
df = df.set_index (['TimeReceived'])
I can select rows based on predefined time ranges but I cannot do computations on time intervals as shown above.
If you have aDatetimeIndex you can then use resample to up or down sample your data to a new frequency. This will introduce NaN rows where there are gaps but you can drop these using dropna:
df.resample('100ms').mean().dropna()