I am new to Pandas. I have a set of Excel data read into a dataframe as follows:
TimeReceived A B
08:00:01.010 70 40
08:00:01.050 80 50
08:01:01.100 50 20
08:01:01.150 40 30
I want to compute the average for columns A & B based on time intervals of 100 ms.
The output in this case would be :
TimeReceived A B
08:00:01.000 75 45
08:00:01.100 45 25
I have set the 'TimeReceived' as a Date-Time index:
df = df.set_index (['TimeReceived'])
I can select rows based on predefined time ranges but I cannot do computations on time intervals as shown above.
If you have aDatetimeIndex you can then use resample to up or down sample your data to a new frequency. This will introduce NaN rows where there are gaps but you can drop these using dropna:
df.resample('100ms').mean().dropna()
Related
I'm sure that this question is not really helpful and could mean a lot of thinks so I'll try to explain the question with an example.
So my goal is to delete rows in a DataFrame like the following one if the row can't be part in a line of consecutive days which are as big as a given time period t. If t for example is 3, then the last row needs to be deleted, because there is a gap between the last and the row before. If t would be 4 then also the first three rows must be deleted, hence the 07.04.2012 or 03.04.2012 is missing. Hopefully you can understand what I try to explain here.
Date
Value
04.04.2012
24
05.04.2012
21
06.04.2012
20
08.04.2012
21
09.04.2012
23
10.04.2012
21
11.04.2012
26
13.04.2012
24
My attempt was to iterate over the values in the column 'Date' and check for every element x in the column if the value of the element x subtracted by the value of element x + t = -t. If this is not the case the whole row of the element should be deleted. But while I was searching how you can iterate over a DataFrame I read several times that it is not recommended to do that, because this needs a lot of computing time for big DataFrames. Unfortunately I couldn't find any other method or function which could do this. Therefore, I would be really glad if someone could help me out here. Thank you! :)
With the dates as index you can expand the index of the dataframe to include the missing days. The new dates will create nan values. Create groups for every nan value with .isna().cumsum() and count the size of each groups. Finally select the rows with a count larger or equal to the desired time period.
period = 3
df.set_index('Date', inplace=True)
df[df.groupby(df.reindex(pd.date_range(df.index.min(), df.index.max()))
.Value.isna().cumsum())
.transform('count').ge(period).Value].reset_index()
Output
Date Value
0 2012-04-04 24
1 2012-04-05 21
2 2012-04-06 20
3 2012-04-08 21
4 2012-04-09 23
5 2012-04-10 21
6 2012-04-11 26
To create the dataframe used in this solution
t = '''
Date Value
04.04.2012 24
05.04.2012 21
06.04.2012 20
08.04.2012 21
09.04.2012 23
10.04.2012 21
11.04.2012 26
13.04.2012 24
'''
import pandas as pd
from datetime import datetime
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y'))
I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0
I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851
I have a pandas dataframe named 'stock_data' with a MultiIndex index of ('Date', 'StockID') and a column 'Price'. The rows are ordered by date, so for the same stock a later date will have a higher row index. I want to add a new column that for each stock (i.e. group by stock) contains a number with the maximum positive difference between the prices of the stock through time, as in max_price - min_price.
To explain this further, one could calculate this in O(stocks*rows^2) by:
for each stock:
max = 0.0
for i in range(len(rows)-1):
for j in range(i+1, len(rows):
if price[j] - price[i] > max:
max = price[j] - price[i]
How do I do this in pandas without actually calculating every value and assigning it to the right spot of a new column of the dataframe one-at-a-time like the above algorithm (which could probably be improved by sorting but this is besides the point)?
So far, I have only figured out that I can group by 'StockID' with:
stock_data.groupby(level='Stock') and pick the column stock_data.groupby(level='Stock')['Price']. But something like:
stock_data.groupby(level='Stock')['Price'].max() - stock_data.groupby(level='Stock')['Price'].min()
is not what I described above because there is no resitriction that the max() must come after the min().
Edit: The accepted solution works. Now I am also wondering if there is a way to penalize that distance by how far the max is from the min, so shorter gains are higher (therefore preferred) over longterm ones with somewhat bigger difference.
For example, maybe we could do cumsum() up to a certain length after min and not till the end? Somehow?
Let's try [::-1] to reverse the order to be able to get the maximum "in the future", then cummin and cummax after the groupby.
# sample data
np.random.seed(1)
stock_data = pd.DataFrame({'Price':np.random.randint(0,100, size=14)},
index=pd.MultiIndex.from_product(
[pd.date_range('2020-12-01', '2020-12-07', freq='D'),
list('ab')],
names=['date','stock'])
)
and assuming the dates are ordered in time, you can do:
stock_data['diff'] = (df.loc[::-1, 'Price'].groupby(level='stock').cummax()
- df.groupby(level='stock')['Price'].cummin())
print(stock_data)
Price diff
date stock
2020-12-01 a 37 42
b 12 59
2020-12-02 a 72 42
b 9 62
2020-12-03 a 75 42
b 5 66
2020-12-04 a 79 42
b 64 66
2020-12-05 a 16 60
b 1 70
2020-12-06 a 76 60
b 71 70
2020-12-07 a 6 0
b 25 24
I would like to create a column that contains the sum of all amounts that occurred in a given hour. For example, if the row I am looking at has 0 under the column 0, I would like the volume column for that row to be the total volume for all amounts that occurred within that hour.
So:
dat.groupby('Hours')['Amount'].sum()
by performing groupby hours and summing the amount, I get the total amount of transactions that where made in each hour.
Hours
0 257101.87
1 146105.69
2 108819.17
....
45 532181.83
46 448887.69
47 336343.60
Name: Amount, dtype: float64
Problem is my database contains 1000s of rows and I can't simply create a new column with the values from the groupby, I would need a condition stipulating that if the value on the hour column is 0 then return the sum of all the amounts where the hour is 0.
So the desired result would be something like this
Hours Amount Total
0 20 100
0 20 100
0 60 100
1 10 20
1 10 20
2 50 50
In this scenario I would want to create the total column and return the sum of all amounts that occurred in a given hour
Groupby + transform should do it
df["Total"] = df.groupby("Hours")["Amount"].transform(sum)
Why this works...
A transform in pandas is like a split-apply-combine-merge in one go. You keep the same axis length after the groupby reduction.
I would use the output of dat.groupby('Hours')['Amount'].sum(), and merge it with the original set on the Hours column:
totals = dat.groupby('Hours')['Amount'].sum()
dat_with_totals = dat.merge(totals, on='Hours')