I have a dataset with timestamps and values for each ID. The number of rows, for each ID, is different and I need a double for loop like this:
for ids in IDs:
for index in Date:
Now, I would like to find the difference between timestamps, for each ID, in these ways:
values between 2 days
values between 7 days
In particular, for each ID
if, from the first value, in the next 2 days there is an increment of at least 0.3 fromt the first value
OR
if, from the first value, in the next 7 days there is a value equals to 1.5*first value
I store that ID in a dataframe, otherwise I store that ID in another dataframe.
Now, my code is the following:
yesDf = pd.DataFrame()
noDf = pd.DataFrame()
for ids in IDs:
for index in Date:
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 2):
if (df.iloc[index]['Val'] - df.iloc[index - 1]['Val'] >= 0.3):
yesDf += IDs['ID']
noDf += IDs['ID']
if ((df.iloc[Date - 1]['Date'] - df.iloc[0]['Date']).days <= 7):
if(df.iloc[Date - 1]['Val'] >= df.iloc[index]['Val'] * 1.5):
yesDf += IDs['ID']
noDf += IDs['ID']
print(yesDf)
print(noDf)
I get these errors:
TypeError: incompatible type for a datetime/timedelta operation [sub]
and
pandas.errors.NullFrequencyError: Cannot shift with no freq
How can I solve this problem?
Thank you
Edit: my dataframe
Val ID Date
2199 0.90 0000.0 2017-12-26 11:00:01
2201 1.35 0001.0 2017-12-26 11:00:01
63540 0.72 0001.0 2018-08-10 11:53:01
68425 0.86 0001.0 2018-10-14 08:33:01
42444 0.99 0002.0 2018-02-01 09:25:53
41474 1.05 0002.0 2018-04-01 08:00:04
42148 1.19 0002.0 2018-07-01 08:50:00
24291 1.01 0004.0 2017-01-01 08:12:02
for exampled: for ID 0001.0 the first value is 1.35, and in the next 2 days I don't have an increment of at least 0.3 from the start value, and in the next 7 days I don't have an increment 1.5times the firsrt value, so it goes in the noDf Dataframe.
Also the dtypes:
Val float64
ID object
Date datetime64[ns]
Surname object
Name object
dtype: object
edit:
after the modified code the results are:
Val ID Date Date_diff_cumsum Val_diff
24719 2.08 0118.0 2017-01-15 08:16:05 1.0 0.36
24847 2.17 0118.0 2017-01-16 07:23:04 1.0 0.45
25233 2.45 0118.0 2017-01-17 08:21:03 2.0 0.73
24749 2.95 0118.0 2017-01-18 09:49:09 3.0 1.23
17042 1.78 0129.0 2018-02-05 22:48:17 0.0 0.35
And it is correct. Now I only need to add the single ID into a dataframe
This answer should work assuming you start from the first value of an ID, so the first timestamp.
First, I added the 'Date_diff_cumsum' column, which stores the difference in days between the first date for the ID and the row's date:
df['Date_diff_cumsum'] = df.groupby('ID').Date.diff().dt.days
df['Date_diff_cumsum'] = df.groupby('ID').Date_diff_cumsum.cumsum().fillna(0)
Then, I add the 'Value_diff' column, which is the difference between the first value for an ID and the row's value:
df['Val_diff'] = df.groupby('ID')['Val'].transform(lambda x:x-x.iloc[0])
Here is what I get after adding the columns for your sample DataFrame:
Val ID Date Date_diff_cumsum Val_diff
0 0.90 0.0 2017-12-26 11:00:01 0.0 0.00
1 1.35 1.0 2017-12-26 11:00:01 0.0 0.00
2 0.72 1.0 2018-08-10 11:53:01 227.0 -0.63
3 0.86 1.0 2018-10-14 08:33:01 291.0 -0.49
4 0.99 2.0 2018-02-01 09:25:53 0.0 0.00
5 1.05 2.0 2018-04-01 08:00:04 58.0 0.06
6 1.19 2.0 2018-07-01 08:50:00 149.0 0.20
7 1.01 4.0 2017-01-01 08:12:02 0.0 0.00
And finally, return the rows which satisfy the conditions in your question:
df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))]
In this case, it will return no rows.
yesDf = df[((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
noDf = df[~((df['Val_diff']>=0.3) & (df['Date_diff_cumsum']<=2)) |
((df['Val'] >= 1.5*(df['Val']-df['Val_diff'])) & (df['Date_diff_cumsum']<=7))].ID.drop_duplicates().to_frame()
yesDf contains the IDs that satisfy the condition, and noDf the ones that don't
I hope this answers your question !
Related
I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64
I have df like below I want to create dayshigh column. This column will show the row counts until the highest date.
date high
05-06-20 1.85
08-06-20 1.88
09-06-20 2
10-06-20 2.11
11-06-20 2.21
12-06-20 2.17
15-06-20 1.99
16-06-20 2.15
17-06-20 16
18-06-20 9
19-06-20 14.67
should be like:
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16 8
18-06-20 9 0
19-06-20 14.67 1
using the below code but showing error somehow:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
for j in range(df["DaysHigh"][i].index, len(df)):
if df["high"][i] > df["high"][i-1]:
df["DaysHigh"][i] = df["DaysHigh"][i-1] + 1
else:
df["DaysHigh"][i] = 0
At which point am I doing wrong? Thank you
Is the dayshigh number for 17-06-20 supposed to be 2 instead of 8? If so, you can basically use the code you had already written here. There are three changes I'm making below:
starting i from 1 instead of 0 to avoid trying to access the -1th element
removing the loop over j (doesn't seem to be necessary)
using loc to set the values instead of df["high"][i] -- you'll see this should resolve the warnings about copies and slices.
Keeping first line same as before,
for i in range(1, len(df)):
if df["high"][i] > df["high"][i-1]:
df.loc[i,"DaysHigh"] = df["DaysHigh"][i-1] + 1
else:
df.loc[i,"DaysHigh"] = 0
procedure
Use pandas.shift() to create a column for the next row of comparison results.
calculate the cumulative sum of its created columns
delete the columns if they are not needed
df['tmp'] = np.where(df['high'] >= df['high'].shift(), 1, np.NaN)
df['dayshigh'] = df['tmp'].groupby(df['tmp'].isna().cumsum()).cumsum()
df.drop('tmp', axis=1, inplace=True)
df
date high dayshigh
0 05-06-20 1.85 NaN
1 08-06-20 1.88 1.0
2 09-06-20 2.00 2.0
3 10-06-20 2.11 3.0
4 11-06-20 2.21 4.0
5 12-06-20 2.17 NaN
6 15-06-20 1.99 NaN
7 16-06-20 2.15 1.0
8 17-06-20 16.00 2.0
9 18-06-20 9.00 NaN
10 19-06-20 14.67 1.0
Well, I think I did, here is my solution:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
#for i in range(len(df)-1000, len(df)):
for j in reversed(range(i)):
if df["high"][i] > df["high"][j]:
df["DaysHigh"][i] = df["DaysHigh"][i] + 1
else:
break
print(df)
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2.00 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16.00 8
18-06-20 9.00 0
19-06-20 14.67 1
How to calculate MAX and MIN for 3 or more dataframes
I can do calculate the difference between prices simply by adding this line of code
diff['list2-list1'] = diff['price2'] - diff['price1']
but it not work to calculate MIN with
diff['min'] = (df1,df2.df3).min()
or
diff['min'] = (diff['price2'],diff['price1'],diff['price3']).min()
or
diff['min'] = (diff['price2'],diff['price1'],diff['price3']).idxmin()
and do not print result of if in new column when latest list (list3) have minimum value
if diff['min'] == diff['price3']
diff['Lowest now?'] = "yes"
The python code I have
import pandas
import numpy as np
import csv
from csv_diff import load_csv, compare
df1 = pandas.read_csv('list1.csv')
df1['version'] = 'list1'
df2 = pandas.read_csv('list2.csv')
df2['version'] = 'list2'
df3 = pandas.read_csv('list3.csv')
df3['version'] = 'list3'
# keep only columns 'version', 'ean', 'price'
diff = df1.append([df2,df3])[['version', 'ean','price']]
# keep only duplicated eans, which will only occur
# for eans in both original lists
diff = diff[diff['ean'].duplicated(keep=False)]
# perform a pivot https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
diff = diff.pivot_table(index='ean', columns='version', values='price', aggfunc='first')
# back to a normal dataframe
diff = diff.reset_index()
diff.columns.name = None
# rename columns and keep only what we want
diff = diff.rename(columns={'list1': 'price1', 'list2': 'price2', 'list3': 'price3'})[['ean', 'price1', 'price2','price3']]
diff['list2-list1'] = diff['price2'] - diff['price1']
diff['list3-list2'] = diff['price3'] - diff['price2']
diff['min'] = (df1,df2).min()
if diff['min'] == diff['price3']
diff['Lowest now?'] = "yes"
diff.to_csv('diff.csv')
more information
headers of list1,lsit2,list3 are the same
price,ean,unit
example of list1
price,ean,unit
143.80,2724316972629,0
125.00,2724456127521,0
158.00,2724280705919,0
19.99,2724342954019,0
20.00,2724321942662,0
212.00,2724559841560,0
1322.98,2724829673686
example of list2
price,ean,unit
55.80,2724316972629,0
15.00,2724456127521,0
66.00,2724559841560,0
1622.98,2724829673686,0
example of list3
price,ean,unit
139.99,2724342954019,0
240.00,2724321942662,0
252.00,2724559841560,0
1422.98,2724829673686,0
There you go:
data = pd.concat([df1, df2, df3], axis=1).fillna(0).astype('float')
data['minimum_price'] = data['price'].min(1)
data['maximum_price'] = data['price'].max(1)
Out:
price ean units price ean units price ean units minimum_price maximum_price
0 143.80 2.724317e+12 0.0 55.80 2.724317e+12 0.0 139.99 2.724343e+12 0.0 55.80 143.80
1 125.00 2.724456e+12 0.0 15.00 2.724456e+12 0.0 240.00 2.724322e+12 0.0 15.00 240.00
2 158.00 2.724281e+12 0.0 66.00 2.724560e+12 0.0 252.00 2.724560e+12 0.0 66.00 252.00
3 19.99 2.724343e+12 0.0 1622.98 2.724830e+12 0.0 1422.98 2.724830e+12 0.0 19.99 1622.98
4 20.00 2.724322e+12 0.0 0.00 0.000000e+00 0.0 0.00 0.000000e+00 0.0 0.00 20.00
5 212.00 2.724560e+12 0.0 0.00 0.000000e+00 0.0 0.00 0.000000e+00 0.0 0.00 212.00
6 1322.98 2.724830e+12 0.0 0.00 0.000000e+00 0.0 0.00 0.000000e+00 0.0 0.00 1322.98
Assuming the dataframes have the same columns, you can use pd.concat.
min = pd.concat(df1, df2,df3).min().
I have a df as below. Consider df is indexed by timestamps as dtype='datetime64[ns]' i.e. 1970-01-01 00:00:27.603046999. I am putting dummy timestamps here.
Timestamp Address Type Arrival_Time Time_Delta
0.1 2 A 0.25 0.15
0.4 3 B 0.43 0.03
0.9 1 B 1.20 0.20
1.3 1 A 1.39 0.09
1.5 3 A 1.64 0.14
1.7 3 B 1.87 0.17
2.0 3 A 2.09 0.09
2.1 1 B 2.44 0.34
I have three unique "addresses" (1, 2,3).
I have two unique "types" (A, B)
Now what I am trying to do two things in simple way (possibly using pd.Grouper and pd.Groupby functions in Panda).
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each 1sec bin, for each "address" find the mean and sum of "Time_delta" only if "Type" = A.
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each bin, for each "address", find the mean and sum of Inter-Arrival Time*.
IAT = Arrival Time (i) - Arrival Time (i-1)
Note: If the timestamps duration/length is of 100 seconds, we should have exactly 100 rows in the output dataframe and six columns i.e. two (mean, sum) for each address.
For Problem 1:
I tried the following code:
df = pd.DataFrame({'Timestamp': Timestamp, 'Address': Address,
'Type': Type, 'Arrival_Time': Arrival_time, 'Time_Delta': Time_delta})
# Set index to Datetime
index = pd.DatetimeIndex(df[df.columns[3]]*10**9) # Convert timestamp into format
df = df.set_index(index) # Set timestamp as index
df_1 = df[df.columns[2]].groupby([pd.TimeGrouper('1S'), df['Address']]).mean().unstack(fill_value=0)
which gives results:
Timestamp 1 2 3
1970-01-01 00:00:00 0.20 0.15 0.030
1970-01-01 00:00:01 0.09 0.00 0.155
1970-01-01 00:00:02 0.34 0.00 0.090
As you can see, it gives the mean Time_delta for each address in the 1S bin, But I want to add the second condition i.e. find mean for each address only if Type=A. I hope problem 1 is now clear.
For Problem 2:
Its a bit complicated. I want to do get Mean IAT for each address in the same format (See below):
One possible way is to add an extra column to original df as df['IAT'], where
for in range (1, len(df))
i = 0
df['IAT'] = df['Arrival_Time'][i] - df['Arrival_Time'][i-1] i =
i=i+1
Then apply the same above code to find mean of IAT for each address if Type=A.
Actual Data
Timestamp Address Type Time Delta Arrival Time
1970-01-01 00:00:00.000000000 28:5a:ec:16:00:22 Control frame 0.000000 Nov 10, 2017 22:39:20.538561000
1970-01-01 00:00:00.000287000 28:5a:ec:16:00:23 Data frame 0.000287 Nov 10, 2017 22:39:20.548121000
1970-01-01 00:00:00.000896000 28:5a:ec:16:00:22 Control frame 0.000609 Nov 10, 2017 22:39:20.611256000
1970-01-01 00:00:00.001388000 28:5a:ec:16:00:21 Data frame 0.000492 Nov 10, 2017 22:39:20.321745000
... ...
I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!
Given a pandas DataFrame df with a column returns as such:
returns
Date
2014-12-10 0.0000
2014-12-11 0.0200
2014-12-12 0.0500
2014-12-15 -0.0200
2014-12-16 0.0000
Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).
Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?
Hope its clear enough,
Thanks,
N
"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.
If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-15 -0.02
2014-12-16 -0.02
or, step by step:
>>> df.resample("1d")
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.05
2014-12-13 NaN
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-13 0.05
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 -0.02
The way I would do it is with helper columns. It's a little kludgy but it should work:
numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()
I am not sure if there is a built in method but it does not seem very difficult to write one.
for example, here is one for pandas series.
def cum(df, interval):
all = []
quotient = len(df)//interval
intervals = range(quotient)
for i in intervals:
all.append(df[0:(i+1)*interval].sum())
return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0 6
1 28
2 66
3 120
4 190
dtype: int64
Thanks to #DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:
import numpy as np
import pandas as pd
df.resample("1w"), how={'A': np.sum})
Yields what I want for the example below:
rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})
Outputs:
>> print r
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
2000-01-04 4
2000-01-05 5
2000-01-06 6
2000-01-07 7
2000-01-08 8
2000-01-09 9
2000-01-10 10
2000-01-11 11
...
2000-01-25 25
2000-01-26 26
2000-01-27 27
2000-01-28 28
>> print r2
A
2000-01-02 3
2000-01-09 42
2000-01-16 91
2000-01-23 140
2000-01-30 130
Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.