Calculate time difference depending on another column (Power BI) - python

I want to calculate time difference. I have a timestamp on each row, that is either incoming or outgoing.
e.g.
Msg
Timestamp
StateAfter
A
01.01.2019 11:02:02
1
B
01.01.2019 11:02:03
1
A
01.01.2019 11:02:05
1
A
01.01.2019 11:02:06
0
B
01.01.2019 11:02:08
0
A
01.01.2019 11:02:09
1
B
01.01.2019 11:02:10
1
A
01.01.2019 11:02:11
0
B
01.01.2019 11:02:12
0
i tried to solve the problem with index. However Msg's don't always happen on top of each other (incoming and outgoing).
(StateAfter column --> 0: outgoing, 1: incoming).
Msg
Timestamp
StateAfter
Calculated Time
A
01.01.2019 11:02:02
1
B
01.01.2019 11:02:03
1
A
01.01.2019 11:02:05
1
A
01.01.2019 11:02:06
0
00:00:04
B
01.01.2019 11:02:08
0
00:00:05
B
01.01.2019 11:02:09
1
A
01.01.2019 11:02:10
1
A
01.01.2019 11:02:11
1
00:00:02
B
01.01.2019 11:02:12
1
00:00:02
Result should be
Msg
total-time
A
00:00:06
B
00:00:07
thank you a lot

Related

Python delete rows for each group after first occurance in a column

I Have a dataframe as follows:
df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})
I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:
Required Output
My Approach
Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()
Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.
Thanks in advance !
You can bring the H dates "back" into each previous row to use in a comparison.
First mark each H date in a new column:
df.loc[df["Activity"] == "H" , "End"] = df["Date"]
Key Activity Date End
0 1 A 2022-12-03 NaT
1 1 A 2022-12-04 NaT
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 NaT
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 NaT
8 4 C 2022-12-04 NaT
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
Backward fill the new column for each group:
df["End"] = df.groupby("Key")["End"].bfill()
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 2022-12-06
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
You can then select rows with Date before End
df.loc[df["Date"] < df["End"]]
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
4 2 B 2022-12-03 2022-12-06
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
To generate the final form - you can use .pivot_table()
(df.loc[df["Date"] < df["End"]]
.pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
.reindex(df["Key"].unique()) # Add in keys with no match e.g. `5`
.fillna(0)
.astype(int))
Activity A B C
Key
1 2 0 0
2 0 1 0
4 1 0 1
5 0 0 0
Try this:
(df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
.set_index('Key')['Activity']
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df['Key'].unique(),fill_value=0)
.reset_index())
Output:
Key A B C
0 1 2 0 0
1 2 0 1 0
2 4 1 0 1
3 5 0 0 0
You can try:
# sort by Key and Date
df.sort_values(['Key', 'Date'], inplace=True)
# this is to keep Key in the result when no values are kept after the filter
df.Key = df.Key.astype('category')
# filter all rows after the 1st H for each Key and then pivot
df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
index='Key', columns='Activity', aggfunc='size'
).reset_index()
#Activity Key A B C
#0 1 2 0 0
#1 2 0 1 0
#2 4 1 0 1
#3 5 0 0 0

Group periodic data in pandas dataframe

I have a pandas dataframe that looks like this:
idx
A
B
01/01/01 00:00:01
5
2
01/01/01 00:00:02
4
5
01/01/01 00:00:03
5
4
02/01/01 00:00:01
3
8
02/01/01 00:00:02
7
4
02/01/01 00:00:03
1
3
I would like to group data based on its periodicity such that the final dataframe is:
new_idx
01/01/01
02/01/01
old_column
00:00:01
5
3
A
00:00:02
4
7
A
00:00:03
5
1
A
00:00:01
2
8
B
00:00:02
5
4
B
00:00:03
4
3
B
Is there a way to this that holds when the first dataframe gets big (more columns, more periods and more samples)?
One way is to melt the DataFrame, then split the datetime to dates and times; finally pivot the resulting DataFrame for the final output:
df = df.melt('idx', var_name='old_column')
df[['date','new_idx']] = df['idx'].str.split(expand=True)
out = df.pivot(['new_idx','old_column'], 'date', 'value').reset_index().rename_axis(columns=[None]).sort_values(by='old_column')
Output
new_idx old_column 01/01/01 02/01/01
0 00:00:01 A 5 3
2 00:00:02 A 4 7
4 00:00:03 A 5 1
1 00:00:01 B 2 8
3 00:00:02 B 5 4
5 00:00:03 B 4 3

Sum a column based on groupby and condition

I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.
region. date. time. gap
0 1 2016-01-01 00:00:08 1
1 1 2016-01-01 00:00:48 0
2 1 2016-01-01 00:02:50 1
3 1 2016-01-01 00:00:52 0
4 1 2016-01-01 00:10:01 0
5 1 2016-01-01 00:10:03 1
6 1 2016-01-01 00:10:05 0
7 1 2016-01-01 00:10:08 0
I want to sum gap column. I have time slots in dict like that.
'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'
Now after summation, above dataframe should like that.
region. date. time. gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.
regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()
But it doesn't work.
Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
region date time gap
0 1 2016-01-01 00:00:00 1
1 1 2016-01-01 00:00:00 0
2 1 2016-01-01 00:00:00 1
3 1 2016-01-01 00:00:00 0
4 1 2016-01-01 00:10:00 0
5 1 2016-01-01 00:10:00 1
6 1 2016-01-01 00:10:00 0
7 1 2016-01-01 00:10:00 0
Aggregate sum and last map values by dictionary with swapped keys with values:
regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:00:00/slot1 2
1 1 2016-01-01 00:10:00/slot2 1
If want display next 10Min slots:
d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}
times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
region date time gap time1
0 1 2016-01-01 00:00:00 1 00:10:00
1 1 2016-01-01 00:00:00 0 00:10:00
2 1 2016-01-01 00:00:00 1 00:10:00
3 1 2016-01-01 00:00:00 0 00:10:00
4 1 2016-01-01 00:10:00 0 00:20:00
5 1 2016-01-01 00:10:00 1 00:20:00
6 1 2016-01-01 00:10:00 0 00:20:00
7 1 2016-01-01 00:10:00 0 00:20:00
regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
region date time gap
0 1 2016-01-01 00:10:00/slot1 2
1 1 2016-01-01 00:20:00/slot2 1
EDIT:
Improvement for floor and convert to strings is use bining by cut or searchsorted:
df['time'] = pd.to_timedelta(df['time'])
bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]
df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]
Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.
df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056],
'gap': [1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()
Output
gap
slot
-----------
0 2
1000 1
1500 3
The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.
The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum
None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.
#Just creating the DataFrame using a dictionary here
regdict = {
'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
'gap': [1,0,1,0,0,1,0,0],}
df = pd.DataFrame(regdict)
import pandas as pd
import numpy as np #This is the library you require for np.select function
#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00']
choicelist = ['00:10:00/slot1','00:20:00/slot2']
#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']
#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
time gap
0 00:10:00 1
1 00:10:00 0
2 00:10:00 1
3 00:10:00 0
4 00:20:00 0
5 00:20:00 1
6 00:20:00 0
7 00:20:00 0
df = df.groupby('time', as_index=False)['gap'].sum()
print (df)
time gap
0 00:10:00 2
1 00:20:00 1
If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.
df['timeNew'] = answerlist
print (df)
time gap timeNew
0 00:00:08 1 00:10:00/slot1
1 00:00:48 0 00:10:00/slot1
2 00:02:50 1 00:10:00/slot1
3 00:00:52 0 00:10:00/slot1
4 00:10:01 0 00:20:00/slot2
5 00:10:03 1 00:20:00/slot2
6 00:10:05 0 00:20:00/slot2
7 00:10:08 0 00:20:00/slot2
#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df)
time gap timeNew aggregate sum of gap
0 00:00:08 1 00:10:00/slot1 2
1 00:00:48 0 00:10:00/slot1 2
2 00:02:50 1 00:10:00/slot1 2
3 00:00:52 0 00:10:00/slot1 2
4 00:10:01 0 00:20:00/slot2 1
5 00:10:03 1 00:20:00/slot2 1
6 00:10:05 0 00:20:00/slot2 1
7 00:10:08 0 00:20:00/slot2 1

Dataframe has Everyother column timestamp, how to get it in one column?

I have a dataframe I import from excel that is of 'n x n' length, that looks like the following (sorry, i do not know how to easily duplicate this with code)
How do I get the timestamps into one column? Like the following (I've tried pivot)
You may need to extract the data by 3 columns group. Then rename the columns and add the "A,B,C" flag column and concatenate them together. See the test as below:
abc_list = [["2017-10-01",0,"2017-10-02",1,"2017-10-03",8],["2017-11-01",3,"2017-11-01",5,"2017-11-05",10],["2017-12-01",0,"2017-12-07",7,"2017-12-07",12]]
df = pd.DataFrame(abc_list,columns=["Time1","A","Time2","B","Time3","C"])
The output:
Time1 A Time2 B Time3 C
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Then:
df_a=df.iloc[:,0:2].rename(columns={'Time1':'time','A':'value'})
df_a['flag']="A"
df_b=df.iloc[:,2:4].rename(columns={'Time2':'time','B':'value'})
df_b['flag']="B"
df_c=df.iloc[:,4:].rename(columns={'Time3':'time','C':'value'})
df_c['flag']="C"
df_final=pd.concat([df_a,df_b,df_c])
df_final.reset_index(drop=True)
output:
time value flag
0 2017-10-01 0 A
1 2017-11-01 3 A
2 2017-12-01 0 A
3 2017-10-02 1 B
4 2017-11-01 5 B
5 2017-12-07 7 B
6 2017-10-03 8 C
7 2017-11-05 10 C
8 2017-12-07 12 C
This is a quit bit not a pythonic way to do it.
Here is another way:
columns = pd.MultiIndex.from_tuples([('A','Time'),('A','Value'),('B','Time'),('B','Value'),('C','Time'),('C','Value')],names=['Group','Sub_value'])
df.columns=columns
Output:
Group A B C
Sub_value Time Value Time Value Time Value
0 2017-10-01 0 2017-10-02 1 2017-10-03 8
1 2017-11-01 3 2017-11-01 5 2017-11-05 10
2 2017-12-01 0 2017-12-07 7 2017-12-07 12
Run:
df.stack(level='Group')
Output:
Sub_value Time Value
Group
0 A 2017-10-01 0
B 2017-10-02 1
C 2017-10-03 8
1 A 2017-11-01 3
B 2017-11-01 5
C 2017-11-05 10
2 A 2017-12-01 0
B 2017-12-07 7
C 2017-12-07 12
This is one method. It is fairly easy to extend to any number of columns.
import pandas as pd
dfs = {}
# read in pairs of columns and assign 'Category' column
dfs[i] = {i: pd.read_excel('file.xlsx', usecols=[2*i, 2*i+1], skiprows=[0],
header=None, columns=['Date', 'Value']).assign(Category=j) \
for i, j in enumerate(['A', 'B', 'C'])}
# concatenate dataframes
df = pd.concat(list(dfs.values()), ignore_index=True)

pandas shifts column names and fills last column with NAN

I have a csv file that is tab delimited.
Example:
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts State ES DPt Time
1 0 1 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 3.41214609 R 0 09:44:13
2 0 1 30.00000000 30.00000000 0.00000000 0.00000000 0.00000000 3.41077280 R 1 09:44:43
3 0 1 60.00000000 60.00000000 0.00000000 0.00000000 0.00000000 3.41077280 R 1 09:45:13
I read the csv in using:
import pandas as pd
df = pd.read_csv('foo.csv', sep='\t')
This gives the output:
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts State ES DPt Time
1 0 1 0.00 0.00 0.000000 0.000000 0.000000 3.412146 R 0 09:44:13 NaN
2 0 1 30.00 30.00 0.000000 0.000000 0.000000 3.410773 R 1 09:44:43 NaN
3 0 1 60.00 60.00 0.000000 0.000000 0.000000 3.410773 R 1 09:45:13 NaN
This seems to have shifted my column names over by one and causes my last column to be filled with NAN's instead of dates.
If I do the following:
import pandas as pd
df = pd.read_csv("foo.csv", sep="\t")
df = pd.read_csv("foo.csv", sep="\t", usecols=df[:len(df.columns)])
I get the following output:
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts State ES DPt Time
1 1 0 1 0.00 0.00 0.000000 0.000000 0.000000 3.412146 R 0 09:44:13
2 2 0 1 30.00 30.00 0.000000 0.000000 0.000000 3.410773 R 1 09:44:43
3 3 0 1 60.00 60.00 0.000000 0.000000 0.000000 3.410773 R 1 09:45:13
Also if I try to just grab two specific columns it seems to grab those correctly. As in df = df = pd.read_csv("foo.csv", sep="\t", usecols=[3, 8]) will correctly grab the Time (Sec) column and the Volts Column.
I was hoping there was a way to correctly frame the data that wouldn't require me reading it twice.
Thanks in advance!
Oniwa
It looks like there are some trailing tabs:
>>> with open("oniwa.dat") as fp:
... for line in fp:
... print(repr(line))
...
'Rec#\tCyc#\tStep\tTest (Sec)\tStep (Sec)\tAmp-hr\tWatt-hr\tAmps\tVolts\tState\tES\tDPt Time\n'
'1\t0\t1\t0.00000000\t0.00000000\t0.00000000\t0.00000000\t0.00000000\t3.41214609\tR\t0\t09:44:13\t\n'
'2\t0\t1\t30.00000000\t30.00000000\t0.00000000\t0.00000000\t0.00000000\t3.41077280\tR\t1\t09:44:43\t\n'
'3\t0\t1\t60.00000000\t60.00000000\t0.00000000\t0.00000000\t0.00000000\t3.41077280\tR\t1\t09:45:13\n'
As a result, pandas concludes there's an index column. We can tell it otherwise using index_col. To be specific, instead of
>>> pd.read_csv("oniwa.dat", sep="\t") # no good
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts \
1 0 1 0 0 0 0 0 3.412146 R
2 0 1 30 30 0 0 0 3.410773 R
3 0 1 60 60 0 0 0 3.410773 R
State ES DPt Time
1 0 09:44:13 NaN
2 1 09:44:43 NaN
3 1 09:45:13 NaN
we can use
>>> pd.read_csv("oniwa.dat", sep="\t", index_col=False) # hooray!
Rec# Cyc# Step Test (Sec) Step (Sec) Amp-hr Watt-hr Amps Volts \
0 1 0 1 0 0 0 0 0 3.412146
1 2 0 1 30 30 0 0 0 3.410773
2 3 0 1 60 60 0 0 0 3.410773
State ES DPt Time
0 R 0 09:44:13
1 R 1 09:44:43
2 R 1 09:45:13

Categories