Cumulative aggregate of unique string values

Cumulative aggregate of unique string values - python

Here is what I have:
import pandas as pd
df = pd.DataFrame()
df['date'] = ['2020-01-01', '2020-01-01','2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
df['value'] = ['A', 'A', 'A', 'A', 'B', 'A', 'C']
df
date value
0 2020-01-01 A
1 2020-01-01 A
2 2020-01-01 A
3 2020-01-02 A
4 2020-01-02 B
5 2020-01-03 A
6 2020-01-03 C
I want to aggregate unique values over time like this:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 3
I am NOT looking for this as an answer:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 2
I need the 2020-01-03 to be 3 because there are three unique values (A,B,C).

We can do agg list with cumsum
s=df.groupby('date').value.agg(list).cumsum().map(set).map(len)
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
Name: value, dtype: int64

Let's use pd.crosstab instead:
(pd.crosstab(df['date'], df['value']) !=0).cummax().sum(axis=1)
Output:
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
Details:
First, let's reshape the dataframe such that you have 'date' as rows and the values listed across as columns. Then check for non-zero cells and use cummax in the column to keep track of every "value" seen in a column, then use sum across rows to calculate how many distinct values are seen at any point in time in the dataframe.

I think,np.cumsum the first unique values. .groupby the date which in this case I have set as the index and find either the maximum or last value.
import numpy as np
(np.cumsum((~(df.set_index('date')).duplicated('value')))).groupby(level=0).max()
date
2020-01-01 1
2020-01-02 2
2020-01-03 3

Related

How To Find A Closest Date To a Given Date?

I have two dataframes df1, df2. I need to construct an output that finds the nearest date to df1, whilst simultaneously matching the ID Value in both df1 and df2. df (Output Desired) shown below illustrates what I have tried explain above!
df1:
ID Date
1 2020-01-01
2 2020-01-03
df2:
ID Date
11 2020-01-11
4 2020-02-03
5 2020-04-02
6 2020-01-05
1 2021-01-13
1 2021-03-03
1 2020-01-30
2 2020-03-31
2 2021-04-01
2 2021-02-02
df (Output desired)
ID Date Closest Date
1 2020-01-01 2020-01-30
2 2020-01-03 2020-03-31

Here's one way to achieve it – assuming that the Date columns' dtype is datetime: First,
df3 = df1[df1.ID.isin(df2.ID)]
will give you
ID Date
0 1 2020-01-01
1 2 2020-01-03
Then
df3['Closest_date'] = df3.apply(lambda row:min(df2[df2.ID.eq(row.ID)].Date,
key=lambda x:abs(x-row.Date)),
axis=1)
gets the min of df2.Date, where
df2[df2.ID.eq(row.ID)].Date is getting the rows that have the matching ID and
key=lambda x:abs(x-row.Date) is telling min to compare by distance,
which has to be done on rows, so axis=1
Output:
ID Date Closest_date
0 1 2020-01-01 2020-01-30
1 2 2020-01-03 2020-03-31

Pandas concat/merge Dataframes fill missing values with last in column

I want to aggregate the data of two pandas Dataframes into one, where the column total needs to backfill with previous existing values, here is my code:
import pandas as pd
df1 = pd.DataFrame({
'date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-05'],
'day_count': [1, 1, 1, 1],
'total': [1, 2, 3, 4]})
df2 = pd.DataFrame({
'date': ['2020-01-02', '2020-01-03', '2020-01-04'],
'day_count': [2, 2, 2],
'total': [2, 4, 6]})
# set "date" as index and convert to datetime for later resampling
df1.index = df1['date']
df1.index = pd.to_datetime(df1.index)
df2.index = df2['date']
df2.index = pd.to_datetime(df2.index)
Now I need to resample both my dataframes to some frequency, let's say daily so I would do:
df1 = df1.resample('D').agg({'day_count': 'sum', 'total': 'last'})
df2 = df2.resample('D').agg({'day_count': 'sum', 'total': 'last'})
The Dataframes now looks like:
In [20]: df1
Out[20]:
day_count total
date
2020-01-01 1 1.0
2020-01-02 1 2.0
2020-01-03 1 3.0
2020-01-04 0 NaN
2020-01-05 1 4.0
In [22]: df2
Out[22]:
day_count total
date
2020-01-02 2 2
2020-01-03 2 4
2020-01-04 2 6
Now I need to merge both, but notice that total, has some NaN values where I need to backfill the the previously existing value, so I do:
df1['total'] = df1['total'].fillna(method='ffill').astype(int)
df2['total'] = df2['total'].fillna(method='ffill').astype(int)
Now df1 looks like:
In [25]: df1
Out[25]:
day_count total
date
2020-01-01 1 1
2020-01-02 1 2
2020-01-03 1 3
2020-01-04 0 3
2020-01-05 1 4
So now I have the two dataframes ready to be merged, I think, so I concat them:
final_df = pd.concat([df1, df1]).fillna(method='ffill').groupby(["date"], as_index=True).sum()
In [31]: final_df
Out[31]:
day_count total
date
2020-01-01 1 1
2020-01-02 3 4
2020-01-03 3 7
2020-01-04 2 9
2020-01-05 1 4
I have the correct aggregation for day_count simply summing what's on the same date for both DF's but for total I do not get what I expected, which is to get:
In [31]: final_df
Out[31]:
day_count total
date
2020-01-01 1 1
2020-01-02 3 4
2020-01-03 3 7
2020-01-04 2 9
2020-01-05 1 10 --> this value I miss
Certainly I am doing something wrong, I feel like maybe there is even a simpler way to do this, thanks !

Concatenate them horizontally and groupby along columns:
pd.concat([df1,df2], axis=1).ffill().groupby(level=0, axis=1).sum()
That said, you can also bypass the individual fillna and groupby
# these are not needed
# df1['total'] = df1['total'].fillna(method='ffill').astype(int)
# df2['total'] = df2['total'].fillna(method='ffill').astype(int)
pd.concat([df1,df2],axis=1).ffill().sum(level=0, axis=1)
Output:
day_added total
date
2020-01-01 1.0 1.0
2020-01-02 3.0 4.0
2020-01-03 3.0 7.0
2020-01-04 2.0 9.0
2020-01-05 3.0 10.0

Difference of value between two different times at the same date

I have a dataframe df as below:
Datetime Value
2020-03-01 08:00:00 10
2020-03-01 10:00:00 12
2020-03-01 12:00:00 15
2020-03-02 09:00:00 1
2020-03-02 10:00:00 3
2020-03-02 13:00:00 8
2020-03-03 10:00:00 20
2020-03-03 12:00:00 25
2020-03-03 14:00:00 15
I would like to calculate the difference between the value on the first time of each date and the last time of each date (ignoring the value of other time within a date), so the result will be:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5
I have been doing this using a for loop, but it is slow (as expected) when I have larger data. Any help will be appreciated.

One solution would be to make sure the data is sorted by time, group by the data and then take the first and last value in each day. This works since pandas will preserve the order during groupby, see e.g. here.
df = df.sort_values(by='Datetime').groupby(df['Datetime'].dt.date).agg({'Value': ['first', 'last']})
df['Value_Difference'] = df['Value']['last'] - df['Value']['first']
df = df.drop('Value', axis=1).reset_index()
Result:
Datetime Value_Difference
2020-03-01 5
2020-03-02 7
2020-03-03 -5

Shaido's method works, but might be slow due to the groupby on very large sets
Another possible way is to take a difference from dates converted to int and only grab the values necessary without a loop.
idx = df.index
loc = np.diff(idx.strftime('%Y%m%d').astype(int).values).nonzero()[0]
loc1 = np.append(0,loc)
loc2 = np.append(loc,len(idx)-1)
res = df.values[loc2]-df.values[loc1]
df = pd.DataFrame(index=idx.date[loc1],values=res,columns=['values'])

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4

Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0

Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

Slicing and assigning values multi-indexed pandas dataframe of unique sequential indices

I want to select and change the value of a dataframe cell. There are 2 indices used for this dataframe: 'datetime' and 'idx'. Both contain labels which are unique and sequential. 'datetime' index has datetime label of datetime type, and 'idx' has integer valued labels.
import numpy as np
import pandas as pd
dt = pd.date_range("2010-10-01 00:00:00", periods=5, freq='H')
d = {'datetime': dt, 'a': np.arange(len(dt))-1,'b':np.arange(len(dt))+1}
df = pd.DataFrame(data=d)
df.set_index(keys='datetime',inplace=True,drop=True)
df.sort_index(axis=0,level='datetime',ascending=False,inplace=True)
df.loc[:,'idx'] = np.arange(0, len(df),1)+5
df.set_index('idx',drop=True,inplace=True,append=True)
print(df)
'Here is the dataframe:
a b
datetime idx
2010-10-01 04:00:00 5 3 5
2010-10-01 03:00:00 6 2 4
2010-10-01 02:00:00 7 1 3
2010-10-01 01:00:00 8 0 2
2010-10-01 00:00:00 9 -1 1
'Say I want to get the row where idx=5. How do I do that? I could use this:
print(df.iloc[0])
Then I will get result below:
a 3
b 5
Name: (2010-10-01 04:00:00, 5), dtype: int32
But I want to access and set the value in this cell where idx=5, column='a', by specifying idx value, and column name 'a'. How do I do that?
Please advice.

You can use DatFrame.query() method for querying MultiIndex DFs:
In [54]: df
Out[54]:
a b
datetime idx
2010-10-01 04:00:00 5 3 5
2010-10-01 03:00:00 6 2 4
2010-10-01 02:00:00 7 1 3
2010-10-01 01:00:00 8 0 2
2010-10-01 00:00:00 9 -1 1
In [55]: df.query('idx==5')
Out[55]:
a b
datetime idx
2010-10-01 04:00:00 5 3 5
In [56]: df.query('idx==5')['a']
Out[56]:
datetime idx
2010-10-01 04:00:00 5 3
Name: a, dtype: int32
Or you can use DataFrame.eval() method if you need to set/update some cells:
In [61]: df.loc[df.eval('idx==5'), 'a'] = 100
In [62]: df
Out[62]:
a b
datetime idx
2010-10-01 04:00:00 5 100 5
2010-10-01 03:00:00 6 2 4
2010-10-01 02:00:00 7 1 3
2010-10-01 01:00:00 8 0 2
2010-10-01 00:00:00 9 -1 1
Explanation:
In [59]: df.eval('idx==5')
Out[59]:
datetime idx
2010-10-01 04:00:00 5 True
2010-10-01 03:00:00 6 False
2010-10-01 02:00:00 7 False
2010-10-01 01:00:00 8 False
2010-10-01 00:00:00 9 False
dtype: bool
In [60]: df.loc[df.eval('idx==5')]
Out[60]:
a b
datetime idx
2010-10-01 04:00:00 5 3 5
PS if your original MultiIndex doesn't have names, you can easily set them using rename_axis() method:
df.rename_axis(('datetime','idx')).query(...)
Alternative (bit more expensive) solution - using sort_index() + pd.IndexSlice[]:
In [106]: df.loc[pd.IndexSlice[:,5], ['a']]
...
skipped
...
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
so we would need to sort index first:
In [107]: df.sort_index().loc[pd.IndexSlice[:,5], ['a']]
Out[107]:
a
datetime idx
2010-10-01 04:00:00 5 3

One more way to do it.
Select value:
df.xs(5, level=-1)
Set value:
df.set_value(df.xs(5, level=-1).index, 'a', 100)

In case to be used in a loop in a large data set, I realized it is about 20 times faster to extract the columns of the dataframe to pandas Series type first, then continue with the selecting and assigning operations.
Or
Even faster (almost 10000 times faster) to a numpy array, if the index labels happen to be consecutive integers.
MYGz's solution was good, but in my use case in a for-loop, it was too slow to be feasible as these operations took most of the time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cumulative aggregate of unique string values - python

We can do agg list with cumsum s=df.groupby('date').value.agg(list).cumsum().map(set).map(len) date 2020-01-01 1 2020-01-02 2 2020-01-03 3 Name: value, dtype: int64

I think,np.cumsum the first unique values. .groupby the date which in this case I have set as the index and find either the maximum or last value. import numpy as np (np.cumsum((~(df.set_index('date')).duplicated('value')))).groupby(level=0).max() date 2020-01-01 1 2020-01-02 2 2020-01-03 3

Related

How To Find A Closest Date To a Given Date?

Pandas concat/merge Dataframes fill missing values with last in column

Difference of value between two different times at the same date

Delete all (hourly) day entries per row based on a daily table in python

Slicing and assigning values multi-indexed pandas dataframe of unique sequential indices

Categories

Resources