I have a problem. I want to answer some question (see below). Unfortunately I got an error ValueError: Wrong number of items passed 0, placement implies 1. How could I determine the questions?
When was the last interactivity how many days ago (from today)?
How many orders has the customer placed ?
When was the fastest interactivity ?
When was the shortest interactivity ?
How much was the average of the interactivity ? (For that I calculated the days)
From 2021-02-10 to 2021-02-22 = 12
From 2021-02-22 to 2021-03-18 = 24
From 2021-03-18 to 2021-03-22 = 4
From 2021-03-22 to 2021-09-07 = 109
From 2021-09-07 to 2022-01-18 = 193
-------
68,4 (average days)
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17']
}
df = pd.DataFrame(data=d)
display(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
#df = df['fromDate'].groupby('customerId', group_keys=False)
df = df.sort_values(['customerId','fromDate'],ascending=False)#.groupby('customerId')
df_new = pd.DataFrame()
df_new['average'] = df.groupby('customerId').mean()
[OUT] AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
df_new = pd.DataFrame()
df_new['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate'].max()
[OUT] TypeError: '>=' not supported between instances of 'datetime.date' and 'float'
What I want
customerId lastInteractivity howMany shortest Longest Average
2 371 1 None None None
1 125 5 4 193 68,4
# shortes, longest, average is None because customer with the Id 2 had only 1 date
Use:
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
#count how many missing dates
m = df['lastInteractivity'].notna()
df1 = (df[~m].groupby(['customerId', m.cumsum()])['customerId']
.size()
.add(1)
.reset_index(name='Count'))
print (df1)
customerId lastInteractivity Count
0 1 1 12
1 1 2 24
2 1 3 4
3 1 4 169
4 1 5 133
df1 = df1.groupby('customerId').agg(howMany=('Count','size'),
shortest=('Count','min'),
Longest=('Count','max'),
Average=('Count','mean'))
#get last lastInteractivity and joined df1
df = (df.groupby('customerId')['lastInteractivity']
.last()
.dt.days
.sort_index(ascending=False)
.to_frame()
.join(df1)
.reset_index())
print (df)
customerId lastInteractivity howMany shortest Longest Average
0 2 371 NaN NaN NaN NaN
1 1 125 5.0 4.0 169.0 68.4
Related
From below picture, we see that serial C was failed on 3rd January, and A failed on 5th January within 6 days period. I am interested to take samples for 3 days before the failure of each serial number.
My codes:
from pickle import TRUE
import pandas as pd
import numpy as np
import datetime
from datetime import date, timedelta
df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='\t')
df['date'] = pd.to_datetime(df['date'])
#df.drop(columns=df.columns[0], axis=1, inplace=True)
df = df.sort_values(by="date")
d = datetime.timedelta(days = 3)
df_fail_date = df[df['failure']==1].groupby(['serial_number'])['date'].min()
df_fail_date = df_fail_date - d
df_fail_date
I was not not able to move further to sample my data. I am interested to get the following data, that is 3 days before the failure. Serial C had only 1 day available before failure so I wanna keep that one as well. It would be nice to add duration column to count the days before failure occurred. I appreciate your suggestions. Thanks!
Expected output dataframe:
You can use a groupby.rolling to get the dates/serials to keep, then merge to select:
df['date'] = pd.to_datetime(df['date'])
N = 3
m = (df.sort_values(by='date')
.loc[::-1]
.groupby('serial_number', group_keys=False)
.rolling(f'{N+1}d', on='date')
['failure'].max().eq(1)
.iloc[::-1]
)
out = df.merge(m[m], left_on=['serial_number', 'date'],
right_index=True, how='right')
Output:
date serial_number failure_x smart_5_raw smart_187_raw failure_y
2 2014-01-01 C 0 0 80 True
8 2014-01-02 C 0 0 200 True
4 2014-01-03 C 1 0 120 True
7 2014-01-02 A 0 0 180 True
5 2014-01-03 A 0 0 140 True
9 2014-01-04 A 0 0 280 True
14 2014-01-05 A 1 0 400 True
Another possible solution:
N = 4
df['date'] = pd.to_datetime(df['date'])
(df[df.groupby('serial_number')['failure'].transform(sum) == 1]
.sort_values(by=['serial_number', 'date'])
.groupby('serial_number')
.apply(lambda g:
g.assign(duration=1+np.arange(min(0, min(N, len(g))-len(g)), min(N, len(g)))))
.loc[lambda x: x['duration'] > 0]
.reset_index(drop=True))
Output:
date serial_number failure smart_5_raw smart_187_raw duration
0 2014-01-02 A 0 0 180 1
1 2014-01-03 A 0 0 140 2
2 2014-01-04 A 0 0 280 3
3 2014-01-05 A 1 0 400 4
4 2014-01-01 C 0 0 80 1
5 2014-01-02 C 0 0 200 2
6 2014-01-03 C 1 0 120 3
I have a problem. I have missing numbers inside the column materialNumber. But if a similar price is it should take the exactly same materialNumber. If more than two materialNumber occur with the same price it should take the first. If no materialNumber is found with the same price it should take the next nearst materialnumber depending on the price.
Dataframe
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 NaN 100
3 2 4562.0 30
4 3 1547.0 40
5 3 NaN 37
Code
import pandas as pd
d = {
"customerId": [1, 1, 2, 2, 3, 3],
"materialNumber": [
1234,
4562,
None,
4562,
1547,
None,
],
"price": [100, 20, 100, 30, 40, 37],
}
df = pd.DataFrame(data=d)
print(df)
import numpy as np
def find_next(x):
if(x['materialNumber'] == None):
#if price occurs only once it should finde the next nearst price
if(x['price'].value_counts().shape[0] == 1):
return x.drop_duplicates(subset=['price'], keep="first")
else:
return x.iloc[(x['price']-input).abs().argsort()[:2]]
df['materialNumber'] = df.apply(lambda x: find_next(x), axis=1)
What I want
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234 100 # of index 0: 1234.0, 100 (same value)
3 2 4562.0 30
4 3 1547.0 40
5 3 1547 37 # of index 4: 1547.0, 40 (next similar value)
Use merge_asof with match rows with missing values per materialNumber by rows without missing values and assign values in DataFrame.loc:
m = df['materialNumber'].isna()
new = pd.merge_asof(df[m].reset_index().sort_values('price'),
df[~m].sort_values('price'), on='price', direction='nearest')
df.loc[m, 'materialNumber'] = new.set_index('index')['materialNumber_y']
print(df)
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234.0 100
3 2 4562.0 30
4 3 1547.0 40
5 3 1547.0 37
IIUC, you can use a merge_asof to find the equal or closest price value, then update your dataframe:
# mask to split the DataFrame in NaN/non-NaN for materialNumber
m = df['materialNumber'].isna()
# sort by price (required for merge_asof)
df2 = df.sort_values(by='price')
# fill missing values
missing = pd.merge_asof(df2.reset_index().loc[m, ['index', 'price']],
df2.loc[~m, ['price', 'materialNumber']],
on='price',
direction='nearest') # direction='forward' for next only
# update in place
df.update(missing.set_index('index')['materialNumber'])
output:
customerId materialNumber price
0 1 1234.0 100
1 1 4562.0 20
2 2 1234.0 100
3 2 4562.0 30
4 3 1547.0 40
5 3 1547.0 37
Given that I have a pandas dataframe:
waterflow_id created_at
0 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-20 13:19:21.430816+00:00
1 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819+00:00
2 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819+00:00
3 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821+00:00
4 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821+00:00
How do I get the median of days between created_at so that I can have a dataframe of days in between waterflow ids having something like:
waterflow days_median
1 0
2 4
3 6
4 7
5 10
Basically here, waterflow represents the unique occurrence of waterflow_id's
With the latest answer I tried
meddata = waterflow_df.groupby("waterflow_id")['created_at'].apply(lambda s: s.diff().median())
print(meddata)
And I recieved:
waterflow_id
0788a658-06d9-4b61-9ac4-2728ace02a86 0 days
1f8752f8-f667-44ec-84b9-acad02d384c0 0 days
2655b525-8b2c-4a53-abdc-5208cb95d96e 0 days
8d3cd7e3-900c-4996-b202-f66eb41ac37b 0 days
9d02b939-f295-4d36-8f72-e9984a52dbd9 0 days
d8d8fb70-d755-48c3-8c19-8032864719da 0 days
dc1da5e1-6974-4145-a0d8-615e08506ebf 0 days
f39366f5-c9e2-415a-baec-530bb8bd2f07 0 days
Whats weird is that I have dates spanning up to 6 months.
The output is unclear, but IIUC, you could use a GroupBy.agg:
from itertools import count
c = count(1)
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', lambda s: next(c)),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
or using factorize to number the groups:
df['created_at'] = pd.to_datetime(df['created_at'])
(df.assign(waterflow_id=df['waterflow_id'].factorize()[0]+1)
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', 'first'),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
output:
waterflow days_median
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 1 0.0
Simple version with just the median:
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df.groupby('waterflow_id')['created_at']
.apply(lambda s: s.diff().median()
.total_seconds()//(3600*24))
)
output:
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 0.0
Name: created_at, dtype: float64
I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!
IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))
I am doing the following:
# Load date
data = pd.read_csv('C:/Users/user/Desktop/STOCKS.txt', keep_default_na=True, sep='\t', nrows=5)
# Convert dates from object columns to datetime columns
data['DATE'] = pd.to_datetime(data_orders['DATE'])
print(data.columns)
# Index(['COUNTRY_ID', 'STOCK_ID', 'DATE', 'STOCK_VALUE'], dtype='object')
# Count of stock per country per day
data_agg= data.groupby(['COUNTRY_ID'], as_index=False).agg({'DATE': 'count'})
print(data_agg.columns)
# Index(['COUNTRY_ID', 'DATE'], dtype='object')
# Rename count column
data_agg.rename({'DATE': 'Count'}, inplace=True)
print(data_agg.columns)
# Index(['COUNTRY_ID', 'DATE'], dtype='object')
As you see above at the last lines, I try to rename the aggregated column after the groupby but for some reason this does not work (I still get the name DATE for this column instead of Count).
How can I fix this?
Need columns keyword, if omit it, rename try change values of index:
data_agg.rename(columns={'DATE': 'Count'}, inplace=True)
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'DATE': rng, 'COUNTRY_ID': [3]*3+ [4]*5 + [1]*2})
print (data)
DATE COUNTRY_ID
0 2017-04-03 3
1 2017-04-04 3
2 2017-04-05 3
3 2017-04-06 4
4 2017-04-07 4
5 2017-04-08 4
6 2017-04-09 4
7 2017-04-10 4
8 2017-04-11 1
9 2017-04-12 1
data_agg= data.groupby(['COUNTRY_ID'], as_index=False).agg({'DATE': 'count'})
data_agg.rename({'DATE': 'Count', 1:'aaa'}, inplace=True)
print (data_agg)
COUNTRY_ID DATE
0 1 2
aaa 3 3
2 4 5
data_agg.rename(columns={'DATE': 'Count', 1:'aaa'}, inplace=True)
print (data_agg)
COUNTRY_ID Count
0 1 2
1 3 3
2 4 5
Another solution is remove as_index=False and use DataFrameGroupBy.count with Series.reset_index with :
data_agg= data.groupby('COUNTRY_ID')['DATE'].count().reset_index(name='Count')
print (data_agg)
COUNTRY_ID Count
0 1 2
1 3 3
2 4 5
I thinks this solves your problem
data_agg=data_agg.rename(columns={'Date':'Count'})