Transpose a dataframe and melt - python

I have a dataframe. There is always data available for each date and firm. But a given row isn't guaranteed to have the data; the row only has data if that firm is True.
date IBM AAPL_total_amount IBM_total_amount AAPL_count_avg IBM_count_avg
2013-01-31 True False 29 9
2013-01-31 True True 29 9 27 5
2013-02-31 False True 27 5
2013-02-08 True True 2 3 5 6
...
How could I transpose the above dataframe to long format?
Expected output:
date Firm total_amount count_avg
2013-01-31 IBM 9 5
2013-01-31 AAPL 29 27
...

Might have to add some logic to drop all the boolean masks, but once you have that it's just a stack.
u = df.set_index('date').drop(['IBM', 'AAPL'], 1)
u.columns = u.columns.str.split('_', expand=True)
u.stack(0)
count total
date
2013-01-31 IBM 9.0 29.0
AAPL 5.0 27.0
IBM 9.0 29.0
2013-02-31 AAPL 5.0 27.0
2013-02-08 AAPL 6.0 5.0
IBM 3.0 2.0
To drop all the masks if you don't have a list of keys, possibly use select_dtypes
df.select_dtypes(exclude=[bool])

Use wide_to_long with pre-processing on columns and post-processing with slicing and dropna
df.columns = ['_'.join(col[::-1]) for col in df.columns.str.split('_')]
df_final = (pd.wide_to_long(df.reset_index(), stubnames=['total','count'],
i=['index','date'],
j='firm', sep='_', suffix='\w+')[['total', 'count']]
.reset_index(level=[1,2]).dropna())
Out[59]:
date firm total count
index
0 2013-01-31 IBM 29.0 9.0
1 2013-01-31 IBM 29.0 9.0
1 2013-01-31 AAPL 27.0 5.0
2 2013-02-31 AAPL 27.0 5.0
3 2013-02-08 IBM 2.0 3.0
3 2013-02-08 AAPL 5.0 6.0

That's an unusual table design. Let's assume the table is called df.
So you first want to find the list of tickers:
Either you have them elsewhere:
tickers = ['AAPL','IBM']
or you can extract them from your table:
tickers = [c for c in df.columns
if not c.endswith('_count') and
not c.endswith('_total') and
c != 'date']
Now you have to loop over the tickers:
res = []
for tic in tickers:
sub = df[df[tic]][ ['date', f'{tic}_total','f{tic}_count'] ].copy()
sub.columns = ['date', 'Total','Count']
sub['Firm'] = tic
res.append(sub)
res = pd.concat(res, axis=0)
Eventually, you might want to reorder the columns:
res = res[['date','Item','Total','Count']]
You might want to handle duplicates. From what I read in your example, you want to drop them:
res = res.drop_duplicates()

Related

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

How do I create a dummy variable by comparing columns in different data frames?

I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)

How to display grouped by column during ffill() and not agg using pandas?

This isn't duplicate. I already referred this post_1 and post_2
My question is different and not about agg function. It is about displaying grouped by column as well during ffill operation. Though the code works fine, just sharing the full code for you to get an idea. Problem is in the commented line. look out for that line below.
I have a dataframe like as given below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What this code with the help of Jezrael from forum does is add missing dates based on threshold value. Only issue is,I don't see the grouped by column during output
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
df2 = df2.groupby(df2['subject_id']).ffill() # problem is here #here is the problem
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
As shown in code above, I tried the below approaches
df2 = df2.groupby(df2['subject_id']).ffill() # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index() # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill() # doesn't help
Incorrect output without subject_id
I expect my output to have subject_id column as well
Here are 2 possible solutions - specify all columns in list after groupby and assign back:
cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill() # problem is here #here is the problem
Or create index by subject_id column and grouping by index:
#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()
#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day month count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 4.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 4.0 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 4.0 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 4.0 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 5.0 1.0
33 1 2173-05-05 2173-05-05 13:37:00 1 5 5.0 1.0
95 1 2173-07-06 2173-07-06 13:39:00 6 6 7.0 1.0
96 1 2173-07-07 2173-07-07 13:39:00 6 7 7.0 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 7.0 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 4.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8 9 4.0 NaN
100 2 2173-04-10 2173-04-10 22:00:00 8 10 4.0 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 4.0 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 4.0 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 4.0 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 4.0 1.0

Display variable data as month and year

I have the code below:
import pandas as pd
import datetime
df=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
df["date"]=pd.to_datetime(df["date"])
df['date'] = df.date.apply(lambda x: datetime.datetime.strftime(x,'%b')) # SHOWS date as MONTH
pvt_enroll=df.pivot_table(index='site', columns="date", values = 'baseline', aggfunc = {'baseline' : 'count'}, fill_value=0, margins=True) # Pivot_Table with enrollment by SITE by MONTH
pvt_enroll.to_csv("pivot_test.csv")
table_enroll_site_month = pd.read_csv('pivot_test.csv', encoding='latin-1')
table_enroll_site_month.rename(columns={'site':'Study Site'}, inplace=True)
table_enroll_site_month
Study Site Apr Jul Jun May All
0 A 5.0 0.0 8.0 4.0 17.0
1 B 9.0 0.0 11.0 5.0 25.0
2 C 6.0 1.0 3.0 20.0 30.0
3 D 5.0 0.0 3.0 2.0 10.0
4 E 5.0 0.0 5.0 0.0 10.0
5 All 30.0 1.0 30.0 31.0 92.0
And wonder how to:
1. Display months with year as
Apr16 Jul16 Jun16 May16
2. Is it possible to get same table without running this step (pvt_enroll.to_csv("pivot_test.csv")? I mean, can I get same result without needing to save to .csv file first?
I think by using %b%y you can get 'Apr16' etc format.
I tried with the following code, without saving into .csv.
import pandas as pd
from datetime import datetime
df=pd.read_csv("demo.csv")
df["date"]=pd.to_datetime(df["date"])
df['date'] = df['date'].apply(lambda x: datetime.strftime(x,'%b%y'))
pvt_enroll=df.pivot_table(index='site', columns="date", values = 'baseline', aggfunc = {'baseline' : 'count'}, fill_value=0, margins=True) # Pivot_Table with enrollment by SITE by MONTH
pvt_enroll.reset_index(inplace=True)
pvt_enroll.rename(columns={'site':'Study Site'}, inplace=True)
print(pvt_enroll)
And I got the output as follows
date Study Site Apr16 Jul16 Jun16 May16 All
0 A 5 0 8 4 17
1 B 9 0 11 5 25
2 C 6 1 3 20 30
3 D 5 0 3 2 10
4 E 5 0 5 0 10
5 All 30 1 30 31 92

How to sum python pandas dataframe in certain time range

I have a dataframe like this
df
order_date amount
0 2015-10-02 1
1 2015-12-21 15
2 2015-12-24 3
3 2015-12-26 4
4 2015-12-27 5
5 2015-12-28 10
I would like to sum on df["amount"] based on range from df["order_date"] to df["order_date"] + 6 days
order_date amount sum
0 2015-10-02 1 1
1 2015-12-21 15 27 //comes from 15 + 3 + 4 + 5
2 2015-12-24 3 22 //comes from 3 + 4 + 5 + 10
3 2015-12-26 4 19
4 2015-12-27 5 15
5 2015-12-28 10 10
the data type of order_date is datetime
have tried to use iloc but it did not work well...
if anyone has any idea/example on who to work on this,
please kindly let me know.
If pandas rolling allowed left-aligned window (default is right-aligned) then the answer would be a simple single liner: df.set_index('order_date').amount.rolling('7d',min_periods=1,align='left').sum(), however forward-looking has not been implemented yet (i.e. rolling does not accept an align parameter). So, the trick I came up with is to "reverse" the dates temporarily. Solution:
df.index = pd.to_datetime(pd.datetime.now() - df.order_date)
df['sum'] = df.sort_index().amount.rolling('7d',min_periods=1).sum()
df.reset_index(drop=True)
Output:
order_date amount sum
0 2015-10-02 1 1.0
1 2015-12-21 15 27.0
2 2015-12-24 3 22.0
3 2015-12-26 4 19.0
4 2015-12-27 5 15.0
5 2015-12-28 10 10.0
Expanding on my comment:
from datetime import timedelta
df['sum'] = 0
for i in range(len(df)):
dt1 = df['order_date'][i]
dt2 = dt1 + timedelta(days=6)
df['sum'][i] = sum(df['amount'][(df['order_date'] >= dt1) & (df['order_date'] <= dt2)])
There's probably a much better way to do this but it works...
There is my way for this problem. It works.. (I believe there should be a much better way to do this.)
import pandas as pd
df['order_date']=pd.to_datetime(pd.Series(df.order_date))
Temp=pd.DataFrame(pd.date_range(start='2015-10-02', end='2017-01-01'),columns=['STDate'])
Temp=Temp.merge(df,left_on='STDate',right_on='order_date',how='left')
Temp['amount']=Temp['amount'].fillna(0)
Temp.sort(['STDate'],ascending=False,inplace=True)
Temp['rolls']=pd.rolling_sum(Temp['amount'],window =7,min_periods=0)
Temp.loc[Temp.STDate.isin(df.order_date),:].sort(['STDate'],ascending=True)
STDate Unnamed: 0 order_date amount rolls
0 2015-10-02 0.0 2015-10-02 1.0 1.0
80 2015-12-21 1.0 2015-12-21 15.0 27.0
83 2015-12-24 2.0 2015-12-24 3.0 22.0
85 2015-12-26 3.0 2015-12-26 4.0 19.0
86 2015-12-27 4.0 2015-12-27 5.0 15.0
87 2015-12-28 5.0 2015-12-28 10.0 10.0
Set order_date to be DatetimeIndex, so that you can use df.ix[time1:time2] to get the time range rows, then filter amount column and sum them.
You can try with :
from datetime import timedelta
df = pd.read_fwf('test2.csv')
df.order_date = pd.to_datetime(df.order_date)
df =df.set_index(pd.DatetimeIndex(df['order_date']))
sum_list = list()
for i in range(len(df)):
sum_list.append(df.ix[df.ix[i]['order_date']:(df.ix[i]['order_date'] + timedelta(days=6))]['amount'].sum())
df['sum'] = sum_list
df
Output:
order_date amount sum
2015-10-02 2015-10-02 1 1
2015-12-21 2015-12-21 15 27
2015-12-24 2015-12-24 3 22
2015-12-26 2015-12-26 4 19
2015-12-27 2015-12-27 5 15
2015-12-28 2015-12-28 10 10
import datetime
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')
df.set_index(['order_date'], inplace=True)
# Sum rows within the range of six days in the future
d = {t: df[(df.index >= t) & (df.index <= t + datetime.timedelta(days=6))]['amount'].sum()
for t in df.index}
# Assign the summed values back to the dataframe
df['amount_sum'] = [d[t] for t in df.index]
df is now:
amount amount_sum
order_date
2015-10-02 1.0 1.0
2015-12-21 15.0 27.0
2015-12-24 3.0 22.0
2015-12-26 4.0 19.0
2015-12-27 5.0 15.0
2015-12-28 10.0 10.0

Categories