I have a matrix df with 70 columns.
id day_1 day_2 day_3 day_4 ... day_69 day_70
1 1 2 4 1 1 1
2 0 0 0 0 0 0
3 0 3 0 0 0 0
4 3 2 1 0 0 3
I would like to aggregate the columns dinamically by [2,7,10, etc.] number of days. I.e. [bi-daily, weekly, ten-daily, etc.]
E.g. one of the results for aggregation (sum) by 2 days would be a dataframe with 35 columns, see below:
id bi_daily_1 bi_daily_2 ...bi_daily_35
1 3 5 2
2 0 0 0
3 3 0 0
4 5 1 3
where :
bi_daily_1 = aggregation(day_1, day_2)
bi_daily_2 = aggregation(day_3, day_4) and so on...
Note: Real matrix shape is aprox (2000, 1500)
Use floor division based on the number of days to determine groups (df.shape[1] is the number of columns in the dataframe), then use groupby on these groups specifying the axis as 1 (columns). Then just rename the columns.
days = 2
result = df.groupby([x // days for x in range(df.shape[1])], axis=1).sum()
result.columns = [f'bi_daily_{n + 1}' for n in result.columns]
>>> result
bi_daily_1 bi_daily_2
id
1 3 5
2 0 0
3 3 0
4 5 1
This could work, using a list comprehension: split the dataframe into pairs of two consecutive columns, use the iloc notation, sum each new dataframe, then concat to get a new dataframe.
day_1 day_2 day_3 day_4
0 1 2 4 1
1 0 0 0 0
2 0 3 0 0
3 3 2 1 0
(pd.concat([df.iloc[:,[i,i+1]]
.sum(axis=1)
for i in range(0,df.shape[1],2)],
axis=1)
.add_prefix('bi_daily_')
)
bi_daily_0 bi_daily_1
0 3 5
1 0 0
2 3 0
3 5 1
Related
I have the dollowing dataframe:
name code 1 2 3 4 5 6 7 .........155 days
0 Lari EH214 0 5 2 1 0 0 0 0 3
1 Suzi FK362 0 0 0 0 2 3 0 0 108
2 Jil LM121 0 0 4 2 1 0 0 0 5
...
I want to sum the column between column 1 to column with the number that appears on "days" , for example,
for row 1, I will sum 3 days-> 0+5+2
For row 2 108 days,
for row 3 5 days->0+4+2+1+0
How can I do something like this?
Looking for method.
For vectorized solution filter rows by positions first and get mask by compare days in numpy boroadasting, if not match replace 0 in DataFrame.where and last sum:
df1 = df.iloc[:, 2:-1]
m = df1.columns.astype(int).to_numpy() <= df['days'].to_numpy()[:, None]
df['sum'] = df1.where(m, 0).sum(axis=1)
print (df)
name code 1 2 3 4 5 6 7 155 days sum
0 Lari EH214 0 5 2 1 0 0 0 0 3 7
1 Suzi FK362 0 0 0 0 2 3 0 0 108 5
2 Jil LM121 0 0 4 2 1 0 0 0 5 7
IIUC, use:
df['sum'] = df.apply(lambda r: r.loc[1: r['days']].sum(), axis=1)
or, if the column names are strings:
df['sum'] = df.apply(lambda r: r.loc['1': str(r['days'])].sum(), axis=1)
output:
name code 1 2 3 4 5 6 7 155 days sum
0 Lari EH214 0 5 2 1 0 0 0 0 3 7
1 Suzi FK362 0 0 0 0 2 3 0 0 108 5
2 Jil LM121 0 0 4 2 1 0 0 0 5 7
I have a dataframe:
df
ID 0 1 2 3 4 ....
1 10 20 5 1 2 ....
2 3 4 NaN 10 1 ....
And I need to transpose the cell values of the column 0,1,2,3,4... to the column headers, and fill it for the Id's with 1 if the cell value is present for the respective ID.
Desired Output:
ID 1 2 3 4 5 ... 10 20 ..
1 1 1 0 0 1 ... 1 1 ..
2 1 0 1 1 0 ... 1 0 ..
Note that some entries can be NaN.
How can I get the desired output?
Use DataFrame.set_index with DataFrame.stack for remove missing values, then create indicators by get_dummies and return 1/0 by max by first level, last convert columns to integers:
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
EDIT:
print (df)
ID 0 1 2 3 4 5
0 1 10 20 5.0 1 2 5
1 2 3 4 NaN 10 1 2
If use max then always in output are 0/1 values (check 5 column):
df1 = (pd.get_dummies(df.set_index('ID').stack())
.max(level=0)
.rename(columns=int)
.reset_index())
print (df1)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 1 1 1
1 2 1 1 1 1 0 1 0
But if use sum it count values (check 5 column):
df2 = (pd.get_dummies(df.set_index('ID').stack())
.sum(level=0)
.rename(columns=int)
.reset_index())
print (df2)
ID 1 2 3 4 5 10 20
0 1 1 1 0 0 2 1 1
1 2 1 1 1 1 0 1 0
Another way using melt and pd.crosstab
df1 = df.melt('ID')
df_final = pd.crosstab(index=df1.ID, columns=df1.value).reset_index()
Out[673]:
value ID 1.0 2.0 3.0 4.0 5.0 10.0 20.0
0 1 1 1 0 0 1 1 1
1 2 1 0 1 1 0 1 0
Note: default counting of pd.crosstab uses frequency. Therefore, duplicate values will count as their frequencies. If you want only 1/0 indicator, just chain ge(1) and astype as follows
pd.crosstab(index=df1.ID, columns=df1.value).ge(1).astype(int).reset_index()
I would like to delete the rows that users equal to 1 first occurred and its previous rows for each unique user in the DataFrame.
For instance, I have the following Dataframe, and I would like to get another dataframe which deletes the row in the "val" column 1 first occured and its previous rows for each user.
user val
0 1 0
1 1 1
2 1 0
3 1 1
4 2 0
5 2 0
6 2 1
7 2 0
8 3 1
9 3 0
10 3 0
11 3 0
12 3 1
user val
0 1 0
1 1 1
2 2 0
3 3 0
4 3 0
5 3 0
6 3 1
Sample Data
import pandas as pd
s = [1,1,1,1,2,2,2,2,3,3,3,3,3]
t = [0,1,0,1,0,0,1,0,1,0,0,0,1]
df = pd.DataFrame(zip(s,t), columns=['user', 'val'])
groupby checking cummax and shift to remove all rows before, and including, the first 1 in the 'val' column per user.
Assuming your values are either 1 or 0, also possible to create the mask with a double cumsum.
m = df.groupby('user').val.apply(lambda x: x.eq(1).cummax().shift().fillna(False))
# m = df.groupby('user').val.apply(lambda x: x.cumsum().cumsum().gt(1))
df.loc[m]
Output:
user val
2 1 0
3 1 1
7 2 0
9 3 0
10 3 0
11 3 0
12 3 1
I have original dataframe:
ID T value
1 0 1
1 4 3
2 0 0
2 4 1
2 7 3
The value is same previous row.
The output should be like:
ID T value
1 0 1
1 1 1
1 2 1
1 3 1
1 4 3
2 0 0
2 1 0
2 2 0
2 3 0
2 4 1
2 5 1
2 6 1
2 7 3
... ... ...
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!
For solution is necessary unique integer values in T for each group.
Use groupby with custom function - for each group use reindex and then replace NaNs in value column by forward filling ffill:
df1 = (df.groupby('ID')['T', 'value']
.apply(lambda x: x.set_index('T').reindex(np.arange(x['T'].min(), x['T'].max() + 1)))
.ffill()
.astype(int)
.reset_index())
print (df1)
ID T value
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 3
5 2 0 0
6 2 1 0
7 2 2 0
8 2 3 0
9 2 4 1
10 2 5 1
11 2 6 1
12 2 7 3
If get error:
ValueError: cannot reindex from a duplicate axis
it means some duplicated values per group like:
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 1 <-4 is duplicates per group 2
4 2 4 3 <-4 is duplicates per group 2
5 2 7 3
Solution is aggregate values first for unique T - e.g.by sum:
df = df.groupby(['ID', 'T'], as_index=False)['value'].sum()
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 4
4 2 7 3
I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5