Python - Aggregate on week of the month basis and compare - python

I am working on a small csv data set where the values are indexed as per Week of the Month occurrence. What I want is to aggregate all of the weeks in sequence, barring the current week or the last column, to compute weekly average of the remaining data (average for ...10/1 + 11/1 + 12/1.. to get week 1 data).
The data is available in this format:
char 2019/11/1 2019/11/2 2019/11/3 2019/11/4 2019/11/5 2019/12/1 2019/12/2 2019/12/3 2019/12/4 2019/12/5 2020/1/1
A 1477 1577 1401 773 310 1401 1464 1417 909 712 289
B 1684 1485 1220 894 297 1618 1453 1335 920 772 275
C 37 10 1 3 6 17 6 6 3 2 1
D 2041 1883 1302 1136 376 2175 1729 1167 960 745 278
E 6142 5991 5499 3883 1036 4949 6187 5760 3974 2339 826
F 842 846 684 462 140 789 802 134 386 251 94
This column (2020/1/1) shall later be used to compare with the mean of all aggregate values from week one. The desired output is something like this:
char W1 W2 W3 W4 W5 2020/1/1
A 1439 1520.5 1409 841 511 289
B 1651 1469 1277.5 907 534.5 275
C 27 8 3.5 3 4 1
D 2108 1806 1234.5 1048 560.5 278
E 5545.5 6089 5629.5 3928.5 1687.5 826
F 815.5 824 409 424 195.5 94
Is it possible to use rolling or resample in such a case? Any ideas on how to do it?

I beleive you need DataFrame.resample by weeks:
df = df.set_index(['char', '2020/1/1'])
df.columns = pd.to_datetime(df.columns, format='%Y/%m/%d')
df = df.resample('W', axis=1).mean()
print (df)
2019-11-03 2019-11-10 2019-11-17 2019-11-24 2019-12-01 \
char 2020/1/1
A 289 1485.000000 541.5 NaN NaN 1401.0
B 275 1463.000000 595.5 NaN NaN 1618.0
C 1 16.000000 4.5 NaN NaN 17.0
D 278 1742.000000 756.0 NaN NaN 2175.0
E 826 5877.333333 2459.5 NaN NaN 4949.0
F 94 790.666667 301.0 NaN NaN 789.0
2019-12-08
char 2020/1/1
A 289 1125.50
B 275 1120.00
C 1 4.25
D 278 1150.25
E 826 4565.00
F 94 393.25
EDIT: If want grouping first 7 days per each month to separate groups use:
df = df.set_index(['char', '2020/1/1'])
c = pd.to_datetime(df.columns, format='%Y/%m/%d')
df.columns = [f'{y}/{m}/W{w}' for w,m,y in zip((c.day - 1) // 7 + 1,c.month, c.year)]
df = df.groupby(df.columns, axis=1).mean()
print (df)
2019/11/W1 2019/12/W1
char 2020/1/1
A 289 1107.6 1180.6
B 275 1116.0 1219.6
C 1 11.4 6.8
D 278 1347.6 1355.2
E 826 4510.2 4641.8
F 94 594.8 472.4
EDIT1: For grouping dy years and days use DatetimeIndex.strftime:
df = df.set_index(['char', '2020/1/1'])
df.columns = pd.to_datetime(df.columns, format='%Y/%m/%d').strftime('%d-%Y')
df = df.groupby(df.columns, axis=1).mean()
print (df)
01-2019 02-2019 03-2019 04-2019 05-2019
char 2020/1/1
A 289 1439.0 1520.5 1409.0 841.0 511.0
B 275 1651.0 1469.0 1277.5 907.0 534.5
C 1 27.0 8.0 3.5 3.0 4.0
D 278 2108.0 1806.0 1234.5 1048.0 560.5
E 826 5545.5 6089.0 5629.5 3928.5 1687.5
F 94 815.5 824.0 409.0 424.0 195.5

Here is a way using groupby:
m= df.set_index(['char', '2020/1/1']).rename(columns=lambda x: pd.to_datetime(x))
m.groupby(m.columns.week,axis=1).mean().add_prefix('W_').reset_index()
char 2020/1/1 W_44 W_45 W_48 W_49
0 A 289 1485.000000 541.5 1401.0 1125.50
1 B 275 1463.000000 595.5 1618.0 1120.00
2 C 1 16.000000 4.5 17.0 4.25
3 D 278 1742.000000 756.0 2175.0 1150.25
4 E 826 5877.333333 2459.5 4949.0 4565.00
5 F 94 790.666667 301.0 789.0 393.25

Related

python pandas how to read csv file by block

I'm trying to read a CSV file, block by block.
CSV looks like:
No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,
Blocks start with No., and data rows follow.
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print(zf.filename)
print("csv name: ", f)
df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5] #,"nrows=1435? (but for the next blocks?")
print(df, '\n')
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
#print(f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e)
tuples.append((f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
Many thanks for help,
Load your file with pd.read_csv and create block at each time the row of your first column is No.. Use groupby to iterate over each block and create a new dataframe.
data = pd.read_csv('data.csv', header=None)
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
Output:
# First block
>>> dfs[0]
No. time 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:0A ...
0 1 2021/09/12 02:16 235 610 345 997 446 130 129 94 555 274 4 NaN
1 2 2021/09/12 02:17 364 210 371 341 294 87 179 106 425 262 3 NaN
2 1434 2021/09/12 02:28 269 135 372 262 307 73 86 93 512 283 4 NaN
3 1435 2021/09/12 02:29 281 207 688 322 233 75 69 85 663 276 2 NaN
# Second block
>>> dfs[1]
No. time 00:00:10 00:00:11 00:00:12 00:00:13 00:00:14 00:00:15 00:00:16 00:00:17 00:00:18 00:00:19 00:00:1A ...
0 1 2021/09/12 02:16 255 619 200 100 453 456 4 19 56 23 4 NaN
1 2 2021/09/12 02:17 368 21 37 31 24 8 19 1006 4205 2062 30 NaN
2 1434 2021/09/12 02:28 2689 1835 3782 2682 307 743 256 741 52 23 6 NaN
3 1435 2021/09/12 02:29 2281 2047 6848 3522 2353 755 659 885 6863 26 36 NaN
and so on.
Sorry, i don't find a correct way with your code:
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print("using zip :", zf.filename)
str = f
myobject = re.search(r'(^[a-zA-Z]{4})_.*', str)
Objects = myobject.group(1)
if Objects == 'LDEV':
metric = re.search('.*LDEV_(.*)/.*', str)
metric = metric.group(1)
elif Objects == 'Port':
metric = re.search('.*/(Port_.*).csv', str)
metric = metric.group(1)
else:
print("None")
print("using csv : ", f)
#df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
data = pd.read_csv(zf.open(f), header=None, skiprows=[0,1,2,3,4,5])
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
print("here")
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
#formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
tuples.append((f"perf.type.serial.{Objects}.{each_column}.{metric}",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
thanks for your help,

pandas transform data into multi step time series based on a condition

i have a dataframe like below and i'm creating a multistep sequence of data using below for loop but i want to apply the logic at a customer level.
Dataframe :
Date Customer Price
1/1/2019 A 142
1/2/2019 A 123
1/3/2019 A 342
1/4/2019 A 232
1/5/2019 A 657
1/6/2019 B 875
1/7/2019 B 999
1/8/2019 B 434
1/9/2019 B 564
1/10/2019 B 345
1/10/2019 B 798
Below forloop code can create sequence of data having a rolling window 1.
data = np.array(data)
X_data, y_data = [], []
for i in range(2, len(data )-2):
X_data.append(data[i-2:i])
y_data.append(data[i:i+2])
The output of X_data array and y_data array should look like below
X_data(independent variables) y_data(target)
customer 0 1 0 1
A 142 123 342 232
A 123 342 232 657
B 875 999 434 564
B 999 434 564 345
B 434 564 345 798
Please suggest me on this.Thanks in advance
Use DataFrame.shift() to shift index by desired number for rolling data
def get_rolling_data(row):
n = 2
for i in range(n):
row[f'x_data.{i}'] = row.shift(0 - i).Price
row[f'y_data.{i}'] = row.shift(0 - n - i).Price
return row
df_res = df.groupby(['Customer']).apply(get_rolling_data)
print(df_res.dropna())
Date Customer Price x_data.0 x_data.1 y_data.0 y_data.1
0 1/1/2019 A 142 142 123.0 342.0 232.0
1 1/2/2019 A 123 123 342.0 232.0 657.0
5 1/6/2019 B 875 875 999.0 434.0 564.0
6 1/7/2019 B 999 999 434.0 564.0 345.0
7 1/8/2019 B 434 434 564.0 345.0 798.0
Above result is a Dataframe. You can further extract the required data if you need np.array instead.

Error while dropping row from dataframe based on value comparison

I have following unique values in dataframe column.
['1473' '1093' '1346' '1324' 'NA' '1129' '58' '847' '54' '831' '816']
I want to drop rows which have 'NA' in this column.
testData = testData[testData.BsmtUnfSF != "NA"]
and got error
TypeError: invalid type comparison
Then I tried
testData = testData[testData.BsmtUnfSF != np.NAN]
It doesn't give any error but it doesn't drop rows.
How to solve this issue?
Here is how you can do it. Just change column with the column name you want.
import pandas as pd
import numpy as np
df = pd.DataFrame({"column": [1,2,3,np.nan,6]})
df = df[np.isfinite(df['column'])]
You could use dropna
testData = testData.dropna(subsets = 'BsmtUnfSF']
assuming your dataFrame:
>>> df
col1
0 1473
1 1093
2 1346
3 1324
4 NaN
5 1129
6 58
7 847
8 54
9 831
10 816
You have multiple solutions:
>>> df[pd.notnull(df['col1'])]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[df.col1.notnull()]
# df[df['col1'].notnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna(subset=['col1'])
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna()
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[~df.col1.isnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816

exclude row for rolling mean calculation in pandas

I am looking for Pandas way to solve this, I have a DataFrame as
df
A RM
0 384 NaN
1 376 380.0
2 399 387.5
3 333 366.0
4 393 363.0
5 323 358.0
6 510 416.5
7 426 468.0
8 352 389.0
I want to see if value in df['A'] > [Previous] RM value then new column Status should have 0 updated else
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
I suppose i need to use Shift with numpy where, but I am not getting as desired.
import pandas as pd
import numpy as np
df=pd.DataFrame([384,376,399,333,393,323,510,426,352], columns=['A'])
df['RM']=df['A'].rolling(window=2,center=False).mean()
df['Status'] = np.where((df.A > df.RM.shift(1).rolling(window=2,center=False).mean()) , 0, 1)
Finally, applying rolling mean
df.AverageMean=df[df['Status'] == 1]['A'].rolling(window=2,center=False).mean()
Just simple shift
df['Status']=(df.A<=df.RM.fillna(9999).shift()).astype(int)
df
Out[347]:
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
i assume when you compare with na it always be 1
df['Status'] = (df.A < df.RM.fillna(df.A.max()+1).shift(1)).astype(int)
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1

Splitting the header into multiple headers in DataFrame

I have a DataFrame where I need to split the header into multiple rows as headers for the same Dataframe.
The dataframe looks like this,
My data Frame looks like follows,
gene ALL_ID_1 AML_ID_1 AML_ID_2 AML_ID_3 AML_ID_4 AML_ID_5 Stroma_ID_1 Stroma_ID_2 Stroma_ID_3 Stroma_ID_4 Stroma_ID_5 Stroma_CR_Pat_4 Stroma_CR_Pat_5 Stroma_CR_Pat_6 Stroma_CR_Pat_7 Stroma_CR_Pat_8
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
And I want the the above header to be spitted as follows,
network ID ID REL
node B_ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
Any help is greatly appreciated ..
Probably not the best minimal example you put here, very few people has the subject knowledge to understand what is network, node and hemi in your context.
You just need to create your MultiIndex and replace your column index with the one you created:
There are 3 rules in your example:
1, whenever 'Stroma' is found, the column belongs to REL, otherwise belongs to ID.
2, node is the first field of the initial column names
3, hemi is the last field of the initial column names
Then, just code away:
In [110]:
df.columns = pd.MultiIndex.from_tuples(zip(np.where(df.columns.str.find('Stroma')!=-1, 'REL', 'ID'),
df.columns.map(lambda x: x.split('_')[0]),
df.columns.map(lambda x: x.split('_')[-1])),
names=['network', 'node', 'hemi'])
print df
network ID REL \
node ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5
gene
ENSG 8 1 11 5 10 0 628 542 767 578 462
ENSG 0 0 1 0 0 0 0 28 1 3 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110
ENSG 11 26 24 9 11 2 649 532 953 463 468
network
node
hemi 4 5 6 7 8
gene
ENSG 680 513 968 415 623
ENSG 1 4 0 0 0
ENSG 9 3 3 5 1
ENSG 857 1880 1526 2262 2624
ENSG 878 587 245 722 484

Categories