I'm trying to read a CSV file, block by block.
CSV looks like:
No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,
Blocks start with No., and data rows follow.
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print(zf.filename)
print("csv name: ", f)
df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5] #,"nrows=1435? (but for the next blocks?")
print(df, '\n')
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
#print(f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e)
tuples.append((f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
Many thanks for help,
Load your file with pd.read_csv and create block at each time the row of your first column is No.. Use groupby to iterate over each block and create a new dataframe.
data = pd.read_csv('data.csv', header=None)
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
Output:
# First block
>>> dfs[0]
No. time 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:0A ...
0 1 2021/09/12 02:16 235 610 345 997 446 130 129 94 555 274 4 NaN
1 2 2021/09/12 02:17 364 210 371 341 294 87 179 106 425 262 3 NaN
2 1434 2021/09/12 02:28 269 135 372 262 307 73 86 93 512 283 4 NaN
3 1435 2021/09/12 02:29 281 207 688 322 233 75 69 85 663 276 2 NaN
# Second block
>>> dfs[1]
No. time 00:00:10 00:00:11 00:00:12 00:00:13 00:00:14 00:00:15 00:00:16 00:00:17 00:00:18 00:00:19 00:00:1A ...
0 1 2021/09/12 02:16 255 619 200 100 453 456 4 19 56 23 4 NaN
1 2 2021/09/12 02:17 368 21 37 31 24 8 19 1006 4205 2062 30 NaN
2 1434 2021/09/12 02:28 2689 1835 3782 2682 307 743 256 741 52 23 6 NaN
3 1435 2021/09/12 02:29 2281 2047 6848 3522 2353 755 659 885 6863 26 36 NaN
and so on.
Sorry, i don't find a correct way with your code:
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print("using zip :", zf.filename)
str = f
myobject = re.search(r'(^[a-zA-Z]{4})_.*', str)
Objects = myobject.group(1)
if Objects == 'LDEV':
metric = re.search('.*LDEV_(.*)/.*', str)
metric = metric.group(1)
elif Objects == 'Port':
metric = re.search('.*/(Port_.*).csv', str)
metric = metric.group(1)
else:
print("None")
print("using csv : ", f)
#df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
data = pd.read_csv(zf.open(f), header=None, skiprows=[0,1,2,3,4,5])
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
print("here")
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
#formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
tuples.append((f"perf.type.serial.{Objects}.{each_column}.{metric}",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
thanks for your help,
I have following unique values in dataframe column.
['1473' '1093' '1346' '1324' 'NA' '1129' '58' '847' '54' '831' '816']
I want to drop rows which have 'NA' in this column.
testData = testData[testData.BsmtUnfSF != "NA"]
and got error
TypeError: invalid type comparison
Then I tried
testData = testData[testData.BsmtUnfSF != np.NAN]
It doesn't give any error but it doesn't drop rows.
How to solve this issue?
Here is how you can do it. Just change column with the column name you want.
import pandas as pd
import numpy as np
df = pd.DataFrame({"column": [1,2,3,np.nan,6]})
df = df[np.isfinite(df['column'])]
You could use dropna
testData = testData.dropna(subsets = 'BsmtUnfSF']
assuming your dataFrame:
>>> df
col1
0 1473
1 1093
2 1346
3 1324
4 NaN
5 1129
6 58
7 847
8 54
9 831
10 816
You have multiple solutions:
>>> df[pd.notnull(df['col1'])]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[df.col1.notnull()]
# df[df['col1'].notnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna(subset=['col1'])
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna()
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[~df.col1.isnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
I am looking for Pandas way to solve this, I have a DataFrame as
df
A RM
0 384 NaN
1 376 380.0
2 399 387.5
3 333 366.0
4 393 363.0
5 323 358.0
6 510 416.5
7 426 468.0
8 352 389.0
I want to see if value in df['A'] > [Previous] RM value then new column Status should have 0 updated else
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
I suppose i need to use Shift with numpy where, but I am not getting as desired.
import pandas as pd
import numpy as np
df=pd.DataFrame([384,376,399,333,393,323,510,426,352], columns=['A'])
df['RM']=df['A'].rolling(window=2,center=False).mean()
df['Status'] = np.where((df.A > df.RM.shift(1).rolling(window=2,center=False).mean()) , 0, 1)
Finally, applying rolling mean
df.AverageMean=df[df['Status'] == 1]['A'].rolling(window=2,center=False).mean()
Just simple shift
df['Status']=(df.A<=df.RM.fillna(9999).shift()).astype(int)
df
Out[347]:
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
i assume when you compare with na it always be 1
df['Status'] = (df.A < df.RM.fillna(df.A.max()+1).shift(1)).astype(int)
A RM Status
0 384 NaN 0
1 376 380.0 1
2 399 387.5 0
3 333 366.0 1
4 393 363.0 0
5 323 358.0 1
6 510 416.5 0
7 426 468.0 0
8 352 389.0 1
I have a DataFrame where I need to split the header into multiple rows as headers for the same Dataframe.
The dataframe looks like this,
My data Frame looks like follows,
gene ALL_ID_1 AML_ID_1 AML_ID_2 AML_ID_3 AML_ID_4 AML_ID_5 Stroma_ID_1 Stroma_ID_2 Stroma_ID_3 Stroma_ID_4 Stroma_ID_5 Stroma_CR_Pat_4 Stroma_CR_Pat_5 Stroma_CR_Pat_6 Stroma_CR_Pat_7 Stroma_CR_Pat_8
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
And I want the the above header to be spitted as follows,
network ID ID REL
node B_ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
Any help is greatly appreciated ..
Probably not the best minimal example you put here, very few people has the subject knowledge to understand what is network, node and hemi in your context.
You just need to create your MultiIndex and replace your column index with the one you created:
There are 3 rules in your example:
1, whenever 'Stroma' is found, the column belongs to REL, otherwise belongs to ID.
2, node is the first field of the initial column names
3, hemi is the last field of the initial column names
Then, just code away:
In [110]:
df.columns = pd.MultiIndex.from_tuples(zip(np.where(df.columns.str.find('Stroma')!=-1, 'REL', 'ID'),
df.columns.map(lambda x: x.split('_')[0]),
df.columns.map(lambda x: x.split('_')[-1])),
names=['network', 'node', 'hemi'])
print df
network ID REL \
node ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5
gene
ENSG 8 1 11 5 10 0 628 542 767 578 462
ENSG 0 0 1 0 0 0 0 28 1 3 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110
ENSG 11 26 24 9 11 2 649 532 953 463 468
network
node
hemi 4 5 6 7 8
gene
ENSG 680 513 968 415 623
ENSG 1 4 0 0 0
ENSG 9 3 3 5 1
ENSG 857 1880 1526 2262 2624
ENSG 878 587 245 722 484