Error while dropping row from dataframe based on value comparison - python

I have following unique values in dataframe column.
['1473' '1093' '1346' '1324' 'NA' '1129' '58' '847' '54' '831' '816']
I want to drop rows which have 'NA' in this column.
testData = testData[testData.BsmtUnfSF != "NA"]
and got error
TypeError: invalid type comparison
Then I tried
testData = testData[testData.BsmtUnfSF != np.NAN]
It doesn't give any error but it doesn't drop rows.
How to solve this issue?

Here is how you can do it. Just change column with the column name you want.
import pandas as pd
import numpy as np
df = pd.DataFrame({"column": [1,2,3,np.nan,6]})
df = df[np.isfinite(df['column'])]

You could use dropna
testData = testData.dropna(subsets = 'BsmtUnfSF']

assuming your dataFrame:
>>> df
col1
0 1473
1 1093
2 1346
3 1324
4 NaN
5 1129
6 58
7 847
8 54
9 831
10 816
You have multiple solutions:
>>> df[pd.notnull(df['col1'])]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[df.col1.notnull()]
# df[df['col1'].notnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna(subset=['col1'])
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df.dropna()
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816
>>> df[~df.col1.isnull()]
col1
0 1473
1 1093
2 1346
3 1324
5 1129
6 58
7 847
8 54
9 831
10 816

Related

Split columns conditionally on string

I have a data frame with the following shape:
0 1
0 OTT:81 DVBC:398
1 OTT:81 DVBC:474
2 OTT:81 DVBC:474
3 OTT:81 DVBC:454
4 OTT:81 DVBC:443
5 OTT:1 DVBC:254
6 DVBC:151 None
7 OTT:1 DVBC:243
8 OTT:1 DVBC:254
9 DVBC:227 None
I want for column 1 to be same as column 0 if column 1 contains "DVBC".
The split the values on ":" and the fill the empty ones with 0.
The end data frame should look like this
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I try to do this starting with:
if df[0].str.contains("DVBC") is True:
df[1] = df[0]
But after this the data frame looks the same not sure why.
My idea after is to pass the values to the respective columns then split by ":" and rename the columns.
How can I implement this?
Universal solution for split values by : and pivoting- first create Series by DataFrame.stack, split by Series.str.splitSeries.str.rsplit and last reshape by DataFrame.pivot:
df = df.stack().str.split(':', expand=True).reset_index()
df = df.pivot('level_0',0,1).fillna(0).rename_axis(index=None, columns=None)
print (df)
DVBC OTT
0 398 81
1 474 81
2 474 81
3 454 81
4 443 81
5 254 1
6 151 0
7 243 1
8 254 1
9 227 0
Here is one way that should work with any number of columns:
(df
.apply(lambda c: c.str.extract(':(\d+)', expand=False))
.ffill(axis=1)
.mask(df.replace('None', pd.NA).isnull().shift(-1, axis=1, fill_value=False), 0)
)
output:
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227

Python - Aggregate on week of the month basis and compare

I am working on a small csv data set where the values are indexed as per Week of the Month occurrence. What I want is to aggregate all of the weeks in sequence, barring the current week or the last column, to compute weekly average of the remaining data (average for ...10/1 + 11/1 + 12/1.. to get week 1 data).
The data is available in this format:
char 2019/11/1 2019/11/2 2019/11/3 2019/11/4 2019/11/5 2019/12/1 2019/12/2 2019/12/3 2019/12/4 2019/12/5 2020/1/1
A 1477 1577 1401 773 310 1401 1464 1417 909 712 289
B 1684 1485 1220 894 297 1618 1453 1335 920 772 275
C 37 10 1 3 6 17 6 6 3 2 1
D 2041 1883 1302 1136 376 2175 1729 1167 960 745 278
E 6142 5991 5499 3883 1036 4949 6187 5760 3974 2339 826
F 842 846 684 462 140 789 802 134 386 251 94
This column (2020/1/1) shall later be used to compare with the mean of all aggregate values from week one. The desired output is something like this:
char W1 W2 W3 W4 W5 2020/1/1
A 1439 1520.5 1409 841 511 289
B 1651 1469 1277.5 907 534.5 275
C 27 8 3.5 3 4 1
D 2108 1806 1234.5 1048 560.5 278
E 5545.5 6089 5629.5 3928.5 1687.5 826
F 815.5 824 409 424 195.5 94
Is it possible to use rolling or resample in such a case? Any ideas on how to do it?
I beleive you need DataFrame.resample by weeks:
df = df.set_index(['char', '2020/1/1'])
df.columns = pd.to_datetime(df.columns, format='%Y/%m/%d')
df = df.resample('W', axis=1).mean()
print (df)
2019-11-03 2019-11-10 2019-11-17 2019-11-24 2019-12-01 \
char 2020/1/1
A 289 1485.000000 541.5 NaN NaN 1401.0
B 275 1463.000000 595.5 NaN NaN 1618.0
C 1 16.000000 4.5 NaN NaN 17.0
D 278 1742.000000 756.0 NaN NaN 2175.0
E 826 5877.333333 2459.5 NaN NaN 4949.0
F 94 790.666667 301.0 NaN NaN 789.0
2019-12-08
char 2020/1/1
A 289 1125.50
B 275 1120.00
C 1 4.25
D 278 1150.25
E 826 4565.00
F 94 393.25
EDIT: If want grouping first 7 days per each month to separate groups use:
df = df.set_index(['char', '2020/1/1'])
c = pd.to_datetime(df.columns, format='%Y/%m/%d')
df.columns = [f'{y}/{m}/W{w}' for w,m,y in zip((c.day - 1) // 7 + 1,c.month, c.year)]
df = df.groupby(df.columns, axis=1).mean()
print (df)
2019/11/W1 2019/12/W1
char 2020/1/1
A 289 1107.6 1180.6
B 275 1116.0 1219.6
C 1 11.4 6.8
D 278 1347.6 1355.2
E 826 4510.2 4641.8
F 94 594.8 472.4
EDIT1: For grouping dy years and days use DatetimeIndex.strftime:
df = df.set_index(['char', '2020/1/1'])
df.columns = pd.to_datetime(df.columns, format='%Y/%m/%d').strftime('%d-%Y')
df = df.groupby(df.columns, axis=1).mean()
print (df)
01-2019 02-2019 03-2019 04-2019 05-2019
char 2020/1/1
A 289 1439.0 1520.5 1409.0 841.0 511.0
B 275 1651.0 1469.0 1277.5 907.0 534.5
C 1 27.0 8.0 3.5 3.0 4.0
D 278 2108.0 1806.0 1234.5 1048.0 560.5
E 826 5545.5 6089.0 5629.5 3928.5 1687.5
F 94 815.5 824.0 409.0 424.0 195.5
Here is a way using groupby:
m= df.set_index(['char', '2020/1/1']).rename(columns=lambda x: pd.to_datetime(x))
m.groupby(m.columns.week,axis=1).mean().add_prefix('W_').reset_index()
char 2020/1/1 W_44 W_45 W_48 W_49
0 A 289 1485.000000 541.5 1401.0 1125.50
1 B 275 1463.000000 595.5 1618.0 1120.00
2 C 1 16.000000 4.5 17.0 4.25
3 D 278 1742.000000 756.0 2175.0 1150.25
4 E 826 5877.333333 2459.5 4949.0 4565.00
5 F 94 790.666667 301.0 789.0 393.25

Performing operations on group by based on column value Pandas

I have a grouped pandas dataframe
x y id date qty
6 3 932 2017-05-14 212
6 3 932 2017-05-15 212
6 3 932 2017-05-18 212
6 3 933 2016-10-03 518
6 3 933 2016-10-09 16
6 3 933 2016-10-15 28
I want to know how to get the number of days between each order for a particular id. The first date should be the 0th day and the consecutive column values the number of days after the first order. Something like this
x y id date qty
6 3 932 0 212
6 3 932 1 212
6 3 932 3 212
6 3 933 0 518
6 3 933 6 16
6 3 933 6 28
You can groupby by id and get diff, repalce NaT with fillna and last get days:
print (df)
x y id date qty
0 6 3 932 2017-05-14 212
1 6 3 932 2017-05-15 212
2 6 3 932 2017-05-18 212
3 6 3 933 2016-10-03 518
4 6 3 933 2016-10-09 16
5 6 3 933 2016-10-15 28
#if necessary convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['date'] = df.groupby(['id'])['date'].diff().fillna(0).dt.days
print (df)
x y id date qty
0 6 3 932 0 212
1 6 3 932 1 212
2 6 3 932 3 212
3 6 3 933 0 518
4 6 3 933 6 16
5 6 3 933 6 28
And Zero's solution is very similar, only output is float and not int, because of ordering of functions.
Use diff() on date of id groups, then using accessor to get dt.days days, fill NaNs with 0
In [772]: df.groupby('id')['date'].diff().dt.days.fillna(0)
Out[772]:
0 0.0
1 1.0
2 3.0
3 0.0
4 6.0
5 6.0
Name: date, dtype: float64
In [773]: df['date'] = df.groupby('id')['date'].diff().dt.days.fillna(0)
In [774]: df
Out[774]:
x y id date qty
0 6 3 932 0.0 212
1 6 3 932 1.0 212
2 6 3 932 3.0 212
3 6 3 933 0.0 518
4 6 3 933 6.0 16
5 6 3 933 6.0 28
Details
Original df
In [776]: df
Out[776]:
x y id date qty
0 6 3 932 2017-05-14 212
1 6 3 932 2017-05-15 212
2 6 3 932 2017-05-18 212
3 6 3 933 2016-10-03 518
4 6 3 933 2016-10-09 16
5 6 3 933 2016-10-15 28
In [778]: df.dtypes
Out[778]:
x int64
y int64
id int64
date datetime64[ns]
qty int64
dtype: object

Rosalind: REVP failing the given case

I wrote a solution to this challenge . It successfully handles the example case given, but not the actual case.
Challenge:
A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC. For example:
5'...GCATGC...3'
3'...CGTACG...5'
Given:
A DNA string of length at most 1 kbp in FASTA format.
Return:
The position and length of every reverse palindrome in the string
having length between 4 and 12. You may return these pairs in any
order.
Sample Dataset
>Rosalind_24 TCAATGCATGCGGGTCTATATGCAT
Sample Output
4 6
5 4
6 6
7 4
17 4
18 4
20 6
21 4
For the sample, it works. However it failed on the actual sample.
Actual Dataset:
>Rosalind_7901 ATATAGTCGGCTGTCCAGGCAATCGCGAGATGGGGAACGACATCTTGGTACTTTACGGAT GCCAAGACTTAATATCTGGCCCGGATATGACCGCGAGCACCCCCTACTCGTCTGTCGGTT TCGGCCGGCATGACCTGTCCTCTTGATAATAGATATAAGTTGCCAACCGCACTATTTCAA GATCAGATGCCCCAAGGCACAAGGCACAGAAGAATCAGGTACTGAGCAAACAGCGCCCAT TTGTCAGCGCAACTCCGAGCGACAGGCACAAGTGGTAGTAACATCTGTAGTCTACGAGCG CGGGACCGATGTAAAAAGCAACGAGAGACGGGGCCGTCGATAGAAAAGCAATGGAGTCCA TATGGGCACGCTGAGCGTGCCTGTACTAATTTCTATGGGCTACTGGCACTAGGGGCTTAA GCCCTCGGTTACCGCGCTTTATGAATATAGTTTTCGTGCCAGGAGTGTCTTGTTTCGAGG AAGCGTGAGCTACACTTAGCACGTCCGGGCTTATTGGAAATTTGTTCAGTCTGTATGCTC CGCAATATCATGTCGGCGCTCATTCAATGTTGCGTGTAATTTAGACCTCTACTACAGCTG GGGTTGGAGCGGTCGGTAGTAAGACGTATGATTACGGTTTACATCCCGCCGGCGGACACG GAACGTGATTTTCAGCATTGTCCCATCGTAGGGATTGGGGCCCTAGTAGGTGTGGGTAGC ACGTTACATGAAGCTATCCAATGGCGTATATACTCCATCCCATCGGACTAGAAGATTTGA GGGACCCAGTCATAACTGGTGCAAAATTACGTTACAAAAGCCGAGGATACAGTATA
Actual Output:
1 4
2 4
23 6
24 4
48 4
70 4
73 4
79 4
82 4
86 4
93 4
124 6
125 4
126 6
127 4
131 4
155 4
156 4
184 4
222 4
236 4
251 4
337 4
342 4
389 4
394 4
415 4
423 4
440 4
441 4
452 4
453 4
482 4
496 4
509 4
513 4
526 6
527 4
554 4
558 4
565 4
587 4
604 6
605 4
634 4
656 10
657 8
658 6
659 4
674 4
709 6
710 4
714 4
733 4
739 4
744 4
758 8
759 4
759 6
760 4
761 4
780 4
813 4
818 4
822 4
846 4
Code:
from string import maketrans
table=maketrans('ATCG','TAGC')
protein=open('rosalind_revp.txt','r').read()[14::].strip()
for i in range(len(protein)):
for ii in range(2,7):
if protein[i:i+ii]==protein[i+2*ii-1:i+ii-1:-1].translate(table):
print str(i+1),str(2*ii)
(When testing sample, the 4th line is
protein=open('rosalind_revp.txt','r').read()[12::].strip()
I even manually matched a bunch of the position-length pairs, and sad to find that they all worked perfectly. I still don't know why the result wasn't accepted.
Could anyone let me know where I was wrong?
This is my github link and it has the solution hope this works
def reverse(l):
t=""
for i in range(len(l)):
if(l[i]=='A'):
t=t+'T'
elif(l[i]=='T'):
t=t+'A'
elif(l[i]=='C'):
t=t+'G'
elif(l[i]=='G'):
t=t+'C'
return t
def rev(d):
return d[len(d)::-1]
k=input()
p=input()
for i in range(len(p)):
for j in range(4,14):
if (p[i:i+j]==rev(reverse(p[i:i+j]))and i+j<=len(p)):
print(i+1, end=" ")
print(j)
https://github.com/jssssv007/stackexcahnge

Splitting the header into multiple headers in DataFrame

I have a DataFrame where I need to split the header into multiple rows as headers for the same Dataframe.
The dataframe looks like this,
My data Frame looks like follows,
gene ALL_ID_1 AML_ID_1 AML_ID_2 AML_ID_3 AML_ID_4 AML_ID_5 Stroma_ID_1 Stroma_ID_2 Stroma_ID_3 Stroma_ID_4 Stroma_ID_5 Stroma_CR_Pat_4 Stroma_CR_Pat_5 Stroma_CR_Pat_6 Stroma_CR_Pat_7 Stroma_CR_Pat_8
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
And I want the the above header to be spitted as follows,
network ID ID REL
node B_ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10
ENSG 8 1 11 5 10 0 628 542 767 578 462 680 513 968 415 623
ENSG 0 0 1 0 0 0 0 28 1 3 0 1 4 0 0 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2 9 3 3 5 1
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110 857 1880 1526 2262 2624
ENSG 11 26 24 9 11 2 649 532 953 463 468 878 587 245 722 484
Any help is greatly appreciated ..
Probably not the best minimal example you put here, very few people has the subject knowledge to understand what is network, node and hemi in your context.
You just need to create your MultiIndex and replace your column index with the one you created:
There are 3 rules in your example:
1, whenever 'Stroma' is found, the column belongs to REL, otherwise belongs to ID.
2, node is the first field of the initial column names
3, hemi is the last field of the initial column names
Then, just code away:
In [110]:
df.columns = pd.MultiIndex.from_tuples(zip(np.where(df.columns.str.find('Stroma')!=-1, 'REL', 'ID'),
df.columns.map(lambda x: x.split('_')[0]),
df.columns.map(lambda x: x.split('_')[-1])),
names=['network', 'node', 'hemi'])
print df
network ID REL \
node ALL AML Stroma
hemi 1 1 2 3 4 5 1 2 3 4 5
gene
ENSG 8 1 11 5 10 0 628 542 767 578 462
ENSG 0 0 1 0 0 0 0 28 1 3 0
ENSG 661 1418 2580 6817 14727 5968 9 3 5 9 2
ENSG 20 315 212 8 790 471 1283 2042 1175 2839 1110
ENSG 11 26 24 9 11 2 649 532 953 463 468
network
node
hemi 4 5 6 7 8
gene
ENSG 680 513 968 415 623
ENSG 1 4 0 0 0
ENSG 9 3 3 5 1
ENSG 857 1880 1526 2262 2624
ENSG 878 587 245 722 484

Categories