having a dataframe such as:
myFrame = pd.DataFrame(np.random.randint(1000, size=[7,4]),
index=[['GER','GER','GER','GER','FRA','FRA','FRA'],
['Phone','Email','Chat','Other','Phone','Chat','Email']])
0 1 2 3
GER Phone 765 876 588 933
Email 819 364 42 73
Chat 954 665 317 918
Other 692 531 312 206
FRA Phone 272 261 426 270
Chat 158 172 347 902
Email 453 721 67 6
How could I easily add the missing index(es) of the inner level? E.g. you can see that GER has an "Other" index label. I'd like to add that "Other" to all countries and fill it's values with e.g. 0. There might be a third outer index (e.g. ITA), for which yet another inner-index could be found (e.g. SMS).
At the end, all countries should have exactly the same amount of indexes.
Thanks!
Use reindex with MultiIndex.from_product created by unique values of each levels generated by MultiIndex.get_level_values:
mux = pd.MultiIndex.from_product([myFrame.index.get_level_values(0).unique(),
myFrame.index.get_level_values(1).unique()])
print (myFrame.reindex(mux, fill_value=0))
0 1 2 3
GER Phone 250 614 226 777
Email 917 156 148 902
Chat 537 665 87 75
Other 431 203 921 572
FRA Phone 159 790 646 810
Email 294 205 949 726
Chat 209 895 128 282
Other 0 0 0 0
Another solution with unstack and stack - MultiIndex is sorted:
print (myFrame.unstack(fill_value=0).stack(dropna=False))
0 1 2 3
FRA Chat 209 895 128 282
Email 294 205 949 726
Other 0 0 0 0
Phone 159 790 646 810
GER Chat 537 665 87 75
Email 917 156 148 902
Other 431 203 921 572
Phone 250 614 226 777
Related
I have a column in a data frame (df1) that contains the number of columns I should subset in another dataframe (df2) before performing a calculation.
df1
count
0
5
1
6
2
8
3
1
4
9
df2
A
B
C
D
E
F
G
H
I
0
337
687
972
530
366
187
964
952
820
1
144
971
233
819
340
600
694
155
913
2
904
951
732
987
661
907
786
126
674
3
675
474
925
663
570
591
805
404
184
4
775
907
616
973
800
117
512
222
300
However, the number of columns used for the subset has a threshold/limit so I tried to write it like this:
df2['mean_6cols'] = np.where(df1['count'] >= 6, df2.iloc[:,:6].mean(axis=1), df2.iloc[:,:df1['count']].mean(axis=1))
So if df1['count'] is at least 6, I want to use the first 6 columns from df2, but if df1['count'] is less than 6, I want to use the value specified in the row.
Unfortunately, it results in the below presumably because of df1['count'] inside of iloc.
TypeError: cannot do positional indexing with these indexers
I did think of writing a for-loop and using the index variable to get the current value of df1['count'] for each row, but it's not a practical solution since I have a lot of different combinations of calculations/dataframes to do this for.
You can use numpy broadcasting to mask df2 by df1['count']:
mask = df1[['count']].to_numpy() > np.arange(df2.shape[1])
df2['mean_6cols'] = np.where(df1['count']>=6,
df2.iloc[:,:6].mean(axis=1),
df2.where(mask).mean(axis=1)
)
Output:
A
B
C
D
E
F
G
H
I
mean_6cols
0
337
687
972
530
366
187
964
952
820
578.400000
1
144
971
233
819
340
600
694
155
913
517.833333
2
904
951
732
987
661
907
786
126
674
857.000000
3
675
474
925
663
570
591
805
404
184
675.000000
4
775
907
616
973
800
117
512
222
300
698.000000
The following code generates a pandas.DataFrame from a 3D array over the first axis. I manually create the columns names (defining cols): is there a more built-in way to do this (to avoid potential errors e.g. regarding C-order)?
--> I am looking for a way to guarantee the respect of the order of the indices after the reshape operation (here it relies on the correct order of the iterations over range(nrow) and range(ncol)).
import numpy as np
import pandas as pd
nt = 6 ; nrow = 4 ; ncol = 3 ; shp = (nt, nrow, ncol)
np.random.seed(0)
a = np.array(np.random.randint(0, 1000, nt*nrow*ncol)).reshape(shp)
# This is the line I think should be improved --> any numpy function or so?
cols = [str(i) + '-' + str(j) for i in range(nrow) for j in range(ncol)]
adf = pd.DataFrame(a.reshape(nt, -1), columns = cols)
print(adf)
0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
EDIT
Illustrating why I don't like my solution - it is just too easy to make a code which technically works but produce a wrong result (inverting i and j or nrow and ncol):
wrongcols1 = [str(i) + '-' + str(j) for i in range(ncol) for j in range(nrow)]
adf2 = pd.DataFrame(a.reshape(nt, -1), columns=wrongcols1)
print(adf2)
0-0 0-1 0-2 0-3 1-0 1-1 1-2 1-3 2-0 2-1 2-2 2-3
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
wrongcols2 = [str(j) + '-' + str(i) for i in range(nrow) for j in range(ncol)]
adf3 = pd.DataFrame(a.reshape(nt, -1), columns=wrongcols2)
print(adf3)
0-0 1-0 2-0 0-1 1-1 2-1 0-2 1-2 2-2 0-3 1-3 2-3
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
Try this and see if it fits your use case:
Generate columns via a combination of np.indices, np.dstack and np.vstack :
columns = np.vstack(np.dstack(np.indices((nrow, ncol))))
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2],
[2, 0],
[2, 1],
[2, 2],
[3, 0],
[3, 1],
[3, 2]])
Now convert to string via a combination of map, join and list comprehension:
columns = ["-".join(map(str, entry)) for entry in columns]
['0-0',
'0-1',
'0-2',
'1-0',
'1-1',
'1-2',
'2-0',
'2-1',
'2-2',
'3-0',
'3-1',
'3-2']
Let's know how it goes.
You could try to use pd.MultiIndex to construct your hierarchy.
First redefine your cols to a list of tuples:
cols = [(i, j) for i in range(nrow) for j in range(ncol)]
Then construct the multi index with cols:
multi_cols = pd.MultiIndex.from_tuples(cols)
And build the dataframe:
adf = pd.DataFrame(a.reshape(nt, -1), columns=multi_cols)
Result:
0 1 2 3
0 1 2 0 1 2 0 1 2 0 1 2
0 684 559 629 192 835 763 707 359 9 723 277 754
1 804 599 70 472 600 396 314 705 486 551 87 174
2 600 849 677 537 845 72 777 916 115 976 755 709
3 847 431 448 850 99 984 177 755 797 659 147 910
4 423 288 961 265 697 639 544 543 714 244 151 675
5 510 459 882 183 28 802 128 128 932 53 901 550
Access of elements:
print(adf[1][2][0])
>>> 763
I'm trying to transform my dataset to a normal distribution.
0 8.298511e-03
1 3.055319e-01
2 6.938647e-02
3 2.904091e-02
4 7.422441e-02
5 6.074046e-02
6 9.265747e-04
7 7.521846e-02
8 5.960521e-02
9 7.405019e-04
10 3.086551e-02
11 5.444835e-02
12 2.259236e-02
13 4.691038e-02
14 6.463911e-02
15 2.172805e-02
16 8.210005e-02
17 2.301189e-02
18 4.073898e-07
19 4.639910e-02
20 1.662777e-02
21 8.662539e-02
22 4.436425e-02
23 4.557591e-02
24 3.499897e-02
25 2.788340e-02
26 1.707958e-02
27 1.506404e-02
28 3.207647e-02
29 2.147011e-03
30 2.972746e-02
31 1.028140e-01
32 2.183737e-02
33 9.063370e-03
34 3.070437e-02
35 1.477440e-02
36 1.036309e-02
37 2.000609e-01
38 3.366233e-02
39 1.479767e-03
40 1.137169e-02
41 1.957088e-02
42 4.921303e-03
43 4.279257e-02
44 4.363429e-02
45 1.040123e-01
46 2.930958e-02
47 1.935434e-03
48 1.954418e-02
49 2.980253e-02
50 3.643772e-02
51 3.411437e-02
52 4.976063e-02
53 3.704608e-02
54 7.044161e-02
55 8.101365e-03
56 9.310477e-03
57 7.626637e-02
58 8.149728e-03
59 4.157399e-01
60 8.200258e-02
61 2.844295e-02
62 1.046601e-01
63 6.565680e-02
64 9.825436e-04
65 9.353639e-02
66 6.535298e-02
67 6.979044e-04
68 2.772859e-02
69 4.378422e-02
70 2.020185e-02
71 4.774493e-02
72 6.346146e-02
73 2.466264e-02
74 6.636585e-02
75 2.548934e-02
76 1.113937e-06
77 5.723409e-02
78 1.533288e-02
79 1.027341e-01
80 4.294570e-02
81 4.844853e-02
82 5.579620e-02
83 2.531824e-02
84 1.661426e-02
85 1.430836e-02
86 3.157232e-02
87 2.241722e-03
88 2.946256e-02
89 1.038383e-01
90 1.868837e-02
91 8.854596e-03
92 2.391759e-02
93 1.612714e-02
94 1.007823e-02
95 1.975513e-01
96 3.581289e-02
97 1.199747e-03
98 1.263381e-02
99 1.966746e-02
100 4.040786e-03
101 4.497264e-02
102 4.030524e-02
103 8.627087e-02
104 3.248317e-02
105 5.727582e-03
106 1.781355e-02
107 2.377991e-02
108 4.299568e-02
109 3.664353e-02
110 5.167902e-02
111 4.006848e-02
112 7.072990e-02
113 6.744938e-03
114 1.064900e-02
115 9.823497e-02
116 8.992714e-03
117 1.792453e-01
118 6.817763e-02
119 2.588843e-02
120 1.048027e-01
121 6.468491e-02
122 1.035536e-03
123 8.800684e-02
124 5.975065e-02
125 7.365861e-04
126 4.209485e-02
127 4.232421e-02
128 2.371866e-02
129 5.894714e-02
130 7.177195e-02
131 2.116566e-02
132 7.579219e-02
133 3.174744e-02
134 0.000000e+00
135 5.786439e-02
136 1.458493e-02
137 9.820156e-02
138 4.373873e-02
139 4.271649e-02
140 5.532575e-02
141 2.311324e-02
142 1.644508e-02
143 1.328273e-02
144 3.908473e-02
145 2.355468e-03
146 2.519321e-02
147 1.131868e-01
148 1.708967e-02
149 1.027661e-02
150 2.439899e-02
151 1.604058e-02
152 1.134323e-02
153 2.247722e-01
154 3.408590e-02
155 2.222239e-03
156 1.659830e-02
157 2.284733e-02
158 4.618550e-03
159 3.674162e-02
160 4.131283e-02
161 8.846273e-02
162 2.504404e-02
163 6.004396e-03
164 1.986309e-02
165 2.347111e-02
166 3.865636e-02
167 3.672307e-02
168 6.658419e-02
169 3.726879e-02
170 7.600138e-02
171 7.184871e-03
172 1.142840e-02
173 9.741311e-02
174 8.165448e-03
175 1.529210e-01
176 6.648081e-02
177 2.617601e-02
178 9.547816e-02
179 6.857775e-02
180 8.129399e-04
181 7.107914e-02
182 5.884794e-02
183 8.398721e-04
184 6.972981e-02
185 4.461767e-02
186 2.264404e-02
187 5.566633e-02
188 6.595136e-02
189 2.301914e-02
190 7.488919e-02
191 3.108619e-02
192 4.989364e-07
193 4.834949e-02
194 1.422578e-02
195 9.398186e-02
196 4.870391e-02
197 3.841369e-02
198 6.406801e-02
199 2.603315e-02
200 1.692629e-02
201 1.409982e-02
202 4.099215e-02
203 2.093724e-03
204 2.640732e-02
205 1.032129e-01
206 1.581881e-02
207 8.977325e-03
208 1.941141e-02
209 1.502126e-02
210 9.923589e-03
211 2.757357e-01
212 3.096234e-02
213 4.388900e-03
214 1.784778e-02
215 2.179550e-02
216 3.944159e-03
217 3.703552e-02
218 4.033897e-02
219 1.157076e-01
220 2.400446e-02
221 5.761179e-03
222 1.899621e-02
223 2.401468e-02
224 4.458745e-02
225 3.357898e-02
226 5.331003e-02
227 3.488753e-02
228 7.466599e-02
229 6.075236e-03
230 9.815318e-03
231 9.598735e-02
232 7.103607e-03
233 1.100602e-01
234 5.677641e-02
235 2.420500e-02
236 9.213369e-02
237 4.024043e-02
238 6.987694e-04
239 8.612055e-02
240 5.663353e-02
241 4.871693e-04
242 4.533811e-02
243 3.593244e-02
244 1.982537e-02
245 5.490786e-02
246 5.603109e-02
247 1.671653e-02
248 6.522711e-02
249 3.341356e-02
250 2.378629e-06
251 4.299939e-02
252 1.223163e-02
253 8.392798e-02
254 4.272826e-02
255 3.183946e-02
256 4.431299e-02
257 2.661024e-02
258 1.686707e-02
259 4.070924e-03
260 3.325947e-02
261 2.023611e-03
262 2.402284e-02
263 8.369778e-02
264 1.375093e-02
265 8.899898e-03
266 2.148740e-02
267 1.301483e-02
268 8.355791e-03
269 2.549934e-01
270 2.792516e-02
271 4.652563e-03
272 1.556313e-02
273 1.936942e-02
274 3.547794e-03
275 3.412516e-02
276 3.932606e-02
277 5.305868e-02
278 2.354438e-02
279 5.379380e-03
280 1.904203e-02
281 2.045495e-02
282 3.275855e-02
283 3.007389e-02
284 8.227664e-02
285 2.479949e-02
286 6.573835e-02
287 5.165842e-03
288 7.599650e-03
289 9.613557e-02
290 6.690175e-03
291 1.779880e-01
292 5.076263e-02
293 3.117607e-02
294 7.495692e-02
295 3.707768e-02
296 7.086975e-04
297 8.935981e-02
298 5.624249e-02
299 7.105331e-04
300 3.339868e-02
301 3.354603e-02
302 2.041988e-02
303 3.862522e-02
304 5.977081e-02
305 1.730081e-02
306 6.909621e-02
307 3.729478e-02
308 3.940647e-07
309 4.385336e-02
310 1.391891e-02
311 8.898305e-02
312 3.840141e-02
313 3.214408e-02
314 4.284080e-02
315 1.841022e-02
316 1.528207e-02
317 3.106559e-03
318 3.945481e-02
319 2.085094e-03
320 2.464190e-02
321 7.844914e-02
322 1.526590e-02
323 9.922147e-03
324 1.649218e-02
325 1.341602e-02
326 8.124446e-03
327 2.867380e-01
328 2.663867e-02
329 5.342012e-03
330 1.752612e-02
331 2.010863e-02
332 3.581845e-03
333 3.652284e-02
334 4.484362e-02
335 4.600939e-02
336 2.213280e-02
337 5.494917e-03
338 2.016594e-02
339 2.118010e-02
340 2.964000e-02
341 3.405549e-02
342 1.014185e-01
343 2.451624e-02
344 7.966998e-02
345 5.301538e-03
346 8.198895e-03
347 8.789368e-02
348 7.222417e-03
349 1.448276e-01
350 5.676056e-02
351 2.987054e-02
352 6.851434e-02
353 4.193034e-02
354 7.025054e-03
355 8.557358e-02
356 5.812736e-02
357 2.263676e-02
358 2.922588e-02
359 3.363161e-02
360 1.495056e-02
361 5.871619e-02
362 6.235094e-02
363 1.691340e-02
364 5.361939e-02
365 3.722318e-02
366 9.828477e-03
367 4.155345e-02
368 1.327760e-02
369 7.205372e-02
370 4.151130e-02
371 3.265365e-02
372 2.879418e-02
373 2.314340e-02
374 1.653692e-02
375 1.077611e-02
376 3.481427e-02
377 1.815487e-03
378 2.232305e-02
379 1.005192e-01
380 1.491262e-02
381 3.752658e-02
382 1.271613e-02
383 1.223707e-02
384 8.088923e-03
385 2.572550e-01
386 2.300194e-02
387 2.847960e-02
388 1.782098e-02
389 1.900759e-02
390 3.647629e-03
391 3.723368e-02
392 4.079514e-02
393 5.510332e-02
394 3.072313e-02
395 4.183566e-03
396 1.891549e-02
397 1.870293e-02
398 3.182769e-02
399 4.167840e-02
400 1.343152e-01
401 2.451973e-02
402 7.567017e-02
403 4.837843e-03
404 6.477297e-03
405 7.664675e-02
Name: value, dtype: float64
This is the code I used for transforming dataset:
from scipy import stats
x,_ = stats.boxcox(df)
I get this error:
if any(x <= 0):
-> 1031 raise ValueError("Data must be positive.")
1032
1033 if lmbda is not None: # single transformation
ValueError: Data must be positive
Is it because my values are too small that it's producing an error? Not sure what I'm doing wrong. New to using boxcox, could be using it incorrectly in this example. Open to suggestions and alternatives. Thanks!
Your data contains the value 0 (at index 134). When boxcox says the data must be positive, it means strictly positive.
What is the meaning of your data? Does 0 make sense? Is that 0 actually a very small number that was rounded down to 0?
You could simply discard that 0. Alternatively, you could do something like the following. (This amounts to temporarily discarding the 0, and then using -1/λ for the transformed value of 0, where λ is the Box-Cox transformation parameter.)
First, create some data that contains one 0 (all other values are positive):
In [13]: np.random.seed(8675309)
In [14]: data = np.random.gamma(1, 1, size=405)
In [15]: data[100] = 0
(In your code, you would replace that with, say, data = df.values.)
Copy the strictly positive data to posdata:
In [16]: posdata = data[data > 0]
Find the optimal Box-Cox transformation, and verify that λ is positive. This work-around doesn't work if λ ≤ 0.
In [17]: bcdata, lam = boxcox(posdata)
In [18]: lam
Out[18]: 0.244049919975582
Make a new array to hold that result, along with the limiting value of the transform of 0 (which is -1/λ):
In [19]: x = np.empty_like(data)
In [20]: x[data > 0] = bcdata
In [21]: x[data == 0] = -1/lam
The following plot shows the histograms of data and x.
Rather than normal boxcox, you can use boxcox1p. It adds 1 to x so there won't be any "0" record
from scipy.special import boxcox1p
scipy.special.boxcox1p(x, lmbda)
For more info check out the docs at https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.boxcox1p.html
Is your data that you are sending to boxcox 1-dimensional ndarray?
Second way could be adding shift parameter by summing shift (see details from the link) to all of the ndarray elements before sending it to boxcox and subtracting shift from the resulting array elements (if I have understood boxcox algorithm correctly, that could be solution in your case, too).
https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.boxcox.html
These were my instructions:
Write a program using while loop, which prints the sum of every third numbers from 1 to 1001 ( both 1 and 1001 are included)
(1 + 4 + 7 + 10 + ....)
Here is my code:
num = 0
x = 1
while x != 1001:
num += x
x += 3
print(num)
Can someone point out where I've gone wrong?
Your while loop never gets the condition x != 1001 evaluated to True.
I checked last few values of x and those are
994
997
1000
1003
As you see the value of x never becomes 1001.
So to terminate the condition when x is going to surpass 1001 you need to modify the conditon as following.
while x <= 1001:
num += x
x += 3
print(num)
You miscalculate the expected value, x can never be 1001, The number around 1001 is 1000 and 1003, so the while loop goes forever.
I think you may use:
while x != 1000:
or:
while x < 1001:
Note as #idjaw pointed out, using != here is not a very good choice.
x won't take the value 1001 ever. It becomes 1000 and then 1003 in the next iteration, so the loop continues to go on forever.
while x<=1001:
Can be used to resolve this.
x Would never be 1001 so It would run forever.
If you want to make sure bring the print statement within loop and print the value of x
num = 0
x = 1
while x != 1001:
num += x
x += 3
print(x)
It would print the value of x. give ctrl+c once it crossed 1000.
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
103
106
109
112
115
118
121
124
127
130
133
136
139
142
145
148
151
154
157
160
163
166
169
172
175
178
181
184
187
190
193
196
199
202
205
208
211
214
217
220
223
226
229
232
235
238
241
244
247
250
253
256
259
262
265
268
271
274
277
280
283
286
289
292
295
298
301
304
307
310
313
316
319
322
325
328
331
334
337
340
343
346
349
352
355
358
361
364
367
370
373
376
379
382
385
388
391
394
397
400
403
406
409
412
415
418
421
424
427
430
433
436
439
442
445
448
451
454
457
460
463
466
469
472
475
478
481
484
487
490
493
496
499
502
505
508
511
514
517
520
523
526
529
532
535
538
541
544
547
550
553
556
559
562
565
568
571
574
577
580
583
586
589
592
595
598
601
604
607
610
613
616
619
622
625
628
631
634
637
640
643
646
649
652
655
658
661
664
667
670
673
676
679
682
685
688
691
694
697
700
703
706
709
712
715
718
721
724
727
730
733
736
739
742
745
748
751
754
757
760
763
766
769
772
775
778
781
784
787
790
793
796
799
802
805
808
811
814
817
820
823
826
829
832
835
838
841
844
847
850
853
856
859
862
865
868
871
874
877
880
883
886
889
892
895
898
901
904
907
910
913
916
919
922
925
928
931
934
937
940
943
946
949
952
955
958
961
964
967
970
973
976
979
982
985
988
991
994
997
1000
1003
1006
As you clearly see the x never become 1001. Thats why the loop run forever.
As others say change the condition to x <= 1001 which would end you loop.
I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you
I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]
You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'