I am trying to compile a few years of traffic data to analyze from the CDOT website. I have used the following code to get the URL's of the data for each month and year; I need to concatenate this information into one large dataframe.
### LIBRARIES ###
from bs4 import BeautifulSoup, SoupStrainer
import requests
from requests_html import HTMLSession
import pandas as pd
from time import sleep
import os
from os.path import basename, join
from random import randint
### SET UP ###
year = ['15','16', '17', '18', '19']
month = ['01','02','03','04','05','06','07','08', '09','10','11','12']
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'}
print('setup complete')
years_urls = []
for y in year:
url = 'https://dtdapps.coloradodot.info/otis/TrafficData/GetDailyTrafficVolumeForStationByMonth/000106/true/20{}/'.format(y)
years_urls.append(url)
sleep(randint(4,6))
#print(years_urls)
print('got years')
urls_list = []
for m in month:
for y in years_urls:
full_urls = y+(m)
urls_list.append(full_urls)
print('Combining data')
#print(urls_list)
#df = pd.concat([pd.concat(pd.read_html(u, header= 0),axis = 0)for u in urls_list],axis = 0) ## GOT FORBIDDEN ERROR
#df.to_csv('Combined_Ski_Data.csv')
appended_data = []
print('MAKING MEGA DATA FILE')
for u, value in enumerate(urls_list):
print(value)
r = requests.get(value, headers = header) #Should be reading each df individually and
dataframe = pd.read_html(r.text, header =0)
appended_data.append(dataframe)
#print(appended_data)
df_list =[]
df_full_list = []
for a, value in enumerate(appended_data):
dataframes = 'df_'+str(a) #Creating unique names for DF's
df_list.append(dataframes)
for d in df_list:
d = pd.DataFrame(value)
df_full_list.append(d)
combined_df = pd.concat(df_full_list, axis = 1)
#print(combined_df)
combined_df.to_csv('Colorado_Ski_Traffic.csv')
#print(appended_data)
print('code_complete')
The output I get is:
[62 rows x 26 columns]"," Count Date Dir 0h 1h 2h 3h 4h 5h 6h ... 15h 16h 17h 18h 19h 20h 21h 22h 23h
0 01/01/2016 P 65 71 64 69 98 168 328 ... 2173 1764 2014 1132 1070 624 391 240 152
1 01/01/2016 S 115 118 99 90 84 108 436 ... 1575 1316 998 753 607 432 326 266 201
2 01/02/2016 P 102 74 78 108 183 372 694 ... 1831 1619 2200 2196 2151 1186 714 360 269
3 01/02/2016 S 142 106 77 77 139 233 854 ... 1298 1221 912 687 549 476 332 291 205
4 01/03/2016 P 156 113 120 112 161 228 438 ... 1752 1615 1364 1328 1201 815 548 260 148
.. ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57 01/29/2016 S 202 180 86 101 111 187 462 ... 2204 2390 2226 2272 2214 2182 1944 803 508
58 01/30/2016 P 88 98 163 106 101 125 233 ... 1489 774 925 924 999 1557 796 342 184
59 01/30/2016 S 259 211 113 83 101 221 1338 ... 1124 337 1229 1167 1135 495 415 308 221
60 01/31/2016 P 88 94 76 82 91 154 345 ... 1695 1322 1397 1263 967 1213 1605 1252 1066
61 01/31/2016 S 165 83 53 59 53 160 1190 ... 950 848 635 500 385 323 331 185 190
[62 rows x 26 columns]"," Count Date Dir 0h 1h 2h 3h 4h 5h 6h ... 15h 16h 17h 18h 19h 20h 21h 22h 23h
0 01/01/2016 P 65 71 64 69 98 168 328 ... 2173 1764 2014 1132 1070 624 391 240 152
1 01/01/2016 S 115 118 99 90 84 108 436 ... 1575 1316 998 753 607 432 326 266 201
2 01/02/2016 P 102 74 78 108 183 372 694 ... 1831 1619 2200 2196 2151 1186 714 360 269
3 01/02/2016 S 142 106 77 77 139 233 854 ... 1298 1221 912 687 549 476 332 291 205
4 01/03/2016 P 156 113 120 112 161 228 438 ... 1752 1615 1364 1328 1201 815 548 260 148
.. ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57 01/29/2016 S 202 180 86 101 111 187 462 ... 2204 2390 2226 2272 2214 2182 1944 803 508
58 01/30/2016 P 88 98 163 106 101 125 233 ... 1489 774 925 924 999 1557 796 342 184
59 01/30/2016 S 259 211 113 83 101 221 1338 ... 1124 337 1229 1167 1135 495 415 308 221
60 01/31/2016 P 88 94 76 82 91 154 345 ... 1695 1322 1397 1263 967 1213 1605 1252 1066
61 01/31/2016 S 165 83 53 59 53 160 1190 ... 950 848 635 500 385 323 331 185 190
I didn't copy the entire output as it is much too large, but it repeats only certain months and years. Any ideas on how I can get one large dataframe, no duplicates, that has only one section of headers?
You will have to drop the columns first
col_names = appended_data[0].columns
Then merge the raw data using numpy
all_data = np.concatenate([df.values for df in appended_data])
And finally restore the column names.
combined_df = pd.DataFrame(data=all_data, columns=col_names)
This seems to be a known pitfall of pandas concat.
Related
I have a list that is a column of numbers in a df called "doylist" for day of year list. I need to figure out how to print a range of user-defined rows in ascending order from the doylist df. For example, let's say I need to print the last daysback=60 days in the list from today's day of year to daysforward = 19 days from today's day of year. So, if today's day of year is 47, then my new list would look like this ranging from day of year 352 to day of year 67.
day_of_year =
day_of_year = (today - datetime.datetime(today.year, 1, 1)).days + 1
doylist =
doylist
Out[106]:
dyofyr
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
13 14
14 15
15 16
16 17
17 18
18 19
19 20
20 21
21 22
22 23
23 24
24 25
25 26
26 27
27 28
28 29
29 30
30 31
31 32
32 33
33 34
34 35
35 36
36 37
37 38
38 39
39 40
40 41
41 42
42 43
43 44
44 45
45 46
46 47
47 48
48 49
49 50
50 51
51 52
52 53
53 54
54 55
55 56
56 57
57 58
58 59
59 60
60 61
61 62
62 63
63 64
64 65
65 66
66 67
67 68
68 69
69 70
70 71
71 72
72 73
73 74
74 75
75 76
76 77
77 78
78 79
79 80
80 81
81 82
82 83
83 84
84 85
85 86
86 87
87 88
88 89
89 90
90 91
91 92
92 93
93 94
94 95
95 96
96 97
97 98
98 99
99 100
100 101
101 102
102 103
103 104
104 105
105 106
106 107
107 108
108 109
109 110
110 111
111 112
112 113
113 114
114 115
115 116
116 117
117 118
118 119
119 120
120 121
121 122
122 123
123 124
124 125
125 126
126 127
127 128
128 129
129 130
130 131
131 132
132 133
133 134
134 135
135 136
136 137
137 138
138 139
139 140
140 141
141 142
142 143
143 144
144 145
145 146
146 147
147 148
148 149
149 150
150 151
151 152
152 153
153 154
154 155
155 156
156 157
157 158
158 159
159 160
160 161
161 162
162 163
163 164
164 165
165 166
166 167
167 168
168 169
169 170
170 171
171 172
172 173
173 174
174 175
175 176
176 177
177 178
178 179
179 180
180 181
181 182
182 183
183 184
184 185
185 186
186 187
187 188
188 189
189 190
190 191
191 192
192 193
193 194
194 195
195 196
196 197
197 198
198 199
199 200
200 201
201 202
202 203
203 204
204 205
205 206
206 207
207 208
208 209
209 210
210 211
211 212
212 213
213 214
214 215
215 216
216 217
217 218
218 219
219 220
220 221
221 222
222 223
223 224
224 225
225 226
226 227
227 228
228 229
229 230
230 231
231 232
232 233
233 234
234 235
235 236
236 237
237 238
238 239
239 240
240 241
241 242
242 243
243 244
244 245
245 246
246 247
247 248
248 249
249 250
250 251
251 252
252 253
253 254
254 255
255 256
256 257
257 258
258 259
259 260
260 261
261 262
262 263
263 264
264 265
265 266
266 267
267 268
268 269
269 270
270 271
271 272
272 273
273 274
274 275
275 276
276 277
277 278
278 279
279 280
280 281
281 282
282 283
283 284
284 285
285 286
286 287
287 288
288 289
289 290
290 291
291 292
292 293
293 294
294 295
295 296
296 297
297 298
298 299
299 300
300 301
301 302
302 303
303 304
304 305
305 306
306 307
307 308
308 309
309 310
310 311
311 312
312 313
313 314
314 315
315 316
316 317
317 318
318 319
319 320
320 321
321 322
322 323
323 324
324 325
325 326
326 327
327 328
328 329
329 330
330 331
331 332
332 333
333 334
334 335
335 336
336 337
337 338
338 339
339 340
340 341
341 342
342 343
343 344
344 345
345 346
346 347
347 348
348 349
349 350
350 351
351 352
352 353
353 354
354 355
355 356
356 357
357 358
358 359
359 360
360 361
361 362
362 363
363 364
364 365
daysback = doylist.iloc[day_of_year-61] # 60 days back from today
daysforward = doylist.iloc[day_of_year+19] # 20 days forward from today
I need my final df or list to look like this:
final_list =
352
353
354
355
356
357
358
359
360
361
362
363
364
365
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
I have tried variations of this but get the following error using this with a df called "doylist"-thank you!
finallist = list(range(doylist.iloc[day_of_year-61],doylist.iloc[day_of_year+19]))
Traceback (most recent call last):
Cell In[113], line 1
finallist = list(range(doylist.iloc[day_of_year-61],doylist.iloc[day_of_year+19]))
TypeError: 'Series' object cannot be interpreted as an integer
I can't understand why you are using a dataframe to do this. This could be done with a simple list and modulus.
def days_between_forward_back(day_of_year, days_since, days_forward):
doylist = [x + 1 for x in range(365)]
lower_index = (day_of_year - days_since - 1) % 365
upper_index = day_of_year + days_forward
assert upper_index < 365
if lower_index > upper_index:
result = doylist[lower_index:]
result.extend(doylist[:upper_index])
return result
else:
return doylist[lower_index:upper_index]
days = days_between_forward_back(47, 60, 20)
print(f"For day of year 47, 60 days before, 20 days ahead, days are {days}")
days = days_between_forward_back(300, 61, 10)
print(f"For day of year 300, 61 days before, 10 days ahead, days are {days}")
Handling the case where both days_since and days_forward will move us to another year is left as an exercise for the asker.
i think this will help you :
import datetime
this_date = datetime.datetime.now()
how_many_dayes_do_you_want_to_go_back = 80
how_many_dayes_in_each_munth = {1:31
,2:28
,3:31
,4:30
,5:31
,6:30
,7:31
,8:31
,9:30
,10:31
,11:30
,12:31}
dayes_in_this_year = 0
for i in range(1,this_date.month+1):
dayes_in_this_year += how_many_dayes_in_each_munth.get(i)
if how_many_dayes_do_you_want_to_go_back % dayes_in_this_year == how_many_dayes_do_you_want_to_go_back and how_many_dayes_do_you_want_to_go_back < dayes_in_this_year:
for i in range(dayes_in_this_year-how_many_dayes_do_you_want_to_go_back,dayes_in_this_year+1):
print(i)
else:
the_rest_to_the_last_year = how_many_dayes_do_you_want_to_go_back - dayes_in_this_year
for i in range(365-the_rest_to_the_last_year,366):
print(i)
for i in range(dayes_in_this_year+1):
print(i)
and yes , you know you can improve the code to use it anywhere
It seems like you're getting hung up while converting back and forth between data formats of int, datetime etc... This type of error is much easier to keep track of and fix if you utilize python's new-ish type hinting to make sure you're being careful with data types. To that end it is also useful to keep using datetime as much as possible to take better advantage of the library (so you don't have to keep track of things like leap years etc. on your own). I wrote a few functions to help you convert:
from datetime import datetime, timedelta
def dt_from_doy(year: int, doy: int) -> datetime:
#useful if you need to use doy from your dataframe to get datetime.
#if you can convert the input to be a datetime in the first place that
#might be even better (fewer conversions of data type)
return datetime.strptime("{:04d}-{:03d}".format(year, doy), "%Y-%j")
def doy_from_dt(dt: datetime) -> int:
#used in the example below
return int(dt.strftime("%j"))
#example
today = datetime(2023,2,16)
list_of_dt = [today + timedelta(days=x) for x in range(-20,20)]
list_of_doy = [doy_from_dt(dt) for dt in list_of_dt]
i am implementing probabilistic neural network on my dataset and below it my code which tested on iris dataset and there is no error but when i applied to my dataset i got the following error:
KeyError Traceback (most recent call last)
<ipython-input-30-230e6aa7ae95> in <module>()
13 for i, (train, test) in enumerate(skfold, start=1):
14 pnn_network = PNN(std=std, step=0.2, verbose=False, batch_size=2)
---> 15 pnn_network.train(input_dataset_data[train], input_dataset_target[train])
16 predictions = pnn_network.predict(input_dataset_data[test])
17 print("Positive in predictions:", 1 in predictions)
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2677 if isinstance(key, (Series, np.ndarray, Index, list)):
2678 # either boolean or fancy integer index
-> 2679 return self._getitem_array(key)
2680 elif isinstance(key, DataFrame):
2681 return self._getitem_frame(key)
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_array(self, key)
2721 return self._take(indexer, axis=0)
2722 else:
-> 2723 indexer = self.loc._convert_to_indexer(key, axis=1)
2724 return self._take(indexer, axis=1)
2725
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
1325 if mask.any():
1326 raise KeyError('{mask} not in index'
-> 1327 .format(mask=objarr[mask]))
1328
1329 return com._values_from_object(indexer)
KeyError: '[ 0 1 2 4 5 6 7 8 9 10 11 12 15 16 17 18 19 20\n 21 22 23 25 26 27 28 29 30 31 32 33 34 35 36 38 39 40\n 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58\n 59 60 61 62 63 64 65 66 67 68 69 71 72 73 74 75 76 77\n 78 79 80 82 83 84 85 86 87 88 90 92 93 94 95 96 97 98\n 99 100 101 102 104 105 106 108 109 110 112 114 115 116 117 118 119 120\n 121 122 123 125 126 127 128 131 132 133 134 136 137 138 139 140 141 142\n 143 144 145 146 147 148 149 151 153 154 155 156 157 159 160 161 162 163\n 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 181 182 183\n 185 186 187 188 189 190 192 193 194 195 196 197 198 199 200 201 202 204\n 205 206 207 208 209 211 212 213 214 215 216 217 218 219 220 221 222 223\n 224 225 226 227 228 229 230 231 232 233 234 236 237 238 239 240 241 242\n 243 244 245 246 247 248 249 250 251 252 253 255 257 258 259 260 261 262\n 263 264 265 267 269 270 271 272 273 274 275 276 277 278 279 280 281 282\n 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 300 301\n 302 303 304 305 306 307 308 309 310 311 312 313 314 315 317 318 320 321\n 322 323 324 325 326 327] not in index'
The code on iris example is below:
from sklearn import datasets
iris=datasets.load_iris()
input_dataset_data = iris.data
input_dataset_target = iris.target
print(input_dataset_data.shape)
print(input_dataset_target.shape)
kfold_number = 10
skfold = StratifiedKFold(input_dataset_target, kfold_number, shuffle=True)
#print("> Start classify input_dataset dataset")
for std in [0.2, 0.4, 0.6, 0.8, 1]:
average_results = []
for i, (train, test) in enumerate(skfold, start=1):
pnn_network = PNN(std=std, step=0.2, verbose=False, batch_size=2)
pnn_network.train(input_dataset_data[train], input_dataset_target[train])
predictions = pnn_network.predict(input_dataset_data[test])
print("Positive in predictions:", 1 in predictions)
average_results.append(np.sum(predictions == input_dataset_target[test]) /float(len(predictions)))
print(std, np.average(average_results))
Below shapes of mydataset
X.shape
(328, 13)
Y.shape
Y.shape
(328,)
You need to access the dataframe by index:
pnn_network.train(input_dataset_data.iloc[train], input_dataset_target.iloc[train])
I'm trying to transform my dataset to a normal distribution.
0 8.298511e-03
1 3.055319e-01
2 6.938647e-02
3 2.904091e-02
4 7.422441e-02
5 6.074046e-02
6 9.265747e-04
7 7.521846e-02
8 5.960521e-02
9 7.405019e-04
10 3.086551e-02
11 5.444835e-02
12 2.259236e-02
13 4.691038e-02
14 6.463911e-02
15 2.172805e-02
16 8.210005e-02
17 2.301189e-02
18 4.073898e-07
19 4.639910e-02
20 1.662777e-02
21 8.662539e-02
22 4.436425e-02
23 4.557591e-02
24 3.499897e-02
25 2.788340e-02
26 1.707958e-02
27 1.506404e-02
28 3.207647e-02
29 2.147011e-03
30 2.972746e-02
31 1.028140e-01
32 2.183737e-02
33 9.063370e-03
34 3.070437e-02
35 1.477440e-02
36 1.036309e-02
37 2.000609e-01
38 3.366233e-02
39 1.479767e-03
40 1.137169e-02
41 1.957088e-02
42 4.921303e-03
43 4.279257e-02
44 4.363429e-02
45 1.040123e-01
46 2.930958e-02
47 1.935434e-03
48 1.954418e-02
49 2.980253e-02
50 3.643772e-02
51 3.411437e-02
52 4.976063e-02
53 3.704608e-02
54 7.044161e-02
55 8.101365e-03
56 9.310477e-03
57 7.626637e-02
58 8.149728e-03
59 4.157399e-01
60 8.200258e-02
61 2.844295e-02
62 1.046601e-01
63 6.565680e-02
64 9.825436e-04
65 9.353639e-02
66 6.535298e-02
67 6.979044e-04
68 2.772859e-02
69 4.378422e-02
70 2.020185e-02
71 4.774493e-02
72 6.346146e-02
73 2.466264e-02
74 6.636585e-02
75 2.548934e-02
76 1.113937e-06
77 5.723409e-02
78 1.533288e-02
79 1.027341e-01
80 4.294570e-02
81 4.844853e-02
82 5.579620e-02
83 2.531824e-02
84 1.661426e-02
85 1.430836e-02
86 3.157232e-02
87 2.241722e-03
88 2.946256e-02
89 1.038383e-01
90 1.868837e-02
91 8.854596e-03
92 2.391759e-02
93 1.612714e-02
94 1.007823e-02
95 1.975513e-01
96 3.581289e-02
97 1.199747e-03
98 1.263381e-02
99 1.966746e-02
100 4.040786e-03
101 4.497264e-02
102 4.030524e-02
103 8.627087e-02
104 3.248317e-02
105 5.727582e-03
106 1.781355e-02
107 2.377991e-02
108 4.299568e-02
109 3.664353e-02
110 5.167902e-02
111 4.006848e-02
112 7.072990e-02
113 6.744938e-03
114 1.064900e-02
115 9.823497e-02
116 8.992714e-03
117 1.792453e-01
118 6.817763e-02
119 2.588843e-02
120 1.048027e-01
121 6.468491e-02
122 1.035536e-03
123 8.800684e-02
124 5.975065e-02
125 7.365861e-04
126 4.209485e-02
127 4.232421e-02
128 2.371866e-02
129 5.894714e-02
130 7.177195e-02
131 2.116566e-02
132 7.579219e-02
133 3.174744e-02
134 0.000000e+00
135 5.786439e-02
136 1.458493e-02
137 9.820156e-02
138 4.373873e-02
139 4.271649e-02
140 5.532575e-02
141 2.311324e-02
142 1.644508e-02
143 1.328273e-02
144 3.908473e-02
145 2.355468e-03
146 2.519321e-02
147 1.131868e-01
148 1.708967e-02
149 1.027661e-02
150 2.439899e-02
151 1.604058e-02
152 1.134323e-02
153 2.247722e-01
154 3.408590e-02
155 2.222239e-03
156 1.659830e-02
157 2.284733e-02
158 4.618550e-03
159 3.674162e-02
160 4.131283e-02
161 8.846273e-02
162 2.504404e-02
163 6.004396e-03
164 1.986309e-02
165 2.347111e-02
166 3.865636e-02
167 3.672307e-02
168 6.658419e-02
169 3.726879e-02
170 7.600138e-02
171 7.184871e-03
172 1.142840e-02
173 9.741311e-02
174 8.165448e-03
175 1.529210e-01
176 6.648081e-02
177 2.617601e-02
178 9.547816e-02
179 6.857775e-02
180 8.129399e-04
181 7.107914e-02
182 5.884794e-02
183 8.398721e-04
184 6.972981e-02
185 4.461767e-02
186 2.264404e-02
187 5.566633e-02
188 6.595136e-02
189 2.301914e-02
190 7.488919e-02
191 3.108619e-02
192 4.989364e-07
193 4.834949e-02
194 1.422578e-02
195 9.398186e-02
196 4.870391e-02
197 3.841369e-02
198 6.406801e-02
199 2.603315e-02
200 1.692629e-02
201 1.409982e-02
202 4.099215e-02
203 2.093724e-03
204 2.640732e-02
205 1.032129e-01
206 1.581881e-02
207 8.977325e-03
208 1.941141e-02
209 1.502126e-02
210 9.923589e-03
211 2.757357e-01
212 3.096234e-02
213 4.388900e-03
214 1.784778e-02
215 2.179550e-02
216 3.944159e-03
217 3.703552e-02
218 4.033897e-02
219 1.157076e-01
220 2.400446e-02
221 5.761179e-03
222 1.899621e-02
223 2.401468e-02
224 4.458745e-02
225 3.357898e-02
226 5.331003e-02
227 3.488753e-02
228 7.466599e-02
229 6.075236e-03
230 9.815318e-03
231 9.598735e-02
232 7.103607e-03
233 1.100602e-01
234 5.677641e-02
235 2.420500e-02
236 9.213369e-02
237 4.024043e-02
238 6.987694e-04
239 8.612055e-02
240 5.663353e-02
241 4.871693e-04
242 4.533811e-02
243 3.593244e-02
244 1.982537e-02
245 5.490786e-02
246 5.603109e-02
247 1.671653e-02
248 6.522711e-02
249 3.341356e-02
250 2.378629e-06
251 4.299939e-02
252 1.223163e-02
253 8.392798e-02
254 4.272826e-02
255 3.183946e-02
256 4.431299e-02
257 2.661024e-02
258 1.686707e-02
259 4.070924e-03
260 3.325947e-02
261 2.023611e-03
262 2.402284e-02
263 8.369778e-02
264 1.375093e-02
265 8.899898e-03
266 2.148740e-02
267 1.301483e-02
268 8.355791e-03
269 2.549934e-01
270 2.792516e-02
271 4.652563e-03
272 1.556313e-02
273 1.936942e-02
274 3.547794e-03
275 3.412516e-02
276 3.932606e-02
277 5.305868e-02
278 2.354438e-02
279 5.379380e-03
280 1.904203e-02
281 2.045495e-02
282 3.275855e-02
283 3.007389e-02
284 8.227664e-02
285 2.479949e-02
286 6.573835e-02
287 5.165842e-03
288 7.599650e-03
289 9.613557e-02
290 6.690175e-03
291 1.779880e-01
292 5.076263e-02
293 3.117607e-02
294 7.495692e-02
295 3.707768e-02
296 7.086975e-04
297 8.935981e-02
298 5.624249e-02
299 7.105331e-04
300 3.339868e-02
301 3.354603e-02
302 2.041988e-02
303 3.862522e-02
304 5.977081e-02
305 1.730081e-02
306 6.909621e-02
307 3.729478e-02
308 3.940647e-07
309 4.385336e-02
310 1.391891e-02
311 8.898305e-02
312 3.840141e-02
313 3.214408e-02
314 4.284080e-02
315 1.841022e-02
316 1.528207e-02
317 3.106559e-03
318 3.945481e-02
319 2.085094e-03
320 2.464190e-02
321 7.844914e-02
322 1.526590e-02
323 9.922147e-03
324 1.649218e-02
325 1.341602e-02
326 8.124446e-03
327 2.867380e-01
328 2.663867e-02
329 5.342012e-03
330 1.752612e-02
331 2.010863e-02
332 3.581845e-03
333 3.652284e-02
334 4.484362e-02
335 4.600939e-02
336 2.213280e-02
337 5.494917e-03
338 2.016594e-02
339 2.118010e-02
340 2.964000e-02
341 3.405549e-02
342 1.014185e-01
343 2.451624e-02
344 7.966998e-02
345 5.301538e-03
346 8.198895e-03
347 8.789368e-02
348 7.222417e-03
349 1.448276e-01
350 5.676056e-02
351 2.987054e-02
352 6.851434e-02
353 4.193034e-02
354 7.025054e-03
355 8.557358e-02
356 5.812736e-02
357 2.263676e-02
358 2.922588e-02
359 3.363161e-02
360 1.495056e-02
361 5.871619e-02
362 6.235094e-02
363 1.691340e-02
364 5.361939e-02
365 3.722318e-02
366 9.828477e-03
367 4.155345e-02
368 1.327760e-02
369 7.205372e-02
370 4.151130e-02
371 3.265365e-02
372 2.879418e-02
373 2.314340e-02
374 1.653692e-02
375 1.077611e-02
376 3.481427e-02
377 1.815487e-03
378 2.232305e-02
379 1.005192e-01
380 1.491262e-02
381 3.752658e-02
382 1.271613e-02
383 1.223707e-02
384 8.088923e-03
385 2.572550e-01
386 2.300194e-02
387 2.847960e-02
388 1.782098e-02
389 1.900759e-02
390 3.647629e-03
391 3.723368e-02
392 4.079514e-02
393 5.510332e-02
394 3.072313e-02
395 4.183566e-03
396 1.891549e-02
397 1.870293e-02
398 3.182769e-02
399 4.167840e-02
400 1.343152e-01
401 2.451973e-02
402 7.567017e-02
403 4.837843e-03
404 6.477297e-03
405 7.664675e-02
Name: value, dtype: float64
This is the code I used for transforming dataset:
from scipy import stats
x,_ = stats.boxcox(df)
I get this error:
if any(x <= 0):
-> 1031 raise ValueError("Data must be positive.")
1032
1033 if lmbda is not None: # single transformation
ValueError: Data must be positive
Is it because my values are too small that it's producing an error? Not sure what I'm doing wrong. New to using boxcox, could be using it incorrectly in this example. Open to suggestions and alternatives. Thanks!
Your data contains the value 0 (at index 134). When boxcox says the data must be positive, it means strictly positive.
What is the meaning of your data? Does 0 make sense? Is that 0 actually a very small number that was rounded down to 0?
You could simply discard that 0. Alternatively, you could do something like the following. (This amounts to temporarily discarding the 0, and then using -1/λ for the transformed value of 0, where λ is the Box-Cox transformation parameter.)
First, create some data that contains one 0 (all other values are positive):
In [13]: np.random.seed(8675309)
In [14]: data = np.random.gamma(1, 1, size=405)
In [15]: data[100] = 0
(In your code, you would replace that with, say, data = df.values.)
Copy the strictly positive data to posdata:
In [16]: posdata = data[data > 0]
Find the optimal Box-Cox transformation, and verify that λ is positive. This work-around doesn't work if λ ≤ 0.
In [17]: bcdata, lam = boxcox(posdata)
In [18]: lam
Out[18]: 0.244049919975582
Make a new array to hold that result, along with the limiting value of the transform of 0 (which is -1/λ):
In [19]: x = np.empty_like(data)
In [20]: x[data > 0] = bcdata
In [21]: x[data == 0] = -1/lam
The following plot shows the histograms of data and x.
Rather than normal boxcox, you can use boxcox1p. It adds 1 to x so there won't be any "0" record
from scipy.special import boxcox1p
scipy.special.boxcox1p(x, lmbda)
For more info check out the docs at https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.boxcox1p.html
Is your data that you are sending to boxcox 1-dimensional ndarray?
Second way could be adding shift parameter by summing shift (see details from the link) to all of the ndarray elements before sending it to boxcox and subtracting shift from the resulting array elements (if I have understood boxcox algorithm correctly, that could be solution in your case, too).
https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.boxcox.html
To substitute the numbers with their corresponding "ranks":
import pandas as pd
import numpy as np
numbers = np.random.random_integers(low=0.0, high=10000.0, size=(1000,))
df = pd.DataFrame({'a': numbers})
df['a_rank'] = df['a'].rank()
I am getting the float values as the default output type of rankmethod:
987 82.0
988 36.5
989 526.0
990 219.0
991 957.0
992 819.5
993 787.5
994 513.0
Instead of floats I would rather have the integers. Rounding the resulted float values using asType(int) would be risky since converting to int would probably introduce the duplicated values from the float values that are too close to each other such as 3.5 and 4.0. Those when converted to the integers both would result to the integer value of 4.
Is there any way to guide rank method to output the integers?
The above solution did not work for me. The following did work though. The critical line with edits is:
df['a_rank'] = df['a'].rank(method='dense').astype(int);
This could be a version issue.
Pass param method='dense', this will increase the ranks by 1 between groups, see the docs:
In [2]:
numbers = np.random.random_integers(low=0.0, high=10000.0, size=(1000,))
df = pd.DataFrame({'a': numbers})
df['a_rank'] = df['a'].rank(method='dense')
df
Out[2]:
a a_rank
0 1095 114
1 2514 248
2 500 53
3 6112 592
4 5582 533
5 851 91
6 2887 287
7 3798 366
8 4698 458
9 1699 170
10 4739 462
11 7199 693
12 817 88
13 3801 367
14 5584 534
15 4939 481
16 2569 258
17 6806 656
18 93 8
19 8574 816
20 4107 396
21 7086 684
22 6819 657
23 8844 847
24 170 15
25 6629 634
26 9905 950
27 5312 512
28 3794 365
29 9476 911
.. ... ...
970 4607 447
971 8430 801
972 6527 625
973 2794 280
974 4414 425
975 1069 111
976 2849 285
977 7955 759
978 5767 547
979 7767 742
980 2956 294
981 5847 554
982 1029 107
983 4967 485
984 256 25
985 5577 532
986 6866 662
987 5903 563
988 1785 181
989 749 78
990 2164 212
991 1074 112
992 8752 837
993 2737 272
994 2761 277
995 7355 705
996 8956 857
997 4831 473
998 222 21
999 9531 917
[1000 rows x 2 columns]
No need to use method='dense', just convert to an integer.
df['a_rank'] = df['a'].rank().astype(int)
I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you
I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]
You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'