Speed of using append in python repeatedly - python

Is it faster substantially to start with a preallocated list and set items at each index, as opposed to starting with an empty list and appending items? I need this list to hold 10k-100k items.
I ask because I am trying to implement an algorithm that requires O(n) time at each level of recursion, but I am getting results that indicate O(n^2) time. I thought perhaps python needing to keep resizing the list might cause this slowdown.
I found similar questions but none that explicitly answered my question. One answer indicated that garbage collecting might be very slow with so many items, so I tried turning gc on and off at saw no improvement in results.
PROBLEM SOLVED:
If anyone is curious, the slowdown was caused by unioning sets together too often. Now I use a different method (involves sorting), to check if the same key is seen twice.

Python preallocates the list in chunks that are proportional to the size of the list. This gives amortized O(1) for appending to lists
Here is a simple test to see when a list grows. Note that many of these will be able to be reallocated in place, so a copy over isn't always necessary
>>> import sys
>>> A = []
>>> sz = sys.getsizeof(A)
>>> for i in range(100000):
... if sz != sys.getsizeof(A):
... sz = sys.getsizeof(A)
... print i, sz
... A.append(i)
...
1 48
5 64
9 96
17 132
26 172
36 216
47 264
59 320
73 384
89 456
107 536
127 624
149 724
174 836
202 964
234 1108
270 1268
310 1448
355 1652
406 1880
463 2136
527 2424
599 2748
680 3116
772 3528
875 3992
991 4512
1121 5100
1268 5760
1433 6504
1619 7340
1828 8280
2063 9336
2327 10524
2624 11864
2959 13368
3335 15060
3758 16964
4234 19108
4770 21520
5373 24232
6051 27284
6814 30716
7672 34580
8638 38924
9724 43812
10946 49312
12321 55500
13868 62460
15608 70292
17566 79100
19768 89012
22246 100160
25033 112704
28169 126816
31697 142692
35666 160552
40131 180644
45154 203248
50805 228676
57162 257284
64314 289468
72360 325676
81412 366408
91595 412232

Related

Which ML algorithm would be appropriate for clustering a combination of categorical and numerical dataframe?

I wish to cluster a DataFrame with a dimension of (120000 x 4).
It consists of two string-based "label" columns (str1 and str2), and two numerical columns which looks like the following:
Str1 Str2 Energy intensity
0 713 599 7678.159 5367.276014
1 715 598 7678.182 6576.100453
2 714 597 7678.183 5675.788001
3 684 587 7678.493 3040.650157
4 693 588 7678.585 5585.908164
5 695 586 7678.615 3184.001905
6 684 584 7678.674 4896.774505
7 799 509 7693.645 4907.484401
8 798 508 7693.754 4075.800912
9 797 507 7693.781 4407.800702
10 796 506 7694.043 3138.073328
11 794 505 7694.049 3653.699936
12 795 504 7694.077 3875.120022
13 675 277 7694.948 3081.797654
14 709 221 7698.216 3587.704908
15 708 220 7698.252 4070.050144
...........
What would be the best ML algorithm to cluster/categorize this data?
I have tried plotting individual energy&intensity components belonging to one specific category Str1== "713" etc, which didn't give me much information. I am in need of somewhat more compact clustering, if possible.
You can try to do categorical encoding or one-hot encoding to Str1 and Str2 (categorical encoding is suitable for the class with magnitude relation, while one-hot encoding is more widely used), these will convert the string into numerical data, can you can just simply use any regression model.

Cannot open eps file after saving figure

Normally, opening an eps file is no problem but with this current code in Python that I am working on, the exported eps file is loading when opened but never appearing. I have tried exporting the same figure as a png and that works fine. Also I have tried exporting a really simple figure as eps and that opens without any flaws. I have included some of the relevant code concerning the plot/figure. Any help would be much appreciated.
#%% plot section
plt.close('all')
plt.figure()
plt.errorbar(r,omega,yerr=omega_err,fmt='mo')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.title('profile averaged from {} ms to {} ms \n shot {}'.format(tidsinterval[0],tidsinterval[1],skud_numre[0]),y=1.05)
plt.grid()
plt.axhline(y=2.45,color='Red')
plt.text(39,2.43,'txt block for horizontal line',backgroundcolor='white')
plt.axvline(x=37.5,color='Purple')
plt.text(37.5,1.2,'txt block for vertical line',ha='center',va="center",rotation='vertical',backgroundcolor='white')
plt.savefig('directory/plot.eps', format='eps')
plt.show()
The variables r, omega, omega_err are vectors of float of small sizes (6 perhaps).
Update: The program I use for opening eps-files is Evince, furthermore, one can download the eps file here https://filedropper.com/d/s/z7lxUCtANeox7tDMQ6dI6HZUpcTfHn. As far as I can see, it is fine sharing files over filedropper via community guidelines, but if I'm wrong please say so.
Found out that it is possible to open the file as long as there is no text contained in the plot (for example x-label,y-label, title and so on), so the problem has to be related to the text.
The short answer is it's your font. The /e glyph is throwing an error on setcachedevice (your PostScript interpreter should have told you this).
The actual problem is that the font program is careless (at least) about it's use of function name. The program contains this:
/mpldict 11 dict def
mpldict begin
/d { bind def } bind def
That creates a new dictionary called mpldict, begins that dictionary (makes it the topmost entry in the dictionary stack) and defines a function called 'd' in that dictionary
We then move on to the font definition, there's a lot of boiler plate in here, but each character shape is defined by an entry in the font's CharStrings dictionary, we'll pick that up with the definition of the function called 'd' in the font's CharStrings dictionary.
/d{1300 0 113 -29 1114 1556 sc
930 950 m
930 1556 l
ce} d
(2.60) == flush
/e{1260 0 113 -29 1151 1147 sc
1151 606 m
1151 516 l
305 516 l
313 389 351 293 419 226 c
488 160 583 127 705 127 c
776 127 844 136 910 153 c
977 170 1043 196 1108 231 c
1108 57 l
1042 29 974 8 905 -7 c
836 -22 765 -29 694 -29 c
515 -29 374 23 269 127 c
165 231 113 372 113 549 c
113 732 162 878 261 985 c
360 1093 494 1147 662 1147 c
813 1147 932 1098 1019 1001 c
1107 904 1151 773 1151 606 c
967 660 m
966 761 937 841 882 901 c
827 961 755 991 664 991 c
561 991 479 962 417 904 c
356 846 320 764 311 659 c
967 660 l
ce} d
Notice that what this does is create a new definition of a function named 'd' in the current dictionary. That's not a problem in itself. We now have two functions named 'd'; one in the current dictionary (the font's CharStrings dictionary) and one in 'mpldict'.
Then we define the next character:
/e{1260 0 113 -29 1151 1147 sc
1151 606 m
1151 516 l
305 516 l
313 389 351 293 419 226 c
488 160 583 127 705 127 c
776 127 844 136 910 153 c
977 170 1043 196 1108 231 c
1108 57 l
1042 29 974 8 905 -7 c
836 -22 765 -29 694 -29 c
515 -29 374 23 269 127 c
165 231 113 372 113 549 c
113 732 162 878 261 985 c
360 1093 494 1147 662 1147 c
813 1147 932 1098 1019 1001 c
1107 904 1151 773 1151 606 c
967 660 m
966 761 937 841 882 901 c
827 961 755 991 664 991 c
561 991 479 962 417 904 c
356 846 320 764 311 659 c
967 660 l
ce} d
Now, the last thing we do at the end of defining that character shape (for the character named 'e') is call a function named 'd'. But there are two, which one do we call ? The answer is that we work backwards down the dictionary stack looking in each dictionary to see if it has a function called 'd' and we use the first one we find. The current dictionary is the font's CharStrings dictionary, and it has a function called 'd' (which defines the 'd' character) so we call that.
And that function then tries to use setcachedevice. That operator is not legal except when executing a character description, which we are not doing, so it throws an undefined error.
Now your PostScript interpreter should tell you there is an error (Ghostscript, for example, does so). Because there is an error the interpreter stops and doesn't draw anything further, which is why you get a blank page.
What can you do about this ? Well you could raise a bug report with the creating application (apparently Matplotlib created the font too). This is not a good way to define a font!
Other than that, well frankly the only thing you can do is search and replace through the file. If you look for occurrences of ce} d and replace them with ce}bind def it'll probably work. This time.

pandas not sorting as expected

I have a pandas dataframe I am trying to sort, which contains a int column (encoded target) which I sort like so:
some_set.encoded_target = train_set.encoded_target.astype(int) # last but one column
some_set.sort_values(by='encoded_target', ascending=True)
print(some_set)
and this gives me:
1953 61c4930b42ca426eb8dfaf7314899d08__11_115_3... 61c4930b42ca426eb8dfaf7314899d08__115 134 61c4930b42ca426eb8dfaf7314899d08
1623 3659cfea02b44543812e13f0d7fb7147__105_105_4... 3659cfea02b44543812e13f0d7fb7147__105 63 3659cfea02b44543812e13f0d7fb7147
241 bd67717fe59e4fa8bb5307a663016eb3__13_13_3_p... bd67717fe59e4fa8bb5307a663016eb3__13 290 bd67717fe59e4fa8bb5307a663016eb3
1573 9fdfabfad9974d6cac5b588ff2d9e47a__194__194_2... 9fdfabfad9974d6cac5b588ff2d9e47a__194 238 9fdfabfad9974d6cac5b588ff2d9e47a
602 0a64aee93755481cb9f5162373c776f8__182__182_1... 0a64aee93755481cb9f5162373c776f8__182 13 0a64aee93755481cb9f5162373c776f8
... ... ... ... ...
1779 7b19321376b842a2aece02cd458fb043__186__186_3... 7b19321376b842a2aece02cd458fb043__186 187 7b19321376b842a2aece02cd458fb043
2910 64bff78431914373a78c8f547d985b7d__141__141_2... 64bff78431914373a78c8f547d985b7d__141 142 64bff78431914373a78c8f547d985b7d
1377 2410de3f2fee45cdab25b61428f282bd__93__93_3_p... 2410de3f2fee45cdab25b61428f282bd__93 39 2410de3f2fee45cdab25b61428f282bd
2533 a567db4f10c34228b5452f79b5ff08d7__43__43_1_p... a567db4f10c34228b5452f79b5ff08d7__43 247 a567db4f10c34228b5452f79b5ff08d7
2790 9430d8f375bc4888a0a61b47bc7228fd__102__102_3... 9430d8f375bc4888a0a61b47bc7228fd__102 217 9430d8f375bc4888a0a61b47bc7228fd
clearly, this is wrong, 13 must come before 134
I have spent two hours trying to figure WTF could be wrong, but I am having no lick whatsoever.
:((
Any clues would be great.
One thing need to remember is to assign it back
some_set = some_set.sort_values(by='encoded_target', ascending=True)

How can I read a file having different column for each rows?

my data looks like this.
0 199 1028 251 1449 847 1483 1314 23 1066 604 398 225 552 1512 1598
1 1214 910 631 422 503 183 887 342 794 590 392 874 1223 314 276 1411
2 1199 700 1717 450 1043 540 552 101 359 219 64 781 953
10 1707 1019 463 827 675 874 470 943 667 237 1440 892 677 631 425
How can I read this file structure in python? I want to extract a specific column from rows. For example, If I want to extract value in the second row, second column, how can I do that? I've tried 'loadtxt' using data type string. But it requires string index slicing, so that I could not proceed because each column has different digits. Moreover, each row has a different number of columns. Can you guys help me?
Thanks in advance.
Use something like this to split it
split2=[]
split1=txt.split("\n")
for item in split1:
split2.append(item.split(" "))
I have stored given data in "data.txt". Try below code once.
res=[]
arr=[]
lines = open('data.txt').read().split("\n")
for i in lines:
arr=i.split(" ")
res.append(arr)
for i in range(len(res)):
for j in range(len(res[i])):
print(res[i][j],sep=' ', end=' ', flush=True)
print()

closest points based on coordinates, python

I have a list of the station with x and y coordinates. I tried to find at least 4 closest points for each station. I had a look at this link but can not able to figure out how to do that.
for example, my data looks like:
station Y X
601 28.47 83.43
604 28.45 83.42
605 28.16 83.36
606 28.29 83.39
607 28.38 83.36
608 28.49 83.53
609 28.21 83.34
610 29.03 83.53
612 29.11 83.58
613 28.11 83.45
614 28.13 83.42
615 282.4 83.06
616 28.36 83.13
619 28.24 83.44
620 28.02 83.39
621 28.23 83.24
622 28.09 83.34
623 29.06 84
624 28.58 83.47
625 28.54 83.41
626 28.28 83.36
627 28.23 83.29
628 28.3 83.18
629 28.34 83.23
630 28.08 83.37
633 29.11 83.59
Any help will be highly appriciated.
For large-data, you might try to be clever in regards to data-stuctures. As already tagged by yourself, there are specialized data-structures for these kind of lookups. Scipy supports some, sklearn is even more complete (and imho better and more actively developed for these tasks; personal opinion)!
The code-example uses scipy's API to not use (python-)loops. The disadvantage is the need for discarding the 0-distance to itself for each element.
Code
import numpy as np
from scipy.spatial import KDTree
""" Data """
data_i = np.array([601, 604, 605, 606])
data = np.array([[28.47, 83.43],[28.45, 83.42],[28.16, 83.36],[82.29, 83.39]])
print(data_i)
print(data)
""" KDTree """
N_NEIGHBORS = 2
kdtree = KDTree(data)
kdtree_q = kdtree.query(data, N_NEIGHBORS+1) # 0-dist to self -> +1
print(data_i[kdtree_q[1][:, 1:]]) # discard 0-dist
# uses guarantee of sorted-by-dist
Output
[601 604 605 606]
[[ 28.47 83.43]
[ 28.45 83.42]
[ 28.16 83.36]
[ 82.29 83.39]]
[[604 605]
[601 605]
[604 601]
[601 604]]

Categories