I am trying to make a DataFrame in python and update the various rows and columns in the dataframe though a loop based on various calculations. The calculations are all correct, but when I try to display the DataFrame once the loop is complete, some of the numbers calculated are displayed yet I also see mostly zeros. Below is an example code:
map = pd.DataFrame(np.zeros([16, 6]), columns=['A', 'B', 'C', 'D', 'E', 'F'])
for i in range(0, len(map)):
map.A[i] = 1+1 #Some calculations
map.B[i] = map.A[i] + 2
print map
Result (just an example):
A B C D E F
1 2 4 0.000 0.000 0.000 0.000
2 0.000 0.000 0.000 0.000 0.000 0.000
3 0.000 0.000 0.000 0.000 0.000 0.000
4 0.000 0.000 0.000 0.000 0.000 0.000
5 0.000 0.000 0.000 0.000 0.000 0.000
(continues for 16 rows)
However, if I were to print a specific clolumn, I would get the real calculated numbers. Also, calcB uses the correct numbers from calcA so it has to be just a print error. I am guessing it is something to do with initializing the array and the memory but I am not sure. I originally used np.empty([16, 6]) but the same result occured. How do I get the DataFrame to print the actual numbers, not the zeros?
Related
I'm trying to profile my python script using cProfile and displaying the results with pstats. In particular, I'm trying to use the pstats function p.sort_stats('time').print_callers(20) to print only the top 20 functions by time, as decribed in the documentation.
I expect to get only the top 20 results (functions profiled and their calling functions ordered by time), instead, I get a seemingly unfiltered list of over 1000 functions that completely saturates my terminal (hence I'm estimating over a 1000 functions).
Why is my restriction argument (i.e. 20) being ignored by print_callers() and how can I fix this?
I've tried looking up an answer and couldn't find one. And I tried to create a minimal reproducible example, but when I do, I can't reproduce the problem (i.e. it works fine).
my profiling code is:
import cProfile
import pstats
if __name__ == '__main__':
cProfile.run('main()', 'mystats')
p = pstats.Stats('mystats')
p.sort_stats('time').print_callers(20)
I'm trying to avoid having to post my full code, so if someone else has encountered this issue before, and can answer without seeing my full code, that would be great.
Thank you very much in advance.
Edit 1:
Partial output:
Ordered by: internal time
List reduced from 1430 to 1 due to restriction <1>
Function was called by...
ncalls tottime cumtime
{built-in method builtins.isinstance} <- 2237 0.000 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
9 0.000 0.000 <frozen importlib._bootstrap_external>:485(_compile_bytecode)
44 0.000 0.000 <frozen importlib._bootstrap_external>:1117(_get_spec)
4872 0.001 0.001 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\_strptime.py:321(_strptime)
5 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\abc.py:196(__subclasscheck__)
26 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\calendar.py:58(__getitem__)
14 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\calendar.py:77(__getitem__)
2 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\distutils\version.py:331(_cmp)
20 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\enum.py:797(__or__)
362 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\enum.py:803(__and__)
1 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\inspect.py:73(isclass)
30 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\json\encoder.py:182(encode)
2 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:34(_get_bothseps)
1 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:75(join)
4 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:122(splitdrive)
3 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\ntpath.py:309(expanduser)
4 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\os.py:728(check_str)
44 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\re.py:249(escape)
4 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\re.py:286(_compile)
609 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\dateutil\parser\_parser.py:62(__init__)
1222 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py:48(_count_reduce_items)
1222 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py:58(_mean)
1 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\arrayprint.py:834(__init__)
1393 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:1583(ravel)
1239 0.000 0.000 C:\Users\rafael.natan\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:1966(sum)
...
I figured out the issue.
As usual, the Python Library does not have a bug, rather, I misunderstood the output of the function call.
I'm elaborating it here as an answer in case it helps anyone clear up this misunderstanding in the future.
When I asked the question, I didn't understand why p.print_callers(20) prints out to terminal over a thousand lines, even though I am restricting it to the top 20 function calls (by time).
What is actually happening is that my restriction of printing the top 20 "most time consuming functions", restricts the list to the top 20 functions, but then prints a list of all the functions that called each of the top twenty functions.
Since each of the top 20 functions was called (on average) by about 100 different functions each one of the top functions, had about 100 lines associated with it. so 20*100=2000, and so p.print_callers(20) printed well over a 1000 lines and saturated my terminal.
I hope this saves someone some time and debugging headache :)
Here is the rate limiting function in my code
def timepropagate(wv1, ham11,
ham12, ham22, scalararray, nt):
wv2 = np.zeros((nx, ny), 'c16')
fw1 = np.zeros((nx, ny), 'c16')
fw2 = np.zeros((nx, ny), 'c16')
for t in range(0, nt, 1):
wv1, wv2 = scalararray*wv1, scalararray*wv2
fw1, fw2 = (np.fft.fft2(wv1), np.fft.fft2(wv2))
fw1 = ham11*fw1+ham12*fw2
fw2 = ham12*fw1+ham22*fw2
wv1, wv2 = (np.fft.ifft2(fw1), np.fft.ifft2(fw2))
wv1, wv2 = scalararray*wv1, scalararray*wv2
del(fw1)
del(fw2)
return np.array([wv1, wv2])
What I would need to do is find a reasonably fast implementation that would allow me to go at twice the speed, preferably the fastest.
The more general question I'm interested in, is what way can I speed up this piece, using minimal possible connections back to python. I assume that even if I speed up specific segments of the code, say the scalar array multiplications, I would still come back and go from python at the Fourier transforms which would take time. Are there any ways I can use, say numba or cython and not make this "coming back" to python in the middle of the loops?
On a personal note, I'd prefer something fast on a single thread considering that I'd be using my other threads already.
Edit: here are results of profiling, the 1st one for 4096x4096 arrays for 10 time steps, I need to scale it up for nt = 8000.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.099 0.099 432.556 432.556 <string>:1(<module>)
40 0.031 0.001 28.792 0.720 fftpack.py:100(fft)
40 45.867 1.147 68.055 1.701 fftpack.py:195(ifft)
80 0.236 0.003 47.647 0.596 fftpack.py:46(_raw_fft)
40 0.102 0.003 1.260 0.032 fftpack.py:598(_cook_nd_args)
40 1.615 0.040 99.774 2.494 fftpack.py:617(_raw_fftnd)
20 0.225 0.011 29.739 1.487 fftpack.py:819(fft2)
20 2.252 0.113 72.512 3.626 fftpack.py:908(ifft2)
80 0.000 0.000 0.000 0.000 fftpack.py:93(_unitary)
40 0.631 0.016 0.820 0.021 fromnumeric.py:43(_wrapit)
80 0.009 0.000 0.009 0.000 fromnumeric.py:457(swapaxes)
40 0.338 0.008 1.158 0.029 fromnumeric.py:56(take)
200 0.064 0.000 0.219 0.001 numeric.py:414(asarray)
1 329.728 329.728 432.458 432.458 profiling.py:86(timepropagate)
1 0.036 0.036 432.592 432.592 {built-in method builtins.exec}
40 0.001 0.000 0.001 0.000 {built-in method builtins.getattr}
120 0.000 0.000 0.000 0.000 {built-in method builtins.len}
241 3.930 0.016 3.930 0.016 {built-in method numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}
40 18.861 0.472 18.861 0.472 {built-in method numpy.fft.fftpack_lite.cfftb}
40 28.539 0.713 28.539 0.713 {built-in method numpy.fft.fftpack_lite.cfftf}
1 0.000 0.000 0.000 0.000 {built-in method numpy.fft.fftpack_lite.cffti}
80 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
40 0.006 0.000 0.006 0.000 {method 'astype' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
80 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
40 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
80 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
80 0.001 0.000 0.001 0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
40 0.022 0.001 0.022 0.001 {method 'take' of 'numpy.ndarray' objects}
I think I've done it wrong the first time, using time.time() to calculate time differences for small arrays and extrapolating the conclusions for larger ones.
If most of the time is spent in the hamiltonian multiplication, you may want to apply numba on that part. The most benefit coming from removing all the temporal arrays that would be needed if evaluating expressions from within NumPy.
Bear also in mind that the arrays (4096, 4096, c16) are big enough to not fit comfortably in the processor caches. A single matrix would take 256 MiB. So think that the performance is unlikely to be related at all with the operations, but rather on the bandwidth. So implement those operations in a way that you only perform one pass in the input operands. This is really trivial to implement in numba. Note: You will only need to implement in numba the hamiltonian expressions.
I want also to point out that the "preallocations" using np.zeros seems to signal that your code is not following your intent as:
fw1 = ham11*fw1+ham12*fw2
fw2 = ham12*fw1+ham22*fw2
will actually create new arrays for fw1, fw2. If your intent was to reuse the buffer, you may want to use "fw1[:,:] = ...". Otherwise the np.zeros do nothing but waste time and memory.
You may want to consider to join (wv1, wv2) into a (2, 4096, 4096, c16) array. The same with (fw1, fw2). That way code will be simpler as you can rely on broadcasting to handle the "scalararray" product. fft2 and ifft2 will actually do the right thing (AFAIK).
Sorting a list of tuples (dictionary keys,values pairs where the key is a random string) is faster when I do not explicitly specify that the key should be used (edit: added operator.itemgetter(0) from comment by #Chepner and the key version is now faster!):
import timeit
setup ="""
import random
import string
random.seed('slartibartfast')
d={}
for i in range(1000):
d[''.join(random.choice(string.ascii_uppercase) for _ in range(16))] = 0
"""
print min(timeit.Timer('for k,v in sorted(d.iteritems()): pass',
setup=setup).repeat(7, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=lambda x: x[0]): pass',
setup=setup).repeat(7, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=operator.itemgetter(0)): pass',
setup=setup).repeat(7, 1000))
Gives:
0.575334150664
0.579534521128
0.523808984422 (the itemgetter version!)
If however I create a custom object passing the key=lambda x: x[0] explicitly to sorted makes it faster:
setup ="""
import random
import string
random.seed('slartibartfast')
d={}
class A(object):
def __init__(self):
self.s = ''.join(random.choice(string.ascii_uppercase) for _ in
range(16))
def __hash__(self): return hash(self.s)
def __eq__(self, other):
return self.s == other.s
def __ne__(self, other): return self.s != other.s
# def __cmp__(self, other): return cmp(self.s ,other.s)
for i in range(1000):
d[A()] = 0
"""
print min(timeit.Timer('for k,v in sorted(d.iteritems()): pass',
setup=setup).repeat(3, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=lambda x: x[0]): pass',
setup=setup).repeat(3, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=operator.itemgetter(0)): pass',
setup=setup).repeat(3, 1000))
Gives:
4.65625458083
1.87191002252
1.78853626684
Is this expected ? Seems like second element of the tuple is used in the second case but shouldn't the keys compare unequal ?
Note: uncommenting the comparison method gives worse results but still the times are at one half:
8.11941771831
5.29207000173
5.25420037046
As expected built in (address comparison) is faster.
EDIT: here are the profiling results from my original code that triggered the question - without the key method:
12739 function calls in 0.007 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.007 0.007 <string>:1(<module>)
1 0.000 0.000 0.007 0.007 __init__.py:6527(_refreshOrder)
1 0.002 0.002 0.006 0.006 {sorted}
4050 0.003 0.000 0.004 0.000 bolt.py:1040(__cmp__) # here is the custom object
4050 0.001 0.000 0.001 0.000 {cmp}
4050 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
291 0.000 0.000 0.000 0.000 __init__.py:6537(<lambda>)
291 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 bolt.py:1240(iteritems)
1 0.000 0.000 0.000 0.000 {method 'iteritems' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
and here are the results when I specify the key:
7027 function calls in 0.004 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.004 0.004 <string>:1(<module>)
1 0.000 0.000 0.004 0.004 __init__.py:6527(_refreshOrder)
1 0.001 0.001 0.003 0.003 {sorted}
2049 0.001 0.000 0.002 0.000 bolt.py:1040(__cmp__)
2049 0.000 0.000 0.000 0.000 {cmp}
2049 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
291 0.000 0.000 0.000 0.000 __init__.py:6538(<lambda>)
291 0.000 0.000 0.000 0.000 __init__.py:6533(<lambda>)
291 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 bolt.py:1240(iteritems)
1 0.000 0.000 0.000 0.000 {method 'iteritems' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Apparently it is the __cmp__ and not the __eq__ method that is called (edit cause that class defines __cmp__ but not __eq__, see here for the order of resolution of equal and compare).
In the code here __eq__ method is indeed called (8605 times) as seen by adding debug prints (see the comments).
So the difference is as stated in the answer by #chepner. The last thing I am not quite clear on is why are those tuple equality calls needed (IOW why we need to call eq and we don't call cmp directly).
FINAL EDIT: I asked this last point here: Why in comparing python tuples of objects is __eq__ and then __cmp__ called? - turns out it's an optimization, tuple's comparison calls __eq__ in the tuple elements, and only call cmp for not eq tuple elements. So this is now perfectly clear. I thought it called directly __cmp__ so initially it seemed to me that specifying the key is just unneeded and after Chepner's answer I was still not getting where the equal calls come in.
Gist: https://gist.github.com/Utumno/f3d25e0fe4bd0f43ceb9178a60181a53
There are two issues at play.
Comparing two values of builtin types (such as int) happens in C. Comparing two values of a class with an __eq__ method happens in Python; repeatedly calling __eq__ imposes a significant performance penalty.
The function passed with key is called once per element, rather than once per comparison. This means that lambda x: x[0] is called once to build a list of A instances to be compared. Without key, you need to make O(n lg n) tuple comparisons, each of which requires a call to A.__eq__ to compare the first element of each tuple.
The first explains why your first pair of results is under a second while the second takes several seconds. The second explains why using key is faster regardless of the values being compared.
I have a program that is reading some data from an Excel spreadsheet (a small one: ~10 sheets with ~100 cells per sheet), doing some calculations, and then writing output to cells in a spreadsheet.
The program ran quickly until I modified to write its output into the same Excel file as where the input is read. Previously I was generating a new spreadsheet and then copying the output into the original file manually.
After the modifications the script's runtime jumped from a few seconds to about 7 minutes. I ran cProfile to investigate and got this output, sorted by cumulative runtime:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 440.918 440.918 xlsx_transport_calc.py:1(<module>)
1 0.000 0.000 437.926 437.926 excel.py:76(load_workbook)
1 0.000 0.000 437.924 437.924 excel.py:161(_load_workbook)
9 0.000 0.000 437.911 48.657 worksheet.py:302(read_worksheet)
9 0.000 0.000 437.907 48.656 worksheet.py:296(fast_parse)
9 0.065 0.007 437.906 48.656 worksheet.py:61(parse)
9225 45.736 0.005 437.718 0.047 worksheet.py:150(parse_column_dimensions)
9292454 80.960 0.000 391.640 0.000 functools.py:105(wrapper)
9292437 62.181 0.000 116.213 0.000 cell.py:94(get_column_letter)
18585439 20.881 0.000 98.832 0.000 threading.py:214(__exit__)
18585443 58.912 0.000 86.641 0.000 threading.py:146(acquire)
18585443 56.600 0.000 77.951 0.000 threading.py:186(release)
9293461/9293452 22.317 0.000 22.319 0.000 {method 'join' of 'str' objects}
37170887 15.795 0.000 15.795 0.000 threading.py:63(_note)
21406059 13.460 0.000 13.460 0.000 {divmod}
37170888 12.853 0.000 12.853 0.000 {thread.get_ident}
18585447 12.589 0.000 12.589 0.000 {method 'acquire' of 'thread.lock' objects}
21408493 9.948 0.000 9.948 0.000 {chr}
21441151 8.323 0.000 8.323 0.000 {method 'append' of 'list' objects}
18585446 7.843 0.000 7.843 0.000 {method 'release' of 'thread.lock' objects}
...
...
...
Relevant code in the script:
...
from openpyxl import load_workbook
import pandas as pd
...
xlsx = 'path/to/spreadsheet.xlsx'
...
def loadxlsx(fname, sname, usecols=None):
with pd.ExcelFile(fname) as ef:
df = ef.parse(sheetname=sname)
if usecols:
return [df.values[:,col] for col in usecols]
else:
return [df.values[:,col] for col in range(df.shape[1])]
...
data = loadxlsx('path/to/spreadsheet.xlsx')
...
<do computations>
...
book = load_workbook(xlsx)
<write data back to spreadsheet>
...
So according to the cProfile output the culprit appears to something within the call to load_workbook. Beyond that observation I'm a bit confused. Why are there 9000 calls to parse_column_dimensions and 18 million calls to various threading functions? And 9 million calls to get_column_letter?
This is the first time I have profiled any python scripts so I'm not sure if this output is normal or not... It seems to have some odd portions though.
Can anyone shed some light on what might be happening here?
I don't know what's happening in the Pandas code but the number of calls is wrong. If you try simply opening the file with openpyxl and modifying the cells in place then it should be a lot faster so it looks like there is some unnecessary looping going on.
Hey guys so I have a 2D array that looks like this:
12381.000 63242.000 0.000 0.000 0.000 8.000 9.200 0.000 0.000
12401.000 8884.000 0.000 0.000 96.000 128.000 114.400 61.600 0.000
12606.000 74204.000 56.000 64.000 72.000 21.600 18.000 0.000 0.000
12606.000 105492.000 0.000 0.000 0.000 0.000 0.000 0.000 45.600
12606.000 112151.000 2.400 4.000 0.000 0.000 0.000 0.000 0.000
12606.000 121896.000 0.000 0.000 0.000 0.000 0.000 60.800 0.000
(Cut off couple of columns due to formatting)
So it indicates the employee ID, Department ID, followed by the 12 months worked by each employee and the hours they worked for each month. My 2D array is essentially a list of lists where each row is a list in its own. I am trying to convert each nonzero value to a one and maintain all the zeros. There are 857 rows and 14 columns. My code is as follows:
def convBin(A):
"""Nonzero values are converted into 1s and zero values are kept constant.
"""
for i in range(len(A)):
for j in range(len(A[i])):
if A[i][j] > 0:
A[i][j] == 1
else:
A[i][j] == 0
return A
Can someone tell me what I am doing wrong?
You are doing equality evaluation, not assignment, inside your loop:
A[i][j] == 1
should be
A[i][j] = 1
# ^ note only one equals sign
Also, there is no need to return A; A is being modified in-place, so it is conventional to implicitly return None by removing the explicit return ... line.
You should bear in mind that:
You don't actually want to do anything in the else case; and
Iterating over range(len(...)) is not Pythonic - use e.g. enumerate.
Your function could therefore be simplified to:
def convBin(A):
"""Convert non-zero values in 2-D array A into 1s."""
for row in A:
for j, val in enumerate(row):
if val:
row[j] = 1