Intel Vtune cannot find python source file - python

This is an old problem as is demonstrated as in https://community.intel.com/t5/Analyzers/Unable-to-view-source-code-when-analyzing-results/td-p/1153210. I have tried all the listed methods, none of them works, and I cannot find any more solutions on the internet. Basically vtune cannot find the custom python source file no matter what is tried. I am using the most recently version as of speaking. Please let me whether there is a solution.
For example, if you run the following program.
def myfunc(*args):
# Do a lot of things.
if __name__ = '__main__':
# Do something and call myfunc
Call this script main.py. Now use the newest vtune version (I have using Ubuntu 18.04), run the vtune-gui and basic hotspot analysis. You will not found any information on this file. However, a huge pile of information on Python and its other codes are found (related to your python environment). In theory, you should be able to find the source of main.py as well as cost on each line in that script. However, that is simply not happening.
Desired behavior: I would really like to find the source file and function in the top-down manual (or any really). Any advice is welcome.

VTune offer full support for profiling python code and the tool should be able to display the source code in your python file as you expected. Could you please check if the function you are expecting to see in the VTune results, ran long enough?
Just to confirm that everything is working fine, I wrote a matrix multiplication code as shown below (don't worry about the accuracy of the code itself):
def matrix_mul(X, Y):
result_matrix = [ [ 1 for i in range(len(X)) ] for j in range(len(Y[0])) ]
# iterate through rows of X
for i in range(len(X)):
# iterate through columns of Y
for j in range(len(Y[0])):
# iterate through rows of Y
for k in range(len(Y)):
result_matrix[i][j] += X[i][k] * Y[k][j]
return result_matrix
Then I called this function (matrix_mul) on my Ubuntu machine with large enough matrices so that the overall execution time was in the order of few seconds.
I used the below command to start profiling (you can also see the VTune version I used):
/opt/intel/oneapi/vtune/2021.1.1/bin64/vtune -collect hotspots -knob enable-stack-collection=true -data-limit=500 -ring-buffer=10 -app-working-dir /usr/bin -- python3 /home/johnypau/MyIntel/temp/Python_matrix_mul/mat_mul_method.py
Now open the VTune results in the GUI and under the bottom-up tab, order by "Module / Function / Call-stack" (or whatever preferred grouping is).
You should be able to see the the module (mat_mul_method.py in my case) and the function "matrix_mul". If you double click, VTune should be able to load the sources too.

Related

Why does a call to torch.tensor inside of apply_async fail to complete (seems to block execution)?

I'm trying to understand why the following simple example doesn't successfully complete execution and seems to get stuck on the first line of really_simple_func (on Ubuntu machines, but not Windows). The code is:
import torch as t
import numpy as np
import multiprocessing as mp # I've tried both multiprocessing
# import torch.multiprocessing as mp # and torch.multiprocessing
def really_simple_func():
temp_val_2 = t.tensor(np.zeros(425447)[0:400000]) # this is the line that blocks.
return 4.3
if __name__ == "__main__":
print("Run brief starting")
some_zeros = np.zeros(425447)
temp_val = t.tensor(some_zeros[0:400000]) # DELETE THIS LINE TO MAKE IT WORK
pool = mp.Pool(processes=1)
job = pool.apply_async(really_simple_func)
print("just before job.get()")
result = job.get()
print("Run brief completed. Reward = {}".format(result))
I have torch 1.11.0 installed, numpy 1.22.3 and have tried both CPU and GPU versions of Torch. When I run this code on two different Ubuntu machines, I get the following output:
Run brief starting
just before job.get()
However, the code never successfully completes (doesn't print the "Run brief completed" line). (It does complete on a third Windows box).
On the Ubuntu machines, if I delete the line with the comment "#DELETE THIS LINE TO MAKE IT WORK" the execution DOES complete, printing the final line as expected. Similarly, if I leave the line defining temp_val in but delete the line with the comment "This is the line that blocks" it will also complete. Moreover, if I reduce the size of the temp_val tensor (say from 400000 to 4000) it will also complete successfully. Finally, it is worth noting that while I can reproduce this behaviour on two different Ubuntu machines, this code does actually complete on my Windows machine - though, as far as I can tell, the versions of key packages, such as torch, are the same.
I don't understand this behaviour. I suspect it is something to do with the way torch allocates memory or stores information. I've tried calling del temp_val to free up memory, but that doesn't seem to fix things. It seems to me that the async call to t.tensor within really_simple_func is stopped from completing if there has already been a call to t.tensor in the main code block, creating a sufficiently large tensor.
I don't understand why this is happening, or even if that is the correct explanation. In any case, what would be best practice if I do need to do some tensor processing within apply_async as well as in the main thread? More generally, what is Torch waiting on when I make a call to t.tensor?
(Obviously, this is just the simplest version of the real code I'm trying to get to work that reproduced this issue. I realise that calling mp.Pool with only one process doesn't really make sense...nor, indeed, does using apply_async to call a function that returns a constant!)
Unfortunately, I cannot provide any answers to your questions.
I can, however, share experiences with seemingly the same issue. I use a Linux machine with torch 1.8.1 and numpy 1.19.2.
When I run the following code on my machine:
with Pool(max_pool) as p:
pool_outputs = list(
tqdm(
p.imap(lambda f: get_model_results_per_query_file(get_preds, tokenizer, f), query_files),
total=len(query_files)
)
)
For which the function get_model_results_per_query_file contains operations similar to the following:
feats = features.unsqueeze(0).repeat(batch_size, 1, 1).to(device)
(features is a torch tensor)
The first round of jobs automatically fail, and new ones are immediately started (that do not fail for some reason). The whole process never completes though, since the main process still seems to be waiting for the first failed jobs.
If I remove the lines in my code involving the repeat function, no jobs fail.
I managed to solve my issue and preserve the same results by adapting a similar solution to yours:
feats = torch.as_tensor(np.tile(features, (batch_size, 1, 1))).to(device)
I believe as_tensor works in a similar fashion to from_numpy in this case.
I only managed to find this solution thanks to your post and your proposed workaround, so thank you!
After some further exploration, here is a brief answer to my own question.
While I still don't fully understand the blocking behaviour (and would welcome any further explanation), I have just seen that the way I'm generating torch tensors from a numpy array is not correct.
In particular, instead of using torch.tensor(temp_val) where temp_val is a numpy array, I should be using torch.from_numpy(temp_val). Doing this fixes the problem.
Alternatively, I can convert temp_val into a list and then create the tensor via torch.tensor(temp_val_as_list) - which also avoids the issue.

How to execute python function as whole in VSCode (it splits and sends just the first line to an interpreter)

I'm getting used to VSCode in my daily Data Science remote workflow due to LiveShare feature.
So, upon executing functions it just executes the first line of code; if I mark the whole region then it does work, but it's cumbersome way of dealing with the issue.
I tried number of extensions, but none of them seem to solve the problem.
def gini_normalized(test, pred):
"""Simple normalized Gini based on Scikit-Learn's roc_auc_score"""
gini = lambda a, p: 2 * roc_auc_score(a, p) - 1
return gini(test, pred)
Executing the beginning of the function results in error:
def gini_normalized(test, pred):...
File "", line 1
def gini_normalized(test, pred):
^
SyntaxError: unexpected EOF while parsing
There's a solution for PyCharm: Python Smart Execute - https://plugins.jetbrains.com/plugin/11945-python-smart-execute. Also Atom's Hydrogen doesn't have such issue either.
Any ideas regarding VSCode?
Thanks!
I'm a developer on the VSCode DataScience features. Just to make sure that I'm understanding correctly. You would like the shift-enter command to send the entire function to the Interactive Window if you run it on the definition of the function?
If so, then yes, we don't currently support that. Shift-enter can run line by line or run a section of code that you manually highlight. If you want, you can use #%% lines in your code to put functions into code cells. Then when you are in a cell shift-enter will run that entire cell, might be the best current approach for you.
That smart execute does look interesting, if you would like to file that as a suggestion you can use our GitHub here to get it on our backlog to look at.
https://github.com/Microsoft/vscode-python
Hi you could click the symbol before each line and turn it into > (the indented codes of the function was hidden now). Then if you select the whole line and the next line, shift+enter could run them together.
enter image description here

object was probably modified after being freed with Python on Mac

I have the following code in a function:
phi = fourier_matrix(y, fs)
N = np.size(phi,axis=1)
x = np.ones(N)
for i in range(136):
W_pk = np.diag(x)
temp = pinv(np.dot(phi, W_pk))
q_k = np.dot(temp, y)
x = np.dot(W_pk, q_k)
where phi is (96,750), W_pk is (750,750) and q_k is (750,).
This throws the following error:
Python(12001,0x7fff99d8b380) malloc: * error for object 0x7fc71aa37200: incorrect checksum for freed object - object was probably modified after being freed.
* set a breakpoint in malloc_error_break to debug
If I comment the last dot product the error does not appear.
I think I need to free memory in some way or maybe do the dot product in a different way?
Also, this only happens when I run it from a mac. On windows or linux it does not throw the error.
Python is 3.6 (tried with 3.7), and numpy is 1.14.5, also tried with 1.15
Any help would be greatly appreciated, since I really need to make this work!
Thanks in advance.
EDIT I:
I tried this portion of the code on a jupyter notebook, and it didn't fail. This confused me even more! It fails when I run it in Visual Studio Code on a mac. The rest of my code, an algorithm to remove artifacts from a signal, works as it should until I add that last piece of code x = np.dot(W_pk, q_k). Maybe it works on jupyter because I don't run the rest of the algorithm there? but as I said, it only crashes on that last dot product.
EDIT II: I added the piece of code above the for loop to this question, because I found that the problem is somehow related to how x is being used. You see it's declared above as a float64 ndarray. When it reaches the last line of the for loop, the dot product returns a complex128 (should be complex64, don't know what's happening there) and overwrites the x array. The first time works, but the second time it crashes when trying to overwrite. If I use a new variable for the dot product, say z, then it does not crash! not sure why... but I need to overwrite x in each iteration.
Furthermore, if I do something like this:
z = np.dot(W_pk, q_k)
x = abs(z) #I don't need complex numbers at this point
Then it crashes with the same error on the first dot product (presumably):
temp = pinv(np.dot(phi, W_pk))
Also, the memory consumption is not that bad, around 110M according some measurements, and the same algorithm does not crash on iPython with twice the memory usage. This is what I find the most obscure, why doesn't it crash on iPython??

How can I change baselines code output/replay (PPO) on github?

I am trying to run my own version of baselines code source of reinforcement learning on github: (https://github.com/openai/baselines/tree/master/baselines/ppo2).
Whatever I do, I keep having the same display which looks like this :
Where can I edit it ? I know I should edit the "learn" method but I don't know how
Those prints are the result of the following block of code, which can be found at this link (for the latest revision at the time of writing this at least):
if update % log_interval == 0 or update == 1:
ev = explained_variance(values, returns)
logger.logkv("serial_timesteps", update*nsteps)
logger.logkv("nupdates", update)
logger.logkv("total_timesteps", update*nbatch)
logger.logkv("fps", fps)
logger.logkv("explained_variance", float(ev))
logger.logkv('eprewmean', safemean([epinfo['r'] for epinfo in epinfobuf]))
logger.logkv('eplenmean', safemean([epinfo['l'] for epinfo in epinfobuf]))
logger.logkv('time_elapsed', tnow - tfirststart)
for (lossval, lossname) in zip(lossvals, model.loss_names):
logger.logkv(lossname, lossval)
logger.dumpkvs()
If your goal is to still print some things here, but different things (or the same things in a different format) your only option really is to modify this source file (or copy the code you need into a new file and apply your changes there, if allowed by the code's license).
If your goal is just to suppress these messages, the easiest way to do so would probably be by running the following code before running this learn() function:
from baselines import logger
logger.set_level(logger.DISABLED)
That's using this function to disable the baselines logger. It might also disable other baselines-related output though.

How to debug Python memory fault?

Edit: Really appreciate help in finding bug - but since it might prove hard to find/reproduce, any general debug help would be greatly appreciated too! Help me help myself! =)
Edit 2: Narrowing it down, commenting out code.
Edit 3: Seems lxml might not be the culprit, thanks! The full script is here. I need to go over it looking for references. What do they look like?
Edit 4: Actually, the scripts stops (goes 100%) in this, the parse_og part of it. So edit 3 is false - it must be lxml somehow.
Edit 5 MAJOR EDIT: As suggested by David Robinson and TankorSmash below, I've found a type of data content that will send lxml.etree.HTML( data ) in a wild loop. (I carelessly disregarded it, but find my sins redeemed as I've paid a price to the tune of an extra two days of debug! ;) A working crashing script is here. (Also opened a new question.)
Edit 6: Turns out this is a bug with lxml version 2.7.8 and below (at
least). Updated to lxml 2.9.0, and bug is gone. Thanks also to the fine folks over at this follow-up question.
I don't know how to debug this weird problem I'm having.
The below code runs fine for about five minutes, when the RAM is suddenly completely filled up (from 200MB to 1700MB during the 100% period - then when memory is full, it goes into blue wait state).
It's due to the code below, specifically the first two lines. That's for sure. But what is going on? What could possibly explain this behaviour?
def parse_og(self, data):
""" lxml parsing to the bone! """
try:
tree = etree.HTML( data ) # << break occurs on this line >>
m = tree.xpath("//meta[#property]")
#for i in m:
# y = i.attrib['property']
# x = i.attrib['content']
# # self.rj[y] = x # commented out in this example because code fails anyway
tree = ''
m = ''
x = ''
y = ''
i = ''
del tree
del m
del x
del y
del i
except Exception:
print 'lxml error: ', sys.exc_info()[1:3]
print len(data)
pass
You can try Low-level Python debugging with GDB. Probably there is a bug in Python interpreter or in lxml library and it is hard to find it without extra tools.
You can interrupt your script running under gdb when CPU usage goes to 100% and look at stack trace. It will probably help to understand what's going on inside script.
it must be due to some references which keep the documents alive. one must always be careful with string results from xpath evaluation. I see you have assigned None to tree and m but not to y,x and i .
Can you also assign None to y,x and i .
Tools are also helpful when trying to track down memory problems. I've found guppy to be a very useful Python memory profiling and exploration tool.
It is not the easiest to get started with due to a lack of good tutorials / documentation, but once you get to grips with it you will find it very useful. Features I make use of:
Remote memory profiling (via sockets)
Basic GUI for charting usage, optionally showing live data
Powerful, and consistent, interfaces for exploring data usage in a Python shell

Categories