Clear all variables which where defined in Jupiter cell after execution finished - python

I need to import and manipulate memory heavy data in jupyter.
Because I tend to have rather long notebooks were several data sets will be importer I need to clear them continuously by hand.
This is tideous.
If possible, i would like to have a tool which clears all variables introduced in a cell and only those without the need of addressing them by hand after they fullfilled there purpose.
I could of course overwrite variables, however as they all serve rather different purposes this will drastically reduce the readabiliy of the code.
To summarize:
Cell 1:
variable overhead #this will be used in the entire notebook
Cell 2:
import or generate data & manipulate data
clear all variables introduced in cell without the need of addressing
every single one of them by hand <-- this is what i am looking for.
Thank you very much!

You can reset variables in the Jupyter Notebook by putting the following magic command in the code:
%reset_selective -f [var1, var2, var3]
If you add such lines in your code it should remain readable.
To answer your question completely - At the moment I don't think there exists a command that would automatically find all variables created in a specific cell and reset only them. (Someone please correct me if I am wrong.)
But you can use the following code that deletes exactly those namespace objects which were newly created in a cell. It is probably what you wanted:
from IPython import get_ipython
my_variables = set(dir()) # Write this line at the beginning of cell
# Here is the content of the cell
my_variables = list(set(dir()) - my_variables) # Write these 2 lines at the end of cell
get_ipython().magic('%reset_selective -f [{}]'.format(','.join(my_variables)))

It's not a clean solution but the cells which data I need to keep are not computationally expensive.
Therefore I found it most convenient to simply do:
%reset -f
exec In[n:m]

Related

Jupyter notebook: how to leave one cell out while 'run all'

I'm writing python using jupyter notebook and I have two cells that can influence each other.
I'm wondering is it possible to leave some certain cells out after I click Restart & Run All so that I can test the two cells independently?
One option based on Davide Fiocco's answer of this post and that I just tested is to include %%script magic command on each cell you don't want to execute.
For example
%%script false --no-raise-error
for i in range(100000000000000):
print(i)
If you put those two cells at the end of the page, you can run all cells above a certain cell with a single click.
That or you can put a triple-quote at the beginning and end of the two cells, then un-quote the cells to test them.
One option is to create a parameter and run the cells accordingly
x = 1
# cell 1
if x == 1:
// run this cell
# cell 2
if x != 1:
// run the other cell
In this example, you will skip cell 2.
I recently discovered an easy way to do this.
You may have noticed that cells can be set as type Code or Markdown - this lets you prepare the notebook with headers and explanatory text (in Markdown), but also sections of executable code (the default). This can be set from a drop-down already on the screen if using via Jupyter Lab. In Jupyter Notebook I think it's under the Cells menu.
You can also use keyboard shortcuts (first hit Escape if needed to get out of text-entry mode: Y for Code, Mfor Markdown, or Rfor Raw.
Wait, what's that about Raw? It appears to just take away the code highlighting and make the cell not executable! So Esc+R to make it Raw, then execute like you wanted to, then Esc+Y if you want to re-enable that block.
Alternative: If you want a quicker way to comment out all the lines but leave it as a Code block, make sure you are in edit mode (click on the cell contents), do Ctrl+A (for select-all), and then Ctrl+/ (for "comment this line"). I tested with python and it inserts # at the beginning of each selected line.

I can't delete cases from .sav files using spss with python

I have some .sav files that I want to check for bad data. What I mean by bad data is irrelevant to the problem. I have written a script in python using the spss module to check the cases and then delete them if they are bad. I do that within a datastep by defining a dataset object and then getting its case list. I then use
del datasetObj.cases[k]
to delete the problematic cases within the datastep.
Here is my problem:
Say I have a data set foo.sav and it is the active data set in spss, then I can run something like:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[k]
spss.EndDataStep()
END PROGRAM.
from within the spss client and it will delete the case k from the data set foo.sav. But, if I run something like the following using the directory of foo.sav as the working directory:
import os, spss
pathname = os.curdir()
foopathname = os.path.join(pathname, 'foo.sav')
spss.Submit("""
GET FILE='%(foopathname)s'.
DATASET NAME file1.
DATASET ACTIVATE file1.
""" %locals())
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[3]
spss.EndDataStep()
from command line, then it doesn't delete the case k. Similar code which gets values will work fine. E.g.,
print caselist[3]
will print case k (when it is in the data step). I can even change the values for the various entries of a case. But it will not delete cases. Any ideas?
I am new to python and spss, so there may be something that I am not seeing which is obvious to others; hence why I am asking the question.
Your first piece of code did not work for me. I adjusted it as follows to get it working:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
del datasetObj.cases[k]
spss.EndDataStep()
END PROGRAM.
Notice that, in your code, caselist is just a list, containing values taken from the datasetObj in SPSS. The attribute .cases belongs to datasetObj.
With spss.Submit, you can also delete cases (or actually, not select them) using the SPSS command SELECT IF. For example, if your file has a variable (column) named age, with values ranging from 0 to 100, you can delete all cases with an age lower than (in SPSS: lt or <) 25 using:
BEGIN PROGRAM PYTHON.
import spss
spss.Submit("""
SELECT IF age lt 25.
""")
END PROGRAM.
Don't forget to add some code to save the edited file.
caselist is not actually a regular list containing the dataset values. Although its interface is the list interface, it actually works directly with the dataset, so it does not contain a list of values. It just accesses operations on the SPSS side to retrieve, change, or delete values. The most important difference is that since Statistics is not keeping the data in memory, the size of the caselist is not limited by memory.
However, if you are trying to iterate over the cases with a loop using
range(spss.GetCaseCount())
and deleting some, the loop will eventually fail, because the actual case count reflects the deletions, but the loop limit doesn't reflect that. And datasetObj.cases[k] might not be the case you expect if an earlier case has been deleted. So you need to keep track of the deletions and adjust the limit or the k value appropriately.
HTH

How do I assign the result of iPython profiler %%prun -r to a variable?

In the docs of the iPython magic functions it says:
Usage, in cell mode:
%%prun [options] [statement] code... code...
In cell mode, the additional code lines are appended to the (possibly
empty) statement in the first line. Cell mode allows you to easily
profile multiline blocks without having to put them in a separate
function.
Options:
-r return the pstats.Stats object generated by the profiling. This object has all the information about the profile in it, and you can
later use it for further analysis or in other functions.
But it doesn't give any examples of how to use the -r option. How do I associate the pstats.Stats object to a variable? using the cell profiler?
edit:
This is not a duplicate because I specifically ask about cell mode, the other questions are about line magic functions. Thomas K answers my question by saying it is not possible. That should be allowed as an answer to my question here which is not an answer to the other questions.
Unfortunately there is not a way to capture a returned value from a cell magic. With a line magic you can do:
a = %prun -r ...
But cell magics have to start at the beginning of the cell, with nothing before them.

IPython and OS X terminal output is line wrapping before column limit

I'm using IPython working with the pandas module which allows for the DataFrame object. When I'm running some code, I get an output where the DataFrame output is wrapping before the width of my terminal despite that the terminal width should accommodate the length. This issue seems to be isolated only to the pandas Series and DataFrame objects and not say, a long list.
Running pip uninstall readline and then reinstalling readline through easy_install and restarting IPython did not solve the problem.
It would be helpful to see my data not broken up like that, but I honestly don't know where to begin to fix this. Any insight?
I found a workaround that allows the console output to be more readable. Calling to_string() on the DataFrame object returns a string representation of the object, skirting around whatever inherent formatting that DataFrame contains, especially since the goal is readability.
data = DataFrame(some_long_list)
print data.to_string() # outputs to console's full-width
EDIT:
From pandas docs: "New since 0.10.0, wide DataFrames will now be printed across multiple rows by default". I'm seeing that this helps as a default so that instead of cramming rows onto the next line, you'll see separation by column. There are two additional methods to configure output width:
import pandas as pd
pd.set_option('line_width', 40) # default is 80
or to turn off the wrap feature completely:
pd.set_option('expand_frame_repr', False)

Python sas7bdat module usage

I have to dump data from SAS datasets. I found a Python module called sas7bdat.py that says it can read SAS .sas7bdat datasets, and I think it would be simpler and more straightforward to do the project in Python rather than SAS due to the other functionality required. However, the help(sas7bdat) in interactive Python is not very useful and the only example I was able to find to dump a dataset is as follows:
import sas7bdat
from sas7bdat import *
# following line is sas dataset to convert
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
#following line is txt file to create
foo.convertFile('/support/textfiles/locked_data.txt','\t')
This doesn't do what I want because a) it uses the SAS variable names as column headers and I need it to use the variable labels, and b) it uses "nan" to denote missing numeric values where I'd rather just leave the value blank.
Can anyone point me to some useful documentation on the methods included in sas7bdat.py? I've Googled every permutation of key words that I could think of, with no luck. If not, can someone give me an example or two of using readColumnAttributes(), readColumnLabels(), and/or readColumnNames()?
Thanks, all.
As time passes, solutions become easier. I think this one is easiest if you want to work with pandas:
import pandas as pd
df = pd.read_sas('/support/sas/locked_data.sas7bdat')
Note that it is easy to get a numpy array by using df.values
This is only a partial answer as I've found no [easy to read] concrete documentation.
You can view the source code here
This shows some basic info regarding what arguments the methods require, such as:
readColumnAttributes(self, colattr)
readColumnLabels(self, collabs, coltext, colcount)
readColumnNames(self, colname, coltext)
I think most of what you are after is stored in the "header" class returned when creating an object with SAS7BDAT. If you just print that class you'll get a lot of info, but you can also access class attributes as well. I think most of what you may be looking for would be under foo.header.cols. I suspect you use various header attributes as parameters for the methods you mention.
Maybe something like this will get you closer?
from sas7bdat import SAS7BDAT
foo = SAS7BDAT(inFile) #your file here...
for i in foo.header.cols:
print '"Atrributes"', i.attr
print '"Labels"', i.label
print '"Name"', i.name
edit: Unrelated to this specific question, but the type() and dir() commands come in handy when trying to figure out what is going on in an unfamiliar class/library
I know I'm late for the answer, but in case someone searches for similar question. The best option is:
import sas7bdat
from sas7bdat import *
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
# This converts to dataframe:
ds = foo.to_data_frame()
Personally I think the better approach would be to export the data using SAS then process the external file as needed using Python.
In SAS, you can do this...
libname datalib "/support/sas";
filename sasdump "/support/textfiles/locked_data.txt";
proc export
data = datalib.locked_data
outfile = sasdump
dbms = tab
label
replace;
run;
The downside to this is that while the column labels are used rather than the variable names, the labels are enclosed in double quotes. When processing in Python, you may need to programmatically remove them if they cause a problem. I hope that helps even though it doesn't use Python like you wanted.

Categories