I can't delete cases from .sav files using spss with python

I can't delete cases from .sav files using spss with python - python

I have some .sav files that I want to check for bad data. What I mean by bad data is irrelevant to the problem. I have written a script in python using the spss module to check the cases and then delete them if they are bad. I do that within a datastep by defining a dataset object and then getting its case list. I then use
del datasetObj.cases[k]
to delete the problematic cases within the datastep.
Here is my problem:
Say I have a data set foo.sav and it is the active data set in spss, then I can run something like:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[k]
spss.EndDataStep()
END PROGRAM.
from within the spss client and it will delete the case k from the data set foo.sav. But, if I run something like the following using the directory of foo.sav as the working directory:
import os, spss
pathname = os.curdir()
foopathname = os.path.join(pathname, 'foo.sav')
spss.Submit("""
GET FILE='%(foopathname)s'.
DATASET NAME file1.
DATASET ACTIVATE file1.
""" %locals())
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[3]
spss.EndDataStep()
from command line, then it doesn't delete the case k. Similar code which gets values will work fine. E.g.,
print caselist[3]
will print case k (when it is in the data step). I can even change the values for the various entries of a case. But it will not delete cases. Any ideas?
I am new to python and spss, so there may be something that I am not seeing which is obvious to others; hence why I am asking the question.

Your first piece of code did not work for me. I adjusted it as follows to get it working:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
del datasetObj.cases[k]
spss.EndDataStep()
END PROGRAM.
Notice that, in your code, caselist is just a list, containing values taken from the datasetObj in SPSS. The attribute .cases belongs to datasetObj.
With spss.Submit, you can also delete cases (or actually, not select them) using the SPSS command SELECT IF. For example, if your file has a variable (column) named age, with values ranging from 0 to 100, you can delete all cases with an age lower than (in SPSS: lt or <) 25 using:
BEGIN PROGRAM PYTHON.
import spss
spss.Submit("""
SELECT IF age lt 25.
""")
END PROGRAM.
Don't forget to add some code to save the edited file.

caselist is not actually a regular list containing the dataset values. Although its interface is the list interface, it actually works directly with the dataset, so it does not contain a list of values. It just accesses operations on the SPSS side to retrieve, change, or delete values. The most important difference is that since Statistics is not keeping the data in memory, the size of the caselist is not limited by memory.
However, if you are trying to iterate over the cases with a loop using
range(spss.GetCaseCount())
and deleting some, the loop will eventually fail, because the actual case count reflects the deletions, but the loop limit doesn't reflect that. And datasetObj.cases[k] might not be the case you expect if an earlier case has been deleted. So you need to keep track of the deletions and adjust the limit or the k value appropriately.
HTH

Related

Abaqus Python Command to insert keyword to the .inp file

The whole objective behind asking this question stems from trying to multi-process the creation of Linear Constraint Equations (http://abaqus.software.polimi.it/v6.14/books/usb/default.htm?startat=pt08ch35s02aus129.html#usb-cni-pequation) in Abaqus/CAE for applying periodic boundary conditions to a meshed model. Since my model has over a million elements and I need to perform a Monte Carlo simulation of 1000 such models, I would like to parallelize the procedure for which I haven't found a solution due to the licensing and multi-threading restrictions associated with Abaqus/CAE. Some discussions on this here: Python multiprocessing from Abaqus/CAE
I am currently trying to perform the equation definitions outside of Abaqus using the node sets created as I know the syntax of Equations for the input file.
** Constraint: <name>
*Equation
<dof>
<set1>, <dof>, <coefficient1>.
<set2>, <dof>, <coefficient2>.
<set3>, <dof>, <coefficient3>.
e.g.
** Constraint: Corner_c1_Constraint-1-pair1
*Equation
3
All-1.c1_Node-1, 1, 1.
All-1.c5_Node-1, 1, -1.
RefPoint-3.SetRefPoint3, 1, -1.
Instead of directly writing these lines into the .inp file, I can also write these commands as a separate file and link it to the .inp file of the model using
*EQUATION, INPUT=file_name
I am looking for the Abaqus Python command to write a keyword such as above to the .inp file instead of specifying the Equation constraints themselves.
The User's guide linked above instructs specifying this via GUI but that I haven't been able to do that in my version of Abaqus CAE 2018.
Abaqus/CAE Usage:
Interaction module: Create Constraint: Equation: click mouse button 3 while holding the cursor over the data table, and select Read from File.
So I am looking for a command from the scripting reference manual to do this instead. There are commands to parse input files (http://abaqus.software.polimi.it/v6.14/books/ker/pt01ch24.html) but not something to directly write to input file instead of performing it via scripting. I know I can hard code this into the input file but the sheer number of simulations that I would like to perform calls for every bit of automation possible. I have already tried optimising the code using appropriate algorithms and numpy arrays yet the pre-processing itself takes hours for one single model.
p.s. This is my first post on SO - so I'm not sure if this question is phrased in the appropriate format. Would appreciate any answers to the actual question or any other solutions to the intended outcome of parallelizing the pre-processing steps in Abaqus/CAE.

You are looking for the KeywordBlock object. This allows you to write anything to the input file (keywords, comments, etc) as a formatted string. I believe this approach is much less error-prone than alternatives that require you to programmatically write an input file (or update an existing one) on your own.
The KeywordBlock object is created when a Part Instance is created at the Assembly level. It stores everything you do in the CAE with the corresponding keywords.
Note that any changes you make to the KeywordBlock object will be synced to the Job input file when it is written, but will not update the CAE Model database. So for example, if you use the KeywordBlock to store keywords for constraint equations, the constraint definitions will not show up in the CAE model tree. The rest of the Mdb does not know they exist.
As you know, keywords must be written to the appropriate section of an input file. For example, the *equation keyword may be defined at the Part, Part Instance, or Assembly level (see the Keywords Reference Manual). This must also be considered when you store keywords in the KeywordBlock object (it will not auto-magically put your keywords in the right spot, unfortunately!). A side effect is that it is not always safe to write to the KeywordBlock - that is, whenever it may conflict with subsequent changes made via the CAE GUI. I believe that the Abaqus docs recommend that you add your keywords as a last step.
So basically, read the KeywordBlock object, find the right place, and insert your new keyword. Here's an example snippet to get you started:
partname = "Part-1"
model = mdb.models["Model-1"]
modelkwb = model.keywordBlock
assembly = model.rootAssembly
if assembly.isOutOfDate:
assembly.regenerate()
# Synch edits to modelkwb with those made in the model. We don't need
# access to *nodes and *elements as they would appear in the inp file,
# so set the storeNodesAndElements arg to False.
modelkwb.synchVersions(storeNodesAndElements=False)
# Search the modelkwb for the desired insertion point. In this example,
# we are looking for a line that indicates we are beginning the Part-Level
# block for the specific Part we are interested in. If it is found, we
# break the loop, storing the line number, and then write our keywords
# using the insert method (which actually inserts just below the specified
# line number, fyi).
line_num = 0
for n, line in enumerate(modelkwb.sieBlocks):
if line.replace(" ","").lower() == "*part,name={0}".format(partname.lower()):
line_num = n
break
if line_num:
kwds = "your keyword as a string here (may be multiple lines)..."
modelkwb.insert(position=line_num, text=kwds)
else:
e = ("Error: Part '{}' was not found".format(partname),
"in the Model KeywordBlock.")
raise Exception(" ".join(e))

How can I efficiently replace all instances of multiple regex patterns from a dataframe of strings, in pySpark?

I have a table in Hadoop which contains 7 billion strings which can themselves contain anything. I need to remove every name from the column containing the strings. An example string would be 'John went to the park' and I'd need to remove 'John' from that, ideally just replacing with '[name]'.
In the case of 'John and Mary went to market', the output would be '[NAME] and [NAME] went to market'.
To support this I have an ordered list of the most frequently occurring 20k names.
I have access to Hue (Hive, Impala) and Zeppelin (Spark, Python & libraries) to execute this.
I've tried this in the DB, but being unable to update columns or iterate over a variable made it a non-starter, so using Python and PySpark seems to be the best option especially considering the number of calculations (20k names * 7bil input strings)
#nameList contains ['John','Emma',etc]
def removeNames(line, nameList):
str_line= line[0]
for name in nameList:
rx = f"(^| |[[:^alpha:]])({name})( |$|[[:^alpha:]])"
str_line = re.sub(rx,'[NAME]', str_line)
str_line= [str_line]
return tuple(str_line)
df = session.sql("select free_text from table")
rdd = df.rdd.map(lambda line: removeNames(line, nameList))
rdd.toDF().show()
The code is executing, but it's taking an hour and a half even if I limit the input text to 1000 lines (which is nothing for Spark), and the lines aren't actually being replaced in the final output.
What I'm wondering is: Why isn't map actually updating the lines of the RDD, and how could I make this more efficient so it executes in a reasonable amount of time?
This is my first time posting so if there's essential info missing, I'll fill in as much as I can.
Thank you!

In case you're still curious about this, by using the udf (your removeNames function) Spark is serializing all of your data to the master node, essentially defeating your usage of Spark to do this operation in a distributed fashion. As the method suggested in the comments, if you go with the regexp_replace() method, Spark will be able to keep all of the data on the distributed nodes, keeping everything distributed and improving performance.

Clear all variables which where defined in Jupiter cell after execution finished

I need to import and manipulate memory heavy data in jupyter.
Because I tend to have rather long notebooks were several data sets will be importer I need to clear them continuously by hand.
This is tideous.
If possible, i would like to have a tool which clears all variables introduced in a cell and only those without the need of addressing them by hand after they fullfilled there purpose.
I could of course overwrite variables, however as they all serve rather different purposes this will drastically reduce the readabiliy of the code.
To summarize:
Cell 1:
variable overhead #this will be used in the entire notebook
Cell 2:
import or generate data & manipulate data
clear all variables introduced in cell without the need of addressing
every single one of them by hand <-- this is what i am looking for.
Thank you very much!

You can reset variables in the Jupyter Notebook by putting the following magic command in the code:
%reset_selective -f [var1, var2, var3]
If you add such lines in your code it should remain readable.
To answer your question completely - At the moment I don't think there exists a command that would automatically find all variables created in a specific cell and reset only them. (Someone please correct me if I am wrong.)
But you can use the following code that deletes exactly those namespace objects which were newly created in a cell. It is probably what you wanted:
from IPython import get_ipython
my_variables = set(dir()) # Write this line at the beginning of cell
# Here is the content of the cell
my_variables = list(set(dir()) - my_variables) # Write these 2 lines at the end of cell
get_ipython().magic('%reset_selective -f [{}]'.format(','.join(my_variables)))

It's not a clean solution but the cells which data I need to keep are not computationally expensive.
Therefore I found it most convenient to simply do:
%reset -f
exec In[n:m]

Trying to fill an array with data opened from files

the following is code I have written that tries to open individual files, which are long strips of data and read them into an array. Essentially I have files that run over 15 times (24 hours to 360 hours), and each file has an iteration of 50, hence the two loops. I then try to open the files into an array. When I try to print a specific element in the array, I get the error "'file' object has no attribute 'getitem'". Any ideas what the problem is? Thanks.
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(1,51,1):
J = str(j);
for i in range(24,384,24):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_v_'+J+'_FT0'+I+'_'+DATE+'.dat';
FileList.append(fileName);
fo = open(fileName,"rb");
data.append(fo);
fo.close();
print data[1][1];
print FileList;
EDITED TO ADD:
Below, find the CORRECT array that the python script should be producing (sorry it wont let me post this inline yet):
http://i.stack.imgur.com/ItSxd.png
The problem I now run into, is that the first three values in the first row of the output matrix are:
-7.090874
-7.004936
-6.920952
These values are actually the first three values of the 11th row in the array below, which is the how it should look (performed in MATLAB). The next three values the python script outputs (as what it believes to be the second row) are:
-5.255577
-5.159874
-5.064171
These values should be found in the 22nd row. In other words, python is placing the 11th row of values in the first position, the 22nd in the second and so on. I don't have a clue as to why, or where in the code I'm specifying it do this.

You're appending the file objects themselves to data, not their contents:
fo = open(fileName,"rb");
data.append(fo);
So, when you try to print data[1][1], data[1] is a file object (a closed file object, to boot, but it would be just as broken if still open), so data[1][1] tries to treat that file object as if it were a sequence, and file objects aren't sequences.
It's not clear what format your data are in, or how you want to split it up.
If "long strips of data" just means "a bunch of lines", then you probably wanted this:
data.append(list(fo))
A file object is an iterable of lines, it's just not a sequence. You can copy any iterable into a sequence with the list function. So now, data[1][1] will be the second line in the second file.
(The difference between "iterable" and "sequence" probably isn't obvious to a newcomer to Python. The tutorial section on Iterators explains it briefly, the Glossary gives some more information, and the ABCs in the collections module define exactly what you can do with each kind of thing. But briefly: An iterable is anything you can loop over. Some iterables are sequences, like list, which means they're indexable collections that you can access like spam[0]. Others are not, like file, which just reads one line at a time into memory as you loop over it.)
If, on the other hand, you actually imported csv for a reason, you more likely wanted something like this:
reader = csv.reader(fo)
data.append(list(reader))
Now, data[1][1] will be a list of the columns from the second row of the second file.
Or maybe you just wanted to treat it as a sequence of characters:
data.append(fo.read())
Now, data[1][1] will be the second character of the second file.
There are plenty of other things you could just as easily mean, and easy ways to write each one of them… but until you know which one you want, you can't write it.

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!

No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.

I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.

As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.