SPSS Python - fast(er) way of accessing Value Labels - python

I am trying to pull the variables' names, labels and value labels. I noticed that all assignments are quite fast, except the one referencing the ValueLabels. On my test dataset, if I comment out that line, everything else takes about 1 second. But that line alone delays the whole code by about 15 seconds, and the test dataset is not a large one (by my standards at least :))
Is this something inherent with accessing the variable dictionary ? Or is there another, faster, way of pulling the whole dictionary, without going variable by variable...?
begin program.
import spss
import spssaux
vardict = spssaux.VariableDict()
var_list=[]
var_values={}
var_type={}
var_labels={}
for i in range(spss.GetVariableCount()):
var=spss.GetVariableName(i)
var_list.append(var)
#this is the line causing the massive delay
var_values[var]=vardict[i].ValueLabels
var_type[var]=str(spss.GetVariableFormat(i)[0])
var_labels[var]=vardict[i].VariableLabel
end program.
In fact I only need it to check if a variable had value labels defined or not. But I have no idea how to check that in any other way.

It turns out that using the spssaux module was the culprit here. I have no idea why, because pretty much all the Internet knowledge points to that way of getting the value labels.
However, almost by accident I stumbled upon the help of the `spss' module, which states:
| valueLabels
| Get, set or delete value labels. The set of value labels for a particular variable is represented
| as a Python dictionary whose keys are the values for which labels are being set and whose
| values are the associated labels. Labels must be specified as quoted strings.
|
| --examples
| # Get all value labels for a specified variable
| import spss
| spss.StartDataStep()
| datasetObj = spss.Dataset()
| varObj = datasetObj.varlist['minority']
| vallabels = varObj.valueLabels
| spss.EndDataStep()
As I was only interested to see if variables have (or do not have) value labels, I created a dictionary storing the length of the valueLabels dictionary of each variable:
begin program.
# Get all value labels for a specified variable
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
var_labels={}
for var in datasetObj.varlist:
var_labels[var.name]=len(var.valueLabels)
spss.EndDataStep()
print var_labels
end program.
It is instantaneous, even on large files. (I admit, what "large" means may be different from user to user; I stopped the code in the OP after 30 minutes on a "large" file, as it was obviously not being time-effective).

Related

Is there another way to convert ee.Number to float except getInfo()?

Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.
I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.

global variables and .format()

I wrote a python script that generates PDF reports. I had to do some data manipulation to change column names in each of the data sets I used.
My question is, Is there a way to set a global variable and then using .format() inside the Target_Hours_All.rename() ???
I have hardcoded each column name.
For example, Target_Hours_All.rename(columns = {'VP_x':'VP', '2018 Q1 Target Hours':'hourTarget18Q1'}, inplace = True)
However, I want to be able to run this each quarter without having to update every df.rename. Instead, I would like to have global variables at the top of the script and change those.
Any help would be greatly appreciated!!!
Easiest way? Move the strings you want to change out of the function call into variables and then use the variables within Target_Hours_All.rename()
This makes the code way lot easier to read.
I can only guess what Target_Hours_All.rename does. But my guess would be that it takes the hash in "columns" and replaces the key with the value. Correct?
So could write your columns line as:
columns = {}
vpl = 'VP_x'
vpr = 'VP'
columns[vpl] = vpr
target_hours_l = '20{year} {quarter} Target Hours'.format(year='18', quarter='Q1')
target_hours_r = 'hourTarget{year}{quarter}'.format(year='18',quarter='Q1')
columns[target_hours_l] = target_hours_r
Target_Hours_All.rename(columns = columns, ... )
Yes this is more code and I should not have named my has columns but something else instead. So there is way for improvement. But it shows how you can use .format() for your call.

I can't delete cases from .sav files using spss with python

I have some .sav files that I want to check for bad data. What I mean by bad data is irrelevant to the problem. I have written a script in python using the spss module to check the cases and then delete them if they are bad. I do that within a datastep by defining a dataset object and then getting its case list. I then use
del datasetObj.cases[k]
to delete the problematic cases within the datastep.
Here is my problem:
Say I have a data set foo.sav and it is the active data set in spss, then I can run something like:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[k]
spss.EndDataStep()
END PROGRAM.
from within the spss client and it will delete the case k from the data set foo.sav. But, if I run something like the following using the directory of foo.sav as the working directory:
import os, spss
pathname = os.curdir()
foopathname = os.path.join(pathname, 'foo.sav')
spss.Submit("""
GET FILE='%(foopathname)s'.
DATASET NAME file1.
DATASET ACTIVATE file1.
""" %locals())
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[3]
spss.EndDataStep()
from command line, then it doesn't delete the case k. Similar code which gets values will work fine. E.g.,
print caselist[3]
will print case k (when it is in the data step). I can even change the values for the various entries of a case. But it will not delete cases. Any ideas?
I am new to python and spss, so there may be something that I am not seeing which is obvious to others; hence why I am asking the question.
Your first piece of code did not work for me. I adjusted it as follows to get it working:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
del datasetObj.cases[k]
spss.EndDataStep()
END PROGRAM.
Notice that, in your code, caselist is just a list, containing values taken from the datasetObj in SPSS. The attribute .cases belongs to datasetObj.
With spss.Submit, you can also delete cases (or actually, not select them) using the SPSS command SELECT IF. For example, if your file has a variable (column) named age, with values ranging from 0 to 100, you can delete all cases with an age lower than (in SPSS: lt or <) 25 using:
BEGIN PROGRAM PYTHON.
import spss
spss.Submit("""
SELECT IF age lt 25.
""")
END PROGRAM.
Don't forget to add some code to save the edited file.
caselist is not actually a regular list containing the dataset values. Although its interface is the list interface, it actually works directly with the dataset, so it does not contain a list of values. It just accesses operations on the SPSS side to retrieve, change, or delete values. The most important difference is that since Statistics is not keeping the data in memory, the size of the caselist is not limited by memory.
However, if you are trying to iterate over the cases with a loop using
range(spss.GetCaseCount())
and deleting some, the loop will eventually fail, because the actual case count reflects the deletions, but the loop limit doesn't reflect that. And datasetObj.cases[k] might not be the case you expect if an earlier case has been deleted. So you need to keep track of the deletions and adjust the limit or the k value appropriately.
HTH

influxdb: Write multiple points vs single point multiple times

I'm using influxdb in my project and I'm facing an issue with query when multiple points are written at once
I'm using influxdb-python to write 1000 unique points to influxdb.
In the influxdb-python there is a function called influxclient.write_points()
I have two options now:
Write each point once every time (1000 times) or
Consolidate 1000 points and write all the points once.
The first option code looks like this(pseudo code only) and it works:
thousand_points = [0...9999
while i < 1000:
...
...
point = [{thousand_points[i]}] # A point must be converted to dictionary object first
influxclient.write_points(point, time_precision="ms")
i += 1
After writing all the points, when I write a query like this:
SELECT * FROM "mydb"
I get all the 1000 points.
To avoid the overhead added by every write in every iteration, I felt like exploring writing multiple points at once. Which is supported by the write_points function.
write_points(points, time_precision=None, database=None,
retention_policy=None, tags=None, batch_size=None)
Write to multiple time series names.
Parameters: points (list of dictionaries, each dictionary represents
a point) – the list of points to be written in the database
So, what I did was:
thousand_points = [0...999]
points = []
while i < 1000:
...
...
points.append({thousand_points[i]}) # A point must be converted to dictionary object first
i += 1
influxclient.write_points(points, time_precision="ms")
With this change, when I query:
SELECT * FROM "mydb"
I only get 1 point as the result. I don't understand why.
Any help will be much appreciated.
You might have a good case for a SeriesHelper
In essence, you set up a SeriesHelper class in advance, and every time you discover a data point to add, you make a call. The SeriesHelper will batch up the writes for you, up to bulk_size points per write
I know this has been asked well over a year ago, however, in order to publish multiple data points in bulk to influxdb each datapoint needs to have a unique timestamp it seems, otherwise it will just be continously overwritten.
I'd import a datetime and add the following to each datapoint within the for loop:
'time': datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
So each datapoint should look something like...
{'fields': data, 'measurement': measurement, 'time': datetime....}
Hope this is helpful for anybody else who runs into this!
Edit: Reading the docs show that another unique identifier is a tag, so you could instead include {'tag' : i} (supposedly each iteration value is unique) if you wish to specify time. (However this I haven't tried)

Pattern for associating pyparsing results with a linked-list of nodes

I have defined a pyparsing rule to parse this text into a syntax-tree...
TEXT COMMANDS:
add Iteration name = "Cisco 10M/half"
append Observation name = "packet loss 1"
assign Observation results_text = 0.0
assign Observation results_bool = True
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append Observation name = "packet loss 2"
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
append DataPoint
assign DataPoint metric = txpackets
assign DataPoint units = packets
SYNTAX TREE:
['add', 'Iteration', ['name', 'Cisco 10M/half']]
['append', 'Observation', ['name', 'packet loss 1']]
['assign', 'Observation', ['results_text', '0.0']]
['assign', 'Observation', ['results_bool', 'True']]
['append', 'DataPoint']
['assign', 'DataPoint', ['metric', 'txpackets']]
['assign', 'DataPoint', ['units', 'packets']]
...
I'm trying to associate all the nested key-value pairs in the syntax-tree above into a linked-list of objects... the heirarchy looks something like this (each word is a namedtuple... children in the heirarchy are on the parents' list of children):
Log: [
Iteration: [
Observation:
[DataPoint, DataPoint],
Observation:
[DataPoint, DataPoint]
]
]
The goal of all this is to build a generic test data-acquisition platform to drive the flow of tests against network gear, and record the results. After the data is in this format, the same data structure will be used to build a test report. To answer the question in the comments below, I chose a linked list because it seemed like the easiest way to sequentially dequeue the information when writing the report. However, I would rather not assign Iteration or Observation sequence numbers before finishing the tests... in case we find problems and insert more Observations in the course of conducting the test. My theory is that the position of each element in the list is sufficient, but I'm willing to change that if it's part of the problem.
The problem is that I'm getting lost trying to assign Key-Values to objects in the linked list after it's built. For instance, after I insert an Observation namedtuple into the first Iteration, I have trouble reliably handling the update of assign Observation results_bool = True in the example above.
Is there a generalized design pattern to handle this situation? I have googled this for while, but I can't seem to make the link between parsing the text (which I can do) and managing the data-heirarchy (the main problem). Hyperlinks or small demo code is fine... I just need pointers to get on the right track.
I am not aware of an actual design pattern for what you're looking for, but I have a great passion for the issue at hand. I work heavily with network devices and parsing and organizing the data is a large ongoing challenge for me.
It's clear that the problem is not parsing the data, but what you do with it afterwards. This is where you need to think about the meaning you are attaching to the data you have parsed. The nested-list method might work well for you if the objects containing the lists are also meaningful.
Namedtuples are great for quick-and-dirty class-ish behavior, but they fall flat when you need them to do anything outside of basic attribute access, especially considering that as tuples they are immutable. It seems to me that you'll want to replace certain namedtuple objects with full-blown classes. This way you can highly customize the behavior and methods available.
For example, you know that an Iteration will always contain 1 or more Observation objects which will then contain 1 or more DataPoint objects. If you can accurately describe the relationships, this sets you on the path to handling them.
I wound up using textfsm, which allows me to keep state between different lines while parsing the configuration file.

Categories