How to modify SPSS output files with Python?

How to modify SPSS output files with Python? - python

I'm making custom tables in SPSS, but when the cell values (percentages) are rounded to 1 decimal, they sometimes add up to 99,9 or 100,1 in stead of 100,0. My boss asked my to have everything neatly add up to 100. This means slightly changing some values in the output tables.
I wrote some code to retrieve cell values from tables, which works fine, but I cannot find any method or class that allows me to change cells in already generated output. I've tried things like :
Table[(rij,6)] = CellText.Number(11)
and
SpssDataCells[(rij,6)] = CellText.Number(11)
but it keeps giving me "AttributeError: 'SpssClient.SpssTextItem' object has no attribute 'DataCellArray'"
How do I succesfully change cell values of output tables in SPSS?
My code so far:
import SpssClient, spss
# Python verbinden met SPSS.
SpssClient.StartClient()
OutputDoc = SpssClient.GetDesignatedOutputDoc()
OutputItemList = OutputDoc.GetOutputItems()
# Laatste tabel pakken.
lastTab = OutputItemList.Size()-2
OutputItem = OutputItemList.GetItemAt(lastTab)
Table = OutputItem.GetSpecificType()
SpssDataCells = Table.DataCellArray()
# For loop. Voor iedere rij testen of de afgeronde waarden optellen tot 100.
# Specifieke getallen pakken.
rij=0
try:
while (rij<20):
b14 = float(SpssDataCells.GetUnformattedValueAt(rij,0))
z14 = float(SpssDataCells.GetUnformattedValueAt(rij,1))
zz14 = float(SpssDataCells.GetUnformattedValueAt(rij,2))
b15 = float(SpssDataCells.GetUnformattedValueAt(rij,4))
z15 = float(SpssDataCells.GetUnformattedValueAt(rij,5))
zz15 = float(SpssDataCells.GetUnformattedValueAt(rij,6))
print [b14,z14,zz14,b15,z15,zz15]
rij=rij+1
except:
print 'Einde tabel'

The SetValueAt method is what you require to change the value of a cell in a table.
PS. I think your boss should focus on more important things than to spend billable time on having percentages add up neatly to 100% (due to rounding). Also ensure you are using as many decimal point precision as possible in your calculations so to minimize this "discrepancy".
Update:
Just to give an example of what you can do with manipulation like this (beyond fixing round errors):
The table above shows the Share of Voice (SoV) of a Respiratory drug brand (R3) and it's rank among all brands (first two columns of data) and SoV & Rank also within it's same class of brands only (third and forth column). This is compared against previous month (July 15) and if the rank has increased then it is highlighted in green and an upward facing arrow is added and if declined in rank then highlighted in red and downward red facing arrow added. Just adds a little, color and visualization to what otherwise can be dull tables.

Related

How to convert value of pyomo variable from float to int?

I'm working on an Task Scheduling problem given in Table 3 of paper Holistic energy awareness for intelligent drones.
Table 3
In the 6th equation: N_d = E_d/B_d
I want to convert floating value of (E_d/B_d) to an integer value of N_d.
I'm using an Abstract model on pyomo (6.4.0) on python 3.7 and glpk 4.65 solver
The basic original code written is
model.Drones = Set() # List of drones
model.Battery_capacity = Param(model.Drones, within=NonNegativeReals) # =170
model.Energy_total = Var(model.Drones, within=NonNegativeReals, initialize=1)
model.Charging_sessions = Var(model.Drones, within=NonNegativeReals, initialize=1)
def battery_charging_sessions_rule(model, d):
return model.Charging_sessions[d] == (model.Energy_total[d]/model.Battery_capacity[d])
model.battery_charging_sessions = Constraint(model.Drones, rule=battery_charging_sessions_rule)
In this case, model.battery_charging_sessions is a floating point value which can be less than 1 also. I've tried various options like
model.Charging_sessions = Var(model.Drones, within=Integers, initialize=1, bounds=(0,None))
and using the following return statement also instead of previous one
return model.Charging_sessions[d] == floor(value((model.Energy_total[d]/model.Battery_capacity[d])))
However, this cause the model.Charging_sessions forced to be 0 and it wont even be generated in results file. Using the logs I found out with no change in original code,
Charging_sessions[d] - (0.0058823530*Energy_total[d])
is lower and upper bounded by 0,where 0.0058823530 = 1/170.
While with the changes the lower and upper bound of
Charging_sessions[d]
are 0. It seems that by using floor(value()) or int(value()) the term (0.0058823530*Energy_total[d]) is reduced to 0.
What are the ways I can get the integer value?

How to to calculate unnested watersheds in GRASS GIS?

I am running into a few issues using the GRASS GIS module r.accumulate while running it in Python. I use the module to calculate sub watersheds for over 7000 measurement points. Unfortunately, the output of the algorithm is nested. So all sub watersheds are overlapping each other. Running the r.accumulate sub watershed module takes roughly 2 minutes for either one or multiple points, I assume the bottleneck is loading the direction raster.
I was wondering if there is an unnested variant in GRASS GIS available and if not, how to overcome the bottleneck of loading the direction raster every time you call the module accumulate. Below is a code snippet of what I have tried so far (resulting in a nested variant):
locations = VectorTopo('locations',mapset='PERMANENT')
locations.open('r')
points=[]
for i in range(len(locations)):
points.append(locations.read(i+1).coords())
for j in range(0,len(points),255):
output = "watershed_batch_{}#Watersheds".format(j)
gs.run_command("r.accumulate", direction='direction#PERMANENT', subwatershed=output,overwrite=True, flags = "r", coordinates = points[j:j+255])
gs.run_command('r.stats', flags="ac", input=output, output="stat_batch_{}.csv".format(j),overwrite=True)
Any thoughts or ideas are very welcome.

I already replied to your email, but now I see your Python code and better understand your "overlapping" issue. In this case, you don't want to feed individual outlet points one at a time. You can just run
r.accumulate direction=direction#PERMANENT subwatershed=output outlet=locations
r.accumulate's outlet option can handle multiple outlets and will generate non-overlapping subwatersheds.

The answer provided via email was very usefull. To share the answer I have provided the code below to do an unnested basin subwatershed calculation. A small remark: I had to feed the coordinates in batches as the list of coordinates exceeded the max length of characters windows could handle.
Thanks to #Huidae Cho, the call to R.accumulate to calculate subwatersheds and longest flow path can now be done in one call instead of two seperate calls.
The output are unnested basins. Where the largers subwatersheds are seperated from the smaller subbasins instead of being clipped up into the smaller basins. This had to with the fact that the output is the raster format, where each cell can only represent one basin.
gs.run_command('g.mapset',mapset='Watersheds')
gs.run_command('g.region', rast='direction#PERMANENT')
StationIds = list(gs.vector.vector_db_select('locations_snapped_new', columns = 'StationId')["values"].values())
XY = list(gs.vector.vector_db_select('locations_snapped_new', columns = 'x_real,y_real')["values"].values())
for j in range(0,len(XY),255):
output_ws = "watershed_batch_{}#Watersheds".format(j)
output_lfp = "lfp_batch_{}#Watersheds".format(j)
output_lfp_unique = "lfp_unique_batch_{}#Watersheds".format(j)
gs.run_command("r.accumulate", direction='direction#PERMANENT', subwatershed=output_ws, flags = "ar", coordinates = XY[j:j+255],lfp=output_lfp, id=StationIds[j:j+255], id_column="id",overwrite=True)
gs.run_command("r.to.vect", input=output_ws, output=output_ws, type="area", overwrite=True)
gs.run_command("v.extract", input=output_lfp, where="1 order by id", output=output_lfp_unique,overwrite=True)
To export the unique watersheds I used the following code. I had to transform the longest_flow_path to point as some of the longest_flow_paths intersected with the corner boundary of the watershed next to it. Some longest flow paths were thus not fully within the subwatershed. See image below where the red line (longest flow path) touches the subwatershed boundary:
enter image description here
gs.run_command('g.mapset',mapset='Watersheds')
lfps= gs.list_grouped('vect', pattern='lfp_unique_*')['Watersheds']
ws= gs.list_grouped('vect', pattern='watershed_batch*')['Watersheds']
files=np.stack((lfps,ws)).T
#print(files)
for file in files:
print(file)
ids = list(gs.vector.vector_db_select(file[0],columns="id")["values"].values())
for idx in ids:
idx=int(idx[0])
expr = f'id="{idx}"'
gs.run_command('v.extract',input=file[0], where=expr, output="tmp_lfp",overwrite=True)
gs.run_command("v.to.points", input="tmp_lfp", output="tmp_lfp_points", use="vertex", overwrite=True)
gs.run_command('v.select', ainput= file[1], binput = "tmp_lfp_points", output="tmp_subwatersheds", overwrite=True)
gs.run_command('v.db.update', map = "tmp_subwatersheds",col= "value", value=idx)
gs.run_command('g.mapset',mapset='vector_out')
gs.run_command('v.dissolve',input= "tmp_subwatersheds#Watersheds", output="subwatersheds_{}".format(idx),col="value",overwrite=True)
gs.run_command('g.mapset',mapset='Watersheds')
gs.run_command("g.remove", flags="f", type="vector",name="tmp_lfp,tmp_subwatersheds")
I ended up with a vector for each subwatershed

Saving basic and extended stats of street network for multiple polygons using OSMNX

Based on a given place name I use OSMNX to get street network graphs. I layed a grid over the street network dividing the network into multiple cells, for which each cell's polygon is stored in a geo-series object. With the following function I would like to store the basic_stats/extended_stats in a list and later convert it to a dataframe:
def stats_per_cell(geoseries):
network_stats = []
for i in range(0, len(geoseries)):
try:
# Get graph for grid cell -> row
row = ox.graph_from_polygon(geoseries[i], truncate_by_edge=True)
# Save basic stats for graph in list -> grid_stats
grid_stats = ox.basic_stats(row)
# Add grid cell id
grid_stats['index'] = i
print(grid_stats['index'])
# Append in list -> network_stats
network_stats.append(grid_stats)
print(grid_stats)
except:
print('Error' + str(i))
traceback.print_exc()
continue
# Save entire list in df -> street features
street_features = pd.DataFrame.from_dict(network_stats, orient='columns')
return street_features
The code works fine, however after lets say 1000 iterations, it becomes very slow in comparison to the lists it had appended before. I tried to split the number of iterations for each run, reduced the dictionary from basic_stats to only a selection of keys, and assigned more memory to the process.
Still it takes quite a long time. I used Line_Profiler to get the time for each line:
Does anyone have a suggestion what might cause the slow run time, how to increase the speed or what I could do to improve the code?
Cheers

When using pygal.maps.world is there a way to format the numbers that display a country's population?

I am using pygal to make an interactive map showing world country populations from 2010. I am trying to find a way so that the populations of the country display with commas inserted ie as 10,000 not simply 10000.
I have already tried using "{:,}".format(x) when reading the numbers into my lists for the different population levels, but it causes an error. I believe this to be because this changes the value to a string.
I also tried inserting a piece of code I found online
wm.value_formatter = lambda x: "{:,}".format(x).
This doesn't cause any errors but doesn't fix how the numbers are formatted either. I am hoping someone might know of a built in function such as:
wm_style = RotateStyle('#336699')
Which is letting me set a color scheme.
Below is a the part of my code which is plotting the map.
wm = World()
wm.force_uri_protocol = "http"
wm_style = RotateStyle('#996699')
wm.value_formatter = lambda x: "{:,}".format(x)
wm.value_formatter = lambda y: "{:,}".format(y)
wm = World(style=wm_style)
wm.title = "Country populations year 2010"
wm.add('0-10 million', cc_pop_low)
wm.add("10m to 1 billion", cc_pop_mid)
wm.add('Over 1 billion', cc_pop_high)
wm.render_to_file('world_population.svg')

Setting the value_formatter property will change the label format, but in your code you recreate the World object after setting the property. This newly created object will have the default value formatter. You can also remove one of the lines setting the value_formatter property as they both achieve the same thing.
Re-ordering the code will fix your problem:
wm_style = RotateStyle('#996699')
wm = World(style=wm_style)
wm.value_formatter = lambda x: "{:,}".format(x)
wm.force_uri_protocol = "http"
wm.title = "Country populations year 2010"
wm.add('0-10 million', cc_pop_low)
wm.add("10m to 1 billion", cc_pop_mid)
wm.add('Over 1 billion', cc_pop_high)
wm.render_to_file('world_population.svg')

Why does my association model find subgroups in a dataset when there shouldn't any?

I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.

After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to modify SPSS output files with Python? - python

Related

How to convert value of pyomo variable from float to int?

How to to calculate unnested watersheds in GRASS GIS?

Saving basic and extended stats of street network for multiple polygons using OSMNX

When using pygal.maps.world is there a way to format the numbers that display a country's population?

Why does my association model find subgroups in a dataset when there shouldn't any?

Categories

Resources