Populate Unique ID field after Sorting, Python

Populate Unique ID field after Sorting, Python - python

I am trying to create an new unique id field in an access table. I already have one field called SITE_ID_FD, but it is historical. The format of the unique value in that field isn't what our current format is, so I am creating a new field with the new format.
Old Format = M001, M002, K003, K004, S005, M006, etc
New format = 12001, 12002, 12003, 12004, 12005, 12006, etc
I wrote the following script:
fc = r"Z:\test.gdb\testfc"
x = 12001
cursor = arcpy.UpdateCursor(fc)
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
This works fine, but it populates the new id field based on the default sorting of objectID. I need to sort 2 fields first and then populate the new id field based on that sorting (I want to sort by a field called SITE and then by the old id field SITE_ID_FD)
I tried manually sorting the 2 fields in hopes that Python would honor the sort, but it doesn't. I'm not sure how to do this in Python. Can anyone suggest a method?

A possible solution is when you are creating your update cursor. you can specify to the cursor the fields by which you wish it to be sorted (sorry for my english..), they explain this in the documentation: http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//000v0000003m000000
so it goes like this:
UpdateCursor(dataset, {where_clause}, {spatial_reference}, {fields}, {sort_fields})
and you are intrested only in the sort_fields so assuming that your code will work well on a sorted table and that you want the table ordered asscending the second part of your code should look like this:
fc = r"Z:\test.gdb\testfc"
x = 12001
cursor = arcpy.UpdateCursor(fc,"","","","SITE A, SITE_ID_FD A")
#if you want to sort it descending you need to write it with a D
#>> cursor = arcpy.UpdateCursor(fc,"","","","SITE D, SITE_ID_FD D")
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
i hope this helps

Added a link to the arcpy docs in a comment, but from what I can tell, this will create a new, sorted dataset--
import arcpy
from arcpy import env
env.workspace = r"z:\test.gdb"
arcpy.Sort_management("testfc", "testfc_sort", [["SITE", "ASCENDING"],
["SITE_IF_FD", "ASCENDING]])
And this will, on the sorted dataset, do what you want:
fc = r"Z:\test.gdb\testfc_sort"
x = 12001
cursor = arcpy.UpdateCursor(fc)
for row in cursor:
row.setValue("SITE_ID", x)
cursor.updateRow(row)
x+= 1
I'm assuming there's some way to just copy the sorted/modified dataset back over the original, so it's all good?
I'll admit, I don't use arcpy, and the docs could be a lot more explicit.

Related

Python unittest: write, read, compare does not work

I'm using python unittests and sqlalchemy to test datamodels to store an WTFoms in mariaDB.
The test should create a dataset, write this dataset to db, read this set an compare if original dataset is the same like sored data.
So the partial test looks like that:
#set data
myForm = NiceForm()
myForm.name = "Ben"
#write data
db.session.add(myForm)
db.session.commit()
#read data
loadedForms = NiceForm.query.all()
#check that only one entry is in db
self.assertEqual(len(loadedForms), 1)
#compare stores data with dataset
self.assertIn(myForm, loadedForms)
The test seams to work fine. No I tried the find out, if the test fails, if dataset != stored data. So ein changed the dataset before compareing it, like this:
#set data
myForm = NiceForm()
myForm.name = "Ben"
#write data
db.session.add(myForm)
db.session.commit()
#read data
loadedForms = NiceForm.query.all()
#modify dataset
myForm.name = "Foo"
#show content of both
print(myForm.name)
print(loadedForms[0].name)
#check that only one entry is in db
self.assertEqual(len(loadedForms), 1)
#compare stores data with dataset
self.assertIn(myForm, loadedForms)
This test still passed. Why? I output the content of myForm.name and loadedForms[0].name where both set to Foo. This is the reason, why the self.assertsIn(myForm, loadedForms)passed the test, but I don't understand:
Why the content of the loadedForms is changed, when Foowas only applied to myForm?

The row identity for MyForm does not change by changing one of the values.
Row numbers have no meaning in a table, but to make the issue clear I will still use them.
Row 153 has 2 fields. Field name = "Ben" and field homeruns = 3.
Now we change the home runs (Ben has hit a home run);
Row 153 has 2 fields. Field name = "Ben" and field homeruns = 4.
It is still row 153, so your assertIn wil still return True, though one of the values in the row has changed. You only test identity.
If it wouldn't, changing a field in a table row would need to be saved by an insert into the table and not an update to the row. That is not correct of course; how many Bens do we have? One. And he has 4 home runs, not 3 or 4, depending on which record you look at.

petl convert data from duplicate entries

Am trying to use petl library to build an ETL process that copied data between two tables. The table contain a unique slug field on the destination. For that, I wrote my script so It would identify duplicate slugs and convert them with by appending ID to the slug value.
table = etl.fromdb(source_con, 'SELECT * FROM user')
# get whatever remains as duplicates
duplicates = etl.duplicates(table, 'slug')
for dup in [i for i in duplicates.values('id')]:
table = etl.convert(
table,
'slug',
lambda v, row: '{}-{}'.format(slugify_unicode(v), str(row.id).encode('hex')),
where=lambda row: row.id == dup,
pass_row=True
)
The above did not work as expected, it seems like the table object remains with duplicate values after the loop.
Anyone can advise?
Thanks

Postgresql: Insert from huge csv file, collect the ids and respect unique constraints

In a postgresql database:
class Persons(models.Model):
person_name = models.CharField(max_length=10, unique=True)
The persons.csv file, contains 1 million names.
$cat persons.csv
Name-1
Name-2
...
Name-1000000
I want to:
Create the names that do not already exist
Query the database and fetch the id for each name contained in the csv file.
My approach:
Use the COPY command or the django-postgres-copy application that implements it.
Also take advantage of the new Postgresql-9.5+ upsert feature.
Now, all the names in the csv file, are also in the database.
I need to get their ids -from the database- either in memory or in another csv file with an efficient way:
Use Q objects
list_of_million_q = <iterate csv and append Qs>
million_names = Names.objects.filter(list_of_million_q)
or
Use __in to filter based on a list of names:
list_of_million_names = <iterate csv and append strings>
million_names = Names.objects.filter(
person_name__in=[list_of_million_names]
)
or
?
I do not feel that any of the above approaches for fetching the ids is efficient.
Update
There is a third option, along the lines of this post that should be a great solution which combines all the above.

Something like:
SELECT * FROM persons;
make a name: id dictionary out of the names recieved from the database:
db_dict = {'Harry': 1, 'Bob': 2, ...}
Query the dictionary:
ids = []
for name in list_of_million_names:
if name in db_dict:
ids.append(db_dict[name])
This way you're using the quick dictionary indexing as opposed to the slower if x in list approach.
But the only way to really know for sure is to benchmark these 3 approaches.

This post describes how to use RETURNING with ON CONFLICT so while inserting into the database the contents of the csv file, the ids will be saved in another table either when an insertion was successful, or when -due to unique constraints- the insertion was omitted.
I have tested it in sqlfiddle where I used a set up that resembles the one used for the COPY command which inserts to the database straight from a csv file, respecting the unique constraints.
The schema:
CREATE TABLE IF NOT EXISTS label (
id serial PRIMARY KEY,
label_name varchar(200) NOT NULL UNIQUE
);
INSERT INTO label (label_name) VALUES
('Name-1'),
('Name-2');
CREATE TABLE IF NOT EXISTS ids (
id serial PRIMARY KEY,
label_ids varchar(12) NOT NULL
);
The script:
CREATE TEMP TABLE tmp_table
(LIKE label INCLUDING DEFAULTS)
ON COMMIT DROP;
INSERT INTO tmp_table (label_name) VALUES
('Name-2'),
('Name-3');
WITH ins AS(
INSERT INTO label
SELECT *
FROM tmp_table
ON CONFLICT (label_name) DO NOTHING
RETURNING id
)
INSERT INTO ids (label_ids)
SELECT
id FROM ins
UNION ALL
SELECT
l.id FROM tmp_table
JOIN label l USING(label_name);
The output:
SELECT * FROM ids;
SELECT * FROM label;

Keep smallest value for each unique ID with arcpy/numpy

I've got a ESRI Point Shape file with (amongst others) a nMSLINK field and a DIAMETER field. The MSLINK is not unique, because of a spatial join. What I want to achieve is to keep only the features in the shapefile that have a unique MSLINK and the smallest DIAMETER value, together with the corresponding values in the other fields. I can use a searchcursor to achieve this (looping through all features and removing each feature that does not comply, but this takes ages (> 75000 features). I was wondering if eg. numpy could do the trick faster in ArcMap/arcpy.

I think, making that kind of processing would definitely be a lot faster if you work on memory instead of interacting with arcgis. For example, by putting all the rows first into a python object (probably a namedtuple would be a good option here). Then you can find out which rows you want to delete or insert.
The fastest approach depends on a) if you have a lot of (MSLINK) repeated rows, then the fastest would be inserting just the ones you need in a new layer. Or b) if the rows to be deleted are just a few compared to the total of rows, then deleting is faster.
For a) you'll need to fetch all fields into the tuple, including the point coordinates, so that you can just create a new feature class and insert the new rows.
# Example of Variant a:
from collections import namedtuple
# assuming the following:
source_fc # contains name of the fclass
the_path # contains path to the shape
cleaned_fc # the name of the cleaned fclass
# use all fields of source_fc plus the shape token to get a touple with xy
# coordinates (using 'mslink' and 'diam' here to simplify the example)
fields = ['mslink', 'diam', 'field3', ... ]
all_fields = fields + ['SHAPE#XY']
# define a namedtuple to hold and work with the rows, use the name 'point' to
# hold the coordinates-tuple
Row = namedtuple('Row', fields + ['point'])
data = []
with arcpy.da.SearchCursor(source_fc, fields) as sc:
for r in sc:
# unzip the values from each row into a new Row (namedtuple) and append
# to data
data.append(Row(*r))
# now just delete the rows we don't want, for this, the easiest way, is probably
# to order the tuple first after MSLINK and then after the diamater...
data = sorted(data, key = lambda x : (x.mslink, x.diam))
# ... now just keep the first ones for each mslink
to_keep = []
last_mslink = None
for d in data:
if last_mslink != d.mslink:
last_mslink = d.mslink
to_keep.append(d)
# create a new feature class with the same fields as the source_fc
arcpy.CreateFeatureclass_management(
out_path=the_path, out_name=cleaned_fc, template=source_fc)
with arcpy.da.InsertCursor(cleaned_fc, all_fields) as ic:
for r in to_keep:
ic.insertRow(*r)
And for alternative b) I would just fetch 3 fields, a unique ID, MSLINK and the diameter. Then make a delete list (here you only need the unique ids). Then loop again through the feature class and delete the rows with the id on your delete-list. Just to be sure, I would duplicate the feature class first, and work on a copy.

There are a few steps you can take to accomplish this task more efficiently. First and foremost, making use of the data analyst cursor as opposed to the older version of cursor will increase the speed of your process. This assumes you are working in 10.1 or beyond. Then you can employ summary statistics, namely its ability to find a minimum value based off a case field. For yours, the case field would be nMSLINK.
The code below first creates a statistics table with all unique 'nMSLINK' values, and its corresponding minimum 'DIAMETER' value. I then use a table select to select out only rows in the table whose 'FREQUENCY' field is not 1. From here I iterate through my new table and start to build a list of strings that will make up a final sql statement. After this iteration, I use the python join function to create an sql string that looks something like this:
("nMSLINK" = 'value1' AND "DIAMETER" <> 624.0) OR ("nMSLINK" = 'value2' AND "DIAMETER" <> 1302.0) OR ("nMSLINK" = 'value3' AND "DIAMETER" <> 1036.0) ...
The sql selects rows where nMSLINK values are not unique and where DIAMETER values are not the minimum. Using this SQL, I select by attribute and delete selected rows.
This SQL statement is written assuming your feature class is in a file geodatabase and that 'nMSLINK' is a string field and 'DIAMETER' is a numeric field.
The code has the following inputs:
Feature: The feature to be analyzed
Workspace: A folder that will store a couple intermediate tables temporarily
TempTableName1: A name for one temporary table.
TempTableName2: A name for a second temporary table
Field1 = The nonunique field
Field2 = The field with the numeric values that you wish to find the lowest of
Code:
# Import modules
from arcpy import *
import os
# Local variables
#Feature to analyze
Feature = r"C:\E1B8\ScriptTesting\Workspace\Workspace.gdb\testfeatureclass"
#Workspace to export table of identicals
Workspace = r"C:\E1B8\ScriptTesting\Workspace"
#Name of temp DBF table file
TempTableName1 = "Table1"
TempTableName2 = "Table2"
#Field names
Field1 = "nMSLINK" #nonunique
Field2 = "DIAMETER" #field with numeric values
#Make layer to allow selection
MakeFeatureLayer_management (Feature, "lyr")
#Path for first temp table
Table = os.path.join (Workspace, TempTableName1)
#Create statistics table with min value
Statistics_analysis (Feature, Table, [[Field2, "MIN"]], [Field1])
#SQL Select rows with frequency not equal to one
sql = '"FREQUENCY" <> 1'
# Path for second temp table
Table2 = os.path.join (Workspace, TempTableName2)
# Select rows with Frequency not equal to one
TableSelect_analysis (Table, Table2, sql)
#Empty list for sql bits
li = []
# Iterate through second table
cursor = da.SearchCursor (Table2, [Field1, "MIN_" + Field2])
for row in cursor:
# Add SQL bit to list
sqlbit = '("' + Field1 + '" = \'' + row[0] + '\' AND "' + Field2 + '" <> ' + str(row[1]) + ")"
li.append (sqlbit)
del row
del cursor
#Create SQL for selection of unwanted features
sql = " OR ".join (li)
print sql
#Select based on SQL
SelectLayerByAttribute_management ("lyr", "", sql)
#Delete selected features
DeleteFeatures_management ("lyr")
#delete temp files
Delete_management ("lyr")
Delete_management (Table)
Delete_management (Table2)
This should be quicker than a straight-up cursor. Let me know if this makes sense. Good luck!

How to obtain the field type using dbfpy?

I have some dbf files that I want to add new fields to. To do so, I'm using dbfpy to open the original dbf, copy all fields (or the ones I want to keep) and records and then create a new file with those fields plus the new ones that I want. All is working great, except for one minor detail: I can't manage to keep the original fields' types, since I don't know how to obtain them. What I'm doing is to create all the fields in the new file as "C" (character), which so far works for what I need right now but might be an issue eventually.
The real problem is that there is no documentation available. I searched through the package files to look for the examples there, but couldn't find an answer to this question (might be that I couldn't find just by the "greenish" I still am with python... I'm definitely not an expert).
An example of the code:
from dbfpy import dbf
import sys
org_db_file = str(sys.argv[1])
org_db = dbf.Dbf(org_db_file, new = False)
new_db_file = str(sys.argv[2])
new_db = dbf.Dbf(new_db_file, new = True)
#Obtain original field names:
fldnames = []
fldsize = {}
for names in org_db.fieldNames:
fldnames.append(names)
fldsize[name] = 0
#Cycle thru table entries:
for rec in org_db:
#Cycle thru columns to obtain fields' name and value:
for name in fldnames:
value = str(rec[name])
if len(value) > fldsize[name]:
fldsize[name] = len(value)
#Copy original fields to new table:
for names in fldnames:
new_db.addField((names, "C", fldsize[name]))
#Add new fields:
new_fieldname = "some_name"
new_db.addField((new_fieldname, "C", 2))
#Copy original entries and store new values:
for rec in org_db:
#Create new record instance for new table:
new_rec = new_db.newRecord()
#Populate fields:
for field in fldnames:
new_rec[field] = rec[field]
#Store value of new field for record i:
new_rec[new_fieldname] = "some_value"
new_rec.store()
new_db.close()
Thanks in advance for your time.
Cheers.

I don't have any experience with dbfpy other than when I first went looking several years ago it (and several others) did not meet my needs. So I wrote my own.
Here is how you would accomplish your task using it:
import dbf
import sys
org_db_file = sys.argv[1]
org_db = dbf.Table(org_db_file)
new_db_file = sys.argv[2]
# postpone until we have the field names...
# new_db = dbf.Dbf(new_db_file, new = True)
# Obtain original field list:
fields = org_db.field_names
for field in fields[:]: # cycle through a separate list
if field == "something we don't like":
fields.remove(field)
# now get definitions for fields we keep
field_defs = ord_db.structure(fields)
# Add new fields:
field_defs.append("some_name C(2)")
# now create new table
new_db = ord_db.new(new_db_file, field_specs=field_defs)
# open both tables
with dbf.Tables(ord_db, new_db):
# Copy original entries and store new values:
for rec in org_db:
# Create new record instance for new table:
new_db.append()
# Populate fields:
with new_db.last_record as new_rec:
for field in new_db.field_names:
new_rec[field] = rec[field]
# Store value of new field for record i:
new_rec[new_fieldname] = "some_value"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Populate Unique ID field after Sorting, Python - python

Related

Python unittest: write, read, compare does not work

petl convert data from duplicate entries

Postgresql: Insert from huge csv file, collect the ids and respect unique constraints

Keep smallest value for each unique ID with arcpy/numpy

How to obtain the field type using dbfpy?

Categories

Resources