How to overwrite older existing ID's when merging into new table?

How to overwrite older existing ID's when merging into new table? - python

I currently am cacheing data from an API by storing all data to a temporary table and merging into a non-temp table where ID/UPDATED_AT is unique.
ID/UPDATED_AT example:
MERGE
INTO vet_data_patients_stg
USING vet_data_patients_temp_stg
ON vet_data_patients_stg.updated_at=vet_data_patients_temp_stg.updated_at
AND vet_data_patients_stg.id=vet_data_patients_temp_stg.id
WHEN NOT matched THEN
INSERT
(
id,
updated_at,
<<<my_other_fields>>>
)
VALUES
(
vet_data_patients_temp_stg.id,
vet_data_patients_temp_stg.updated_at,
<<<my_other_fields>>>
)
My issue is that this method will leave older ID's/UPDATED_AT's also in the table, but I only want the ID with the most recent UPDATED_AT, to remove the older UPDATED_AT's, and only have unique ID's in the table.
Can I accomplish this by modifying my merge statement?
My python way of auto-generating the string is:
merge_string = f'MERGE INTO {str.upper(tablex)}_{str.upper(envx)}
USING {str.upper(tablex)}_TEMP_{str.upper(envx)}
ON '+' AND '.join(f'{str.upper(tablex)}_{str.upper(envx)}.{x}={str.upper(tablex)}_TEMP_{str.upper(envx)}.{x}' for x in keysx) + f'
WHEN NOT MATCHED THEN INSERT ({field_columnsx})
VALUES ' + '(' + ','.join(f'{str.upper(tablex)}_TEMP_{str.upper(envx)}.{x}' for x in fieldsx) + ')'
EDIT - Examples to more clearly illustrate goal -
So if my TABLE_STG has:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-01-01|A
2|2020-02-01|B
And my API gets the following in TABLE_TEMP_STG:
ID|UPDATED_AT|FIELD
1|2020-02-01|A
2|2020-02-01|B
I currently end up with:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-01-01|A
1|2020-02-01|A
2|2020-02-01|B
But I really want tp remove the older updated_at's and end up with:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-02-01|A
2|2020-02-01|B

We can do deletes in the MATCHED branch of a MERGE statement. Your code needs to look like this:
MERGE
INTO vet_data_patients_stg
USING vet_data_patients_temp_stg
ON vet_data_patients_stg.updated_at=vet_data_patients_temp_stg.updated_at
AND vet_data_patients_stg.id=vet_data_patients_temp_stg.id
WHEN NOT matched THEN
INSERT
(
id,
updated_at,
<<<my_other_fields>>>
)
VALUES
(
vet_data_patients_temp_stg.id,
vet_data_patients_temp_stg.updated_at,
<<<my_other_fields>>>
)
WHEN matched THEN
UPDATE
SET some_other_field = vet_data_patients_temp_stg.some_other_field
DELETE WHERE 1 = 1
This will delete all the rows which are updated, that is all the updated rows.
Note that you need to include the UPDATE clause even though you want to delete all of them. The DELETE logic is applied only to records which are updated, but the syntax doesn't allow us to leave it out.
There is a proof of concept on db<>fiddle.
Re-writing the python code to generate this statement is left as an exercise for the reader :)
The Seeker hasn't posted a representative test case providing sample sets of input data and a desired outcome derived from those samples. So it may be that this doesn't do what they are expecting.

Related

Querying a varchar entry for a partial string

I am wondering if there is a way to query a table for a partial match to a string.
For instance, I have a table that lists drug treatments, but sometimes there are multiple drugs in
a single treatment. These come into the database as a varchar entry with both drugs separated by a semicolon (this formatting could be changed if it helped).
I know I can query on a full string, i.e.
Pharmacology() & 'treatment = "control"'
But if my entry is 'gabazine;TPMPA', is there a way to query on just 'TPMPA' and find all strings containing 'TPMPA'?
Alternatively, I could make a part table that populates just for cases where there are multiple drugs, which I could then use for querying these cases, but I am not sure how to set up the entries for better querying when the number of drugs is variable (i.e. is there a way to query inside a blob with a python list?)
Here's my table with some entries in case it helps:
my table
and my table definition (part table only there as a dummy):
#schema
class Pharmacology(dj.Computed):
definition = """
# information about pharmacological treatments
-> Presentation
-> PharmInfo
---
treatment :varchar(255) # string of the treatment name from hdf5 file
control_flag :int # 1 if control 0 if in drug
concentration :varchar(255) # drug concentration(s), "0" for controls
multiple_drug_flag :int # 1 if multiple drugs, else 0
"""
class MultipleDrugInfo(dj.Part):
definition = """
-> Pharmacology
---
list_of_drugs :blob # list of drugs in multi-drug cases

You can do that with the LIKE keyword
Pharmacology() & 'treatment LIKE "%TPMPA%"'
The % is wildcard in mysql.

Neo4j import csv / DHCP data controlling duplication

i am confused on how to import data
I have a csv from DHCP with _time, hostname, IP_addr
I would like to add any changed IPs as new relationships, but keep the old ip relationships with a status attribute inactive, also think I want to limt to the last 10.
I am not sure the easiest way to do this in cypher, or should I be in python for this complexity
maybe an always add (remove duplicates)/csv import
and a second query to deactivate any old ips (how do I query non current if i have time as an attribute of relationship)
and a third query to remove relationships that if more that 10 previous ips are hanging off it.
any help or thoughts would be greatly appreciated

Sounds like fun. Not sure if every host-ip combination appears only once in a csv or also at later times like an "still-here" update
Import Statement
LOAD CSV FROM "url" AS row
MERGE (h:Host {name:row.hostname})
MERGE (ip:IP {name:row.IP_addr})
MERGE (h)-[:IP]->(ip) ON CREATE SET rel.created = row._time, rel.status = 1
// optional for pre-existing/previous rels
ON MATCH SET rel.status = 0
SET rel.updated = row._time;
Cleanup statement
MATCH (h:Host) WHERE size( (h)-[:IP]->() ) > 1
MATCH (h)-[rel:IP]->(:IP)
WITH h,rel ORDER BY rel.updated DESC
WITH h, collect(rel) as rels
// not necessary when the status is set above
FOREACH (r in rels[1..9] | SET r.status=0)
FOREACH (r IN rels[10..-1] | DELETE r)
When the status is set correctly in the load statement
MATCH (h:Host)-[rel:IP {status:0}]->(:IP)
WITH h,rel ORDER BY rel.updated DESC
WITH h, collect(rel) as rels
FOREACH (r IN rels[9..-1] | DELETE r)

grouped box plots are too narrow to read

Actually everything works just fine: I try to compare different groups in my survey in one chart. thus I wrote the following code in Python (Jupyter-Notebook)
for value in values:
catpool=getcat()
py.offline.init_notebook_mode()
data = []
for cata in catpool:
for con in constraints:
data.append( go.Box( y=getdf(value,cata,con[0])['Y'+value],x=con[1], name=cata, showlegend=False, boxmean='sd',
#boxpoints='all',
jitter=0.3,
pointpos=0 ) )
layout=go.Layout(title="categorie: "+getclearname(value)+" - local space syntax measurements on placements<br>",yaxis=dict( title='percentage of range between flat extremes'),boxmode='group',showlegend=True)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='pandas-box-plot')
the function 'getdf' queries a column from a database
unfortunately it results in an unreadable diagram like this
a diagram with to narrow box plots
is it possible to give the groups less spacing and accordingly the boxplots in the group more? Or anything else that would make it more readable?
Thank you

I solved that problem at my own:
The strange behaviour occured, because of the for cata in catpool loop - data for boxplots in groups has to contain the group value in the data frame: so I just didn't loop at this place, but looped in the concatenation of the SQL-statement. So the same queries were carried out, but were joined together by "UNION" like this:
def formstmtelse (val, category, ext, constraint, nine=True):
stmt=""
if nine:
matrix=['11','12','13','21','22','23','31','32','33']
else:
matrix=['11']
j=0
for cin in category:
j=j+1
if j>1:
stmt=stmt+" UNION "
m=0
for cell in matrix:
m=m+1
if m>1:
stmt=stmt+"""UNION
"""
stmt=stmt+"""SELECT '"""+cin+"""' AS cata,
"""
if ext:
stmt=stmt+"""((("""+val+"""-MIN"""+val+""")/(MAX"""+val+"""-MIN"""+val+"""))*100) AS Y"""+val
else:
stmt=stmt+"""((("""+val+""")/(MAX"""+val+"""))*100) AS Y"""+val
stmt=stmt+"""
FROM `stests`
JOIN ssubject
ON ssubject.ssubjectID=stests.ssubjectID
JOIN scountries
ON CountryGrownUpIn=scountriesID
JOIN scontinents
ON Continent=scontinentsID
JOIN gridpoint
ON stests.FlatID=gridpoint.GFlatID AND G"""+cin+cell+"""=GIntID
JOIN
(
SELECT GFlatID AS GSUBFlatID"""
stmt=stmt+""",
MAX("""+val+""") AS MAX"""+val+""",MIN("""+val+""") AS MIN"""+val
stmt=stmt+"""
FROM gridpoint
GROUP BY GSUBFlatID
)
AS virtualtable
ON FlatID=virtualtable.GSUBFlatID
"""+constraint+" "
return stmt
that solves the problem. The 'x' attribute of the box call must be a list or dataframe, not just a single string, even though the string is always the same. I guess this was causing 'invisible' x-values making the automated scaling impossible, or in other words a new point in the legend for each query with a new constraint and for each category, only one dataframe is needed for every constraint. when changing this - fine adjustments could be made by adding boxgroupgap or groupgab attribute to layout...
please excuse my english

Filtering a set of data based on indices in line

I have a python script that pulls data from an external servers SQL database and sum's the values based on transaction numbers. I've gotten some assistance in cleaning up the result sets - which have been a huge help, but now I've hit another problem.
My original query:
SELECT th.trans_ref_no, th.doc_no, th.folio_yr, th.folio_mo, th.transaction_date, tc.prod_id, tc.gr_gals FROM TransHeader th, TransComponents tc WHERE th.term_id="%s" and th.source="L" and th.folio_yr="%s" and th.folio_mo="%s" and (tc.prod_id="TEXLED" or tc.prod_id="103349" or tc.prod_id="103360" or tc.prod_id="103370" or tc.prod_id="113107" or tc.prod_id="113093")and th.trans_ref_no=tc.trans_ref_no;
Returns a set of data that I've copied a snippet here:
"0520227370","0001063257","2014","01","140101","113107","000002000"
"0520227370","0001063257","2014","01","140101","TEXLED","000002550"
"0520227378","0001063265","2014","01","140101","113107","000001980"
"0520227378","0001063265","2014","01","140101","TEXLED","000002521"
"0520227380","0001063267","2014","01","140101","113107","000001500"
"0520227380","0001063267","2014","01","140101","TEXLED","000001911"
"0520227384","0001063271","2014","01","140101","113107","000003501"
"0520227384","0001063271","2014","01","140101","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000004000"
"0520227384","0001063271","2014","01","140101","TEXLED","000005103"
"0520227385","0001063272","2014","01","140101","113107","000007500"
"0520227385","0001063272","2014","01","140101","TEXLED","000009565"
"0520227388","0001063275","2014","01","140101","113107","000002000"
"0520227388","0001063275","2014","01","140101","TEXLED","000002553"
The updated query runs this twice and JOINS the trans_ref_no, which is the first position in the result set, so the first 6 lines get condensed into three, and the last four lines get condensed into two. The problem I'm having is getting transaction number 0520227384 to get condensed to two lines.
SELECT t1.trans_ref_no, t1.doc_no, t1.folio_yr, t1.folio_mo, t1.transaction_date, t1.prod_id, t1.gr_gals, t2.prod_id, t2.gr_gals FROM (SELECT th.trans_ref_no, th.doc_no, th.folio_yr, th.folio_mo, th.transaction_date, tc.prod_id, tc.gr_gals FROM Tms6Data.TransHeader th, Tms6Data.TransComponents tc WHERE th.term_id="00000MA" and th.source="L" and th.folio_yr="2014" and th.folio_mo="01" and (tc.prod_id="103349" or tc.prod_id="103360" or tc.prod_id="103370" or tc.prod_id="113107" or tc.prod_id="113093") and th.trans_ref_no=tc.trans_ref_no) t1 JOIN (SELECT th.trans_ref_no, th.doc_no, th.folio_yr, th.folio_mo, th.transaction_date, tc.prod_id, tc.gr_gals FROM Tms6Data.TransHeader th, Tms6Data.TransComponents tc WHERE th.term_id="00000MA" and th.source="L" and th.folio_yr="2014" and th.folio_mo="01" and tc.prod_id="TEXLED" and th.trans_ref_no=tc.trans_ref_no) t2 ON t1.trans_ref_no = t2.trans_ref_no;
Here is what the new query returns for transaction number 0520227384:
"0520227384","0001063271","2014","01","140101","113107","000003501","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000003501","TEXLED","000005103"
"0520227384","0001063271","2014","01","140101","113107","000004000","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000004000","TEXLED","000005103"
What I need to get out of this is a set of condensed lines where, in this group, the seconds and third need to be removed:
"0520227384","0001063271","2014","01","140101","113107","000003501","TEXLED","000004463"
"0520227384","0001063271","2014","01","140101","113107","000004000","TEXLED","000005103"
How can I go about filtering these lines from the updated query result set?

i think, the answer is:
(... your heavy sql ..) group by 7
or
(... your heavy sql ..) group by t1.gr_gals

Loop on Select_Analysis tool (Python and ArcGIS 9.3)

First, I'm new in Python and I work on Arc GIS 9.3.
I'd like to realize a loop on the "Select_Analysis" tool. Indeed I have a layer "stations" composed of all the bus stations of a city.
The layer has a field "rte_id" that explains on what line a station is located.
And I'd like to save in distinct layers all the stations with "rte_id" = 1, the stations with "rte_id" = 2 and so on. Hence the use of the tool select_analysis.
So, I decided to make a loop (I have 70 different "rte_id" .... so 70 different layers to create!). But it does not work and I'm totally lost!
Here is my code:
import arcgisscripting, os, sys, string
gp = arcgisscripting.create(9.3)
gp.AddToolbox("C:/Program Files (x86)/ArcGIS/ArcToolbox/Toolboxes/Data Management Tools.tbx")
stations = "d:/Travaux/NantesMetropole/Traitements/SIG/stations.shp"
field = "rte_id"
for i in field:
gp.Select_Analysis (stations, "d:/Travaux/NantesMetropole/Traitements/SIG/stations_" + i + ".shp", field + "=" + i)
i = i+1
print "ok"
And here is the error message:
gp.Select_Analysis (stations, "d:/Travaux/NantesMetropole/Traitements/SIG/stations_" + i + ".shp", field + "=" + i)
TypeError: can only concatenate list (not "str") to list
Have you got any ideas to solve my problem?
Thanks in advance!
Julien

The main problem here is in the string
for i in field:
You are trying to iterate a string - field name ("rte_id").
This is not correct.
You need to iterate all possible values of field "rte_id".
Easiest solution:
if you know that field "rte_id" have values 1 - 70 (for example) then you can try
for i in range(1, 71):
shp_name = "d:/Travaux/NantesMetropole/Traitements/SIG/stations_" + str(i) + ".shp"
expression = '{0} = {1}'.format(field, i)
gp.Select_Analysis (stations, shp_name , expression)
print "ok"
More sophisticated solution:
You need to get a list of all unique values of field "rte_id" in terms of SQL - to perform GROUP BY.
I think it is not actually possible to perform GROUP BY operation on SHP files with one tool.
You can use SearchCursor, iterate through all features and generate a list of unique values of you field. But this is more complex task.
Another way is to use the Summarize option on the shapefile table in ArcMap (open table, right click on the column header). You will get dbf table with unique values which you can read in your script.
I hope it will help you to start!
Don't have ArcGIS right now and can't write and check any script.

You will need to make substantial changes to this code in order to get it to do what you want. You may just want to download the Split Layer By Attribute Code from ArcGIS online which does the exact same thing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to overwrite older existing ID's when merging into new table? - python

Related

Querying a varchar entry for a partial string

Neo4j import csv / DHCP data controlling duplication

grouped box plots are too narrow to read

Filtering a set of data based on indices in line

Loop on Select_Analysis tool (Python and ArcGIS 9.3)

Categories

Resources