Querying a varchar entry for a partial string - python

I am wondering if there is a way to query a table for a partial match to a string.
For instance, I have a table that lists drug treatments, but sometimes there are multiple drugs in
a single treatment. These come into the database as a varchar entry with both drugs separated by a semicolon (this formatting could be changed if it helped).
I know I can query on a full string, i.e.
Pharmacology() & 'treatment = "control"'
But if my entry is 'gabazine;TPMPA', is there a way to query on just 'TPMPA' and find all strings containing 'TPMPA'?
Alternatively, I could make a part table that populates just for cases where there are multiple drugs, which I could then use for querying these cases, but I am not sure how to set up the entries for better querying when the number of drugs is variable (i.e. is there a way to query inside a blob with a python list?)
Here's my table with some entries in case it helps:
my table
and my table definition (part table only there as a dummy):
#schema
class Pharmacology(dj.Computed):
definition = """
# information about pharmacological treatments
-> Presentation
-> PharmInfo
---
treatment :varchar(255) # string of the treatment name from hdf5 file
control_flag :int # 1 if control 0 if in drug
concentration :varchar(255) # drug concentration(s), "0" for controls
multiple_drug_flag :int # 1 if multiple drugs, else 0
"""
class MultipleDrugInfo(dj.Part):
definition = """
-> Pharmacology
---
list_of_drugs :blob # list of drugs in multi-drug cases

You can do that with the LIKE keyword
Pharmacology() & 'treatment LIKE "%TPMPA%"'
The % is wildcard in mysql.

Related

How to write in python a parser for a character-based protocol

I'm implementing a client for an already existing (old) standard for exchanging information between shops and providers of some specific sector, let's say vegetables.
It must be in python, and I want my package to read a plaintext file and build some objects accessible by a 3d party application. I want to write a client, an implementation of this standard in python, and offer it open source as a library/package, and use it for my project.
It looks roughly like this (without the # comments)
I1234X9876DELIVERY # id line. 1234 is sender id and 9876 target id.
# Doctype "delivery"
H27082022RKG # header line. specificy to "delivery" doctype.
# It will happen at 27 aug '22, at Regular time schedule. Units kg.
PAPPL0010 # Product Apple. 10 kg
PANAN0015 # Product Ananas. 15 kg
PORAN0015 # Product Orange. 15 kg
The standard has 3 types of lines: identifier, header and details or body. Header format depend on the document type of the identifier line. Body lines depend also on doc type.
Formats are defined by character-length. One character of {I, H, P, ...} at the start of the line to identify the type of line, like P. Then, if it's a product of a delivery, 4 chars to identify the type of product (APPL), and 4 digits number to specify the amount of product (10).
I thought about using a hierarchy of classes, maybe enums, to identify which kind of document I obtained, so that an application can process differently a delivery document from a catalogue document. And then, for a delivery, as the structure is known, read the date attribute, and the products array.
However, I'm not sure of:
how to parse efficiently the lines.
what to build with the parsed message.
What does it sound like to you? I didn't study computer science theory, and although I've been coding for years, it's out of the bounds I usually do. I've read an article about parsing tools for python but I'm unsure of the concepts and which tool to use, if any.
Do I need some grammar parser for this?
What would be a pythonic way to represent the data?
Thank you very much!
PS: the documents use 8-bit character encodings, usually Latin-1, so I can read byte by byte.
Looking at the start of the entry for each line would allow that line to be sent to a function for processing of that information.
This would allow for a function for each format type to allow for easier testing and maintenance.
The data could be stored in a Python dataclass. The use of enums would be possible as it looks like that is what the document is specifying.
Using enums to give more meaningful names to the abbreviations used in the format is probably a good idea.
Here is an example of do this:
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
import re
from typing import List, Union
data = """
I1234X9876DELIVERY
H27082022RKG
PAPPL0010
PANAN0015
PORAN0015
"""
class Product(Enum):
APPLE = "APPL"
PINEAPPLE = "ANAN"
ORANGE = "ORAN"
class DocType(Enum):
UNDEFINED = "NONE"
DELIVERY = "DELIVERY"
class DeliveryType(Enum):
UNDEFINED = "NONE"
REGULAR = "R"
class Units(Enum):
UNDEFINED = "NONE"
KILOGRAMS = "KG"
#dataclass
class LineItem:
product: Product
quantity: int
#dataclass
class Header:
sender: int = 0
target: int = 0
doc_type: DocType = DocType.UNDEFINED
#dataclass
class DeliveryNote(Header):
delivery_freq: DeliveryType = DeliveryType.UNDEFINED
date: Union[datetime, None] = None
units: Units = Units.UNDEFINED
line_items: List[LineItem] = field(default_factory=list)
def show(self):
print(f"Sender: {self.sender}")
print(f"Target: {self.target}")
print(f"Type: {self.doc_type.name}")
print(f"Delivery Date: {self.date.strftime('%d-%b-%Y')}")
print(f"Deliver Type: {self.delivery_freq.name}")
print(f"Units: {self.units.name}")
print()
print(f"\t|{'Item':^12}|{'Qty':^6}|")
print(f"\t|{'-' * 12}|{'-' * 6}|")
for entry in self.line_items:
print(f"\t|{entry.product.name:<12}|{entry.quantity:>6}|")
def process_identifier(entry):
match = re.match(r'(\d+)X(\d+)(\w+)', entry)
sender, target, doc_type = match.groups()
doc_type = DocType(doc_type)
sender = int(sender)
target = int(target)
if doc_type == DocType.DELIVERY:
doc = DeliveryNote(sender, target, doc_type)
return doc
def process_header(entry, doc):
match = re.match(r'(\d{8})(\w)(\w+)', entry)
if match:
date_str, freq, units = match.groups()
doc.date = datetime.strptime(date_str, '%d%m%Y')
doc.delivery_freq = DeliveryType(freq)
doc.units = Units(units)
def process_details(entry, doc):
match = re.match(r'(\D+)(\d+)', entry)
if match:
prod, qty = match.groups()
doc.line_items.append(LineItem(Product(prod), int(qty)))
def parse_data(file_content):
doc = None
for line in file_content.splitlines():
if line.startswith('I'):
doc = process_identifier(line[1:])
elif line.startswith('H'):
process_header(line[1:], doc)
elif line.startswith('P'):
process_details(line[1:], doc)
return doc
if __name__ == '__main__':
this_doc = parse_data(data)
this_doc.show()
When I ran this test it gave the following output:
$ python3 read_protocol.py
Sender: 1234
Target: 9876
Type: DELIVERY
Delivery Date: 27-Aug-2022
Deliver Type: REGULAR
Units: KILOGRAMS
| Item | Qty |
|------------|------|
|APPLE | 10|
|PINEAPPLE | 15|
|ORANGE | 15|
Hopefully that gives you some ideas as I'm sure there are lots of assumptions about your data I've got wrong.
For ease of displaying here I haven't shown reading from the file. Using Python's pathlib.read_text() should make this relatively straightforward to get data from a file.

How to overwrite older existing ID's when merging into new table?

I currently am cacheing data from an API by storing all data to a temporary table and merging into a non-temp table where ID/UPDATED_AT is unique.
ID/UPDATED_AT example:
MERGE
INTO vet_data_patients_stg
USING vet_data_patients_temp_stg
ON vet_data_patients_stg.updated_at=vet_data_patients_temp_stg.updated_at
AND vet_data_patients_stg.id=vet_data_patients_temp_stg.id
WHEN NOT matched THEN
INSERT
(
id,
updated_at,
<<<my_other_fields>>>
)
VALUES
(
vet_data_patients_temp_stg.id,
vet_data_patients_temp_stg.updated_at,
<<<my_other_fields>>>
)
My issue is that this method will leave older ID's/UPDATED_AT's also in the table, but I only want the ID with the most recent UPDATED_AT, to remove the older UPDATED_AT's, and only have unique ID's in the table.
Can I accomplish this by modifying my merge statement?
My python way of auto-generating the string is:
merge_string = f'MERGE INTO {str.upper(tablex)}_{str.upper(envx)}
USING {str.upper(tablex)}_TEMP_{str.upper(envx)}
ON '+' AND '.join(f'{str.upper(tablex)}_{str.upper(envx)}.{x}={str.upper(tablex)}_TEMP_{str.upper(envx)}.{x}' for x in keysx) + f'
WHEN NOT MATCHED THEN INSERT ({field_columnsx})
VALUES ' + '(' + ','.join(f'{str.upper(tablex)}_TEMP_{str.upper(envx)}.{x}' for x in fieldsx) + ')'
EDIT - Examples to more clearly illustrate goal -
So if my TABLE_STG has:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-01-01|A
2|2020-02-01|B
And my API gets the following in TABLE_TEMP_STG:
ID|UPDATED_AT|FIELD
1|2020-02-01|A
2|2020-02-01|B
I currently end up with:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-01-01|A
1|2020-02-01|A
2|2020-02-01|B
But I really want tp remove the older updated_at's and end up with:
ID|UPDATED_AT|FIELD
0|2018-01-01|X
1|2020-02-01|A
2|2020-02-01|B
We can do deletes in the MATCHED branch of a MERGE statement. Your code needs to look like this:
MERGE
INTO vet_data_patients_stg
USING vet_data_patients_temp_stg
ON vet_data_patients_stg.updated_at=vet_data_patients_temp_stg.updated_at
AND vet_data_patients_stg.id=vet_data_patients_temp_stg.id
WHEN NOT matched THEN
INSERT
(
id,
updated_at,
<<<my_other_fields>>>
)
VALUES
(
vet_data_patients_temp_stg.id,
vet_data_patients_temp_stg.updated_at,
<<<my_other_fields>>>
)
WHEN matched THEN
UPDATE
SET some_other_field = vet_data_patients_temp_stg.some_other_field
DELETE WHERE 1 = 1
This will delete all the rows which are updated, that is all the updated rows.
Note that you need to include the UPDATE clause even though you want to delete all of them. The DELETE logic is applied only to records which are updated, but the syntax doesn't allow us to leave it out.
There is a proof of concept on db<>fiddle.
Re-writing the python code to generate this statement is left as an exercise for the reader :)
The Seeker hasn't posted a representative test case providing sample sets of input data and a desired outcome derived from those samples. So it may be that this doesn't do what they are expecting.

Comparing spelling of strings from db and csv file

I'm mapping a table from a CSV file, and comparing some values to keys in a DB in order to fetch another value.
There's a possibility of spelling mistakes when people write the CSV files, so sometimes some values are not found in the db.
E.g. person writes: 'Contributions Other', db has a key of 'ContributionsOther'
What I did was remove all the spaces and dashes and lowercased both the value from the CSV and converted the values when creating a table from the db. Here are the following methods:
def get_trade_type_mappings(self):
sql = """
SELECT Code, TradeTypeID
FROM dbo.TradeType"""
with self.job.connect(database='rap') as conn:
trade_types = etl.fromdb(conn, sql)
trade_types.convert('Code', lambda x: x.replace(' ', '').replace('-', '').lower())
return dict(trade_types)
def fetch_trade_type_id(self, trade_type):
# Prevents case and space difference causing issues
trade_type = trade_type.replace(' ', '').replace('-', '').lower()
if trade_type == 'cover':
trade_type = 'covershort'
elif trade_type == 'short':
trade_type = 'sellshort'
return self.get_trade_type_mappings.get(trade_type)
I'm trying to think of any other possible occurrences that might be prone to error.
What I wrote will work for stuff like:
'Contribution Other' vs. 'ContributionOther'
but not for:
'ContributionOthers' vs. 'ContributionOther'
Anything else you think would be useful? I've seen a Levenshtein Distance method for spelling comparison between two words... maybe I could integrate that.

Update Query if data of 2 columns is equal to a particular string

My table contains user query data. I generate a hashed string by doing the following:
queries = Query.objects.values('id', 'name')
# creating a bytes string using the ID, NAME columns and a string "yes" (this string could be anything, I've considered yes as an example)
data = (str(query['id']) + str(query['name']) + "yes").encode()
link_hash = hashlib.pbkdf2_hmac('sha256', data, b"satisfaction", 100000)
link_hash_string = binascii.hexlify(link_hash).decode()
I've sent this hashstring via email embedded in a link which is checked when the use visits that link. My current method of checking if the hash (got from the GET parameter in the link) matches with some data in the table is like this:
queries = Query.objects.values('id', 'name')
# I've set replyHash as a string here as an example, it is generated by the code written above, but the hash will be got from the GET parameter in the link
replyHash = "269e1b3de97b10cd28126209860391938a829ef23b2f674f79c1436fd1ea38e4"
#Currently iterating through all the queries and checking each of the query
for query in queries:
data = (str(query['id']) + str(query['name']) + "yes").encode()
link_hash = hashlib.pbkdf2_hmac('sha256', data, b"satisfaction", 100000)
link_hash_string = binascii.hexlify(link_hash).decode()
if replyHash == link_hash_string:
print("It exists, valid hash")
query['name'] = "BooBoo"
query.save()
break
The problem with this approach is that if I have a large table with thousands of rows, this method will take a lot of time. Is there an approach using annotation or aggregation or something else which will perform the same action in less time?

Pandas dataframe to doc2vec.LabeledSentence

I have this dataframe :
order_id product_id user_id
2 33120 u202279
2 28985 u202279
2 9327 u202279
4 39758 u178520
4 21351 u178520
5 6348 u156122
5 40878 u156122
Type user_id : String
Type product_id : Integer
I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words:
all product ids ordered by each user_id}
But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.
I try to run this function (below) with multiprocessing but is too long.
Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?
def append_to_sequences(i):
user_id = liste_user_id.pop(0)
liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )
pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()
Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.
Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.
Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.
If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:
documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
for uid in liste_user_id]
Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)

Categories