Convert data between databases with different schemes and value codings

Convert data between databases with different schemes and value codings - python

As part of my job I am meant to write a script to convert data from an available database here on-site to an external database for publication which is similar but far less detailed. I want to achieve this in Python.
Some columns can just be converted by copying their content changing the coding, i.e. 2212 in the original database becomes 2 in the external database in column X. To achieve this I wrote the codings in JSON, e.g.
{
"2212": 2,
"2213": 1,
"2214": 2,
...
}
This leads to some repetition as you can see but since lists cannot be keys in JSON I don't see a better way to it (which is simple and clean; sure I could have the right hand side as a key but then instead of jsonParsedDict["2212"] I would have to go through all the keys 1, 2, ... and find my original key on the right hand side.)
Where it gets ugly (in my opinion) is when information from multiple columns in the original database need to be combined to get the new column. Right now I just wrote a Python function doing a lot of if checks. It works and I will finish the job this way but it just seems aesthetically wrong and I want to learn more about Python's possibilities in that task.
Imagine for example I have two columns in the original database X and Y and then based on values in X I either do nothing (for values that are not coded in the external database), return a value directly or I return a result based on what value Y has in the same row. Right now this leads to quite some if statements. My other idea was to have stacked entries in the JSON file, e.g.
{
"X":
{
"2211": 1,
"2212": null,
"2213":
{
"Y":
{
"3112": 1
"3212": 2
}
}
"2214":
{
"Y":
{
"3112": 2
"3212": 1
}
}
"2215":
{
"Y":
{
"3112": 1
"3212": 2
}
}
}
}
But this approach really blows up the JSON file and repetitions get even more painful. Alas I cannot think of any other way to code these kind of conditions, apart from if in the code.
Is this a feasible way to do it or is there a better solutions? It would be great if I could specify the variables and associated variables, which are part of the decision process, only in the JSON file. I want to abstract the conversion process such that it is mostly steered by these JSON files and the Python code is quite general. If there is a better format for this than JSON then suggestions are very welcome.

Related

Write a custom JSON interpreter for a file that looks like json but isnt using Python

What I need to do is to write a module that can read and write files that use the PDX script language. This language looks alot like json but has enough differences that a custom encoder/decoder is needed to do anything with those files (without a mess of regex substitutions which would make maintenance hell). I originally went with just reading them as txt files, and use regex to find and replace things to convert it to valid json. This lead me to my current point, where any additions to the code requires me to write far more code than I would want to, just to support some small new thing. So using a custom json thing I could write code that shows what valid key:value pairs are, then use that to handle the files. To me that will be alot less code and alot easier to maintain.
So what does this code look like? In general it looks like this (tried to put all possible syntax, this is not an example of a working file):
#key = value # this is the definition for the scripted variable
key = {
# This is a comment. No multiline comments
function # This is a single key, usually optimize_memory
# These are the accepted key:value pairs. The quoted version is being phased out
key = "value"
key = value
key = #key # This key is using a scripted variable, defined either in the file its in or in the `scripted_variables` folder. (see above for example on how these are initially defined)
# type is what the key type is. Like trigger:planet_stability where planet_stability is a trigger
key = type:key
# Variables like this allow for custom names to be set. Mostly used for flags and such things
[[VARIABLE_NAME]
math_key = $VARIABLE_NAME$
]
# this is inline math, I dont actually understand how this works in the script language yet as its new. The "<" can be replaced with any math symbol.
# Valid example: planet_stability < #[ stabilitylevel2 + 10 ]
key < #[ key + 10 ]
# This is used alot to handle code blocks. Valid example:
# potential = {
# exists = owner
# owner = {
# has_country_flag = flag_name
# }
# }
key = {
key = value
}
# This is just a list. Inline brackets are used alot which annoys me...
key = { value value }
}
The major differences between json and PDX script is the nearly complete lack of quotations, using an equals sign instead of a colon for separation and no comma's at the end of the lines. Now before you ask me to change the PDX code, I cant. Its not mine. This is what I have to work with and cant make any changes to the syntax. And no I dont want to convert back and forth as I have already mentioned this would require alot of work. I have attempted to look for examples of this, however all I can find are references to convert already valid json to a python object, which is not what I want. So I cant give any examples of what I have already done, as I cant find anywhere to even start.
Some additional info:
Order of key:value pairs does not technically matter, however it is expected to be in a certain order, and when not in that order causes issues with mods and conflict solvers
bool properties always use yes or no rather than true or false
Lowercase is expected and in some cases required
Math operators are used as separators as well, eg >=, <= ect
The list of syntax is not exhaustive, but should contain most of the syntax used in the language
Past work:
My last attempts at this all revolved around converting it from a text file to a json file. This was alot of work just to get a small piece of this to work.
Example:
potential = {
exists = owner
owner = {
is_regular_empire = yes
is_fallen_empire = no
}
NOR = {
has_modifier = resort_colony
has_modifier = slave_colony
uses_habitat_capitals = yes
}
}
And what i did to get most of the way to json (couldnt find a way to add quotes)
test_string = test_string.replace("\n", ",")
test_string = test_string.replace("{,", "{")
test_string = test_string.replace("{", "{\n")
test_string = test_string.replace(",", ",\n")
test_string = test_string.replace("}, ", "},\n")
test_string = "{\n" + test_string + "\n}"
# Replace the equals sign with a colon
test_string = test_string.replace(" =", ":")
This resulted in this:
{
potential: {
exists: owner,
owner: {
is_regular_empire: yes,
is_fallen_empire: no,
},
NOR: {
has_modifier: resort_colony,
has_modifier: slave_colony,
uses_habitat_capitals: yes,
},
}
}
Very very close yes, but in no way could I find a way to add the quotations to each word (I think I did try a regex sub, but wasnt able to get it to work, since this whole thing is just one unbroken string), making this attempt stuck and also showing just how much work is required just to get a very simple potential block to mostly work. However this is not the method I want anymore, one because its alot of work and two because I couldnt find anything to finish it. So a custom json interpreter is what I want.

The classical approach (potentially leading to more code, but also more "correctness"/elegance) is probably to build a "recursive descent parser", from a bunch of conditionals/checks, loops and (sometimes recursive?) functions/handlers to deal with each of the encountered elements/characters on the input stream. An implicit parse/call tree might be sufficient if you directly output/print the JSON equivalent, or otherwise you could also create a representation/model in memory for later output/conversion.
Related book recommendation could be "Language Implementation Patterns" by Terence Parr, me avoiding to promote my own interpreters and introductory materials :-) In case you need further help, maybe write me?

Converting Python Dict to JSON for MySQL field of JSON type

I am currently getting this error:
Invalid JSON text: "not a JSON text, may need CAST" at position 0 in value for column
This is the value that is trying to be inserted:
{
"ath": 69045,
"ath_date": "2021-11-10T14:24:11.849Z",
"atl": 67.81,
"atl_date": "2013-07-06T00:00:00.000Z"
}
When trying to insert into my database. I believe it is due to malformed JSON however I am using json.dumps() to convert my dictionary. I have tried several things I have found over the last few hours to try and format it correctly but am hitting a wall between two errors.
I tried adding another level as well as wrapping it all in an array as that was recommended in another question, however, that produced the same error.
My Dict:
ticker_market_data[ticker] = {
"all_time": {
"ath": market_data["ath"]["usd"],
"ath_date": market_data["ath_date"]["usd"],
"atl": market_data["atl"]["usd"],
"atl_date": market_data["atl_date"]["usd"],
},
"price_change_percent": {
"1h": market_data["price_change_percentage_1h_in_currency"]["usd"],
"24h": market_data["price_change_percentage_24h"],
"7d": market_data["price_change_percentage_7d"],
"30d": market_data["price_change_percentage_30d"],
"1y": market_data["price_change_percentage_1y"],
},
}
The problem items being all_time and price_change_percent.
This is how I am creating the variables to store in the database:
all_time = json.dumps(ticker_market_data[ticker].get("all_time"))
price_change_percent = json.dumps(ticker_market_data[ticker].get("price_change_percent"))

I wanted to answer my own question though I don't know how much it will benefit the community as it was completely on me. My code posted above was entirely correct, the problem lay in the order of variables being inserted in the SQL. I had the wrong type of variable one position up from where it needed to be. So rather than JSON being inserted into the column it was a float.

looping through json python is very slow

Can someone help me understand what I'm doing wrong in the following code:
def matchTrigTohost(gtriggerids,gettriggers):
mylist = []
for eachid in gettriggers:
gtriggerids['params']['triggerids'] = str(eachid)
hgetjsonObject = updateitem(gtriggerids,processor)
hgetjsonObject = json.dumps(hgetjsonObject)
hgetjsonObject = json.loads(hgetjsonObject)
hgetjsonObject = eval(hgetjsonObject)
hostid = hgetjsonObject["result"][0]["hostid"]
hname = hgetjsonObject["result"][0]["name"]
endval = hostid + "--" + hname
mylist.append(endval)
return(hgetjsonObject)
The variable gettriggers contain a lot of ids (~3500):
[ "26821", "26822", "26810", ..... ]
I'm looping through the ids in the variable and assigning them to a json object.
gtriggerids = {
"jsonrpc": "2.0",
"method": "host.get",
"params": {
"output": ["hostid", "name"],
"triggerids": "26821"
},
"auth": mytoken,
"id": 2
}
When I run the code against the above json variable, it is very slow. It is taking several minutes to check each ID. I'm sure I'm doing many things wrong here or at least not in the pythonic way. Can anyone help me speed this up? I'm very new to python.
NOTE:
The dump() , load(), eval() were used to convert the str produced to json.

You asked for help knowing what you're doing wrong. Happy to oblige :-)
At the lowest level—why your function is running slowly—you're running many unnecessary operations. Specifically, you're moving data between formats (python dictionaries and JSON strings) and back again which accomplishes nothing but wasting CPU cycles.
You mentioned this is only way you could get the data in the format you needed. That brings me to the second thing you're doing wrong.
You're throwing code at the wall instead of understanding what's happening.
I'm quite sure (and several of your commenters appear to agree) that your code is not the only way to arrange your data into a usable structure. What you should do instead is:
Understand as much as you can about the data you're being given. I suspect the output of updateitem() should be your first target of learning.
Understand the right/typical way to interact with that data. Your data doesn't have to be a dictionary before you can use it. Maybe it's not the best approach.
Understand what regularities and irregularities the data may have. Part of your problem may not be with types or dictionaries, but with an unpredictable/dirty data source.
Armed with all this new knowledge, manipulate your as simply as you can.
I can pretty much guarantee the result will run faster.
More detail! Some things you wrote suggest misconceptions:
I'm looping through the ids in the variable and assigning them to a json object.
No, you can't assign to a JSON object. In python, JSON data is always a string. You probably mean that you're assigning to a python dictionary, which (sometimes!) can be converted to a JSON object, represented as a string. Make sure you have all those concepts clear before you move forward.
The dump() , load(), eval() were used to convert the str produced to json.
Again, you don't call dumps() on a string. You use that to convert a python object to a string. Run this code in a REPL, go step by step, and inspect or play with each output to understand what it is.

Python - Compare 2 different output formats - Logic

I have a programming case which has stumped me. I'm not necassarily looking for code - I'm looking for logic advice, for which I'm at a loss.
I've tried several different things but nothing seems to really formulate concretely.
This is for a regression test. I have two files with the same data in two very dissimilar formats. I need to compare the data and automate the process. I'll worry about the 'diff' at a later stage. It should not be too hard if I can arrive at data from both files which can be compared.
File 1 has essentially JSON data. There is other garbage in the file but that can be removed. This is what data looks like:
{
"Chan-1" : [ {
"key1" : "val1",
"key2" : val2,
"key3" : val3,
}, {
"key1" : "val1",
"key2" : val2,
"key3" : val3,
} ]
}
File 2 has essentially what I can decipher as a python list with items. Each item has data which is in a key=value format, separated by commas, in parenthesis.
[
spacecraft.telemetry.channel(key1=val1,key2="val2",key3=val3),
spacecraft.telemetry.channel(key1=val1,key2="val2",key3=val3)
]
Each block in one file corresponds to the one in another, and essentially if going to be diff'd. So in other words:
{
"key1" : "val1",
"key2" : val2,
"key3" : val3,
}
from File 1 will (or should) have the same key value pairs as File 2:
(key1=val1,key2="val2",key3=val3)
The order is also similar.
Both files contain a plethora of key value pairs for the "Chan-1" object, and I truncated the amount of data for example sake. There are about 16 key value pairs in each block, and there are about 400 blocks.
I've tried working on File 2 to make it look like JSON data.
I've tried working on File 1 to make it look more like File 2.
I also tried parsing both files as a 3rd format altogether.
But I've not gotten far with either concept - and something tells me I'm missing something, that this should not be this hard, considering we already have one file in JSON.
I would really appreciate if someone can give me some advice on the Logic to follow here - what appears to be the best route, and what kind of logic should be put in making this happen.
Thanks.

For each file:
extract the 'list' of objects
convert each object in the list into a dictionary
for file 1, this step is basically 'convert JSON into dict'
for file 2, this would involve extracting just the key=value strings, splitting on =, and turning the result into a dict via dictionary comprehension.
At this point, you have two lists of dictionaries. Your question seems to indicate that you can assume the lists are ordered in the same manner, so now you can check that each dict in one list matches the dict in the same position in the other list. Check out zip(list_1, list_2); it should make this step easier.

Mongodb: How to change an element of a nested arrary?

From what I have read it is impossible to update an element in an nested array using the positional operator $ in mongo. The $ only works one level deep. I see it is a requested feature in mongo 2.7.
Updating the whole document one level up is not an option because of write conflicts. I need to just be able to change the 'username' for a particular reward program for instance.
One of the ideas would to be pull, modify, and push the entire 'reward_programs' element but then I would loose the order. Order is important.
Consider this document:
{
"_id:"0,
"firstname":"Tom",
"profiles" : [
{
"profile_name": "tom",
"reward_programs:[
{
'program_name':'American',
'username':'tomdoe',
},
{
'program_name':'Delta',
'username':'tomdoe',
}
]
}
]
}
How would you go about specifically changing the 'username' of 'program_name'=Delta?

After doing more reading it looks like this is unsupported in mongodb at the moment. Positional updates are only supported for one level deep. The feature might be added for mongodb 2.7.
The are a couple of work arounds.
1) Flatten out your database structure. In this case, make 'reward_programs' it's own collection and do your operation on that.
2) Instead of arrays of dicts, use dicts of dicts. That way you can just have an absolute path down to the object you need to modify. This can have drawbacks to query flexibility.
3) Seems hacky to me but you can also walk the list on the nested array find it's position index in the array and do something like this:
users.update({'_id': request._id, 'profiles.profile_name': profile_name}, {'$set': {'profiles.$.reward_programs.{}.username'.format(index): new_username}})
4) Read in the whole document, modify, write back. However, this has possible write conflicts
Setting up your database structure initially is extremely important. It really depends on how you are going to use it.

A simple way to do this:
doc = collection.find_one({'_id': 0})
doc['profiles'][0]["reward_programs"][1]['username'] = 'new user name'
#replace the whole collection
collection.save(doc)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.