Parsing complicated string-based configuration options

Parsing complicated string-based configuration options - python

I'm trying to build a Python program that will parse the conversion codes from a Universe Database. These conversion codes are highly dense strings that encode a ton of information, and I'm struggling to get started. The different permutations of the conversion codes likely number in the hundreds of thousands (though I haven't done the math).
I'm familiar with argparse, however I can't come up with a way of handling this kind of parsing with argparse, and my Google-fu hasn't come up with any other solution.
Initially, my lazy work around was to just do a dictionary lookup for the most common conversion codes, but now that we're using this Python program for more data, it's becoming a huge chore to maintain each individual conversion code.
For example, date conversion codes may take forms like:
Date: D[n][*m][s][fmt[[f1, f2, f3, f4, f5]]][E][L], e.g. D2/ or D4-2/RM
Datetime: DT[4|D|4D|T|TS|Z][;timezone], e.g. DTZ or DT4;America/Denver
Datetime ISO: DTI[B][R|W][S][Z][2|1|0][;[timezone|offset]], e.g. DTIBZ2 or DTIR;America/Denver
And there are a bunch of other conversion codes, with equally complicated parameters.
My end goal is to be able to convert Universe's string data into the appropriate Python object and back again, and in order to do that, I need to understand these conversion codes.
If it helps, I do not need to validate these conversion codes. Once they are set in the database, they are validated there.

I recommend that you use the readnamedfields/writenamedfields methods, this will return OCONV data, and when you write back it handles the ICONV.
import u2py
help( u2py.File.readnamedfields)
Help on function readnamedfields in module u2py:
readnamedfields(self, *args)
F.readnamedfields(recordid, fieldnnames, [lockflag]) -> new DynArray object -- read the specified fields by name of a record in the file
fieldnames is a u2py.DynArray object with each of its fields being a name defined in the dictionary file
lockflag is either 0 (default), or [LOCK_EXCLUSIVE or LOCK_SHARED] [ + LOCK_WAIT]
note: if fieldnames contains names that are not defined in the dictionary, these names are replaced by #ID and no exception is raised

Related

How to correctly parse as text numbers separated by mixed commas and dots in excel file using Python?

I'm importing data coming from excel files that come from another office.
In one of the columns, for each cell, I have lists of numbers used as tags. These were manually inserted, by different people and (my guess) using computers with different thousands settings, so the result is very heterogeneous.
As an example I have:
tags= ['205', '306.3', '3,206,302','7.205.206']
If this was a CSV file (I tried converting one single file to check), using
pd.read_csv(my_file,sep=';')
would give me exactly the above mentioned list.
Unfortunately as said, we're talking about excel files (plural) and I have to deal with it, and using
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',converters{'my_column':str})
what I get instead is:
tags= ['205', '306.3', '3,206,302','7205206']
As you see, whenever the number can be expressed logically in thousands (so, not the second number in my list) the dot is recognised as a thousands separator and I get a single number, instead of three.
I tried reading documentation, and searching on stackoverflow and google, but the keywords to describe this problem are too vague and I didn't find a viable solution, yet.
How can I get the right list using excel files?
Thanks.

This problem is likely happening because pandas is running their number parser before their date parser.
One possible fix is to add a thousands separator. For example, if you are actually using ',' as your thousands separator, you could add thousands=',' in your excel reader:
pd.read_excel(my_file,sheetname=my_sheet,encoding='utf-16',thousands=',',converters{'my_column':str})
You could also pick an arbitrary thousand separator that doesn't exist in your data to make the output stay the same if thousands=None (which should be the default according to documentation), doesn't already deal with your problem. You should also make sure that you are converting the fields to str (in which case using thousands is kind of redundant, as it's not applied to trings either way).
EDIT:
I tried using the following dummy data ('test.xlsx'):
a b c d
205 306.3 3,206,302 7.205.206
and with
dataf = pandas.read_excel('test.xlsx', header=0, converters={'a':str, 'b':str,'c':str,'d':str})
print(dataf.to_string)
I got the following output:
Columns: [205, 306.3, 3,206,302, 7.205.206]
Which is exactly what you were looking for. Are you sure you have the latest version of pandas and that you are in fact not using converters = {'col':int} or float in your converters keyword?
As it stands, it sounds like you are either converting your fields to numeric (int or float), or there is a problem elsewhere in your code. The pandas read_excel seems to work as described, and I can get the results you specified with the code specified above. In other wods: Your code should work, if it doesn't it might be due to outdated pandas version, other parts in your code or even problems with the source data. As it stands, it's not possible to answer your question further with the information you have provided.

How do I convert a GP to a string and back again using DEAP in Python?

I'm doing a project in Genetic Programming and I need to be able to convert a genetic program (of class deap.creator.Individual) to string, change some things (while keeping the problem 100% syntactically aligned with DEAP), and then put it back into a population of individuals for further evolution.
However, I've only been able to convert my string back to class gp.PrimitiveTree using the from_string method.
The only constructors for creator.Individual I see generate entire populations blindly or construct an Individual from an existing Individual/s. No methods to only create one individual from an existing gp.PrimitiveTree.
So, does anybody have any idea how I go about that?
Note: Individual is self-defined, but it is standard across all DEAP examples and is created using
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMax)

After many many hours I believe I've figured this out.
So, I'd become confused between two of the DEAP modules: 'creator' and 'toolbox'.
In order for me to create an individual with a given PrimitiveTree I simply needed to do:
creator.Individual(myPrimativeTree)
What you do not do is:
toolbox.individual(myPrimativeTree)
as that usually gets setup as the initialiser itself, and thus doesn't take arguments.
I hope that this can save somebody a decent chunk of time at some point in the future.

Individual to string: str(individual)
In order to create an Individual from a string: Primitive Tree has class method from_string:
https://deap.readthedocs.io/en/master/api/gp.html#deap.gp.PrimitiveTree.from_string
In your Deap evolution, to create an individual from string, you can try something like(note use of creator vs toolbox):
creator.Individual.from_string("add(IN1, IN2)", pset)
But the individual expression, as a string, needs to be such as it would if you did str(individual), aka stick to your pset when creating your string. So in my above example string I believe you would need to have a pset similar to:
pset = gp.PrimitiveSetTyped("MAIN", [float]*2, float, "IN")
pset.addPrimitive(operator.add, [float,float], float)

pandas crashes on series with multiple data types

I have a simple excel file with two columns - one categorical column and another numerical column that i read into pandas with the read_excel function as below
df= pd.read_excel('pandas_crasher.xlsx')
The first column is of type Object with multiple types. Since the excel was badly formatted, the column contains a combination of timestamps, floats and texts. But its normally supposed to be just a simple textual column
from datetime import datetime
from collections import Counter
df['random_names'].dtype
dtype('O')
print Counter([type(i) for i in load_instance['random_names']])
Counter({type 'unicode'>: 15427, type 'datetime.datetime'>: 18,
type 'float'>: 2})
When i do a simple groupby on it, it crashes the python kernel without any error messages or notifications - i tried doing it from both jupyter and a small custom flask app without any luck.
df.groupby('random_names')['random_values'].sum() << crashes
Its a relatively small file of 700kb (15k rows and 2 cols) - so its definitely not a memory issue
I tried debugging with pdb to trace the point at which crashes but couldnt get past the the cython function in the pandas/core/groupby.py module
def _cython_operation(self, kind, values, how, axis)
a possible bug in pandas - instead of crashing directly shouldnt it throw an exception and quit gracefully ?
I then convert the various datatypes into text with the following function
def custom_converter(x):
if isinstance(x,datetime) or isinstance( x, ( int, long, float ) ):
return str(x)
else:
return x.encode('utf-8')
df['new_random_names'] = df['random_names'].apply(custom_converter)
df['new_random_names'].groupby('random_names')['random_values'].sum() << does not crash
The apply custom function is probably the slowest way of doing this. Is there any better/faster way of doing this ?
Excel file here: https://drive.google.com/file/d/0B1ZLijGO6gbLelBXMjJWRFV3a2c/view?usp=sharing

For me, the crash seems to happen when pandas tries to sort the group keys. If I pass the sort=False argument to .groupby() then the operation succeeds. This may work for you as well. The sort appears to be a numpy operation that doesn't actually involve pandas objects, so it may ultimately be a numpy issue. (For instance, df.random_names.values.argsort() also crashes for me.)
After some more playing around, I'm guessing the problem has to do with some sort of obscure condition that arises due to the particular comparisons that are made during numpy's sort operation. For me, this crashes:
df.random_names.values[14005:15447]
but leaving one item off either end of the slice doesn't crash anymore. Making a copy of this data and then tweaking it by taking out individual elements, the crash will occur or not depending on whether certain seemingly random elements are removed from the data. Also, under certain circumstances it will fail with an exception of "TypeError: can't compare datetime.datetime to unicode" (or "datetime to float").
This section of the data contains one datetime and one float value, which happens to be a nan. It looks like there is some weird edge case in the numpy code that causes failed comparisons to crash under certain circumstances rather than raise the right exception.
To answer the question at the end of your post, you may have an easier time using the various arguments to read_excel (such as the converters argument) to read all the data in as textual values right from the start.

How do I get a formatted list of methods from a given object?

I'm a beginner, and the answers I've found online so far for this have been too complicated to be useful, so I'm looking for an answer in vocabulary and complexity similar to this writing.
I'm using python 2.7 in ipython notebook environment, along with related modules as distributed by anaconda, and I need to learn about the library-specific objects in the course of my daily work. The case I'm using here is a pandas dataframe object but the answer must work for any object of python or of an imported module.
I want to be able to print a list of methods for the given object. Directly from my program, in a concise and readable format. Even if it's just the method names in a list by alphabetical order, that would be great. A bit more detail would be even better, an ordering based on what it does is fine, but I'd like the output to look like a table, one row per method, and not big blocks of text. What i've tried is below, and it fails for me because it's unreadable. It puts copies of my data between each line, and it has no formatting.
(I love stackoverflow. I aspire to have enough points someday to upvote all your wonderful answers.)
import pandas
import inspect
data_json = """{"0":{"comment":"I won\'t go to school"}, "1":{"note":"Then you must stay in bed"}}"""
data_df = pandas.io.json.read_json(data_json, typ='frame',
dtype=True, convert_axes=True,
convert_dates=True, keep_default_dates=True,
numpy=False, precise_float=False,
date_unit=None)
inspect.getmembers(data_df, inspect.ismethod)
Thanks,
- Sharon

Create an object of type str:
name = "Fido"
List all its attributes (there are no “methods” in Python) in alphabetical order:
for attr in sorted(dir(name)):
print attr
Get more information about the lower (function) attribute:
print(name.lower.__doc__)
In an interactive session, you can also use the more convenient
help(name.lower)
function.

python genfromtext specify some dtype but leave others for automatic determination

I do not understand why numpy.genfromtext allows you to specify only all or none of the dtypes. I am trying to let it do the obvious (default) thing for most fields, but to specify some that I think it will have difficulty with. In general, I do not know the complete list of fields in my data file.
Is there a better way to do this than load in the file twice, as follows?
dtypeoverrides={'textfield':'a20','anotherTrickyfield':'a10','nonexisting field':'a1'}
tsv='inputfile.tsv'
indata=np.genfromtxt(tsvF, delimiter='\t',names=True,dtype=None)
if dtypeoverrides:
dd=indata.dtype
print dd
dd=[(name,dtypeoverrides.get(name,dd[name])) for name in dd.names]
print dd
indata=np.genfromtxt(tsvF, delimiter='\t',names=True,dtype=dd)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing complicated string-based configuration options - python

Related

How to correctly parse as text numbers separated by mixed commas and dots in excel file using Python?

How do I convert a GP to a string and back again using DEAP in Python?

pandas crashes on series with multiple data types

How do I get a formatted list of methods from a given object?

python genfromtext specify some dtype but leave others for automatic determination

Categories

Resources