Parse Python String for Units - python

Say I have a set of strings like the following:
"5 m^2"
"17 sq feet"
"3 inches"
"89 meters"
Is there a Python package which will read such strings, convert them to SI, and return the result in an easily-usable form? For instance:
>>> a=dream_parser.parse("17 sq feet")
>>> a.quantity
1.5793517
>>> a.type
'area'
>>> a.unit
'm^2'

Quantulum will do exactly what you described
Excerpt from its description:
from quantulum import parser
quants = parser.parse('I want 2 liters of wine')
# quants [Quantity(2, 'litre')]

More recently, pint is a good place to start for most of these.

Is there an extension for ipython that can do at least part of what you want. It's called ipython-physics
It does store value and units and allows (at least) some basic math. I have never used it myself, so I don't know how easy would be to use in a python script

If you have 'nice' strings then use pint.
(best for unit conversions)
import pint
u = pint.UnitRegistry()
value = u.quantity("89 meters")
If you have text/sentences then use quantulum
from quantulum import parser
value = parser.parse('Pass me a 300 ml beer.')
If you have 'ugly' strings then use try unit_parse.
Examples of 'ugly' strings: (see unit_parse github for more examples)
2.3 mlgcm --> 2.3 cm * g * ml
5E1 g/mol --> 50.0 g / mol
5 e1 g/mol --> 50.0 g / mol
()4.0 (°C) --> 4.0 °C
37.34 kJ/mole (at 25 °C) --> [[<Quantity(37.34, 'kilojoule / mole')>, <Quantity(25, 'degree_Celsius')>]]
Detection in water: 0.73 ppm; Chemically pure --> 0.73 ppm
(uses pint under the hood)
from unit_parse import parser
result = parser("1.23 g/cm3 (at 25 °C)")
print(result) # [[<Quantity(1.23, 'g / cm ** 3')>, <Quantity(25, 'degC')>]]

Related

Astropy cannot parse inches

I am trying to convert inches to mm with astropy.
In input I have unit as string ("inch","mm"). I create for it example function:
def astro_conv(self, amount: float, fromm: str, to: str) -> float:
u_from = u.Unit(fromm)
u_to = u.Unit(to)
return u_from.to(u_to, amount)
I got message:{ValueError}'inch' did not parse as unit: At col 0, inch is not a valid unit. I have check documentation and inch should be available: http://docs.astropy.org/en/stable/units/
What am I doing wrong?
Per the docs, Astropy does not include imperial units defined by default. I think this is in part to reduce the default units namespace and the overhead involved in creating and searching it, and imperial units get sacrificed in this case since they are less used in astronomy for the most part:
This package defines colloquially used Imperial units. They are available in the astropy.units.imperial namespace, but not in the top-level astropy.units namespace, e.g.:
>>> import astropy.units as u
>>> mph = u.imperial.mile / u.hour
>>> mph
Unit("mi / h")
To include them in compose and the results of find_equivalent_units, do:
>>> import astropy.units as u
>>> u.imperial.enable()

Python datatype that includes uncertainty/error bars?

Is there a python datatype that includes numerical error bars?
For example,
: a = 3.00 ± 0.100
: b = 4.00 ± 0.100
: b + a
>> 7.00 ± 0.141
Where √(0.1^2 + 0.1^2) = 0.141
I figured since imaginary numbers already exist in a form something like this a= 3 + j4, maybe there is a module that handles error analysis for you as well. (I suppose it's complicated by the fact that the + & - uncertainties need not be equal.)
Yes. There is a package called uncertainties.
Install it: sudo pip install uncertainties
Example:
from uncertainties import ufloat_fromstr
x = ufloat_fromstr("0.20+/-0.01")
square = x**2
print square
For more info: https://pythonhosted.org/uncertainties/user_guide.html

Understanding pandas.read_csv() float parsing

I am having problems reading probabilities from CSV using pandas.read_csv; some of the values are read as floats with > 1.0.
Specifically, I am confused about the following behavior:
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
1.0000000000000002
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
1.0000000000000002
Default float-parsing behavior seems to be non-monotonic, and especially some values starting 0.9... are converted to floats that are strictly greater than 1.0, causing problems e.g. when feeding them into sklearn.metrics.
The documentation states that read_csv has a parameter float_precision that can be used to select “which converter the C engine should use for floating-point values”, and setting this to 'high' indeed solves my problem.
However, I would like to understand the default behavior:
Where can I find the source code of the default float converter?
Where can I find documentation on the intended behavior of the default float converter and the other possible choices?
Why does a single-figure change in the least significant position skip a value?
Why does this behave non-monotonically at all?
Edit regarding “duplicate question”: This is not a duplicate. I am aware of the limitations of floating-point math. I was specifically asking about the default parsing mechanism in Pandas, since the builtin float does not show this behavior:
>>> float("0.99999999999999999")
1.0
...and I could not find documentation.
#MaxU already showed the source code for the parser and the relevant tokenizer xstrtod so I'll focus on the "why" part:
The code for xstrtod is roughly like this (translated to pure Python):
def xstrtod(p):
number = 0.
idx = 0
ndecimals = 0
while p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
idx += 1
while idx < len(p) and p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
ndecimals += 1
return number / 10**ndecimals
Which reproduces the "problem" you saw:
print(xstrtod('0.99999999999999997')) # 1.0
print(xstrtod('0.99999999999999998')) # 1.0
print(xstrtod('0.99999999999999999')) # 1.0000000000000002
print(xstrtod('1.00000000000000000')) # 1.0
print(xstrtod('1.00000000000000001')) # 1.0
print(xstrtod('1.00000000000000002')) # 1.0
print(xstrtod('1.00000000000000003')) # 1.0
print(xstrtod('1.00000000000000004')) # 1.0
print(xstrtod('1.00000000000000005')) # 1.0
print(xstrtod('1.00000000000000006')) # 1.0
print(xstrtod('1.00000000000000007')) # 1.0
print(xstrtod('1.00000000000000008')) # 1.0
print(xstrtod('1.00000000000000009')) # 1.0000000000000002
print(xstrtod('1.00000000000000019')) # 1.0000000000000002
The problem seems to be the 9 in the last place which alters the result. So it's floating point accuracy:
>>> float('100000000000000008')
1e+17
>>> float('100000000000000009')
1.0000000000000002e+17
It's the 9 in the last place that is responsible for the skewed results.
If you want high precision you can define your own converters or use python-provided ones, i.e. decimal.Decimal if you want arbitary precision:
>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal} # parse column 0 as decimals
>>> import io
>>> def parse(string):
... return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))
which prints:
0.999999999999999980000000000000
0.999999999999999990000000000000
1.000000000000000000000000000000
1.000000000000000010000000000000
1.000000000000000080000000000000
1.000000000000000090000000000000
Exactly representing the input!
If you want to understand how it works - look at the source code - file "_libs/parsers.pyx" lines: 492-499 for Pandas 0.20.1:
self.parser.double_converter_nogil = xstrtod # <------- default converter
self.parser.double_converter_withgil = NULL
if float_precision == 'high':
self.parser.double_converter_nogil = precise_xstrtod # <------- 'high' converter
self.parser.double_converter_withgil = NULL
elif float_precision == 'round_trip': # avoid gh-15140
self.parser.double_converter_nogil = NULL
self.parser.double_converter_withgil = round_trip
Source code for xstrtod
Source code for precise_xstrtod

How to convert unit prefixes in Sympy

I would like to know how to convert a sympy value with a physical unit into the same unit with another prefix. For example:
>>> import sympy.physics.units as u
>>> x = 0.001 * u.kilogram
0.001kg
should be converted to grams. The approach I have taken so far is very bloated and delivers a wrong result.
>>> x / u.kilogram * u.gram
1.0^-6kg
It should be 1g instead.
If you can accept printing 1 instead of 1g, you could just use division:
>>> x / u.g
1.0
Otherwise, you'd better switch to sympy.physics.unitsystems.
>>> from sympy.physics.unitsystems import Quantity
>>> from sympy.physics.unitsystems.systems import mks
>>> Quantity(0.001, mks['kg'])
0.001kg
>>> _.convert_to(mks['g'])
1g
>>> u.convert_to(x, u.gram)
1.0*gram

ggplot2 hell with rpy2-2.0.7 + python 2.6 + r 2.11 (windows 7)

i am using rpy2-2.0.7 (i need this to work with windows 7, and compiling the binaries for the newer rpy2 versions is a mess) to push a two-column dataframe into r, create a few layers in ggplot2, and output the image into a <.png>.
i have wasted countless hours fidgeting around with the syntax; i did manage to output the files i needed at one point, but (stupidly) did not notice and continued fidgeting around with my code ...
i would sincerely appreciate any help; below is a (trivial) example for demonstration. Thank you very much for your help!!! ~ Eric Butter
import rpy2.robjects as rob
from rpy2.robjects import r
import rpy2.rlike.container as rlc
from array import array
r.library("grDevices") # import r graphics package with rpy2
r.library("lattice")
r.library("ggplot2")
r.library("reshape")
picpath = 'foo.png'
d1 = ["cat","dog","mouse"]
d2 = array('f',[1.0,2.0,3.0])
nums = rob.RVector(d2)
name = rob.StrVector(d1)
tl = rlc.TaggedList([nums, name], tags = ('nums', 'name'))
dataf = rob.RDataFrame(tl)
## r['png'](file=picpath, width=300, height=300)
## r['ggplot'](data=dataf)+r['aes_string'](x='nums')+r['geom_bar'](fill='name')+r['stat_bin'](binwidth=0.1)
r['ggplot'](data=dataf)
r['aes_string'](x='nums')
r['geom_bar'](fill='name')
r['stat_bin'](binwidth=0.1)
r['ggsave']()
## r['dev.off']()
*The output is just a blank image (181 b).
here are a couple common errors R itself throws as I fiddle around in ggplot2:
r['png'](file=picpath, width=300, height=300)
r['ggplot']()
r['layer'](dataf, x=nums, fill=name, geom="bar")
r['geom_histogram']()
r['stat_bin'](binwidth=0.1)
r['ggsave'](file=picpath)
r['dev.off']()
*RRuntimeError: Error: No layers in plot
r['png'](file=picpath, width=300, height=300)
r['ggplot'](data=dataf)
r['aes'](geom="bar")
r['geom_bar'](x=nums, fill=name)
r['stat_bin'](binwidth=0.1)
r['ggsave'](file=picpath)
r['dev.off']()
*RRuntimeError: Error: When setting aesthetics, they may only take one value. Problems: fill,x
I use rpy2 solely through Nathaniel Smith's brilliant little module called rnumpy (see the "API" link at the rnumpy home page). With this you can do:
from rnumpy import *
r.library("ggplot2")
picpath = 'foo.png'
name = ["cat","dog","mouse"]
nums = [1.0,2.0,3.0]
r["dataf"] = r.data_frame(name=name, nums=nums)
r("p <- ggplot(dataf, aes(name, nums, fill=name)) + geom_bar(stat='identity')")
r.ggsave(picpath)
(I'm guessing a little about how you want the plot to look, but you get the idea.)
Another great convenience is entering "R mode" from Python with the ipy_rnumpy module. (See the "IPython integration" link at the rnumpy home page).
For complicated stuff, I usually prototype in R until I have the plotting commands worked out. Error reporting in rpy2 or rnumpy can get quite messy.
For instance, the result of an assignment (or other computation) is sometimes printed even when it should be invisible. This is annoying e.g. when assigning to large data frames. A quick workaround is to end the offending line with a trailing statement that evaluates to something short. For instance:
In [59] R> long <- 1:20
Out[59] R>
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20
In [60] R> long <- 1:100; 0
Out[60] R> [1] 0
(To silence some recurrent warnings in rnumpy, I've edited rnumpy.py to add 'from warnings import warn' and replace 'print "error in process_revents: ignored"' with 'warn("error in process_revents: ignored")'. That way, I only see the warning once per session.)
You have to engage the dev() before you shut it off, which means that you have to print() (like JD guesses above) prior to throwing dev.off().
from rpy2 import robjects
r = robjects.r
r.library("ggplot2")
robjects.r('p = ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()')
r.ggsave('/stackBar.jpeg')
robjects.r('print(p)')
r['dev.off']()
To make it slightly more easy when you have to draw more complex plots:
from rpy2 import robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.lib.ggplot2 as ggplot2
r = robjects.r
grdevices = importr('grDevices')
p = r('''
library(ggplot2)
p <- ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
p <- p + opts(title = "{0}")
# add more R code if necessary e.g. p <- p + layer(..)
p'''.format("stackbar"))
# you can use format to transfer variables into R
# use var.r_repr() in case it involves a robject like a vector or data.frame
p.plot()
# grdevices.dev_off()

Categories