I am working through Learn Python the Hard Way and am browsing through some code on Git Hub before moving on. I am just curious what the .N does on the line with "tm.N = 1000" and how it relates to the end of the code.
import matplotlib.pyplot as plt
import random
import pandas.util.testing as tm
tm.N = 1000
df = tm.makeTimeDataFrame()
import string
foo = list(string.letters[:5]) * 200
df['indic'] = list(string.letters[:5]) * 200
random.shuffle(foo)
df['indic2'] = foo
df.boxplot(by=['indic', 'indic2'], fontsize=8, rot=90)
plt.show()
N is a global in the testing.py module, that is used all around the module to test arrays and other things. Its default value is 30. E.g.
np.arange(N * K).reshape((N, K))
Series(randn(N), index=index)
In the code you're posting it have poor usage, because makeTimeDataFrame can be feed with a nper parameter that end up being substituted by N if nper is not provided. This is the correct usage, that would not confuse you:
df = tm.makeTimeDataFrame(nper=1000)
The previous line, import pandas.util.testing as tm, imports the module pandas.util.testing and, for convenience, gives it the name tm. Thus, tm afterwards refers to this module, and so tm.N refers to the object named "N" (whatever that is) in the module.
Source: https://github.com/pydata/pandas/blob/master/pandas/util/testing.py
N is a variable in the pandas.util.testing library (imported as tm). It's used in a few of the functions defined in that library, including the makeTimeSeries function called in the getTimeSeriesData which is in turn called in the makeTimeDataFrame function that you call with df = tm.makeTimeDataFrame()
You can get information about pandas.util.testing.N from the docstring and the type() function:
>>> tm.N.__doc__
'int(x[, base]) -> integer\n\nConvert a string or number to an integer, if possible. A floating point\nargument will be truncated towards zero (this does not include a string\nrepresentation of a floating point number!) When converting a string, use\nthe optional base. It is an error to supply a base when converting a\nnon-string. If base is zero, the proper base is guessed based on the\nstring content. If the argument is outside the integer range a\nlong object will be returned instead.'
>>> print(tm.N.__doc__)
int(x[, base]) -> integer
Convert a string or number to an integer, if possible. A floating point
argument will be truncated towards zero (this does not include a string
representation of a floating point number!) When converting a string, use
the optional base. It is an error to supply a base when converting a
non-string. If base is zero, the proper base is guessed based on the
string content. If the argument is outside the integer range a
long object will be returned instead.
>>> type(tm.N)
<type 'int'>
In pandas in the module pandas.util.testing the N property means TimeSeries
See this reference in the section:
We could alternatively have used the unit testing function to create a TimeSeries of length 20:
>>>> pandas.util.testing.N = 20
>>>> ts = pandas.util.testing.makeTimeSeries()
It makes a timeseries of length 1000.
>>> df.head()
Out[7]:
A B C D
2000-01-03 -0.734093 -0.843961 -0.879394 0.415565
2000-01-04 0.028562 -1.098165 1.292156 0.512677
2000-01-05 1.135995 -0.864060 1.297646 -0.166932
2000-01-06 -0.738651 0.426662 0.505882 -0.124671
2000-01-07 -1.242401 0.225207 0.053541 -0.234740
>>> len(df)
Out[8]: 1000
.N provides the number of elements in array type. For example, if you use a colormap,
plt.get_cmap('Pastel1').N will return 9 because it consists of 9 colors whereas
plt.get_cmap('nipy_spectral').N will return 256
Related
I am having problems reading probabilities from CSV using pandas.read_csv; some of the values are read as floats with > 1.0.
Specifically, I am confused about the following behavior:
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999998"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n0.99999999999999999"))["column"][0]
1.0000000000000002
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000000"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000001"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000008"))["column"][0]
1.0
>>> pandas.read_csv(io.StringIO("column\n1.00000000000000009"))["column"][0]
1.0000000000000002
Default float-parsing behavior seems to be non-monotonic, and especially some values starting 0.9... are converted to floats that are strictly greater than 1.0, causing problems e.g. when feeding them into sklearn.metrics.
The documentation states that read_csv has a parameter float_precision that can be used to select “which converter the C engine should use for floating-point values”, and setting this to 'high' indeed solves my problem.
However, I would like to understand the default behavior:
Where can I find the source code of the default float converter?
Where can I find documentation on the intended behavior of the default float converter and the other possible choices?
Why does a single-figure change in the least significant position skip a value?
Why does this behave non-monotonically at all?
Edit regarding “duplicate question”: This is not a duplicate. I am aware of the limitations of floating-point math. I was specifically asking about the default parsing mechanism in Pandas, since the builtin float does not show this behavior:
>>> float("0.99999999999999999")
1.0
...and I could not find documentation.
#MaxU already showed the source code for the parser and the relevant tokenizer xstrtod so I'll focus on the "why" part:
The code for xstrtod is roughly like this (translated to pure Python):
def xstrtod(p):
number = 0.
idx = 0
ndecimals = 0
while p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
idx += 1
while idx < len(p) and p[idx].isdigit():
number = number * 10. + int(p[idx])
idx += 1
ndecimals += 1
return number / 10**ndecimals
Which reproduces the "problem" you saw:
print(xstrtod('0.99999999999999997')) # 1.0
print(xstrtod('0.99999999999999998')) # 1.0
print(xstrtod('0.99999999999999999')) # 1.0000000000000002
print(xstrtod('1.00000000000000000')) # 1.0
print(xstrtod('1.00000000000000001')) # 1.0
print(xstrtod('1.00000000000000002')) # 1.0
print(xstrtod('1.00000000000000003')) # 1.0
print(xstrtod('1.00000000000000004')) # 1.0
print(xstrtod('1.00000000000000005')) # 1.0
print(xstrtod('1.00000000000000006')) # 1.0
print(xstrtod('1.00000000000000007')) # 1.0
print(xstrtod('1.00000000000000008')) # 1.0
print(xstrtod('1.00000000000000009')) # 1.0000000000000002
print(xstrtod('1.00000000000000019')) # 1.0000000000000002
The problem seems to be the 9 in the last place which alters the result. So it's floating point accuracy:
>>> float('100000000000000008')
1e+17
>>> float('100000000000000009')
1.0000000000000002e+17
It's the 9 in the last place that is responsible for the skewed results.
If you want high precision you can define your own converters or use python-provided ones, i.e. decimal.Decimal if you want arbitary precision:
>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal} # parse column 0 as decimals
>>> import io
>>> def parse(string):
... return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))
which prints:
0.999999999999999980000000000000
0.999999999999999990000000000000
1.000000000000000000000000000000
1.000000000000000010000000000000
1.000000000000000080000000000000
1.000000000000000090000000000000
Exactly representing the input!
If you want to understand how it works - look at the source code - file "_libs/parsers.pyx" lines: 492-499 for Pandas 0.20.1:
self.parser.double_converter_nogil = xstrtod # <------- default converter
self.parser.double_converter_withgil = NULL
if float_precision == 'high':
self.parser.double_converter_nogil = precise_xstrtod # <------- 'high' converter
self.parser.double_converter_withgil = NULL
elif float_precision == 'round_trip': # avoid gh-15140
self.parser.double_converter_nogil = NULL
self.parser.double_converter_withgil = round_trip
Source code for xstrtod
Source code for precise_xstrtod
I am trying to load a DLL in Python 2.7 using ctypes. The DLL was written using Fortran and has multiple subroutines in it. I was able to successfully set up couple of the exported functions that that long and double pointers as arguments.
import ctypes as C
import numpy as np
dll = C.windll.LoadLibrary('C:\\Temp\\program.dll')
_cp_from_t = getattr(dll, "CP_FROM_T")
_cp_from_t.restype = C.c_double
_cp_from_t.argtypes = [C.POINTER(C.c_longdouble),
np.ctypeslib.ndpointer(C.c_longdouble)]
# Mixture Rgas function
_mix_r = getattr(dll, "MIX_R")
_mix_r.restype = C.c_double
_mix_r.argtypes = [np.ctypeslib.ndpointer(dtype=C.c_longdouble)]
def cp_from_t(composition, temp):
""" Calculates Cp in BTU/lb/R given a fuel composition and temperature.
:param composition: numpy array containing fuel composition
:param temp: temperature of fuel
:return: Cp
:rtype : float
"""
return _cp_from_t(C.byref(C.c_double(temp)), composition)
def mix_r(composition):
"""Return the gas constant for a given composition.
:rtype : float
:param composition: numpy array containing fuel composition
"""
return _mix_r(composition)
# At this point, I can just pass a numpy array as the composition and I can get the
# calculated values without a problem
comps = np.array([0, 0, 12.0, 23.0, 33.0, 10, 5.0])
temp = 900.0
cp = cp_from_t(comps, temp)
rgas = mix_r(comps)
So far, so good.
The problem arises when I try another subroutine called Function2 which needs some strings as input. The strings are all fixed length (255) and they also ask for the length of each of the string parameters.
The function is implemented in Fortran as follows:
Subroutine FUNCTION2(localBasePath,localTempPath,InputFileName,Model,DataArray,ErrCode)
!DEC$ ATTRIBUTES STDCALL,REFERENCE, ALIAS:'FUNCTION2',DLLEXPORT :: FUNCTION2
Implicit None
Character *255 localBasePath,localTempPath,InputFileName
Integer *4 Model(20), ErrCode(20)
Real *8 DataArray(900)
The function prototype in Python is set up as follows
function2 = getattr(dll, 'FUNCTION2')
function2.argtypes = [C.POINTER(C.c_char_p), C.c_long,
C.POINTER(C.c_char_p), C.c_long,
C.POINTER(C.c_char_p), C.c_long,
np.ctypeslib.ndpointer(C.c_long , flags='F_CONTIGUOUS'),
np.ctypeslib.ndpointer(C.c_double, flags='F_CONTIGUOUS'),
np.ctypeslib.ndpointer(C.c_long, flags='F_CONTIGUOUS')]
And I call it using:
base_path = "D:\\Users\\xxxxxxx\\Documents\\xxxxx\\".ljust(255)
temp_path = "D:\\Users\\xxxxxxx\\Documents\\xxxxx\\temp".ljust(255)
inp_file = "inp.txt".ljust(255)
function2(C.byref(C.c_char_p(base_path)),
C.c_long(len(base_path)),
C.byref(C.c_char_p(temp_dir)),
C.c_long(len(temp_dir))),
C.byref(C.c_char_p(inp_file)),
C.c_long(len(inp_file)),
model_array,
data_array,
error_array)
The strings are essentially paths. The function Function2 does not recognize the paths and proves an error message with some non-readable characters at the end, such as:
forrtl: severe (43): file name specification error, unit 16, D:\Users\xxxxxxx\Documents\xxxxx\ωa.
What I wanted the function to receive was D:\Users\xxxxxxx\Documents\xxxxx\. Obviously, the strings are not passed correctly.
I have read that Python uses NULL terminated strings. Can that be a problem while passing strings to a Fortran dll? If so, how do I get around it?
Any recommendations?
Following comment from #eryksun, I made the following changes to make it work.
Changed the argtypes to:
function2 = getattr(dll, 'FUNCTION2')
function2.argtypes = [C.c_char_p, C.c_long,
C.c_char_p, C.c_long,
C.c_char_p, C.c_long,
np.ctypeslib.ndpointer(C.c_long , flags='F_CONTIGUOUS'),
np.ctypeslib.ndpointer(C.c_double, flags='F_CONTIGUOUS'),
np.ctypeslib.ndpointer(C.c_long, flags='F_CONTIGUOUS')]
And instead of passing the string as byref, I changed it to the following.
base_path = "D:\\Users\\xxxxxxx\\Documents\\xxxxx\\".ljust(255)
temp_path = "D:\\Users\\xxxxxxx\\Documents\\xxxxx\\temp".ljust(255)
inp_file = "inp.txt".ljust(255)
function2(base_path, len(base_path), temp_dir, len(temp_dir), inp_file, len(inp_file),
model_array, data_array, error_array)
It was sufficient to pass the values directly.
I would like to change the below Python function to cover all situations in which my business_code will need padding. The string.zfill Python function handles this exception, padding to the left until a given width is reached but I have never used it before.
#function for formating business codes
def formatBusinessCodes(code):
""" Function that formats business codes. Pass in a business code which will convert to a string with 6 digits """
busCode=str(code)
if len(busCode)==1:
busCode='00000'+busCode
elif len(busCode)==2:
busCode='0000'+busCode
else:
if len(busCode)==3:
busCode='000'+busCode
return busCode
#pad extra zeros
df2['business_code']=df2['business_code'].apply(lambda x: formatBusinessCodes(x))
businessframe['business_code']=businessframe['business_code'].apply(lambda x: formatBusinessCodes(x))
financialframe['business_code']=financialframe['business_code'].apply(lambda x: formatBusinessCodes(x))
The code above handles a business_code of length 6 but I'm finding that the business_codes vary in length < and > 6. I'm validating data state by state. Each state varies in their business_code lengths (IL - 6 len, OH - 8 len). All codes must be padded evenly. So a code for IL that is 10 should produce 000010, etc. I need to handle all exceptions. Using a command line parsing parameter (argparse), and string.zfill.
You could use str.format:
def formatBusinessCodes(code):
""" Function that formats business codes. Pass in a business code which will convert to a string with 6 digits """
return '{:06d}'.format(code)
In [23]: formatBusinessCodes(1)
Out[25]: '000001'
In [26]: formatBusinessCodes(10)
Out[26]: '000010'
In [27]: formatBusinessCodes(123)
Out[27]: '000123'
The format {:06d} can be understood as follows:
{...} means replace the following with an argument from format,
(e.g. code).
: begins the format specification
0 enables zero-padding
6 is the width of the string. Note that numbers larger than 6
digits will NOT be truncated, however.
d means the argument (e.g. code) should be of integer type.
Note in Python2.6 the format string needs an extra 0:
def formatBusinessCodes(code):
""" Function that formats business codes. Pass in a business code which will convert to a string with 6 digits """
return '{0:06d}'.format(code)
parser.add_argument('-b',help='Specify length of the district code')
businessformat=args.d
businessformat=businessformat.strip()
df2['business_code']=df2['business_code'].apply(lambda x: str(x))
def formatBusinessCodes(code):
bus=code bus.zfill(4)
return bus
formatBusinessCodes(businessformat)
I want to build a small formatter in python giving me back the numeric
values embedded in lines of hex strings.
It is a central part of my formatter and should be reasonable fast to
format more than 100 lines/sec (each line about ~100 chars).
The code below should give an example where I'm currently blocked.
'data_string_in_orig' shows the given input format. It has to be
byte swapped for each word. The swap from 'data_string_in_orig' to
'data_string_in_swapped' is needed. In the end I need the structure
access as shown. The expected result is within the comment.
Thanks in advance
Wolfgang R
#!/usr/bin/python
import binascii
import struct
## 'uint32 double'
data_string_in_orig = 'b62e000052e366667a66408d'
data_string_in_swapped = '2eb60000e3526666667a8d40'
print data_string_in_orig
packed_data = binascii.unhexlify(data_string_in_swapped)
s = struct.Struct('<Id')
unpacked_data = s.unpack_from(packed_data, 0)
print 'Unpacked Values:', unpacked_data
## Unpacked Values: (46638, 943.29999999943209)
exit(0)
array.arrays have a byteswap method:
import binascii
import struct
import array
x = binascii.unhexlify('b62e000052e366667a66408d')
y = array.array('h', x)
y.byteswap()
s = struct.Struct('<Id')
print(s.unpack_from(y))
# (46638, 943.2999999994321)
The h in array.array('h', x) was chosen because it tells array.array to regard the data in x as an array of 2-byte shorts. The important thing is that each item be regarded as being 2-bytes long. H, which signifies 2-byte unsigned short, works just as well.
This should do exactly what unutbu's version does, but might be slightly easier to follow for some...
from binascii import unhexlify
from struct import pack, unpack
orig = unhexlify('b62e000052e366667a66408d')
swapped = pack('<6h', *unpack('>6h', orig))
print unpack('<Id', swapped)
# (46638, 943.2999999994321)
Basically, unpack 6 shorts big-endian, repack as 6 shorts little-endian.
Again, same thing that unutbu's code does, and you should use his.
edit Just realized I get to use my favorite Python idiom for this... Don't do this either:
orig = 'b62e000052e366667a66408d'
swap =''.join(sum([(c,d,a,b) for a,b,c,d in zip(*[iter(orig)]*4)], ()))
# '2eb60000e3526666667a8d40'
The swap from 'data_string_in_orig' to 'data_string_in_swapped' may also be done with comprehensions without using any imports:
>>> d = 'b62e000052e366667a66408d'
>>> "".join([m[2:4]+m[0:2] for m in [d[i:i+4] for i in range(0,len(d),4)]])
'2eb60000e3526666667a8d40'
The comprehension works for swapping byte order in hex strings representing 16-bit words. Modifying it for a different word-length is trivial. We can make a general hex digit order swap function also:
def swap_order(d, wsz=4, gsz=2 ):
return "".join(["".join([m[i:i+gsz] for i in range(wsz-gsz,-gsz,-gsz)]) for m in [d[i:i+wsz] for i in range(0,len(d),wsz)]])
The input params are:
d : the input hex string
wsz: the word-size in nibbles (e.g for 16-bit words wsz=4, for 32-bit words wsz=8)
gsz: the number of nibbles which stay together (e.g for reordering bytes gsz=2, for reordering 16-bit words gsz = 4)
import binascii, tkinter, array
from tkinter import *
infile_read = filedialog.askopenfilename()
with open(infile, 'rb') as infile_:
infile_read = infile_.read()
x = (infile_read)
y = array.array('l', x)
y.byteswap()
swapped = (binascii.hexlify(y))
This is a 32 bit unsigned short swap i achieved with code very much the same as "unutbu's" answer just a little bit easier to understand. And technically binascii is not needed for the swap. Only array.byteswap is needed.
For a project in one of my classes we have to output numbers up to five decimal places.It is possible that the output will be a complex number and I am unable to figure out how to output a complex number with five decimal places. For floats I know it is just:
print "%0.5f"%variable_name
Is there something similar for complex numbers?
You could do it as is shown below using the str.format() method:
>>> n = 3.4+2.3j
>>> n
(3.4+2.3j)
>>> '({0.real:.2f} + {0.imag:.2f}i)'.format(n)
'(3.40 + 2.30i)'
>>> '({c.real:.2f} + {c.imag:.2f}i)'.format(c=n)
'(3.40 + 2.30i)'
To make it handle both positive and negative imaginary portions properly, you would need a (even more) complicated formatting operation:
>>> n = 3.4-2.3j
>>> n
(3.4-2.3j)
>>> '({0:.2f} {1} {2:.2f}i)'.format(n.real, '+-'[n.imag < 0], abs(n.imag))
'(3.40 - 2.30i)'
Update - Easier Way
Although you cannot use f as a presentation type for complex numbers using the string formatting operator %:
n1 = 3.4+2.3j
n2 = 3.4-2.3j
try:
print('test: %.2f' % n1)
except Exception as exc:
print('{}: {}'.format(type(exc).__name__, exc))
Output:
TypeError: float argument required, not complex
You can however use it with complex numbers via the str.format() method. This isn't explicitly documented, but is implied by the Format Specification Mini-Language documentation which just says:
'f' Fixed point. Displays the number as a fixed-point number. The default precision is 6.
. . .so it's easy to overlook.
In concrete terms, the following works in both Python 2.7.14 and 3.4.6:
print('n1: {:.2f}'.format(n1))
print('n2: {:.2f}'.format(n2))
Output:
n1: 3.10+4.20j
n2: 3.10-4.20j
This doesn't give you quite the control the code in my original answer does, but it's certainly much more concise (and handles both positive and negative imaginary parts automatically).
Update 2 - f-strings
Formatted string literals (aka f-strings) were added in Python 3.6, which means it could also be done like this in that version or later:
print(f'n1: {n1:.2f}') # -> n1: 3.40+2.30j
print(f'n2: {n2:.3f}') # -> n2: 3.400-2.300j
In Python 3.8.0, support for an = specifier was added to f-strings, allowing you to write:
print(f'{n1=:.2f}') # -> n1=3.40+2.30j
print(f'{n2=:.3f}') # -> n2=3.400-2.300j
Neither String Formatting Operations - i.e. the modulo (%) operator) -
nor the newer str.format() Format String Syntax support complex types.
However it is possible to call the __format__ method of all built in numeric types directly.
Here is an example:
>>> i = -3 # int
>>> l = -33L # long (only Python 2.X)
>>> f = -10./3 # float
>>> c = - 1./9 - 2.j/9 # complex
>>> [ x.__format__('.3f') for x in (i, l, f, c)]
['-3.000', '-33.000', '-3.333', '-0.111-0.222j']
Note, that this works well with negative imaginary parts too.
For questions like this, the Python documentation should be your first stop. Specifically, have a look at the section on string formatting. It lists all the string format codes; there isn't one for complex numbers.
What you can do is format the real and imaginary parts of the number separately, using x.real and x.imag, and print it out in a + bi form.
>>> n = 3.4 + 2.3j
>>> print '%05f %05fi' % (n.real, n.imag)
3.400000 2.300000i
As of Python 2.6 you can define how objects of your own classes respond to format strings. So, you can define a subclass of complex that can be formatted. Here's an example:
>>> class Complex_formatted(complex):
... def __format__(self, fmt):
... cfmt = "({:" + fmt + "}{:+" + fmt + "}j)"
... return cfmt.format(self.real, self.imag)
...
>>> z1 = Complex_formatted(.123456789 + 123.456789j)
>>> z2 = Complex_formatted(.123456789 - 123.456789j)
>>> "My complex numbers are {:0.5f} and {:0.5f}.".format(z1, z2)
'My complex numbers are (0.12346+123.45679j) and (0.12346-123.45679j).'
>>> "My complex numbers are {:0.6f} and {:0.6f}.".format(z1, z2)
'My complex numbers are (0.123457+123.456789j) and (0.123457-123.456789j).'
Objects of this class behave exactly like complex numbers except they take more space and operate more slowly; reader beware.
Check this out:
np.set_printoptions(precision=2) # Rounds up to 2 decimals all float expressions
I've successfully printed my complexfloat's expressions:
# Show poles and zeros
print( "zeros = ", zeros_H , "\n")
print( "poles = ", poles_H )
out before:
zeros = [-0.8 +0.6j -0.8 -0.6j -0.66666667+0.j ]
poles = [-0.81542318+0.60991027j -0.81542318-0.60991027j -0.8358203 +0.j ]
out after:
zeros = [-0.8 +0.6j -0.8 -0.6j -0.67+0.j ]
poles = [-0.82+0.61j -0.82-0.61j -0.84+0.j ]