I am trying to read in the contents of a CSV file containing what I believe are IEEE 754 single precision floats, in decimal format.
By default, they are read in as int64. If I specify the data type with something like dtype = {'col1' : np.float32}, the dtype shows up correctly as float32, but they are just the same values as a float instead of an int, ie. 1079762502 becomes 1.079763e+09 instead of 3.435441493988037.
I have managed to do the conversion on single values with either of the following:
from struct import unpack
v = 1079762502
print(unpack('>f', v.to_bytes(4, byteorder="big")))
print(unpack('>f', bytes.fromhex(str(hex(v)).split('0x')[1])))
Which produces
(3.435441493988037,)
(3.435441493988037,)
However, I can't seem to implement this in a vectorised way with pandas:
import pandas as pd
from struct import unpack
df = pd.read_csv('experiments/test.csv')
print(df.dtypes)
print(df)
df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
#df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
print(df)
Throws the following error
col1 int64
dtype: object
col1
0 1079762502
1 1079345162
2 1078565306
3 1078738012
4 1078635652
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-c06d0986cc96> in <module>
7 print(df)
8
----> 9 df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
10 #df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
11
~/anaconda3/envs/test/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.__getattribute__(self, name)
5180
5181 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'to_bytes'
Or if I try the second way, TypeError: 'Series' object cannot be interpreted as an integer
I am at the limits of my Python knowledge here, I suppose I could iterate through every single row, cast to hex, then to string, then strip the 0x, unpack and store. But that seems very convoluted, and already takes several seconds on smaller sample datasets, let along for hundreds of thousands of entries. Am I missing something simple here, is there any better way of doing this?
CSV is a text format, IEEE 754 single precision floats are binary numeric format. If you have a CSV, you have text, it is not that format at all. If I understand you correctly, I think you mean you have text which represent integers (in decimal format) that correspond to a 32bit integer interpretation of your 32bit floats.
So, for starters, when you read the data from a csv, pandas used 64 bit integers by default. So convert to 32bit integers, then re-interpret the bytes using .view:
In [8]: df
Out[8]:
col1
0 1079762502
1 1079345162
2 1078565306
3 1078738012
4 1078635652
In [9]: df.col1.astype(np.int32).view('f')
Out[9]:
0 3.435441
1 3.335940
2 3.150008
3 3.191184
4 3.166780
Name: col1, dtype: float32
Decomposed into steps to help understand:
In [10]: import numpy as np
In [11]: arr = df.col1.values
In [12]: arr
Out[12]: array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652])
In [13]: arr.dtype
Out[13]: dtype('int64')
In [14]: arr_32 = arr.astype(np.int32)
In [15]: arr_32
Out[15]:
array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652],
dtype=int32)
In [16]: arr_32.view('f')
Out[16]:
array([3.4354415, 3.33594 , 3.1500077, 3.191184 , 3.1667795],
dtype=float32)
Related
I am trying to write a simple program where a new column is added to an existing dataframe. The new column is created by multiplying values of two existing columns.
This is the code I have written :
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df=pd.DataFrame(data)
df=df.to_string(index=False)
print(df)
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
df['Total Amounts']=totalamount
print(df)
Even though I've used the int() method to convert the values back to integer, it still gives the type error, the exact error being:
Traceback (most recent call last):
File "File Path", line 11, in <module>
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
TypeError: string indices must be integers
Earlier when I did not have the line df=df.to_string(index=False) line, and also did not use the int() function, there wasn't any error. The list was multiplied, although printed in this manner
[0 400
1 400
2 900
3 1250
4 300
dtype: int64]
But further in the code where I try to add the list to the Dataframe it gives the error ValueError: Length of values (1) does not match length of index (5)
I tried to look any other ways to do this, but can't seem to find any. Thank you in Advance!
you can try this short answer:
df['Total Amounts'] = df.apply(lambda x: x['No. of Tickets'] * x['Ticket Rate'], axis=1)
output:
# print(df['Total Amounts'])
0 400
1 400
2 900
3 1250
4 300
Name: Total Amounts, dtype: int64
You converted your df to a string and again reassigned it to df.
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df = pd.DataFrame(data)
string_df = df.to_string(index=False) #Assign it to another variable!
print(string_df)
totalamount = [df['No. of Tickets'] * df['Ticket Rate'] ]
df['Total Amounts'] = totalamount
print(df)
I have a csv with ~10 columns.. One of the columns has information in bytes i.e., b'gAAAA234'. But when I read this from pandas via .read_csv("file.csv"), I get it all in a dataframe and this particular column is in string rather than bytes i.e., b'gAAAA234'.
How do I simply read it as bytes without having to read it as string and then reconverting?
Currently, I'm working with this:
b = df['column_with_data_in_bytes'][i]
bb = bytes(b[2:len(b)-1],'utf-8')
#further processing of bytes
This works but I was hoping to find a more elegant/pythonic or more reliable way to do this?
You might consider parsing with ast.literal_eval:
import ast
df['column_with_data_in_bytes'] = df['column_with_data_in_bytes'].apply(ast.literal_eval)
Demo:
In [322]: df = pd.DataFrame({'Col' : ["b'asdfghj'", "b'ssdgdfgfv'", "b'asdsfg'"]})
In [325]: df
Out[325]:
Col
0 b'asdfghj'
1 b'ssdgdfgfv'
2 b'asdsfg'
In [326]: df.Col.apply(ast.literal_eval)
Out[326]:
0 asdfghj
1 ssdgdfgfv
2 asdsfg
Name: Col, dtype: object
The csv file I am feeding into read_csv is a couple columns with percentage changes but it has some hidden characters. From repr(data2):
I tried the following:
data2 = pd.read_csv('C:/Users/nnayyar/Documents/MonteCarlo2.csv', "\n", delimiter = ",", dtype = float)
And got the following error:
ValueError: invalid literal for float(): 7.05%
I tried a few things:
float(data2.replace('/n',''))
map(float, data2.strip().split('\r\n'))
But received various errors respectively
TypeError: float() argument must be a string or a number
AttributeError: 'DataFrame' object has no attribute 'strip'
Any help to get the CSV object type into float type would be helpful! THanks!!
If your entire csv has percentage signs then the following will work:
In [203]:
import pandas as pd
import io
t="""0 1 2 3
1.5% 2.5% 6.5% 0.5%"""
# load some dummy data
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
df
Out[203]:
0 1 2 3
0 1.5% 2.5% 6.5% 0.5%
In [205]:
# apply a lambda that replaces the % signs and cast to float
df.apply(lambda x: x.str.replace('%','')).astype(float)
Out[205]:
0 1 2 3
0 1.5 2.5 6.5 0.5
So this applys a lambda to each column that calls the vectorised str.replace to remove the % sign, we can then convert the type to float using astype
So in your case the following should work:
data2 = pd.read_csv('C:/Users/nnayyar/Documents/MonteCarlo2.csv', "\n")
data2 = data2.apply(lambda x: x.str.replace('%', '').astype(float))
I have this type of DataFrame I wish to utilize. But because the data i imported is using the i letter for the imaginary part of the complex number, python doesn't allow me to convert it as a float.
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
How can I proceed to change the i to j in each row of the DataFrame?
Thank you.
If you have a string like this: complexStr = "0.015291+0.0075383i", you could do:
complexFloat = complex(complexStr[:-1] + 'j')
If your data is a string like this: str = "5.0 0.01511+0.0035769i", you have to separate the first part, like this:
number, complexStr = str.split()
complexFloat = complex(complexStr[:-1] + 'j')
>>> complexFloat
>>> (0.015291+0.0075383j)
>>> type(complexFloat)
>>> <type 'complex'>
I'm not sure how you obtain your dataframe, but if you're reading it from a text file with a suitable header, then you can use a converter function to sort out the 'j' -> 'i' so that your dtype is created properly:
For file test.df:
a b
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
the code
import pandas as pd
df = pd.read_table('test.df',delimiter='\s+',
converters={'b': lambda v: complex(str(v.replace('i','j')))}
)
gives df as:
a b
0 5.0000 (0.01511+0.0035769j)
1 5.0298 (0.015291+0.0075383j)
2 5.0594 (0.015655+0.0094534j)
3 5.0874 (0.012456+0.011908j)
4 5.1156 (0.015332+0.011174j)
5 5.1458 (0.015758+0.0095832j)
with column dtypes:
a float64
b complex128
I want to convert a pandas DateTimeIndex to excel dates (the number of days since 12/30/1899).. I tried to use numpy.vectorize on a function that takes datetime64s and returns an excel date. I was surprised by how numpy vectorize behaves - on the first call, a test call to see the return type, vectorize passes in datetime64 as provided. On subsequent calls, it passes in the internal storage type of the datetime64 - in my case a long. Internally, _get_ufunc_and_otypes calls:
inputs = [asarray(_a).flat[0] for _a in args]
outputs = func(*inputs)
While _vectorize_call does the following:
inputs = [array(_a, copy=False, subok=True, dtype=object)
for _a in args]
outputs = ufunc(*inputs)
As it turns out, I could just as easily use the internal numpy array math to do it (x - day0)/1day. But this behavior seems strange (type changing when a function is vectorized)
Here's my sample code:
import numpy
DATETIME64_ONE_DAY = numpy.timedelta64(1,'D')
DATETIME64_DATE_ZERO = numpy.datetime64('1899-12-30T00:00:00.000000000')
def excelDateToDatetime64(x):
return DATETIME64_DATE_ZERO + numpy.timedelta64(int(x),'D')
def datetime64ToExcelDate(x):
print type(x)
return (x - DATETIME64_DATE_ZERO) / DATETIME64_ONE_DAY
excelDateToDatetime64_Array = numpy.vectorize(excelDateToDatetime64)
datetime64ToExcelDate_Array = numpy.vectorize(datetime64ToExcelDate)
excelDates = numpy.array([ 41407.0, 41408.0, 41409.0, 41410.0, 41411.0, 41414.0 ])
datetimes = excelDateToDatetime64_Array(excelDates)
excelDates2 = datetime64ToExcelDate(datetimes)
print excelDates2 # Works fine
# TypeError: ufunc subtract cannot use operands with types dtype('int64') and dtype('<M8[ns]')
# You can see from the print that the type coming in is inconsistent
excelDates2 = datetime64ToExcelDate_Array(datetimes)
Datetimes and timedeltas need to be handled using the underlying data (which you just do arr.view('i8') to get, these are np.int64)
Define your constants in terms of their underlying values
In [94]: DATETIME_DATE_ZERO_VIEW = DATETIME64_DATE_ZERO.view('i8')
In [95]: DATETIME_DATE_ZERO_VIEW
Out[95]: -2209161600000000000
In [96]: DATETIME64_ONE_DAY_VALUE = DATETIME64_ONE_DAY.astype('m8[ns]').item()
In [97]: DATETIME64_ONE_DAY_VALUE
Out[97]: 86400000000000L
In [106]: def vect(x):
.....: return (x-DATETIME_DATE_ZERO_VIEW)/DATETIME64_ONE_DAY_VALUE
.....:
In [107]: f = np.vectorize(vect)
Pass in a view of the underlying np.int64
In [109]: f(datetimes.view('i8'))
Out[109]: array([41407, 41408, 41409, 41410, 41411, 41414])
Pandas way
In [98]: Series(datetimes).apply(lambda x: (x.value-DATETIME_DATE_ZERO_VIEW)/DATETIME64_ONE_DAY_VALUE)
Out[98]:
0 41407
1 41408
2 41409
3 41410
4 41411
5 41414
dtype: int64