Python: read_csv into dataframe - can't convert object into float - python

The csv file I am feeding into read_csv is a couple columns with percentage changes but it has some hidden characters. From repr(data2):
I tried the following:
data2 = pd.read_csv('C:/Users/nnayyar/Documents/MonteCarlo2.csv', "\n", delimiter = ",", dtype = float)
And got the following error:
ValueError: invalid literal for float(): 7.05%
I tried a few things:
float(data2.replace('/n',''))
map(float, data2.strip().split('\r\n'))
But received various errors respectively
TypeError: float() argument must be a string or a number
AttributeError: 'DataFrame' object has no attribute 'strip'
Any help to get the CSV object type into float type would be helpful! THanks!!

If your entire csv has percentage signs then the following will work:
In [203]:
import pandas as pd
import io
t="""0 1 2 3
1.5% 2.5% 6.5% 0.5%"""
# load some dummy data
df = pd.read_csv(io.StringIO(t), delim_whitespace=True)
df
Out[203]:
0 1 2 3
0 1.5% 2.5% 6.5% 0.5%
In [205]:
# apply a lambda that replaces the % signs and cast to float
df.apply(lambda x: x.str.replace('%','')).astype(float)
Out[205]:
0 1 2 3
0 1.5 2.5 6.5 0.5
So this applys a lambda to each column that calls the vectorised str.replace to remove the % sign, we can then convert the type to float using astype
So in your case the following should work:
data2 = pd.read_csv('C:/Users/nnayyar/Documents/MonteCarlo2.csv', "\n")
data2 = data2.apply(lambda x: x.str.replace('%', '').astype(float))

Related

Regex Validation Not working for large Numbers in column Pandas

I am trying to validate columns over a particular regex in dataframe. The Limit of number is (20,3) i.e maximum 20 length with int datatype or 23 with float datatype . but pandas is converting original number to random int number and my regex validation is getting failed . I checked my regex is proper .
Dataframe :
FirstColumn,SecondColumn,ThirdColumn
111900987654123.123,111900987654123.123,111900987654123.123
111900987654123.12,111900987654123.12,111900987654123.12
111900987654123.1,111900987654123.1,111900987654123.1
111900987654123,111900987654123,111900987654123
111900987654123,-111900987654123,-111900987654123
-111900987654123.123,-111900987654123.123,-111900987654123.1
-111900987654123.12,-111900987654123.12,-111900987654123.12
-111900987654123.1,-111900987654123.1,-111900987654123.1
11119009876541231111,1111900987654123,1111900987654123
Code:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF=pd.read_csv("D:\\FTP\LocalUser\\NCCLCOLL\\COLLATERALUPLOAD\\upld\\SplitFiles\\AACCR6675H_22102021_07_1 - Copy.csv")
pd.set_option('display.float_format', '{:.3f}'.format)
rslt_df2=df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1=rslt_df2[~rslt_df2.iloc[:,0].apply(str).str.contains(NumberValidationRegexnegative, regex=True)].index
print("rslt_df1",rslt_df1)
Output Result:
rslt_df1 Int64Index([8], dtype='int64')
Expected Result:
rslt_df1 Int64Index([], dtype='int64')
Use dtype=str as parameter of pd.read_csv:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF = pd.read_csv("data.csv", dtype=str)
rslt_df2 = df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1 = rslt_df2[~rslt_df2.iloc[:,0] \
.str.contains(NumberValidationRegexnegative, regex=True)].index
Output:
>>> print("rslt_df1", rslt_df1)
rslt_df1 Int64Index([], dtype='int64')

Type error when trying to multiply two Lists in Python

I am trying to write a simple program where a new column is added to an existing dataframe. The new column is created by multiplying values of two existing columns.
This is the code I have written :
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df=pd.DataFrame(data)
df=df.to_string(index=False)
print(df)
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
df['Total Amounts']=totalamount
print(df)
Even though I've used the int() method to convert the values back to integer, it still gives the type error, the exact error being:
Traceback (most recent call last):
File "File Path", line 11, in <module>
totalamount=[int(df['No. of Tickets'])* int(df['Ticket Rate']) ]
TypeError: string indices must be integers
Earlier when I did not have the line df=df.to_string(index=False) line, and also did not use the int() function, there wasn't any error. The list was multiplied, although printed in this manner
[0 400
1 400
2 900
3 1250
4 300
dtype: int64]
But further in the code where I try to add the list to the Dataframe it gives the error ValueError: Length of values (1) does not match length of index (5)
I tried to look any other ways to do this, but can't seem to find any. Thank you in Advance!
you can try this short answer:
df['Total Amounts'] = df.apply(lambda x: x['No. of Tickets'] * x['Ticket Rate'], axis=1)
output:
# print(df['Total Amounts'])
0 400
1 400
2 900
3 1250
4 300
Name: Total Amounts, dtype: int64
You converted your df to a string and again reassigned it to df.
import pandas as pd
import numpy as np
data={'Booking Code':['B001','B002','B003','B004','B005'],'Customer Name':['Veer','Umesh','Lavanya','Shobhna','Piyush'],
'No. of Tickets':[4,2,6,5,3], 'Ticket Rate':[100,200,150,250,100],'Booking Clerk':['Manish','Kishor','Manish','John','Kishor']}
df = pd.DataFrame(data)
string_df = df.to_string(index=False) #Assign it to another variable!
print(string_df)
totalamount = [df['No. of Tickets'] * df['Ticket Rate'] ]
df['Total Amounts'] = totalamount
print(df)

Reading decimal representation floats from a CSV with pandas

I am trying to read in the contents of a CSV file containing what I believe are IEEE 754 single precision floats, in decimal format.
By default, they are read in as int64. If I specify the data type with something like dtype = {'col1' : np.float32}, the dtype shows up correctly as float32, but they are just the same values as a float instead of an int, ie. 1079762502 becomes 1.079763e+09 instead of 3.435441493988037.
I have managed to do the conversion on single values with either of the following:
from struct import unpack
v = 1079762502
print(unpack('>f', v.to_bytes(4, byteorder="big")))
print(unpack('>f', bytes.fromhex(str(hex(v)).split('0x')[1])))
Which produces
(3.435441493988037,)
(3.435441493988037,)
However, I can't seem to implement this in a vectorised way with pandas:
import pandas as pd
from struct import unpack
df = pd.read_csv('experiments/test.csv')
print(df.dtypes)
print(df)
df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
#df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
print(df)
Throws the following error
col1 int64
dtype: object
col1
0 1079762502
1 1079345162
2 1078565306
3 1078738012
4 1078635652
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-c06d0986cc96> in <module>
7 print(df)
8
----> 9 df['col1'] = unpack('>f', df['col1'].to_bytes(4, byteorder="big"))
10 #df['col1'] = unpack('>f', bytes.fromhex(str(hex(df['col1'])).split('0x')[1]))
11
~/anaconda3/envs/test/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.__getattribute__(self, name)
5180
5181 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'to_bytes'
Or if I try the second way, TypeError: 'Series' object cannot be interpreted as an integer
I am at the limits of my Python knowledge here, I suppose I could iterate through every single row, cast to hex, then to string, then strip the 0x, unpack and store. But that seems very convoluted, and already takes several seconds on smaller sample datasets, let along for hundreds of thousands of entries. Am I missing something simple here, is there any better way of doing this?
CSV is a text format, IEEE 754 single precision floats are binary numeric format. If you have a CSV, you have text, it is not that format at all. If I understand you correctly, I think you mean you have text which represent integers (in decimal format) that correspond to a 32bit integer interpretation of your 32bit floats.
So, for starters, when you read the data from a csv, pandas used 64 bit integers by default. So convert to 32bit integers, then re-interpret the bytes using .view:
In [8]: df
Out[8]:
col1
0 1079762502
1 1079345162
2 1078565306
3 1078738012
4 1078635652
In [9]: df.col1.astype(np.int32).view('f')
Out[9]:
0 3.435441
1 3.335940
2 3.150008
3 3.191184
4 3.166780
Name: col1, dtype: float32
Decomposed into steps to help understand:
In [10]: import numpy as np
In [11]: arr = df.col1.values
In [12]: arr
Out[12]: array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652])
In [13]: arr.dtype
Out[13]: dtype('int64')
In [14]: arr_32 = arr.astype(np.int32)
In [15]: arr_32
Out[15]:
array([1079762502, 1079345162, 1078565306, 1078738012, 1078635652],
dtype=int32)
In [16]: arr_32.view('f')
Out[16]:
array([3.4354415, 3.33594 , 3.1500077, 3.191184 , 3.1667795],
dtype=float32)

Pandas error: Can only use .str accessor with string values, which use np.object_ dtype in pandas

I have data in my .txt file as below:
029070 ***** 190101010600 270 36 OVC ** 0.0 ** **
I want to extract 190101 from the column 3, I am getting AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandasbelow is my python pandas. Below is my code
import pandas as pd
import numpy as np
import re
data = pd.read_csv('dummy.txt', sep=" ", low_memory=False, header=None)
data.columns = ["a", "b", "c","d","e","f","g","h","i","j"]
print(data.c.str[0:6])
The problem here is that when you read your txt file, in it is casting "c" as an integer and the .str accessor will not work with non-string dtypes, you can fix this problem a couple of ways:
Option 1
Cast the integer as a string in the print statement.
print(data.c.astype(str).str[0:6])
0 190101
Name: c, dtype: object
Option 2
Cast as a string on the into the dataframe with dtype parameter in read_csv
data = pd.read_csv(txtfile, sep=' ', header=None, dtype={2:'str'})
data.columns = list('abcdefghij')
print(data.c.str[0:6]
0 190101
Name: c, dtype: object

Convert DataFrame string complex i to j python

I have this type of DataFrame I wish to utilize. But because the data i imported is using the i letter for the imaginary part of the complex number, python doesn't allow me to convert it as a float.
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
How can I proceed to change the i to j in each row of the DataFrame?
Thank you.
If you have a string like this: complexStr = "0.015291+0.0075383i", you could do:
complexFloat = complex(complexStr[:-1] + 'j')
If your data is a string like this: str = "5.0 0.01511+0.0035769i", you have to separate the first part, like this:
number, complexStr = str.split()
complexFloat = complex(complexStr[:-1] + 'j')
>>> complexFloat
>>> (0.015291+0.0075383j)
>>> type(complexFloat)
>>> <type 'complex'>
I'm not sure how you obtain your dataframe, but if you're reading it from a text file with a suitable header, then you can use a converter function to sort out the 'j' -> 'i' so that your dtype is created properly:
For file test.df:
a b
5.0 0.01511+0.0035769i
5.0298 0.015291+0.0075383i
5.0594 0.015655+0.0094534i
5.0874 0.012456+0.011908i
5.1156 0.015332+0.011174i
5.1458 0.015758+0.0095832i
the code
import pandas as pd
df = pd.read_table('test.df',delimiter='\s+',
converters={'b': lambda v: complex(str(v.replace('i','j')))}
)
gives df as:
a b
0 5.0000 (0.01511+0.0035769j)
1 5.0298 (0.015291+0.0075383j)
2 5.0594 (0.015655+0.0094534j)
3 5.0874 (0.012456+0.011908j)
4 5.1156 (0.015332+0.011174j)
5 5.1458 (0.015758+0.0095832j)
with column dtypes:
a float64
b complex128

Categories