How to fix "ValueError: could not convert string to float" - python

I am going to train an SVM based on x dataframe and y series I have.
X dataframe is shown below:
x:
Timestamp Location of sensors Pressure or Flow values
0.00000 138.22, 1549.64 28.92
0.08333 138.22, 1549.64 28.94
0.16667 138.22, 1549.64 28.96
In X dataframe, location of sensors are represented as the form of node coordinate.
Y series is shown below:
y:
0
0
0
But when I fit svm to the training set, it returned an ValueError: could not convert string to float: '361.51,1100.77' and (361.51, 1100.77) are coordinates for a node.
Could you please give me some ideas to fix this problem?
Appreciate a lot if any advices.

'361.51,1100.77' are actually two numbers right? A latitude (361.51) and a longitude (1100.77). You would first need to split it into two strings. Here is one way to do it:
data = pd.DataFrame(data=[[0, "138.22,1549.64", 28.92]], columns=["Timestamp", "coordinate", "flow"])
data["latitude"] = data["coordinate"].apply(lambda x: float(x.split(",")[0]))
data["longitude"] = data["coordinate"].apply(lambda x: float(x.split(",")[1]))
This will give you two new columns in your dataframe each with a float of the values in the string.

I'm assuming that you are trying to convert the entire string "361.51,1100.77" into a float, and you can see why that's a problem because Python sees two decimal points and a comma inbetween, so it has no idea what to do.
Assuming you want the numbers to be separate, you could do something like this:
myStr = "361.51,1100.77"
x = float(myStr[0:myStr.index(",")])
y = float(myStr[myStr.index(",")+1:])
print(x)
print(y)
Which would get you an output of
361.51
1100.77
Assigning x to be myStr[0:myStr.index(",")] takes a substring of the original string from 1 to the first occurrence of a comma, getting you the first number.
Assigning y to be myStr[myStr.index(",")+1:] takes a substring of the original string starting after the first comma and to the end of the string, getting you the second number.
Both can easily be converted to floats from here using the float(myStr) method, getting you two separate floats.
Here is a helpful link to understand string slicing: https://www.geeksforgeeks.org/string-slicing-in-python/

Related

Remove unwanted characters from Dataframe values in Pandas

I have the following Dataframe full of locus/gen names from a multiple genome alignment.
However, I am trying to get only a full list of the locus/name without the coordinates.
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 0:Rv0001:1-1524 1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524 3:BQ2027_RS00005:1-1524
1 0:Rv0002:2052-3260 1:MSMEG_RS00005:499-1692 2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2 0:Rv0003:3280-4437 1:MSMEG_RS00015:2624-3778 2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437
To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.
for value in orthologs['Tuberculosis_locus']:
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d+'))
Any idea on what I am doing wrong? I'd like the following output:
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 Rv0001 MSMEG_RS33460 MRA_RS00005 BQ2027_RS00005
1 Rv0002 MSMEG_RS00005 MRA_RS00010 BQ2027_RS00010
2 Rv0003 MSMEG_RS00015 MRA_RS00015 BQ2027_RS00015
Split by : with a maximum split of two and then take the 2nd elements, eg:
df.applymap(lambda v: v.split(':', 2)[1])
def clean(x):
x = x.split(':')[1].strip()
return x
orthologs = orthologs.applymap(clean)
should work.
Explanation:
applymap is for the whole dataframe and apply is for a data column.
clean is a function you want to apply to every entry of the dataframe. Note that you don't need (x) anymore when you use it together with applymap or apply.

Remove scientific notation floats in a dataframe

I am receiving different series from a source. Some of those series have the values in big numbers (X billions). I then combine all the series to a dataframe with individual columns for each series.
Now, when I print the dataframe, the big numbers in the series are showed in scientific notation. Even printing the series individually shows the numbers in scientific notation.
Dataframe df (multiindex) output is:
Values
Item Sub
A 1 1.396567e+12
B 1 2.868929e+12
I have tried this:
pd.set_option('display.float_format', lambda x: '%,.2f' % x)
This doesn't work as:
it converts everywhere. I only need the conversion in that specific dataframe.
it tries to convert all kinds of floats, and not just those in scientific. So, even if the float is 89.142, it will try to convert the format and as there's no digit to put ',' it shows an error.
Then I tried these:
df.round(2)
This only converted numeric floats to 2 decimals from existing 3 decimals. Didn't do anything to scientific values.
Then I tried:
df.astypes(floats)
Doesn't do anything visible. Output stayed the same.
How else can we change the scientific notation to normal float digits inside the dataframe. I do not want to create a new list with the converted values. The dataframe itself should show the values in normal terms.
Can you guys please help me find a solution for this?
Thank you.
try df['column'] = df['column'].astype(str) . if does not work you should change type of numbers to string before create pandas dataframe from your data
I would suggest keeping everything in a float type and adjust the display setting.
For example, I have generated a df with some random numbers.
df = pd.DataFrame({"Item": ["A", "B"], "Sub": [1,1],
"Value": [float(31132314122123.1), float(324231235232315.1)]})
# Item Sub Value
#0 A 1 3.113231e+13
#1 B 1 3.242312e+14
If we print(df), we can see that the Sub values are ints and the Value values are floats.
Item object
Sub int64
Value float64
dtype: object
You can then call pd.options.display.float_format = '{:.1f}'.format to suppress the scientific notation of the floats, while retaining the float format.
# Item Sub Value
#0 A 1 31132314122123.1
#1 B 1 324231235232315.1
Item object
Sub int64
Value float64
dtype: object
If you want the scientific notation back, you can call pd.reset_option('display.float_format')
Okay. I found something called option_context for pandas that allows to change the display options just for the particular case / action using a with statement.
with pd.option_context('display.float_format',{:.2f}.format):
print(df)
So, we do not have to reset the options again as well as the options stay default for all other data in the file.
Sadly though, I could find no way to store different columns in different float format (for example one column with currency - comma separated and 2 decimals, while next column in percentage - non-comma and 2 decimals.)

need to convert a object column to int/ float in a Dataframe so that later can do some operations on that

the data set contains Performance_UG which has values of '90.00/100.00' or '4.0/5.0' or '3.50/4.00' which is stored as object.
now i have to extract the 90 and 100 and divide them and then get the output i.e. 0.9 saved to a new column of the data frame as a float
how do i do that?
enter image description here
I'm assuming we are talking about pandas here. Each column then should have a map function. You can call this on each column and assign it to a new one. Perhaps something like:
def parseStuff(x):
x = x.strip()
if x == '0':
return 0
a, b = [float(i) for i in x.split('/')]
return a/b
df['Performance_UG_parsed'] = df['Performance_UG'].map(parseStuff))
If you want to create new columns or just replace the old one, just iterate over columns or cast it to the same column (although, for same column you can just use apply i think).
Note: You may want to cast it to something other then float, maybe a numpy type if you are working with that stuff.

conditional statement producing weird results?

I want to create a new feature that converts and currency to EUR. the process is simple. I have 2 columns, one with the type of currency i.e USD and then the amount on the other column. I am creating a 3rd column called 'price_in_eur' and what is does is looks at the type of currency and if it is not 'EUR' then it should multiply the type of currency column by 1.1, otherwise it should be left alone but when i run the code i get the following error:
ValueError: either both or neither of x and y should be given this is my code:
x = data[['type_of_currency','amount']]
x.type_of_currency= x.amount.str.extract('(\d+)', expand=False)
x['type_of_currency'] = x['amount'].astype(float)
x['price_in_euro'] = np.where(x['type_of_currency']=='USD',x['amount']*1.1)
Can someone please help? I think it has something to do with the fact that the np.where statement is looking at the type_of_currency column which is a string but not sure.
You need to provide both arguments in np.where after condition. --> numpy.where(condition[, x, y])
Ex:
x['price_in_euro'] = np.where(x['type_of_currency']=='USD',x['amount']*1.1, x['amount'])
MoreInfo

Pandas adding decimal points when using read_csv

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)
You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.
It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

Categories