ValueError: could not convert string to float - without positional indication - python

For a current project, I am planning to run a scikit-learn Stochastic Graduent Booster algorithm over a CSV set that includes numerical data.
When calling line sgbr.fit(X_train, y_train) of the script, I am however receiving a ValueError: could not convert string to float: with no further details given on the respective area that cannot be formatted.
I assume that this error is not related to the Python code itself but rather the CSV input. I have however already checked the CSV file to confirm all sections exclusively include floats:
Does anyone have an idea why the ValueError is appearing without further positional indication?

I thing there are not direct function to get positional indication.
you can try this to convert
print (df)
column
0 01
1 02
2 03
3 04
4 05
5 LS
print (pd.to_numeric(df.column.str, errors='coerce'))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
Name: column, dtype: float64

Related

Python calculations using groups to sum

Good afternoon all,
Bit stuck with the last stage of a calculation.
I have a dataframe which outputs as such:
LaCode Group Frequency
0 718 NaN 2
1 718 3 1
2 719 1 4
3 719 2 10
I'm struggling with the percentage calculation which is for each LaCode, ignore where Group is NaN (and just put NaN (or blank) and calculate percentage of the frequency's where Group is known.
Should output as such:
Percentage
NaN
100
28.571
71.428
Can anyone help with this? My code doesn't take into account the change in LaCode and I can't work out the correct syntax to incorporate that issue.
Thanks.
Edit: For completeness, I have converted the NaN to an integer that stands out so I can see it (in this instance 0 as that isn't a valid group in the survey)
The code I'm using for calculation was provided to me and I tweaked a little. Works ok when just one LaCode:
df['Percentage'] = df[df['Value'] != 0]['Count'].apply(lambda x: x/sum(df[df['Value'] != 0]['Count']))

How to add a number in the beginning of a row (pandas)

I'm trying to convert my column type from object to date. But, I got an error that says
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 20-04-09 00:00:00
After looking at the data and trying to find the error, I saw the all the dates that I have has the following format:
print(df['date'])
0 020-04-02
1 020-04-02
2 020-04-05
3 NaN
4 020-04-05
...
60 NaN
61 020-04-07
62 NaN
63 020-04-09
64 020-04-09
As you can see the dates begin with a zero, so adding a 2 at the beginning will fix the problem for me.
So, the question is, how can I add a 2 while ignoring the NaNs?
If add 2 to missing values there are still missing values, so use:
pd.to_datetime('2' + df['date'])

using pandas in organizing my csv file data

I have a large excel file that I need to organize in a certain way (years of climate data), for the sake of understanding my problem, I made a this simple excel file for questions. The data looks similar to this:
(basically 4x4 data with an empty row between them) and I want to transform this data to look like:
(take each row of data transpose it and then add the second row to it with the Nanvalues) using pandas.
The problem that I faced. when reading the file using file = pd.read_csv("excel data.csv"):
my first row will be detected as a header.
the row that separate the data will be converted to NaN and will be confused with the actual NaN in my data
I tried different functions including reading/saving the file with no index (index = False) i also tried functions like file.iloc[0].values , file.shift(1) but I wasn't able to figure it out.
To summarize I want to be able to read the file using pandas then save it as 1 column that include all the data with no extra information or headers (sorry but I am new to pandas).
EDIT: This is how it looks in jupyter notebook.
For the first problem header = None worked.
I tried file.stack(dropna=False).reset_index()[0] but the results stayed the same as in the picture.
If you pass header = None in the read_csv function, it will not detect the first row as header, i.e. file = pd.read_csv("excel_data.csv", header=None)
For the second part, once you have the data into the dataframe, you could try this -
file.stack(dropna=False).reset_index()[0]
Trying to replicate the required results :
df = pd.DataFrame({0:[5.0,54.0,3.0,9.0], 1:[6.0,12.0,6.0,12.0], 2:[9.0,76.0,np.nan,41.0], 3:[8.0,2.0,12.0,100.0]})
df.loc[4] = ['','','','']
0 1 2 3
0 5 6 9 8
1 54 12 76 2
2 3 6 NaN 12
3 9 12 41 100
4
df = df.replace('',np.nan).dropna(how='all') #to remove blank rows
df.stack(dropna=False).reset_index()[0]
0 5.0
1 6.0
2 9.0
3 8.0
4 54.0
5 12.0
6 76.0
7 2.0
8 3.0
9 6.0
10 NaN
11 12.0
12 9.0
13 12.0
14 41.0
15 100.0
I wonder if pd.read_csv("excel data.csv", skip_blank_lines=True, header=None) will work?

Pandas converting column of strings and NaN (floats) to integers, keeping the NaN [duplicate]

This question already has answers here:
Convert Pandas column containing NaNs to dtype `int`
(27 answers)
Closed 3 years ago.
I have problems in converting a column which contains both numbers of 2 digits in string format (type: str) and NaN (type: float64). I want to obtain a new column made this way: NaN where there was NaN and integer numbers where there was a number of 2 digits in string format.
As an example: I want to obtain column Yearbirth2 from column YearBirth1 like this:
YearBirth1 #numbers here are formatted as strings: type(YearBirth1[0])=str
34 # and NaN are floats: type(YearBirth1[2])=float64.
76
Nan
09
Nan
91
YearBirth2 #numbers here are formatted as integers: type(YearBirth2[0])=int
34 #NaN can remain floats as they were.
76
Nan
9
Nan
91
I have tried this:
csv['YearBirth2'] = (csv['YearBirth1']).astype(int)
And as I expected i got this error:
ValueError: cannot convert float NaN to integer
So I tried this:
csv['YearBirth2'] = (csv['YearBirth1']!=NaN).astype(int)
And got this error:
NameError: name 'NaN' is not defined
Finally I have tried this:
csv['YearBirth2'] = (csv['YearBirth1']!='NaN').astype(int)
NO error, but when I checked the column YearBirth2, this was the result:
YearBirth2:
1
1
1
1
1
1
Very bad.. I think the idea is right but there is a problem to make Python able to understand what I mean for NaN.. Or maybe the method I tried is wrong..
I also used pd.to_numeric() method, but this way i obtain floats, not integers..
Any help?!
Thanks to everyone!
P.S: csv is the name of my DataFrame;
Sorry if I am not so clear, I am on improving with English language!
You can use to_numeric, but is impossible get int with NaN values - they are always converted to float: see na type promotions.
df['YearBirth2'] = pd.to_numeric(df.YearBirth1, errors='coerce')
print (df)
YearBirth1 YearBirth2
0 34 34.0
1 76 76.0
2 Nan NaN
3 09 9.0
4 Nan NaN
5 91 91.0

Precision lost while using read_csv in pandas

I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.

Categories