could not convert string to float: '7,751.30' [duplicate] - python

This question already has answers here:
pandas reading CSV data formatted with comma for thousands separator
(3 answers)
Closed 5 years ago.
I get the TWSE price from Taiwan Stock Exchange.
df = pd.read_csv(r'C:\Stock\TWSE.csv',encoding='Big5')
df.head()
日期 開盤指數 最高指數 最低指數 收盤指數
0 96/02/01 7,751.30 7,757.63 7,679.78 7,701.54
1 96/02/02 7,754.16 7,801.63 7,751.53 7,777.03
2 96/02/05 7,786.77 7,823.94 7,772.05 7,783.12
3 96/02/06 7,816.30 7,875.75 7,802.94 7,875.75
4 96/02/07 7,894.77 7,894.77 7,850.06 7,850.06
df.loc[0][2]
'7,757.63'
type(df.loc[0][2])
str
I want to convert the str type to float type for the purpose of plotting.
But, I can not convert them. For example:
float(df.loc[0][2])
ValueError: could not convert string to float: '7,757.63'

pd.read_csv, much like almost every other pd.read_* function, has a thousands parameter you can set to ',' to make sure that you're importing those values as floats.
The following is an illustration:
import io
import pandas as pd
txt = '日期 開盤指數 最高指數 最低指數 收盤指數\n0 96/02/01 7,751.30 7,757.63 7,679.78 7,701.54\n1 96/02/02 7,754.16 7,801.63 7,751.53 7,777.03\n2 96/02/05 7,786.77 7,823.94 7,772.05 7,783.12\n3 96/02/06 7,816.30 7,875.75 7,802.94 7,875.75\n4 96/02/07 7,894.77 7,894.77 7,850.06 7,850.06'
with io.StringIO(txt) as f:
df = pd.read_table(f, encoding='utf8', header=0, thousands=',', sep='\s+')
print(df)
Yields:
日期 開盤指數 最高指數 最低指數 收盤指數
0 96/02/01 7751.30 7757.63 7679.78 7701.54
1 96/02/02 7754.16 7801.63 7751.53 7777.03
2 96/02/05 7786.77 7823.94 7772.05 7783.12
3 96/02/06 7816.30 7875.75 7802.94 7875.75
4 96/02/07 7894.77 7894.77 7850.06 7850.06
I hope this proves helpful.

float(df.loc[0][2].replace(',',''))

Related

Calling the last 20 years from csv file on python

I'm trying to call the last 20 years from a CSV file that has the 'date' and 'price' columns on python.
df = df[(df['Date']>datetime.Date(1999,1,1)) & (df['Date']<datetime.Date(2019,1,1))]
I was expecting the see the data for the last 20 years from 1999 to 2019 alone.
I think, you want last 20 years data dynamically whatever the max date will be,
import pandas as pd
df = pd.read_csv('./your_data.csv')
df['date_col'] = pd.to_datetime(df['date_col'])
df['year'] = df['date_col'].dt.year
last_20_years = df[df['year'] + 20 >= df['year'].max()]
last_20_years

How to extract only year(YYYY) from a CSV column with data like YYYY-YY

I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3

ValueError: Unable to parse string "15,181.80" at position 0

I am trying to convert a df to all numeric values but getting the following error.
ValueError: Unable to parse string "15,181.80" at position 0
Here is my current code:
data = pd.read_csv('pub?gid=1704010735&single=true&output=csv',
usecols=[0,1,2],
header=0,
encoding="utf-8-sig",
index_col='Date')
data.apply(pd.to_numeric)
print("we have a total of:", len(data), " samples")
data.head()
And df before I am trying to convert:
Clicks Impressions
Date
01/03/2020 15,181.80 1.22%
02/03/2020 12,270.76 0.56%
03/03/2020 39,420.79 0.80%
04/03/2020 22,223.97 0.79%
05/03/2020 17,084.45 0.88%
I think the issue is that it handle the special characters E.G. "," - is this correct? What is the best recommendation to help convert the DF into all numeric values?
Thanks!
Deleting all , in the numbers of your dataframe will fixe your problem.
This is the code I used:
import pandas as pd
df = pd.DataFrame({'value':['10,000.23','20,000.30','10,000.10']})
df['value'] = df['value'].str.replace(',', '').astype(float)
df.apply(pd.to_numeric)
OUTPUT:
value
0 10000.23
1 20000.30
2 10000.10
EDIT:
You can use also:
df= df.value.str.replace(',', '').astype(float)
The value is the column that you want to convert

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

pandas to_csv: suppress scientific notation in csv file when writing pandas to csv

I am writing a pandas df to a csv. When I write it to a csv file, some of the elements in one of the columns are being incorrectly converted to scientific notation/numbers. For example, col_1 has strings such as '104D59' in it. The strings are mostly represented as strings in the csv file, as they should be. However, occasional strings, such as '104E59', are being converted into scientific notation (e.g., 1.04 E 61) and represented as integers in the ensuing csv file.
I am trying to export the csv file into a software package (i.e., pandas -> csv -> software_new) and this change in data type is causing problems with that export.
Is there a way to write the df to a csv, ensuring that all elements in df['problem_col'] are represented as string in the resulting csv or not converted to scientific notation?
Here is the code I have used to write the pandas df to a csv:
df.to_csv('df.csv', encoding='utf-8')
I also check the dtype of the problem column:
for df.dtype, df['problem_column'] is an object
For python 3.xx (Python 3.7.2)&
In [2]: pd.__version__ Out[2]: '0.23.4':
Options and Settings
For visualization of the dataframe pandas.set_option
import pandas as pd #import pandas package
# for visualisation fo the float data once we read the float data:
pd.set_option('display.html.table_schema', True) # to can see the dataframe/table as a html
pd.set_option('display.precision', 5) # setting up the precision point so can see the data how looks, here is 5
df = pd.DataFrame(np.random.randn(20,4)* 10 ** -12) # create random dataframe
Output of the data:
df.dtypes # check datatype for columns
[output]:
0 float64
1 float64
2 float64
3 float64
dtype: object
Dataframe:
df # output of the dataframe
[output]:
0 1 2 3
0 -2.01082e-12 1.25911e-12 1.05556e-12 -5.68623e-13
1 -6.87126e-13 1.91950e-12 5.25925e-13 3.72696e-13
2 -1.48068e-12 6.34885e-14 -1.72694e-12 1.72906e-12
3 -5.78192e-14 2.08755e-13 6.80525e-13 1.49018e-12
4 -9.52408e-13 1.61118e-13 2.09459e-13 2.10940e-13
5 -2.30242e-13 -1.41352e-13 2.32575e-12 -5.08936e-13
6 1.16233e-12 6.17744e-13 1.63237e-12 1.59142e-12
7 1.76679e-13 -1.65943e-12 2.18727e-12 -8.45242e-13
8 7.66469e-13 1.29017e-13 -1.61229e-13 -3.00188e-13
9 9.61518e-13 9.71320e-13 8.36845e-14 -6.46556e-13
10 -6.28390e-13 -1.17645e-12 -3.59564e-13 8.68497e-13
11 3.12497e-13 2.00065e-13 -1.10691e-12 -2.94455e-12
12 -1.08365e-14 5.36770e-13 1.60003e-12 9.19737e-13
13 -1.85586e-13 1.27034e-12 -1.04802e-12 -3.08296e-12
14 1.67438e-12 7.40403e-14 3.28035e-13 5.64615e-14
15 -5.31804e-13 -6.68421e-13 2.68096e-13 8.37085e-13
16 -6.25984e-13 1.81094e-13 -2.68336e-13 1.15757e-12
17 7.38247e-13 -1.76528e-12 -4.72171e-13 -3.04658e-13
18 -1.06099e-12 -1.31789e-12 -2.93676e-13 -2.40465e-13
19 1.38537e-12 9.18101e-13 5.96147e-13 -2.41401e-12
And now write to_csv using the float_format='%.15f' parameter
df.to_csv('estc.csv',sep=',', float_format='%.15f') # write with precision .15
file output:
,0,1,2,3
0,-0.000000000002011,0.000000000001259,0.000000000001056,-0.000000000000569
1,-0.000000000000687,0.000000000001919,0.000000000000526,0.000000000000373
2,-0.000000000001481,0.000000000000063,-0.000000000001727,0.000000000001729
3,-0.000000000000058,0.000000000000209,0.000000000000681,0.000000000001490
4,-0.000000000000952,0.000000000000161,0.000000000000209,0.000000000000211
5,-0.000000000000230,-0.000000000000141,0.000000000002326,-0.000000000000509
6,0.000000000001162,0.000000000000618,0.000000000001632,0.000000000001591
7,0.000000000000177,-0.000000000001659,0.000000000002187,-0.000000000000845
8,0.000000000000766,0.000000000000129,-0.000000000000161,-0.000000000000300
9,0.000000000000962,0.000000000000971,0.000000000000084,-0.000000000000647
10,-0.000000000000628,-0.000000000001176,-0.000000000000360,0.000000000000868
11,0.000000000000312,0.000000000000200,-0.000000000001107,-0.000000000002945
12,-0.000000000000011,0.000000000000537,0.000000000001600,0.000000000000920
13,-0.000000000000186,0.000000000001270,-0.000000000001048,-0.000000000003083
14,0.000000000001674,0.000000000000074,0.000000000000328,0.000000000000056
15,-0.000000000000532,-0.000000000000668,0.000000000000268,0.000000000000837
16,-0.000000000000626,0.000000000000181,-0.000000000000268,0.000000000001158
17,0.000000000000738,-0.000000000001765,-0.000000000000472,-0.000000000000305
18,-0.000000000001061,-0.000000000001318,-0.000000000000294,-0.000000000000240
19,0.000000000001385,0.000000000000918,0.000000000000596,-0.000000000002414
And now write to_csv using the float_format='%f' parameter
df.to_csv('estc.csv',sep=',', float_format='%f') # this will remove the extra zeros after the '.'
For more details check pandas.DataFrame.to_csv
Use the float_format argument:
In [11]: df = pd.DataFrame(np.random.randn(3, 3) * 10 ** 12)
In [12]: df
Out[12]:
0 1 2
0 1.757189e+12 -1.083016e+12 5.812695e+11
1 7.889034e+11 5.984651e+11 2.138096e+11
2 -8.291878e+11 1.034696e+12 8.640301e+08
In [13]: print(df.to_string(float_format='{:f}'.format))
0 1 2
0 1757188536437.788086 -1083016404775.687134 581269533538.170288
1 788903446803.216797 598465111695.240601 213809584103.112457
2 -829187757358.493286 1034695767987.889160 864030095.691202
Which works similarly for to_csv:
df.to_csv('df.csv', float_format='{:f}'.format, encoding='utf-8')
If you would like to use the values as formated string in a list, say as part of csvfile csv.writier, the numbers can be formated before creating a list:
with open('results_actout_file','w',newline='') as csvfile:
resultwriter = csv.writer(csvfile, delimiter=',')
resultwriter.writerow(header_row_list)
resultwriter.writerow(df['label'].apply(lambda x: '%.17f' % x).values.tolist())

Categories