Pandas to_sql: float binary issue - python

I have a Pandas DataFrame that I'm sending to MySQL via to_sql with sqlalchemy. My floats in SQL sometimes show decimal places that are slightly off (compared to the df) and result in an error: "Warning: (1265, "Data truncated for column 'Dividend' at row 1")". How do I round the floats so that they match the value in the DataFrame?
The values are pulled from a CSV and converted from strings to floats. They appear fine when written to Excel, but when sent to SQL, the numbers are slightly off.
I've looked into the issues with floats when it comes to binary, but I can't figure out how to override that during the transfer from DataFrame to SQL.
from sqlalchemy import create_engine
import pandas as pd
def str2float(val):
return float(val)
data = pd.read_csv(
filepath_or_buffer = filename,
converters = {'col1':str2float}
db = create_engine('mysql://user:pass#host/database')
data.to_sql(con=db, name='tablename', if_exists='append', index=False)
db.dispose()
Most floats pull over similar to 0.0222000000, but every once in awhile it will appear like 0.0221999995. Ideally I would like it to automatically truncate all the 0s at the end, but I would settle for the first example. However I need to have it round up to match the float that was stored in the DataFrame.

I had a similar problem. The number I imported into the data-frame had 3 decimal places. But when inserted into the SQL table, it had 12 digits.
I just used .round() method and it worked for me.
df["colname"] = df["colname"].round(3)

Related

Pandas read_parquet partially parses binary column

I'm trying to read a parquet file that contains a binary column with multiple hex values, which is causing issues when reading it with Pandas. Pandas is automatically converting some of the hex values to characters, but some are left untouched, so the data is not really usable anymore. When reading it with PySpark, it converts all hex values to decimal base, but as the output is consistent, it's usable.
Any ideas why pandas parse this column differently and how I can get the same output, or at least a consistent one (no partial parsing applied) as Spark returns?
The snippets of code and returned outputs :
Pandas :
df = pd.read_parquet('data.parquet'))
pd.read_parquet output:
Spark :
spark_df = spark.read.parquet("data.parquet")
df = spark_df.toPandas()
Spark.read.parquet output:
Pandas is returning a byte string, some characters will be displayed like that, but nothing is wrong with it. For example:
x = bytes([1,10,100]) # x is shown as b'\x01\nd' where last 'd' is ASCII letter
list(x) # get as a list of numbers
To convert your pandas dataframe to look like spark one, use:
df['BASE_PERIOD_VECTOR'].apply(list)

Is there a way to not show scientific notation in pandas for numeric values?

I queried my postgres database to retrieve information from a table:
SQLResults = cursor.execute('SELECT x.some_12_long_integer from test as x;')
Now when I run this query in database, I get 1272140958198, but when I dump this in a dataframe:
The x.some_12_long_integer is int8.
I am using xlsxwriter:
excel_writer = ExcelWriter(
self.fullFilePath, engine="xlsxwriter",
engine_kwargs={'options': {'strings_to_numbers': True}}
)
frame = DataFrame(SQLResults[0])
frame.to_excel(excel_writer, sheet_name="Sheet1", index=False)
When it is converted to Excel it produces 1.27214E+12 but when I format the cell in the Excel file I get 1272140958198.
How can I make it just stay as 1272140958198 instead of 1.27214E+12?
It is not possible to disable scientific notation in pandas. However, there are a few ways to work around this:
Convert the numeric values to strings using .to_string() . This will remove the scientific notation and provide a more user-friendly representation of the data. Perform mathematical operations on the numeric values in order to round them off to a specific number of decimal places. This will remove most of the scientific notation, but some may still remain.

How to normalize decimal values while iterating over dataframe rows using toLocalIterator

I have a pyspark dataframe which contains a decimal column and the schema for that particular decimal column is Decimal(20,8). When I do a df.show() it shows 3.1E-7 as value for the decimal column for a particular row.
Now I am trying to write this dataframe streaming to an avro file using fastavro and for that I am iterating over all the rows using toLocalIterator. When I get to the row with the above value it is containing Decimal('3.10E-7') and this is breaking my avro writer code with below error as this value is resulting the scale as 9 but my avro file is expecting scale as 8
ValueError: Scale provided in schema does not match the decimal
I was able to iterate over each field for every row and wherever it is decimal datatype, I am using normalize method over it and then passing it to the avro writer (Ref: How to reduce scale in python decimal value). This makes the code slower and inefficient I believe. Is there any other better way?

Pandas' `read_sql` creates integer columns when reading from an Oracle table which has number columns with decimal points

I have a Oracle table with columns of type VARCHAR2 (i.e. string) and of type NUMBER (i.e. a numeric value with a fractional part). And the numeric columns contain indeed values with decimal points, not integer values.
However when I read this table into a Pandas dataframe via pandas.read_sql I receive the numeric columns in the data frame as int64. How can I avoid this and receive instead float columns with the full decimal values?
I'm using the following versions
python : 3.7.4.final.0
pandas : 1.0.3
Oracle : 18c Enterprise Edition / Version 18.9.0.0.0
I have encountered the same thing. I am not sure if this is the reason but I assume that NUMBER type without any size restrictions is too big for pandas and it is automatically truncated to int64 or the type is improperly chosen by pandas – default NUMBER might be treated as an integer. You can limit the type of the column to e.g. NUMBER(5,4) and pandas should recognise it correctly as a float.
I also found out that using pd.read_sql gives me proper types in contrast to pd.read_sql_table.

How to override/prevent sqlalchemy from ever using floating point type?

I've been using pandas to transform raw file data and import into a database. Often times we use large integers as primary keys. When using pandas to_sql function without explicitly specifying column types, it will sometimes automatically assign large integers as float (rather than bigint).
As you can imagine, much hair was lost when we realized our selects and joins weren't working.
Of course, we can go through and via trial-error manually assign problem columns as bigint, but we'd rather outright disable float altogether and instead force bigint since we work with an extremely large amount of tables, and sometimes an extremely large amount of columns that we can't spend time individually fact-checking. We basically never want a float type in any table definition, ever.
Any way to override floating point type (either in pandas, sqlalchemy, or numpy) as bigint?
ie:
import pandas as pd
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://user:pass#host')
columns = ['foo', 'bar']
data = [
[123456789, 'one'],
[234567890, 'two'],
[345678901, 'three']
]
df = pd.DataFrame(data=data, columns=columns)
df.to_sql('table', e, flavor='mysql', schema='schema', if_exists='replace')
Unfortunately, this code does not reproduce the effect. It committed as bigint. It happens when loading data from certain csv or xls files, it happens when doing a transfer from one MySQL database to another (latin1) which one would assume to be an isometric copy.
There's nothing to the code at all, it's just:
import pandas as pd
from sqlalchemy import create_engine
e = create_engine('mysql+pymysql://user:pass#host')
df = pd.read_sql('SELECT * FROM source_schema.source_table;', e)
df.to_sql('target_table', e, flavor='mysql', schema='target_schema')
Creating a testfile.csv:
thing1,thing2
123456789,foo
234567890,bar
345678901,baz
didn't reproduce the effect either. I know for a fact it happens with data from NPPES Dissemination, perhaps it has to do with the encoding? I have to convert the files from WIN-1252 to UTF-8 in order for MySQL to even accept them.

Categories