I have the following line
df = pandas.read_sql_query(sql = sql_script, con=conn, coerce_float = False)
that pulls data from Postgres using a sql script. Pandas keeps setting some of the columns to type float64. They should be just int. These columns contain some null values. Is there a ways to pull the data without having Pandas setting them to float64?
Thanks!
As per the documentation, the lack of NA representation in Numpy implies integer NA values can't be managed, so pandas promotes int columns into float.
Related
I thought df.fillna(value=0) fills null values for all numeric columns to 0.
In an SQL Server table, I have multiple columns that have data type float and columns that have data type int. Following Apache Spark code sample imports a corresponding data file into that SQL table. So, I set the data types of float columns to DoubleType() and for int columns to IntegerType(). The columns for both data types in SQL table are Nullable. But the code throws two errors as follows:
First error on float columns:
the dataframe df column has nullable property of column at index n set to false but the corresponding SQL table column has nullable property set to true.
So in SQL table, I make the float and int columns to `NOT NULL'
Second times, code throws the following error about int columns:
the dataframe df column at index m has nullable property set to true but the corresponding SQL table column has nullable property set to false.
So in SQL table, I make the int columns data types back to `NULL'
Then code compiles successfully and imports the data into SQL table
Question: Why the following code asking me to make corresponding SQL float columns to be NOT NULL but the corresponding SQL int columns to be NULLABLE?
Quote from this site:
PySpark fill(value:Long) signatures that are available in DataFrameNaFunctions is used to replace NULL/None values with numeric values either zero(0) or any constant value for all integer and long datatype columns of PySpark DataFrame or Dataset.
Sample Code:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, DateType, TimestampType, DoubleType
......
df1 = df.withColumn("Price", df.Price.cast(DoubleType())) \
.withColumn("NumOfItems", df.NumOfItems.cast(IntegerType())) \
....................
....................
.withColumn("Cost", df.Cost.cast(DoubleType()))
df2 = df1.fillna(value=0)
#load df2 into SQL table
df2.write(.....)
UPDATE Data file has encoding UTF-16 LE BOM. I'm not sure if the above issue is related to this.
To delete lines in a CSV file that have empty cells - I use the following code:
import pandas as pd
data = pd.read_csv("./test_1.csv", sep=";")
data.dropna()
data.dropna().to_csv("./test_2.csv", index=False, sep=";")
everything works fine, but I get a new CSV file with incorrect data:
what is highlighted in red squares
I get additional signs in the form of a dot and a zero .0.
Could you please tell me how do I get correct data without .0
Thank you very much!
Pandas represents numeric NAs as NaNs and therefore casts all of your ints as floats (python int doesn't have a NaN value, but float does).
If you are sure that you removed all NAs, just cast your columns/dfs to int:
data = data.astype(int)
If you want to have integers and NAs, use pandas nullable integer types such as pd.Int64Dtype().
more on nullable integer types:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
My data is coming from mysql table.
id,revenue,cost,state are varchar columns.
I need to do get_dummies(one hot encoding) for my categorical variable that is state only
if its reading directly from csv(pd.read_csv) I am getting dtypes of id,revenue,cost as int/float and state as object
My Question is how to convert object to int64/float if its numeric and object for category variable
There is a chance of some strange like ?,- character might appear in revenue, still i want this column to be numeric
What I have done
To fix this right now change the varchar to int in the database directly and issue got fixed
But i need to do in pandas
df.apply(pd.to_numeric, errors='coerce').fillna(df) still my int/float columns such as id,revenue,cost is not changing dtype
I think first is necesarry test dtypes after pd.read_csv:
print (df.dtypes)
Then converting columns to numeric, but cannot replace missing values to original, because get mixed values - numeric with strings:
cols = ['id','revenue','cost']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
I'm importing a file into pandas dataframe but the dataframe is not retaining original values as is, instead its adding extra zero to some float columns.
example original value in the file is 23.84 but when i import it into the dataframe it has a value of 23.8400
How to fix this? or is there a way to import original values to the dataframe as is in the text file.
For anyone who encounters the same problem, I'm adding the solution I found to this problem. Pandas read_csv has an attribute as dtype where we can tell pandas to read all columns as a string so this will read the data as is and not interpret based on its own logic.
df1 = pd.read_csv('file_location', sep = ',', dtype = {'col1' : 'str', 'col2' : 'str'}
I had too many columns so I first created a dictionary with all the columns as keys and 'str' as their values and passed this dictionary to the dtype argument.
I have an Excel file which contains all the data I need to read into memory. Each row is a data sample and each column is a feature. I am using pandas.read_excel() function to read it.
The problem is that this function automatically converts some boolean columns into float64 type. I manually checked some columns. Only the columns with missing values are converted. The columns without missing values are still bool.
My question is: how can I prevent read_excel() function from automatically converting boolean columns into float64.
Here is my code snippet:
>>> fp = open('myfile.xlsx', 'rb')
>>> df = pd.read_excel(fp, header=0)
>>> df['BooleanFeature'].dtype
dtype('float64')
Here BooleanFeature is a boolean feature, but with some missing values.