I'm importing a file into pandas dataframe but the dataframe is not retaining original values as is, instead its adding extra zero to some float columns.
example original value in the file is 23.84 but when i import it into the dataframe it has a value of 23.8400
How to fix this? or is there a way to import original values to the dataframe as is in the text file.
For anyone who encounters the same problem, I'm adding the solution I found to this problem. Pandas read_csv has an attribute as dtype where we can tell pandas to read all columns as a string so this will read the data as is and not interpret based on its own logic.
df1 = pd.read_csv('file_location', sep = ',', dtype = {'col1' : 'str', 'col2' : 'str'}
I had too many columns so I first created a dictionary with all the columns as keys and 'str' as their values and passed this dictionary to the dtype argument.
Related
I am using value_counts() to get the frequency for sec_id. The output of value_counts() should be integers.
When I build DataFrame with these integers, I found those columns are object dtype. Does anyone know the reason?
They are the object dtype because your sec_id column contains string values (e.g. "94114G"). When you call .values on the dataframe created by .reset_index(), you get two arrays which both contain string objects.
More importantly, I think you are doing some unnecessary work. Try this:
>>> sec_count_df = df['sec_id'].value_counts().rename_axis("sec_id").rename("count").reset_index()
I have a column that has the values 1,2,3....
I need to change this value to Cluster_1, Cluster_2, Cluster_3... dynamically. My original table looks like below, where cluster_predicted is a column, containing integer value and I need to convert these numbers to cluster_0, cluster_1...
I have tried the below code
clustersDf['clusterDfCategorical'] = "Cluster_" + str(clustersDf['clusterDfCategorical'])
But this is giving me a very weird output as shown below.
import pandas as pd
df = pd.DataFrame()
df['cols']=[1,2,3,4,5]
df['vals']=['one','two','three','four','five']
df['cols'] =df['cols'].astype(str)
df['cols']= 'confuse_'+df['cols']
print(df)
try this , the string conversion is making the issue for you.
One way to convert to string is to use astype
I have the following line
df = pandas.read_sql_query(sql = sql_script, con=conn, coerce_float = False)
that pulls data from Postgres using a sql script. Pandas keeps setting some of the columns to type float64. They should be just int. These columns contain some null values. Is there a ways to pull the data without having Pandas setting them to float64?
Thanks!
As per the documentation, the lack of NA representation in Numpy implies integer NA values can't be managed, so pandas promotes int columns into float.
I have an Excel file which contains all the data I need to read into memory. Each row is a data sample and each column is a feature. I am using pandas.read_excel() function to read it.
The problem is that this function automatically converts some boolean columns into float64 type. I manually checked some columns. Only the columns with missing values are converted. The columns without missing values are still bool.
My question is: how can I prevent read_excel() function from automatically converting boolean columns into float64.
Here is my code snippet:
>>> fp = open('myfile.xlsx', 'rb')
>>> df = pd.read_excel(fp, header=0)
>>> df['BooleanFeature'].dtype
dtype('float64')
Here BooleanFeature is a boolean feature, but with some missing values.
I need to turn a pandas dataframe column into a float. This float is taken from a larger csv file. To get just the number I need to a dataframe I did:
m_df = pd.read_csv(input_file,nrows=1,header=None,skiprows=4)
m1=m_df.ix[:,1:1]
This gets me the dataframe with just the number I want in the first column. How do I turn that number into a float?
float((m_df.ix[:,1:1]).values)
For pandas dataframes, type casting works when done on the values, rather than the dataframe.