pandas vs sasdataset ,values are exact correct - python

Before reading into pandas, data is used in sasdataset. My data looks like
SNYDJCM--integer
740.19999981
After reading into pandas my data is changing as below
SNYDJCM--converting to float
740.200000
How to get same value after reading into pandas dataframe
Steps followed:
1)
import pandas as pd
2)
pd.read_sas(path,format='sas7bdat',encoding='iso-8859-1')
Need your help

Try import SAS7BDAT and casting your file before reading:
from sas7bdat import SAS7BDAT
SAS7BDAT('FILENAME.sas7bdat')
df = pd.read_sas('FILENAME.sas7bdat',format='sas7bdat')
or use it to directly read the file:
from sas7bdat import SAS7BDAT
sas_file = SAS7BDAT('FILENAME.sas7bdat')
df = sas_file.to_data_frame()
or use pyreadstat to read the file:
import pyreadstat
df, meta = pyreadstat.read_sas7bdat('FILENAME.sas7bdat')

First 740.19999981 is no integer 740 would be the nearest integer. But also when you round 740.19999981 down to 6 digits you will get 740.200000. I would sugest printing out with higher precision and see if it is really changed.
print("%.12f"%(x,))

Related

Absolute number is coming with decimals using pandas

I'm using pandas to apply some format level changes on a csv and storing the result in a target file. The source file has some integers, but after pandas operation the integers are converted to decimals. For e.g. 3 in source file converted to 3.0. I would like the integers remain as integers.
Any pointers on how to get this working? Will be really helpful, thank you!
import pandas as pd
# reading the csv file
df = pd.read_csv(source)
# updating the column value/data
df['Regular'] = df['Regular'].replace({',': '_,'})
# writing into the file
df.to_csv(target, index=False)
You can specify data type for pandas read_csv(), eg.:
df = pd.read_csv(source, dtype={'column_name_a': 'Int32', 'column_name_b': 'Int32'})
see docs here :: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

How to load sframe format file in pandas?

Is there any way to directly open .sframe extension file in pandas.
Like an easy way
df = pd.read_csv('people.sframe')
Thank you.
No, you can't import sframe files directly with Pandas. Rather you can use a free python library named sframe:
import sframe
import pandas as pd
sf = sframe.SFrame('people.sframe')
Then you can convert it to a pandas DataFrame using:
df = sf.to_dataframe()

pandas string values are not getting proper format

Before reading into pandas my data looks like in sas dataset
Name
Alfred
Alice
After reading into pandas data is getting as
Name
b'Alfred'
b'Alice'
Why I am getting the data is different? Steps followed:
Import pandas as pd
df=pd.read_sas(r'C:/ProgramData/Anaconda3/Python_local/class.sas7bdat',format='sas7bdat')
Need your help.
SAS files need to be imported with special encoding
df=pd.read_sas(r'C:/ProgramData/Anaconda3/Python_local/class.sas7bdat',format='sas7bdat', encoding='iso-8859-1')

Python - read parquet data from a variable

I am reading a parquet file and transforming it into dataframe.
from fastparquet import ParquetFile
pf = ParquetFile('file.parquet')
df = pf.to_pandas()
Is there a way to read a parquet file from a variable (that previously read and now hold parquet data)?
Thanks.
In Pandas there is method to deal with parquet. Here is reference to the docs. Something like that:
import pandas as pd
pd.read_parquet('file.parquet')
should work. Also please read this post for engine selection.
You can read a file from a variable also using pandas.read_parquet using the following code. I tested this with the pyarrow backend but this should also work for the fastparquet backend.
import pandas as pd
import io
with open("file.parquet", "rb") as f:
data = f.read()
buf = io.BytesIO(data)
df = pd.read_parquet(buf)

Extracting data from a large csv file:causes dtype warnings

I work for a company and I recently switched from using spreadsheet package to python. Since, I am very new to python there are alot of things that I have difficulty grasping.Using python, I am trying to extract data from a large csv file(37791 rows and 316 columns.) Here is a piece of code I wrote:
Solution 1
import numpy as np
import pandas as pd
df=pd.read_csv=('C:\\Users\\Maxwell\\Desktop\\Test.data.csv',skiprows=1)
data=df.loc[:,['Steps','Parameter']]
This command generates an error,i.e, it gives a DtypeWwarning:columns (0,1,2,3........81) have mixed types. Specify dtype option on import or set low memory= False
So, I found a workaround.
Solution 2
import pandas as pd
import numpy as np
df=pd.read_csv(('C:\\Users\\Maxwell\\Desktop\\Test.data.csv',skiprows=1,error_bad_lines=False, index_col=False, dtype='unicode')
data=df.loc[:,['Steps','Parameter']]
Two questions:
i)I was able to get around the error, but now the columns that I want(Steps & Parameter)have been converted to objects(probably due to the dtype='unicode' command). How can I convert Steps column into an integer type and parameter into a float.
ii) Some people say that dtype warning isn't really an error. But, I found out that when I use Solution 1 and read the csv file. The Steps column contains some floats.The original csv file doesn't have any floats in Steps column. It looks as if, some floats have been placed by python itself!! Why does this happen?
(I am not able to upload the original csv file, because my company doesn't allow it!)

Categories