syntax when saving a file using dropna().to_csv - python

To delete lines in a CSV file that have empty cells - I use the following code:
import pandas as pd
data = pd.read_csv("./test_1.csv", sep=";")
data.dropna()
data.dropna().to_csv("./test_2.csv", index=False, sep=";")
everything works fine, but I get a new CSV file with incorrect data:
what is highlighted in red squares
I get additional signs in the form of a dot and a zero .0.
Could you please tell me how do I get correct data without .0
Thank you very much!

Pandas represents numeric NAs as NaNs and therefore casts all of your ints as floats (python int doesn't have a NaN value, but float does).
If you are sure that you removed all NAs, just cast your columns/dfs to int:
data = data.astype(int)
If you want to have integers and NAs, use pandas nullable integer types such as pd.Int64Dtype().
more on nullable integer types:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Related

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

Pandas: Convert Column to timedelta64

I try to read a CSV file as pandas data frame. Beside column names, I get the expected dtype. My approach is:
reading the CSV with inferring column types (as I want to be able
to catch issues)
reading the expected column types
iterating over the columns and try to convert them with 'astype'
Now, I have timedeltas in nanoseconds. They are read in as float64 and can contain missing values. 'astype' fails with the following message:
ValueError: Cannot convert non-finite values (NA or inf) to integer
This little script can reproduce my issue. The method 'to_timedelta' works perfekt on my data while the conversion give the error.
import pandas as pd
import numpy as np
timedeltas = [200800700,30020010030,np.NaN]
data = {'timedelta': timedeltas}
pd.to_timedelta(timedeltas)
df = pd.DataFrame(data)
df.dtypes
df['timedelta'].astype('timedelta64[ns]')
Can anybody help to fix this issue? Is there any other save representation than nanoseconds which would work with 'astype'?
Thanks to MrFuppes.
It's not possible to use astype() but to_timedelta works. Thank you!
df['timedelta'] = pd.to_timedelta(df['timedelta'])

Pandas read_sql_query converting integer column to float

I have the following line
df = pandas.read_sql_query(sql = sql_script, con=conn, coerce_float = False)
that pulls data from Postgres using a sql script. Pandas keeps setting some of the columns to type float64. They should be just int. These columns contain some null values. Is there a ways to pull the data without having Pandas setting them to float64?
Thanks!
As per the documentation, the lack of NA representation in Numpy implies integer NA values can't be managed, so pandas promotes int columns into float.

prevent pandas read_excel function from automatic converting boolean columns into float64

I have an Excel file which contains all the data I need to read into memory. Each row is a data sample and each column is a feature. I am using pandas.read_excel() function to read it.
The problem is that this function automatically converts some boolean columns into float64 type. I manually checked some columns. Only the columns with missing values are converted. The columns without missing values are still bool.
My question is: how can I prevent read_excel() function from automatically converting boolean columns into float64.
Here is my code snippet:
>>> fp = open('myfile.xlsx', 'rb')
>>> df = pd.read_excel(fp, header=0)
>>> df['BooleanFeature'].dtype
dtype('float64')
Here BooleanFeature is a boolean feature, but with some missing values.

Inconsistent pandas read_csv dtype inference on mostly-integer string column in huge TSV file

I have a tab separated file with a column that should be interpreted as a string, but many of the entries are integers. With small files read_csv correctly interprets the column as a string after seeing some non integer values, but with larger files, this doesnt work:
import pandas as pd
df = pd.DataFrame({'a':['1']*100000 + ['X']*100000 + ['1']*100000, 'b':['b']*300000})
df.to_csv('test', sep='\t', index=False, na_rep='NA')
df2 = pd.read_csv('test', sep='\t')
print df2['a'].unique()
for a in df2['a'][262140:262150]:
print repr(a)
output:
['1' 'X' 1]
'1'
'1'
'1'
'1'
1
1
1
1
1
1
Interestingly 262144 is a power of 2 so I think inference and conversion is happening in chunks but is skipping some chunks.
I am fairly certain this is a bug, but would like a work around that perhaps uses quoting, though adding
quoting=csv.QUOTE_NONNUMERIC
for reading and writing does not fix the problem. Ideally I could work around this by quoting my string data and somehow force pandas to not do any inference on quoted data.
Using pandas 0.12.0
To avoid having Pandas infer your data type, provide a converters argument to read_csv:
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
For your file this would look like:
df2 = pd.read_csv('test', sep='\t', converters={'a':str})
My reading of the docs is that you do not need to specify converters for every column. Pandas should continue to infer the datatype of unspecified columns.
You've tricked the read_csv parser here (and to be fair, I don't think it can always be expected to output correctly no matter what you throw at it)... but yes, it could be a bug!
As #Steven points out you can use the converters argument of read_csv:
df2 = pd.read_csv('test', sep='\t', converters={'a': str})
A lazy solution is just to patch this up after you've read in the file:
In [11]: df2['a'] = df2['a'].astype('str')
# now they are equal
In [12]: pd.util.testing.assert_frame_equal(df, df2)
Note: If you are looking for a solution to store DataFrames, e.g. between sessions, both pickle and HDF5Store are excellent solutions which won't be affected by these type of parsing bugs (and will be considerably faster). See: How to store data frame using PANDAS, Python

Categories