Pandas: Convert Column to timedelta64 - python

I try to read a CSV file as pandas data frame. Beside column names, I get the expected dtype. My approach is:
reading the CSV with inferring column types (as I want to be able
to catch issues)
reading the expected column types
iterating over the columns and try to convert them with 'astype'
Now, I have timedeltas in nanoseconds. They are read in as float64 and can contain missing values. 'astype' fails with the following message:
ValueError: Cannot convert non-finite values (NA or inf) to integer
This little script can reproduce my issue. The method 'to_timedelta' works perfekt on my data while the conversion give the error.
import pandas as pd
import numpy as np
timedeltas = [200800700,30020010030,np.NaN]
data = {'timedelta': timedeltas}
pd.to_timedelta(timedeltas)
df = pd.DataFrame(data)
df.dtypes
df['timedelta'].astype('timedelta64[ns]')
Can anybody help to fix this issue? Is there any other save representation than nanoseconds which would work with 'astype'?

Thanks to MrFuppes.
It's not possible to use astype() but to_timedelta works. Thank you!
df['timedelta'] = pd.to_timedelta(df['timedelta'])

Related

.tolist() converting pandas dataframe values to bytes

I read a csv file in as a pandas dataframe but now need to use the data to do some calculations.
import pandas as pd
### LOAD DAQ FILE
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(cal_file,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
### DEFINE SIGNAL CHANNELS
ch104 = df.ch104.dropna() #gets rid of None value at the end of the column
print(ch104)
When I print a ch104 I get the following.
But I cannot do math on it currently as it is a pandas.Series or a string. The datatype is not correct yet.
The error if I do calculations is this:
can't multiply sequence by non-int of type 'float'
So what I tried to do is use .tolist() or even list() on the data, but then ch104 looks like this.
I believe the values are now being written as bytes then stores as a list of strings.
Does anyone know how I can get around this or fix this issue? It may be because the original file is UTF-16 LE encoded, but I cannot change this and need to use the files as is.
I need the values for simple calculations later on but they would need to be a float or a double for that. What can I do?
You probably get this error because you're trying to make calculations on some columns considered by pandas as non numeric. The values (numbers in your sense) are for some reason interpreted as strings (in pandas sense).
To fix that, you can change the type of those columns by using pandas.to_numeric :
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip()) # to get rid of the extra whitespace
import re
cols = list(filter(re.compile('^(ch|alarm)').match, df.columns.to_list()))
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
ch104 = df.ch104.dropna()
#do here your calculations on ch104 serie
Note that the 'coerce' argument will put NaN instead of every bad value in your columns.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.

syntax when saving a file using dropna().to_csv

To delete lines in a CSV file that have empty cells - I use the following code:
import pandas as pd
data = pd.read_csv("./test_1.csv", sep=";")
data.dropna()
data.dropna().to_csv("./test_2.csv", index=False, sep=";")
everything works fine, but I get a new CSV file with incorrect data:
what is highlighted in red squares
I get additional signs in the form of a dot and a zero .0.
Could you please tell me how do I get correct data without .0
Thank you very much!
Pandas represents numeric NAs as NaNs and therefore casts all of your ints as floats (python int doesn't have a NaN value, but float does).
If you are sure that you removed all NAs, just cast your columns/dfs to int:
data = data.astype(int)
If you want to have integers and NAs, use pandas nullable integer types such as pd.Int64Dtype().
more on nullable integer types:
https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Data conversion in Pandas

I am working with dataset which contain numerical values stored as type object. Below you can see how is look like data in pandas.core.frame.DataFrame
So I want to convert this columns from object type into int64 . In order to do this I try with this line of code
X['duration']=X['duration'].astype('float64')
But this line not don't work well and give this error message:
could not convert string to float: '18`'
So can anybody help me how to solve this and convert into numerical values ?
Try pd.to_numeric -
X['duration'] = X['duration'].str.extract('(\d+)')
X['duration'] = pd.to_numeric(X['duration'], errors='coerce')

Pandas adding extra zeros to decimal value

I'm importing a file into pandas dataframe but the dataframe is not retaining original values as is, instead its adding extra zero to some float columns.
example original value in the file is 23.84 but when i import it into the dataframe it has a value of 23.8400
How to fix this? or is there a way to import original values to the dataframe as is in the text file.
For anyone who encounters the same problem, I'm adding the solution I found to this problem. Pandas read_csv has an attribute as dtype where we can tell pandas to read all columns as a string so this will read the data as is and not interpret based on its own logic.
df1 = pd.read_csv('file_location', sep = ',', dtype = {'col1' : 'str', 'col2' : 'str'}
I had too many columns so I first created a dictionary with all the columns as keys and 'str' as their values and passed this dictionary to the dtype argument.

Error when converting pandas data frame columns from string with comma to float with dot

I am trying to convert a pands data frame (read in from a .csv file) from string to float. The columns 1 until 20 are recognized as "strings" by the system. However, they are float values in the format of "10,93847722". Therefore, I tried the following code:
new_df = df[df.columns[1:20]].transform(lambda col: col.str.replace(',','.').astype(float))
The last line causes the Error:
AttributeError: 'DataFrame' object has no attribute 'transform'
Maybe important to know, I can only use pands version 0.16.2.
Thank you very much for your help!
#all: Short extract from one of the columns
23,13854599
23,24945831
23,16853714
23,0876255
23,05908775
Use DataFrame.apply:
df[df.columns[1:20]].apply(lambda col: col.str.replace(',','.').astype(float))
EDIT: If some non numeric values is possible use to_numeric with errors='coerce' for replace these values to NaNs:
df[df.columns[1:20]].apply(lambda col: pd.to_numeric(col.str.replace(',','.'),errors='coerce'))
You should load them directly as numbers:
pd.read_csv(..., decimal=',')
This will recognize , as decimal point for every column.

Categories