Pandas - Parallelizing astype function - python

I'm dealing with a huge dataset with lots of features. Those features are actually int type,but since they have np.nan values, pandas assigns float64 type to them.
I'm casting those features to float32 by iterating every single column. It takes about 10 minutes to complete. Is there any way to speed up this operation?
The data is read from a csv file. There are object and int64 columns in the data.
for col in float_cols:
df[col] = df[col].astype(np.float32)

Use dtype parameter with dictionary in read_csv:
df = pd.read_csv(file, dtype=dict.fromkeys(float_cols, np.float32))

Related

Pandas group by without losing type information while ensuring all data is retained

I am using this function to group a pandas dataframe. I have a frame with float64, int64, and object columns. This function groupbyFlatCount is adapted from dask. I was encountering issues with missing data when grouping over an int64 column. I isolated this column and was able to get it to work with an object dtype.
def groupbyFlatCount(frame, by):
return frame.groupby(by=by).size().reset_index().rename(columns={0:'count'})
How can I get this function to work without losing type information?
The robust solution to this issue is to convert to object columns and then back again.
def groupbyFlatCount(frame, by):
dtypes = frame[by].dtypes.to_dict()
for col in by:
frame[col] = frame[col].map(str)
return frame.groupby(by=by).size().reset_index().rename(columns={0:'count'}).astype(dtypes)

Groupby year dropping some variables

This is the original data and I need the mean of each year of all the variables.
But when I am using groupby('year') command, it is dropping all variables except 'lnmcap' and 'epu'.
Why this is happening and what needs to be done?
Probably the other columns have object or string type of the data, instead of integer, as a result of which only 'Inmcap' and 'epu' has got the average column. Use ds.dtypes or simply ds.info() to check the data types of data in the columns
it comes out to be object/string type then use
ds=ds.drop('company',axis=1)
column_names=ds.columns
for i in column_names:
ds[i]=ds[i].astype(str).astype(float)
This could work
You might want to convert all numerical columns to float before getting their mean, for example
cols = list(ds.columns)
#remove irrelevant columns
cols.pop(cols.index('company'))
cols.pop(cols.index('year'))
#convert remaining relevant columns to float
for col in cols:
ds[col] = pd.to_numeric(ds[col], errors='coerce')
#after that you can apply the aggregation
ds.groupby('year').mean()
You will need to convert the numeric columns to float types. Use df.info() to check the various data types.
for col in ds.select_dtypes(['object']).columns:
try:
ds[col] = ds[col].astype('float')
except:
continue
After this, use df.info() to check again. Those columns with objects like '1.604809' will be converted to float 1.604809
Sometimes, the column may contain some "dirty" data that cannot be converted to float. In this case, you could use below code with errors='coerce' means non-numeric data becomes NaN
column_names = list(ds.columns)
column_names.remove('company')
column_names.remove('year')
for col in column_names:
ds[col] = pd.to_numeric(ds[col], errors='coerce') #this will convert to numeric, whereas non-numeric becomes NaN

How do I use df.astype() inside apply function

I have a data frame in which all the data in columns are of type object. Now I want to convert all objects into numeric types using astype() function but I don't want to do something like this ->
df.astype({'col1': 'int32' , 'col2' : 'int32' ....})
If I do something like this ->
I get an error because apply function needs Series to traverse.
PS: The other option of doing the same thing is ->
df.apply(pd.to_numeric)
But I want to do this using .astype()
Is there any other way instead of using df.apply() and still convert all object type data into numeric using df.astype()
Use df = df.astype(int) to convert all columns to int datatype
import numpy
df.astype(numpy.int32)
In my opinion the safest is to use pd.to_numeric in your apply function which also allows you error manipulation, coerce, raise or ignore. After getting the columns to numeric, then you can safely perform your astype() operation, but I wouldn't suggest it to begin with:
df.apply(pd.to_numeric, errors='ignore')
If the column can't be converted to numeric, it will remain unchanged
df.apply(pd.to_numeric, errors='coerce')
The columns will be converted to numeric, the values that can't be converted to numeric in the column will be replaced with NaN.
df.apply(pd.to_numeric, errors='raise')
ValueError will be returned if the column can't be converted to numeric
If these are object columns and you're certain they can be "soft-casted" to int, you have two options:
df
worker day tasks
0 A 2 read
1 A 9 write
2 B 1 read
3 B 2 write
4 B 4 execute
df.dtypes
worker object
day object
tasks object
dtype: object
pandas <= 0.25
infer_objects (0.21+ only) casts your data to numpy types if possible.
df.infer_objects().dtypes
worker object
day int64
tasks object
dtype: object
pandas >= 1.0
convert_dtypes casts your data to the most specific pandas extension dtype if possible.
df.convert_dtypes().dtypes
worker string
day Int64
tasks string
dtype: object
Also see this answer by me for more information on "hard" versus "soft" conversions.

prevent pandas read_excel function from automatic converting boolean columns into float64

I have an Excel file which contains all the data I need to read into memory. Each row is a data sample and each column is a feature. I am using pandas.read_excel() function to read it.
The problem is that this function automatically converts some boolean columns into float64 type. I manually checked some columns. Only the columns with missing values are converted. The columns without missing values are still bool.
My question is: how can I prevent read_excel() function from automatically converting boolean columns into float64.
Here is my code snippet:
>>> fp = open('myfile.xlsx', 'rb')
>>> df = pd.read_excel(fp, header=0)
>>> df['BooleanFeature'].dtype
dtype('float64')
Here BooleanFeature is a boolean feature, but with some missing values.

dataframe values.tolist() datatype

I have a dataframe like this:
This dataframe has several columns. Two are of type float: price and change, while volme and amountare of type int.
I use the method df.values.tolist() change df to list and get the data:
datatmp = df.values.tolist()
print(datatmp[0])
[20160108150023.0, 11.12, -0.01, 4268.0, 4746460.0, 2.0]
The int types in df all change to float types.
My question is why do int types change to the float types? How can I get the int data I want?
You can convert column-by-column:
by_column = [df[x].values.tolist() for x in df.columns]
This will preserve the data type of each column.
Than convert to the structure you want:
list(list(x) for x in zip(*by_column))
You can do it in one line:
list(list(x) for x in zip(*(df[x].values.tolist() for x in df.columns)))
You can check what datatypes your columns have with:
df.info()
Very likely your column amount is of type float. Do you have any NaN in this column? These are always of type float and would make the whole column float.
You can cast to int with:
df.values.astype(int).tolist()
I think the pandas documentation helps:
DataFrame.values
Numpy representation of NDFrame
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
So here apparently float is chosen to accomodate all component types. A simple method would be (however, most possibly there are more elegant solutions around, I'm not too familiar with pandas):
datatmp = map(lambda row: list(row[1:]), df.itertuples())
Here the itertuples() gives an iterator with elements of the form (rownumber, colum1_entry, colum2_entry, ...). The map takes each such tuple and applies the lambda function, which removes the first component (rownumber), and returns a list containing the components of a single row. You can also remove the list() invocation if it's ok for you to work with a list of tuples.
[Dataframe values property][1]
"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html#pandas.DataFrame.values"

Categories