How to prevent Pandas from converting datetimes to datetime64 - python

Need
I am trying to export a dataframe to a Parquet file, which will be consumed later in the pipeline by something that is not Python or Pandas. (Azure Data Factory)
When I ingest the Parquet file later in the flow, it cannot recognize datetime64[ns]. I would rather just use "vanilla" Python datetime.datetime.
Problem
But I cannot manage to do this. The problem is that Pandas is forcing any "datetime-like object into datetime64[ns] once it is back in a dataframe or series.
Small Example
For instance, assume the iris dataset with a "timestamp" column:
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class timestamp
0 5.1 3.5 1.4 0.2 setosa 2021-02-19 15:07:24.719272
1 4.9 3.0 1.4 0.2 setosa 2021-02-19 15:07:24.719272
2 4.7 3.2 1.3 0.2 setosa 2021-02-19 15:07:24.719272
3 4.6 3.1 1.5 0.2 setosa 2021-02-19 15:07:24.719272
4 5.0 3.6 1.4 0.2 setosa 2021-02-19 15:07:24.719272
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
dtype: object
I can convert a value to a "normal Python datetime":
>>> df.timestamp[1]
Timestamp('2021-02-19 15:07:24.719272')
>>> type(df.timestamp[1])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
>>> df.timestamp[1].to_pydatetime()
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
>>> type(df.timestamp[1].to_pydatetime())
<class 'datetime.datetime'>
But I cannot "keep" it in that type, when I convert the entire column / series:
>>> df['ts2'] = df.timestamp.apply(lambda x: x.to_pydatetime())
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
ts2 datetime64[ns]
Possible Solutions
I looked to see if there were anything I could do to "dumb down" the dataframe column and make its datetimes less precise. But I cannot see anything. Nor can I see an option to specify column data types upon export via the df.to_parquet() method.
Is there a way to create a plain Python datetime.datetime column (not the Numpy/Pandas datetime65[ns] column) in a Pandas dataframe?

Try to force the dtype='object' when you use to_pydatetime:
df['ts'] = pd.Series(df.timestamp.dt.to_pydatetime(),dtype='object')
df.loc[0,'ts']
Output:
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)

Related

Having problems with splitting a dataframe into train and test

I am splitting a data frame into test and train dataframes but have a problem with the output.I want to et the dataframes for train and test The code is as follows.
train_prices=int(len(prices)*0.65)
test_prices=len(prices)-train_prices
train_data,test_data=prices.iloc[0:train_prices,:].values,prices.iloc[train_prices:len(prices),:1].values
however, I only get a single value rather than dataframes.The output is something of the sort
code
training_size,test_size
Output
11156746, 813724
I expect train and test dataframes that will enable me to go on with ML models.
Please assist
since you didn't provide any reproducible example, I'll demonstrate on iris dataset.
from sklearn.datasets import load_iris
data = load_iris(as_frame=True)
data = data["data"]
data.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
first, remove the .values from your code. values gives you the values as a list, but you want the output to be dataframe.
second, in test_data you took only the first column, since you used prices.iloc[train_prices:len(prices),:1] instead of
prices.iloc[train_prices:len(prices),:] as you did in train_data.
so in order to get two dataframe outputs for train and test:
train_prices=int(len(data)*0.65)
test_prices=len(data)-train_prices
train_data,test_data=data.iloc[0:train_prices,:],data.iloc[train_prices:len(data),:]
btw, if you want to do some ML, check out sklearn train_test_split method.

Is there a simple way to output the number of rows, including missing values for each group, without aggregating them?

I just want to know the number of rows, whenever I want, with whatever variables and groups I want. What I want to do is to write the following 'n_groupuby' column in as short and simple code as possible. Of course, it is the number of rows, so it counts even if there are missing values. Counting without missing values is really easy with 'count'.
sl sw pl pw species n_groupby
0 5.1 3.5 1.4 0.2 setosa 50
1 NaN NaN NaN NaN setosa 50
.. ... ... ... ... ... ...
149 5.9 3.0 5.1 1.8 virginica 50
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=['sl','sw','pl','pw']).assign(species=iris.target_names[iris.target])
df.iloc[1,0:4] = None
sl sw pl pw species
0 5.1 3.5 1.4 0.2 setosa
1 NaN NaN NaN NaN setosa
.. ... ... ... ... ...
149 5.9 3.0 5.1 1.8 virginica
#This does not work.
df.assign(
n_groupby = df.groupby('species').transform('size')
)
#This is too long.
df.merge(df.groupby('species',as_index=False).size(), how='left').rename(columns={'size':'n_groupby'})
I believe that you want to get a "cleaner" version of what your already doing up there, df.isna().sum() basically lists all the nan values in a Series format that's feasible to read.
df.isna().sum()
Thanks :)

What is the pandas equivalent of the R function %in%?

What is the pandas equivalent of the R function %in% ?
When we have a dataframe in R, we can check for which rows a column contains strings from a list using the operator %in% which gives a Boolean output.
Concrete example: If we want to check which rows the strings "setosa" and "virginica" are in the column species of the iris dataset, we can simply use the following code:
iris[:,c('species')] %in% c('setosa', 'virginica').
How can we do the same thing in python for a pandas DataFrame?
The reason I want to do this is I want to filter the dataset and only keep rows with the species "setosa" or "virginica".
%in% in R is actually is.element:
r$> 1 %in% 1:2
[1] TRUE
r$> is.element(1, 1:2)
[1] TRUE
datar has ported some functions in R to python:
>>> from datar.all import c, f, is_element, filter
>>> from datar.datasets import iris
>>>
>>> iris >> filter(is_element(f.Species, c('setosa', 'virginica')))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<float64> <float64> <float64> <float64> <object>
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
.. ... ... ... ... ...
4 5.0 3.6 1.4 0.2 setosa
95 6.7 3.0 5.2 2.3 virginica
96 6.3 2.5 5.0 1.9 virginica
97 6.5 3.0 5.2 2.0 virginica
98 6.2 3.4 5.4 2.3 virginica
99 5.9 3.0 5.1 1.8 virginica
[100 rows x 5 columns]
I am the author of the datar package. Feel free to submit issues if you have any questions.
The pandas package has the .str method for columns that are strings and the .str method itself contains the .isin() method which is equivalent to the %in% operator in R. Further, as pointed out by #rhug123 the .isin method can be directly applied on a series. I have made the corresponding change to the code below.
Your R code above can be implemented in python using pandas as follows - assuming that iris is a pandas DataFrame:
iris.species.isin(['setosa', 'virginica'])
You can then filter your DataFrame and only keep the rows with species 'setosa' or 'virginica' as follows:
iris[iris.species.isin(['setosa', 'virginica'])]

How do I generate (train, test) folds where each test fold contains features generated by the corresponding train fold?

For example, consider the iris dataset and suppose my goal is to predict sepal length.
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
train = pd.DataFrame(iris.data, columns=iris.feature_names)
train['Species'] = iris.target
train
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
.. ... ... ... ... ...
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
I'd like to split the data into 5 folds (train1, test1), (train2, test2), ... Now consider the fold (train1, test1). Using train1, I want to measure the average sepal width, petal length, and petal width per Species, and then map those averages to test1 (based on Species). And then I want to do this for the remaining folds.
Ultimately I'd like to do this in a way that allows me to use GridSearchCV to train a RandomForestRegressor using those same folds. Is there a convenient method of doing this with scikit-learn?
I realize this seems silly for this example but I think it makes sense for my real dataset.

What is the difference between pandas agg and apply function?

I can't figure out the difference between Pandas .aggregate and .apply functions.
Take the following as an example: I load a dataset, do a groupby, define a simple function,
and either user .agg or .apply.
As you may see, the printing statement within my function results in the same output
after using .agg and .apply. The result, on the other hand is different. Why is that?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
...: print type(x)
...: print x.head(3)
...: return 1
Using apply:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[33]:
#Species
#setosa 1
#versicolor 1
#virginica 1
#dtype: int64
Using agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[34]:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Species
#setosa 1 1 1 1
#versicolor 1 1 1 1
#virginica 1 1 1 1
apply applies the function to each group (your Species). Your function returns 1, so you end up with 1 value for each of 3 groups.
agg aggregates each column (feature) for each group, so you end up with one value per column per group.
Do read the groupby docs, they're quite helpful. There are also a bunch of tutorials floating around the web.
(Note: These comparisons are relevant for DataframeGroupby objects)
Some plausible advantages of using .agg() compared to .apply(), for DataFrame GroupBy objects would be:
.agg() gives the flexibility of applying multiple functions at once, or pass a list of function to each column.
Also, applying different functions at once to different columns of dataframe.
That means you have pretty much control over each column with each operation.
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the apply function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
Here are some example comparisons for .apply() vs .agg() for DataframeGroupBy objects :
Given the following dataframe:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]:
name score_1 score_2 score_3
0 Foo 5 10 10
1 Baar 10 15 20
2 Foo 15 10 30
3 Baar 10 25 40
Lets first see the operations using .apply():
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]:
name score_1
Baar 10 40
Foo 5 10
15 10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]:
name score_1
Baar 10 15
Foo 5 10
15 10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]:
name score_1
Baar 10 20.0
Foo 5 10.0
15 10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]:
score_2 score_3
<lambda> sum amin mean amax
name score_1
Baar 10 20 60 20 30 40
Foo 5 10 10 10 10 10
15 10 30 30 30 30
So, .agg() could be really handy at handling the DataFrameGroupBy objects, as compared to .apply(). But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply() can be very useful, as apply() can apply a function along any axis of the dataframe.
(For Eg: axis = 0 implies column-wise operation with .apply(), which is a default mode, and axis = 1 would imply for row-wise operation while dealing with pure dataframe objects).
The main difference between apply and aggregate is:
apply()-
cannot be applied to multiple groups together
For apply() - We have to get_group()
ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
#because functions are applied to one data frame
agg()-
can be applied to multiple groups together
For apply() - We do not have to get_group()
iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
Refer here . Let me requote the same statement here
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:
More details with example are presented in the pandas documentation (link provided above)
Kindly refer this great write up from #ted Petrou and #Eric O Lebigot. We can reapply the logic they have used to investigate difference between Apply and transform to Apply and Agg
Then to understand how axis works refer this link
These three link should help is getting better clarity on how they are different.
When using apply to a groupby I have encountered that .apply will return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):
"...Thus the grouped columns(s) may be included in the output as well as set the indices."
.aggregate will not return the grouped columns.
Besides everything other mentioned, another difference I think no one highlighted yet is that apply can be used to apply a function to a group of columns together. Agg only applies a function to one column separately. An example is:
Let's use the same example as of other example:
d = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
Here, the apply is using a function that sums up the values of all the columns of a group together.
d.groupby(["name", "score_1"]).apply(lambda x: x.values.sum())

Categories