What is the pandas equivalent of the R function %in% ?
When we have a dataframe in R, we can check for which rows a column contains strings from a list using the operator %in% which gives a Boolean output.
Concrete example: If we want to check which rows the strings "setosa" and "virginica" are in the column species of the iris dataset, we can simply use the following code:
iris[:,c('species')] %in% c('setosa', 'virginica').
How can we do the same thing in python for a pandas DataFrame?
The reason I want to do this is I want to filter the dataset and only keep rows with the species "setosa" or "virginica".
%in% in R is actually is.element:
r$> 1 %in% 1:2
[1] TRUE
r$> is.element(1, 1:2)
[1] TRUE
datar has ported some functions in R to python:
>>> from datar.all import c, f, is_element, filter
>>> from datar.datasets import iris
>>>
>>> iris >> filter(is_element(f.Species, c('setosa', 'virginica')))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<float64> <float64> <float64> <float64> <object>
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
.. ... ... ... ... ...
4 5.0 3.6 1.4 0.2 setosa
95 6.7 3.0 5.2 2.3 virginica
96 6.3 2.5 5.0 1.9 virginica
97 6.5 3.0 5.2 2.0 virginica
98 6.2 3.4 5.4 2.3 virginica
99 5.9 3.0 5.1 1.8 virginica
[100 rows x 5 columns]
I am the author of the datar package. Feel free to submit issues if you have any questions.
The pandas package has the .str method for columns that are strings and the .str method itself contains the .isin() method which is equivalent to the %in% operator in R. Further, as pointed out by #rhug123 the .isin method can be directly applied on a series. I have made the corresponding change to the code below.
Your R code above can be implemented in python using pandas as follows - assuming that iris is a pandas DataFrame:
iris.species.isin(['setosa', 'virginica'])
You can then filter your DataFrame and only keep the rows with species 'setosa' or 'virginica' as follows:
iris[iris.species.isin(['setosa', 'virginica'])]
Need
I am trying to export a dataframe to a Parquet file, which will be consumed later in the pipeline by something that is not Python or Pandas. (Azure Data Factory)
When I ingest the Parquet file later in the flow, it cannot recognize datetime64[ns]. I would rather just use "vanilla" Python datetime.datetime.
Problem
But I cannot manage to do this. The problem is that Pandas is forcing any "datetime-like object into datetime64[ns] once it is back in a dataframe or series.
Small Example
For instance, assume the iris dataset with a "timestamp" column:
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class timestamp
0 5.1 3.5 1.4 0.2 setosa 2021-02-19 15:07:24.719272
1 4.9 3.0 1.4 0.2 setosa 2021-02-19 15:07:24.719272
2 4.7 3.2 1.3 0.2 setosa 2021-02-19 15:07:24.719272
3 4.6 3.1 1.5 0.2 setosa 2021-02-19 15:07:24.719272
4 5.0 3.6 1.4 0.2 setosa 2021-02-19 15:07:24.719272
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
dtype: object
I can convert a value to a "normal Python datetime":
>>> df.timestamp[1]
Timestamp('2021-02-19 15:07:24.719272')
>>> type(df.timestamp[1])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
>>> df.timestamp[1].to_pydatetime()
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
>>> type(df.timestamp[1].to_pydatetime())
<class 'datetime.datetime'>
But I cannot "keep" it in that type, when I convert the entire column / series:
>>> df['ts2'] = df.timestamp.apply(lambda x: x.to_pydatetime())
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
ts2 datetime64[ns]
Possible Solutions
I looked to see if there were anything I could do to "dumb down" the dataframe column and make its datetimes less precise. But I cannot see anything. Nor can I see an option to specify column data types upon export via the df.to_parquet() method.
Is there a way to create a plain Python datetime.datetime column (not the Numpy/Pandas datetime65[ns] column) in a Pandas dataframe?
Try to force the dtype='object' when you use to_pydatetime:
df['ts'] = pd.Series(df.timestamp.dt.to_pydatetime(),dtype='object')
df.loc[0,'ts']
Output:
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
I would like to know is there a way where can I get one on one( 1 independent variable vs target variable) linear regression analysis ,its p value, R2 value and the plot to show how linearly it is related or not. And I want this to run on all independent variables separately. As far as I know it is possible to get OLS regression analysis from Python statsmodel library. It runs on whole dataset and give the result, and there are no plots to understand it visually.
To very quickly visualize the regression you can try the below using sns:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
data = load_iris()
df = pd.DataFrame(data.data, columns=['sepal.length','sepal.width','petal.length','petal.width'])
df = pd.melt(df,id_vars='sepal.length')
df[:5]
sepal.length variable value
0 5.1 sepal.width 3.5
1 4.9 sepal.width 3.0
2 4.7 sepal.width 3.2
3 4.6 sepal.width 3.1
4 5.0 sepal.width 3.6
sns.lmplot(x ='sepal.length', y ='value', data = df,col='variable',
col_wrap=2,aspect = 0.6, height,= 4, palette ='coolwarm')
For example, consider the iris dataset and suppose my goal is to predict sepal length.
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
train = pd.DataFrame(iris.data, columns=iris.feature_names)
train['Species'] = iris.target
train
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
.. ... ... ... ... ...
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
I'd like to split the data into 5 folds (train1, test1), (train2, test2), ... Now consider the fold (train1, test1). Using train1, I want to measure the average sepal width, petal length, and petal width per Species, and then map those averages to test1 (based on Species). And then I want to do this for the remaining folds.
Ultimately I'd like to do this in a way that allows me to use GridSearchCV to train a RandomForestRegressor using those same folds. Is there a convenient method of doing this with scikit-learn?
I realize this seems silly for this example but I think it makes sense for my real dataset.
I can't figure out the difference between Pandas .aggregate and .apply functions.
Take the following as an example: I load a dataset, do a groupby, define a simple function,
and either user .agg or .apply.
As you may see, the printing statement within my function results in the same output
after using .agg and .apply. The result, on the other hand is different. Why is that?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
...: print type(x)
...: print x.head(3)
...: return 1
Using apply:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[33]:
#Species
#setosa 1
#versicolor 1
#virginica 1
#dtype: int64
Using agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[34]:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Species
#setosa 1 1 1 1
#versicolor 1 1 1 1
#virginica 1 1 1 1
apply applies the function to each group (your Species). Your function returns 1, so you end up with 1 value for each of 3 groups.
agg aggregates each column (feature) for each group, so you end up with one value per column per group.
Do read the groupby docs, they're quite helpful. There are also a bunch of tutorials floating around the web.
(Note: These comparisons are relevant for DataframeGroupby objects)
Some plausible advantages of using .agg() compared to .apply(), for DataFrame GroupBy objects would be:
.agg() gives the flexibility of applying multiple functions at once, or pass a list of function to each column.
Also, applying different functions at once to different columns of dataframe.
That means you have pretty much control over each column with each operation.
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the apply function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
Here are some example comparisons for .apply() vs .agg() for DataframeGroupBy objects :
Given the following dataframe:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]:
name score_1 score_2 score_3
0 Foo 5 10 10
1 Baar 10 15 20
2 Foo 15 10 30
3 Baar 10 25 40
Lets first see the operations using .apply():
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]:
name score_1
Baar 10 40
Foo 5 10
15 10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]:
name score_1
Baar 10 15
Foo 5 10
15 10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]:
name score_1
Baar 10 20.0
Foo 5 10.0
15 10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]:
score_2 score_3
<lambda> sum amin mean amax
name score_1
Baar 10 20 60 20 30 40
Foo 5 10 10 10 10 10
15 10 30 30 30 30
So, .agg() could be really handy at handling the DataFrameGroupBy objects, as compared to .apply(). But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply() can be very useful, as apply() can apply a function along any axis of the dataframe.
(For Eg: axis = 0 implies column-wise operation with .apply(), which is a default mode, and axis = 1 would imply for row-wise operation while dealing with pure dataframe objects).
The main difference between apply and aggregate is:
apply()-
cannot be applied to multiple groups together
For apply() - We have to get_group()
ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
#because functions are applied to one data frame
agg()-
can be applied to multiple groups together
For apply() - We do not have to get_group()
iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
Refer here . Let me requote the same statement here
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:
More details with example are presented in the pandas documentation (link provided above)
Kindly refer this great write up from #ted Petrou and #Eric O Lebigot. We can reapply the logic they have used to investigate difference between Apply and transform to Apply and Agg
Then to understand how axis works refer this link
These three link should help is getting better clarity on how they are different.
When using apply to a groupby I have encountered that .apply will return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):
"...Thus the grouped columns(s) may be included in the output as well as set the indices."
.aggregate will not return the grouped columns.
Besides everything other mentioned, another difference I think no one highlighted yet is that apply can be used to apply a function to a group of columns together. Agg only applies a function to one column separately. An example is:
Let's use the same example as of other example:
d = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
Here, the apply is using a function that sums up the values of all the columns of a group together.
d.groupby(["name", "score_1"]).apply(lambda x: x.values.sum())