What is the pandas equivalent of the R function %in%? - python

What is the pandas equivalent of the R function %in% ?
When we have a dataframe in R, we can check for which rows a column contains strings from a list using the operator %in% which gives a Boolean output.
Concrete example: If we want to check which rows the strings "setosa" and "virginica" are in the column species of the iris dataset, we can simply use the following code:
iris[:,c('species')] %in% c('setosa', 'virginica').
How can we do the same thing in python for a pandas DataFrame?
The reason I want to do this is I want to filter the dataset and only keep rows with the species "setosa" or "virginica".

%in% in R is actually is.element:
r$> 1 %in% 1:2
[1] TRUE
r$> is.element(1, 1:2)
[1] TRUE
datar has ported some functions in R to python:
>>> from datar.all import c, f, is_element, filter
>>> from datar.datasets import iris
>>>
>>> iris >> filter(is_element(f.Species, c('setosa', 'virginica')))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<float64> <float64> <float64> <float64> <object>
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
.. ... ... ... ... ...
4 5.0 3.6 1.4 0.2 setosa
95 6.7 3.0 5.2 2.3 virginica
96 6.3 2.5 5.0 1.9 virginica
97 6.5 3.0 5.2 2.0 virginica
98 6.2 3.4 5.4 2.3 virginica
99 5.9 3.0 5.1 1.8 virginica
[100 rows x 5 columns]
I am the author of the datar package. Feel free to submit issues if you have any questions.

The pandas package has the .str method for columns that are strings and the .str method itself contains the .isin() method which is equivalent to the %in% operator in R. Further, as pointed out by #rhug123 the .isin method can be directly applied on a series. I have made the corresponding change to the code below.
Your R code above can be implemented in python using pandas as follows - assuming that iris is a pandas DataFrame:
iris.species.isin(['setosa', 'virginica'])
You can then filter your DataFrame and only keep the rows with species 'setosa' or 'virginica' as follows:
iris[iris.species.isin(['setosa', 'virginica'])]

Related

Is there a simple way to output the number of rows, including missing values for each group, without aggregating them?

I just want to know the number of rows, whenever I want, with whatever variables and groups I want. What I want to do is to write the following 'n_groupuby' column in as short and simple code as possible. Of course, it is the number of rows, so it counts even if there are missing values. Counting without missing values is really easy with 'count'.
sl sw pl pw species n_groupby
0 5.1 3.5 1.4 0.2 setosa 50
1 NaN NaN NaN NaN setosa 50
.. ... ... ... ... ... ...
149 5.9 3.0 5.1 1.8 virginica 50
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=['sl','sw','pl','pw']).assign(species=iris.target_names[iris.target])
df.iloc[1,0:4] = None
sl sw pl pw species
0 5.1 3.5 1.4 0.2 setosa
1 NaN NaN NaN NaN setosa
.. ... ... ... ... ...
149 5.9 3.0 5.1 1.8 virginica
#This does not work.
df.assign(
n_groupby = df.groupby('species').transform('size')
)
#This is too long.
df.merge(df.groupby('species',as_index=False).size(), how='left').rename(columns={'size':'n_groupby'})
I believe that you want to get a "cleaner" version of what your already doing up there, df.isna().sum() basically lists all the nan values in a Series format that's feasible to read.
df.isna().sum()
Thanks :)

Importing file containing text and numerical data using Python

I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:
The data shown in this snapshot can be found here.
In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.
I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.
If I were you, I would use pandas, and import it using something like this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
This gives you the dataframe:
>>> df
Type T1 T2 T3 T4 T5
Tag Good Good Good Good Good
object1 1.1 2.1 3.1 4.1 5.1
object2 1.2 2.2 3.2 4.2 5.2
object3 1.3 2.3 3.3 4.3 5.3
object4 1.4 2.4 3.4 4.4 5.4
object5 1.5 2.5 3.5 4.5 5.5
object6 1.6 2.6 3.6 4.6 5.6
object7 1.7 2.7 3.7 4.7 5.7
object8 1.8 2.8 3.8 4.8 5.8
And all of your columns are floats:
>>> df.dtypes
Type Tag
T1 Good float64
T2 Good float64
T3 Good float64
T4 Good float64
T5 Good float64
dtype: object
It contains a multi-indexed column header:
>>> df.columns
MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],
labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
names=['Type', 'Tag'])
And a regular index containing the information from Type:
>>> df.index
Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',
'object7', 'object8'],
dtype='object')
Furthermore, you can convert your values to a numpy array of floats simply by using:
>>> df.values
array([[1.1, 2.1, 3.1, 4.1, 5.1],
[1.2, 2.2, 3.2, 4.2, 5.2],
[1.3, 2.3, 3.3, 4.3, 5.3],
[1.4, 2.4, 3.4, 4.4, 5.4],
[1.5, 2.5, 3.5, 4.5, 5.5],
[1.6, 2.6, 3.6, 4.6, 5.6],
[1.7, 2.7, 3.7, 4.7, 5.7],
[1.8, 2.8, 3.8, 4.8, 5.8]])
Use sep with \s for any spaces not only tabs, engine='python' for removing warning:
df=pd.read_csv('dum.txt',engine='python',sep='\s')
print(df)
Output:
Type T1 T2 T3 T4 T5
0 Tag Good Good Good Good Good
1 object1 1.1 2.1 3.1 4.1 5.1
2 object2 1.2 2.2 3.2 4.2 5.2
3 object3 1.3 2.3 3.3 4.3 5.3
4 object4 1.4 2.4 3.4 4.4 5.4
5 object5 1.5 2.5 3.5 4.5 5.5
6 object6 1.6 2.6 3.6 4.6 5.6
7 object7 1.7 2.7 3.7 4.7 5.7
8 object8 1.8 2.8 3.8 4.8 5.8
Or if want two rows of columns (i would not recommend because then it's hard to use):
df=pd.read_csv('dum.txt',engine='python',sep='\s',header=[0,1])
print(df)
Output:
Type T1 T2 T3 T4 T5
Tag Good Good Good Good Good
0 object1 1.1 2.1 3.1 4.1 5.1
1 object2 1.2 2.2 3.2 4.2 5.2
2 object3 1.3 2.3 3.3 4.3 5.3
3 object4 1.4 2.4 3.4 4.4 5.4
4 object5 1.5 2.5 3.5 4.5 5.5
5 object6 1.6 2.6 3.6 4.6 5.6
6 object7 1.7 2.7 3.7 4.7 5.7
Otherwise default direct read_csv (like pd.read_csv('dum.txt')) will return:
Type\tT1\tT2\tT3\tT4\tT5
0 Tag\tGood\tGood\tGood\tGood\tGood
1 object1\t1.1\t2.1\t3.1\t4.1\t5.1
2 object2\t1.2\t2.2\t3.2\t4.2\t5.2
3 object3\t1.3\t2.3\t3.3\t4.3\t5.3
4 object4\t1.4\t2.4\t3.4\t4.4\t5.4
5 object5\t1.5\t2.5\t3.5\t4.5\t5.5
6 object6\t1.6\t2.6\t3.6\t4.6\t5.6
7 object7\t1.7\t2.7\t3.7\t4.7\t5.7
8 object8\t1.8\t2.8\t3.8\t4.8\t5.8

python blaze postgresql can't print "distinct" iris species

Going through this tutorial about blaze, but using the iris dataset in a local postgresql db.
I dont seem to get the same output as shown when using db.iris.Species.distinct() (see In 16 of the Ipython notebook).
My connection string is postgresql://postgres:postgres#localhost:5432/blaze_test
and my simple Python code is:
import blaze as bz
db = bz.Data('postgresql://postgres:postgres#localhost:5432/blaze_test')
mySpecies = db.iris_data.species.distinct()
print mySpecies
All I get in the console (using the Spyder IDE) is distinct(_55.iris_data.species)
How can actually print the distinct species in the table?
NB:I know I am using lowercase "s" for the "species" part in the code, otherwise I just get an error to say: 'Field' object has no attribute 'Species'
The printing mechanism is tripping you up a bit here.
The __str__ implementation (which is what Python's print function calls) returns a string version of the expression.
The __repr__ implementation (called when you execute a line in the interpreter) triggers computation and thus allows you to see the results of a computation.
In [2]: iris = Data(odo(os.path.abspath('./blaze/examples/data/iris.csv'), 'postgresql://localhost::iris'))
In [3]: iris
Out[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
...
In [4]: iris.species.distinct()
Out[4]:
species
0 Iris-versicolor
1 Iris-virginica
2 Iris-setosa
In [8]: print(str(iris.species.distinct()))
distinct(_1.species)
In [9]: print(repr(iris.species.distinct()))
species
0 Iris-versicolor
1 Iris-virginica
2 Iris-setosa
If you want to shove the result into a concrete data structure like a pandas.Series, do this:
In [5]: odo(iris.species.distinct(), pd.Series)
Out[5]:
0 Iris-versicolor
1 Iris-virginica
2 Iris-setosa
Name: species, dtype: object
Ok, I think I know now. The rest of the YouTube video made it a bit more clear.
I should do something like output = odo(mySpecies, pdDataFrame) or output = odo(mySpecies, list) then print output to do the transformation.
Other solutions/points welcome.

how do I transform a DataFrame in pandas with a function applied to many slices in each row?

I want to apply a function f to many slices within each row of a pandas DataFrame.
For example, DataFrame df would look as such:
df = pandas.DataFrame(np.round(np.random.normal(size=(2,49)), 2))
So, I have a dataframe of 2 rows by 49 columns, and my function needs to be applied to every consequent slice of 7 data points in both rows, and so that the resulting dataframe looks identical to the input dataframe.
I was doing it as such:
df1=df.copy()
df1.T[:7], df1.T[7:14], df1.T[14:21],..., df1.T[43:50] = f(df.T.iloc[:7,:]), f(df.T.iloc[7:14,:]),..., f(df.T.iloc[43:50,:])
As you can see that's a whole lot of redundant code.. so I would like to create a loop or something so that it applies the function to every 7 subsequent data point...
I have no idea how to approach this. Is there a more elegant way to do this?
I thought I could maybe use a transform function for this, but in the pandas documentation I can only see that applied to a dataframe that has been grouped and not on slices of the data....
Hopefully this is clear.. let me know.
Thank you.
To avoid redundant code you can just do a loop like this:
STEP = 7
for i in range(0,len(df),STEP):
df1.T[i:i+STEP] = f(df1.T[i:i+STEP]) # could also do an apply here somehow, depending on what you want to do
Don't Repeat Yourself
You don't provide any examples of your desired output, so here's my best guess at what you want...
If your data are lumped into groups of seven, the you need to come up with a way to label them as such.
If other words, you with want to work with arbitrary arrays, use numpy. If you want to work with labeled, meaningful data and it's associated metadata, then use pandas.
Also, pandas works more efficiently when operating (and displaying!) row-wise data. So that mean store data long (49x2), not wide (2x49)
Here's an example of what I mean. I have the same 49x2 random array, but assigned grouping labels to the rows ahead of time.
Let's yeah you're reading in some wide-ish data as following:
import pandas
import numpy
from io import StringIO # python 3
# from StringIO import StringIO # python 2
datafile = StringIO("""\
A,B,C,D,E,F,G,H,I,J
0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9
2.0,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9
""")
df = pandas.read_csv(datafile)
print(df)
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
You could add a cluster value to the columns, like so:
cluster_size = 3
col_vals = []
for n, col in enumerate(df.columns):
cluster = int(n/cluster_size)
col_vals.append((cluster, col))
df.columns = pandas.Index(col_vals)
print(df)
0 1 2 3
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
By default, the groupby method tries to group rows, but you can group columns (I just fogured this out), by passing axis=1 when you create the object. So the sum of each cluster of columns for each row is as follows:
df.groupby(axis=1, level=0).sum()
0 1 2 3
0 0.3 1.2 2.1 0.9
1 3.3 4.2 5.1 1.9
2 6.3 7.2 8.1 2.9
But again, if all you're doing is more "global" operations, there's no need to any of this.
In-place column cluster operation
df[0] *= 5
print(df)
0 1 2 3
A B C D E F G H I J
0 0 2.5 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
In-place row operation
df.T[0] += 20
0 1 2 3
A B C D E F G H I J
0 20 22.5 25 20.3 20.4 20.5 20.6 20.7 20.8 20.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Operate on the entire dataframe at once
def myFunc(x):
return 5 + x**2
myFunc(df)
0 1 2 3
A B C D E F G H I J
0 405 511.25 630 417.09 421.16 425.25 429.36 433.49 437.64 441.81
1 630 761.25 905 6.69 6.96 7.25 7.56 7.89 8.24 8.61
2 2505 2761.25 3030 10.29 10.76 11.25 11.76 12.29 12.84 13.41

What is the difference between pandas agg and apply function?

I can't figure out the difference between Pandas .aggregate and .apply functions.
Take the following as an example: I load a dataset, do a groupby, define a simple function,
and either user .agg or .apply.
As you may see, the printing statement within my function results in the same output
after using .agg and .apply. The result, on the other hand is different. Why is that?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
...: print type(x)
...: print x.head(3)
...: return 1
Using apply:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[33]:
#Species
#setosa 1
#versicolor 1
#virginica 1
#dtype: int64
Using agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[34]:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Species
#setosa 1 1 1 1
#versicolor 1 1 1 1
#virginica 1 1 1 1
apply applies the function to each group (your Species). Your function returns 1, so you end up with 1 value for each of 3 groups.
agg aggregates each column (feature) for each group, so you end up with one value per column per group.
Do read the groupby docs, they're quite helpful. There are also a bunch of tutorials floating around the web.
(Note: These comparisons are relevant for DataframeGroupby objects)
Some plausible advantages of using .agg() compared to .apply(), for DataFrame GroupBy objects would be:
.agg() gives the flexibility of applying multiple functions at once, or pass a list of function to each column.
Also, applying different functions at once to different columns of dataframe.
That means you have pretty much control over each column with each operation.
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the apply function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
Here are some example comparisons for .apply() vs .agg() for DataframeGroupBy objects :
Given the following dataframe:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]:
name score_1 score_2 score_3
0 Foo 5 10 10
1 Baar 10 15 20
2 Foo 15 10 30
3 Baar 10 25 40
Lets first see the operations using .apply():
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]:
name score_1
Baar 10 40
Foo 5 10
15 10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]:
name score_1
Baar 10 15
Foo 5 10
15 10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]:
name score_1
Baar 10 20.0
Foo 5 10.0
15 10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]:
score_2 score_3
<lambda> sum amin mean amax
name score_1
Baar 10 20 60 20 30 40
Foo 5 10 10 10 10 10
15 10 30 30 30 30
So, .agg() could be really handy at handling the DataFrameGroupBy objects, as compared to .apply(). But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply() can be very useful, as apply() can apply a function along any axis of the dataframe.
(For Eg: axis = 0 implies column-wise operation with .apply(), which is a default mode, and axis = 1 would imply for row-wise operation while dealing with pure dataframe objects).
The main difference between apply and aggregate is:
apply()-
cannot be applied to multiple groups together
For apply() - We have to get_group()
ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
#because functions are applied to one data frame
agg()-
can be applied to multiple groups together
For apply() - We do not have to get_group()
iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
Refer here . Let me requote the same statement here
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:
More details with example are presented in the pandas documentation (link provided above)
Kindly refer this great write up from #ted Petrou and #Eric O Lebigot. We can reapply the logic they have used to investigate difference between Apply and transform to Apply and Agg
Then to understand how axis works refer this link
These three link should help is getting better clarity on how they are different.
When using apply to a groupby I have encountered that .apply will return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):
"...Thus the grouped columns(s) may be included in the output as well as set the indices."
.aggregate will not return the grouped columns.
Besides everything other mentioned, another difference I think no one highlighted yet is that apply can be used to apply a function to a group of columns together. Agg only applies a function to one column separately. An example is:
Let's use the same example as of other example:
d = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
Here, the apply is using a function that sums up the values of all the columns of a group together.
d.groupby(["name", "score_1"]).apply(lambda x: x.values.sum())

Categories