String "contains"-slicing on Pandas MultiIndex - python

How can I slice a MultiIndex by its string content? I.e. whether that particular index contains a certain string?
In [12]: df = pd.DataFrame({'a': ['a', 'ab', 'b'],
'c': ['d', 'd', 'd'],
'val': [1, 2 , 3]}).set_index(['a', 'c'])
In [13]: df
Out[13]:
val
a c
a d 1
ab d 2
b d 3
In [15]: df.xs('a', level='a', drop_level=False)
Out[15]:
val
a c
a d 1
In[16]: df.xs(contains('a'), level='a', drop_level=False)
Expected output:
Out[16]:
a c
a d 1
ab d 2
Obviously that last bit is not possible.
How can this be done elegantly?
Can you do it case-insensitive in some way?

Use boolean indexing with get_level_values and str.contains:
print (df.index.get_level_values('a'))
Index(['a', 'ab', 'b'], dtype='object', name='a'
print (df.index.get_level_values('a').str.contains('a'))
[ True True False]
df1 = df[df.index.get_level_values('a').str.contains('a', case=False)]
print (df1)
val
a c
a d 1
ab d 2

Another method is to use query:
The DataFrame.index and DataFrame.columns attributes of the DataFrame
instance are placed in the query namespace by default, which allows
you to treat both the index and columns of the frame as a column in
the frame.
>>> df.query('a.str.contains("a")')
val
a c
a d 1
ab d 2
which IMO is a little more readable and succinct than
>>> df[df.index.get_level_values('a').str.contains('a')]

When index level is not critical, an alternative is to use df.filter(...) with regular expressions; super helpful when exploring data by either column or row. For example, this will give you the same answer with a bit less code:
df.filter(regex=re.compile('A',re.I),axis=0)
However, this filters at all index levels, df.filter(regex=re.compile('D',re.I),axis=0) will look at index "c" and it show this:

Related

pandas: Extracting the index of the maximum value in an expanding window

In a pandas DataFrame, I can create a Series B with the maximum value of another Series A, from the first row to the current one, by using an expanding window:
df['B'] = df['A'].expanding().max()
I can also extract the value of the index of the maximum overall value of Series A:
idx_max_A = df['A'].idxmax().value
What I want is an efficient way to combine both; that is, to create a Series B that holds the value of the index of the maximum value of Series A from the first row up to the current one. Ideally, something like this...
df['B'] = df['A'].expanding().idxmax().value
...but, of course, the above fails because the Expanding object does not have idxmax. Is there a straightforward way to do this?
EDIT: For illustration purposes, for the following DataFrame...
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
...I'd like to create an additional column B so that the DataFrame contains the following:
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
I believe you can use expanding + max + groupby:
v = df.expanding().max().A
df['B'] = v.groupby(v).transform('idxmax')
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
It seems idmax is a function in the latest version of pandas which I don't have yet. Here's a solution not involving groupby or idmax
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
temp = df.A.expanding().max()
df['B'] = temp.apply(lambda x: temp[temp == x].index[0])
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d

Simplest way to return the value only of a location from a dataframe

This is a very simple question but I always found myself doing too many operation to get a single location value from a dataframe. let me explain:
import pandas as pd
df = pd.DataFrame(list(zip('abcde', 'rithf')), columns=['a', 'b'])
df
Out[23]:
a b
0 a r
1 b i
2 c t
3 d h
4 e f
I am trying to extract the column b value where column a == a. using .loc which is very straight forward would return this:
df.loc[df.a == 'a', 'b']
Out[24]:
0 r
Name: b, dtype: object
getting the value gets very dirty:
df.loc[df.a == 'a', 'b'].values[0]
Out[26]:
'r'
when you know the exact index, it actually returns the value only:
df.loc[0, 'b']
Out[27]:
'r'
but obviously I need an indexer first.
so the question is, is there any sexier way df.loc[df.a == 'a', 'b'].values[0] to return the actual value only and not aseries
You can use Series.item:
print(df.loc[df.a == 'a', 'b'].item())
r
You can use argmax to get the match as a scalar and then use .loc -
df.loc[(df.a=='a').argmax(),'b']
Of course, it assumes that we have a match in that 'a' column.
Sample run -
In [346]: df
Out[346]:
a b
0 a r
1 b i
2 c t
3 d h
4 e f
In [347]: (df.a=='a').argmax() # row indexer
Out[347]: 0
In [348]: df.loc[(df.a=='a').argmax(),'b']
Out[348]: 'r'

Check if Pandas column contains value from another column

if df['col']='a','b','c' and df2['col']='a123','b456','d789' how do I create df2['is_contained']='a','b','no_match' where if values from df['col'] are found within values from df2['col'] the df['col'] value is returned and if no match is found, 'no_match' is returned? Also I don't expect there to be multiple matches, but in the unlikely case there are, I'd want to return a string like 'Multiple Matches'.
With this toy data set, we want to add a new column to df2 which will contain no_match for the first three rows, and the last row will contain the value 'd' due to the fact that that row's col value (the letter 'a') appears in df1.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col': ['a123','b456','d789', 'a']})
In other words, values from df1 should be used to populate this new column in df2 only when a row's df2['col'] value appears somewhere in df1['col'].
In [2]: df1
Out[2]:
col
0 a
1 b
2 c
3 d
In [3]: df2
Out[3]:
col
0 a123
1 b456
2 d789
3 a
If this is the right way to understand your question, then you can do this with pandas isin:
In [4]: df2.col.isin(df1.col)
Out[4]:
0 False
1 False
2 False
3 True
Name: col, dtype: bool
This evaluates to True only when a value in df2.col is also in df1.col.
Then you can use np.where which is more or less the same as ifelse in R if you are familiar with R at all.
In [5]: np.where(df2.col.isin(df1.col), df1.col, 'NO_MATCH')
Out[5]:
0 NO_MATCH
1 NO_MATCH
2 NO_MATCH
3 d
Name: col, dtype: object
For rows where a df2.col value appears in df1.col, the value from df1.col will be returned for the given row index. In cases where the df2.col value is not a member of df1.col, the default 'NO_MATCH' value will be used.
You must first guarantee that the indexes match. To simplify, I'll show as if the columns where in the same dataframe. The trick is to use the apply method in the columns axis:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd'],
'col2': ['a123','b456','d789', 'a']})
df['contained'] = df.apply(lambda x: x.col1 in x.col2, axis=1)
df
col1 col2 contained
0 a a123 True
1 b b456 True
2 c d789 False
3 d a False
In 0.13, you can use str.extract:
In [11]: df1 = pd.DataFrame({'col': ['a', 'b', 'c']})
In [12]: df2 = pd.DataFrame({'col': ['d23','b456','a789']})
In [13]: df2.col.str.extract('(%s)' % '|'.join(df1.col))
Out[13]:
0 NaN
1 b
2 a
Name: col, dtype: object

How to assign a name to the size() column?

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Categories