I am querying a dataframe like below:
>>> df
A,B,C
1,1,200
1,1,433
1,1,67
1,1,23
1,2,330
1,2,356
1,2,56
1,3,30
if I do part_df = df[df['A'] == 1 & df['B'] == 2], I am able to get a sub-dataframe as
>>> part_df
A, B, C
1, 2, 330
1, 2, 356
1, 2, 56
Now i wanna make some changes to part_df like:
part_df['C'] = 0
The changes are not reflected in the original df at all. I guess it is because of numpy's array mechanism that everytime a new copy of dataframe is produced. I am wondering how do I query a dataframe with some conditions and makes changes to the selected part as the example I provided and reflect value back to original dataframe in place?
You should do this instead:
In [28]:
df.loc[(df['A'] == 1) & (df['B'] == 2),'C']=0
df
Out[28]:
A B C
0 1 1 200
1 1 1 433
2 1 1 67
3 1 1 23
4 1 2 0
5 1 2 0
6 1 2 0
7 1 3 30
[8 rows x 3 columns]
You should use loc and select the column of interest 'C' in the square brackets at the end
Related
I have a dataset given below:
a,b,c
1,1,1
1,1,1
1,1,2
2,1,2
2,1,1
2,2,1
I created crosstab with pandas:
cross_tab = pd.crosstab(index=a, columns=[b, c], rownames=['a'], colnames=['b', 'c'])
my crosstab is given as an output:
b 1 2
c 1 2 1
a
1 2 1 0
2 1 1 1
I want to iterate over this crosstab for given each a,b and c values. How can I get values such as cross_tab[a=1][b=1, c=1]? Thank you.
You can use slicers:
a,b,c = 1,1,1
idx = pd.IndexSlice
print (cross_tab.loc[a, idx[b,c]])
2
You can also reshape df by DataFrame.unstack, reorder_levels and then use loc:
a = cross_tab.unstack().reorder_levels(('a','b','c'))
print (a)
a b c
1 1 1 2
2 1 1 1
1 1 2 1
2 1 2 1
1 2 1 0
2 2 1 1
dtype: int64
print (a.loc[1,1,1])
2
You are looking for df2.xxx.get_level_values:
In [777]: cross_tab.loc[cross_tab.index.get_level_values('a') == 1,\
(cross_tab.columns.get_level_values('b') == 1)\
& (cross_tab.columns.get_level_values('c') == 1)]
Out[777]:
b 1
c 1
a
1 2
Another way to consider, albeit at loss of a little bit of readability, might be to simply use the .loc to navigate the hierarchical index generated by pandas.crosstab. Following example illustrates it:
import pandas as pd
import numpy as np
np.random.seed(1234)
df = pd.DataFrame(
{
"a": np.random.choice([1, 2], 5, replace=True),
"b": np.random.choice([11, 12, 13], 5, replace=True),
"c": np.random.choice([21, 22, 23], 5, replace=True),
}
)
df
Output
a b c
0 2 11 23
1 2 11 23
2 1 12 23
3 2 12 21
4 1 12 21
crosstab output is:
cross_tab = pd.crosstab(
index=df.a, columns=[df.b, df.c], rownames=["a"], colnames=["b", "c"]
)
cross_tab
b 11 12
c 23 21 23
a
1 0 1 1
2 2 1 0
Now let's say you want to access value when a==2, b==11 and c==23, then simply do
cross_tab.loc[2].loc[11].loc[23]
2
Why does this work? .loc allows one to select by index labels. In the dataframe output by crosstab, our erstwhile column values now become index labels. Thus, with every .loc selection we do, it gives the slice of the dataframe corresponding to that index label. Let's navigate cross_tab.loc[2].loc[11].loc[23] step by step:
cross_tab.loc[2]
yields:
b c
11 23 2
12 21 1
23 0
Name: 2, dtype: int64
Next one:
cross_tab.loc[2].loc[11]
Yields:
c
23 2
Name: 2, dtype: int64
And finally we have
cross_tab.loc[2].loc[11].loc[23]
which yields:
2
Why do I say that this reduces the readability a bit? Because to understand this selection you have to be aware of how the crosstab was created, i.e. rows are a and columns were in the order [b, c]. You have to know that to be able to interpret what cross_tab.loc[2].loc[11].loc[23] would do. But I have found that often to be a good tradeoff.
I have a dataframe like the following.
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
Assume that column A will always in the dataframe but sometimes the could be column B, column B and C, or multiple number of columns.
I have created a code to save the columns names (other than A) in a list as well as the unique permutations of the values in the other columns into a list. For instance, in this example, we have columns B and C saved into columns:
col = ['B','C']
The permutations in the simple df are 1,7; 2,8; 3,9. For simplicity assume one permutation is saved as follows:
permutation = [2,8]
How do I select the entire rows (and only those) that equal that permutation?
Right now, I am using:
a[a[col].isin(permutation)]
Unfortunately, I don't get the values in column A.
(I know how to drop those values that are NaN later. BuT How should I do this to keep it dynamic? Sometimes there will be multiple columns. (Ultimately, I'll run through a loop and save the different iterations) based upon multiple permutations in the columns other than A.
Use the intersection of boolean series (where both conditions are true) - first setup code:
import pandas as pd
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
col = ['B','C']
permutation = [2,8]
And here's the solution for this limited example:
>>> df[(df[col[0]] == permutation[0]) & (df[col[1]] == permutation[1])]
A B C
1 Jean 2 8
3 Sue 2 8
To break that down:
>>> b, c = col
>>> per_b, per_c = permutation
>>> column_b_matches = df[b] == per_b
>>> column_c_matches = df[c] == per_c
>>> intersection = column_b_matches & column_c_matches
>>> df[intersection]
A B C
1 Jean 2 8
3 Sue 2 8
Additional columns and values
To take any number of columns and values, I would create a function:
def select_rows(df, columns, values):
if not columns or not values:
raise Exception('must pass columns and values')
if len(columns) != len(values):
raise Exception('columns and values must be same length')
intersection = True
for c, v in zip(columns, values):
intersection &= df[c] == v
return df[intersection]
and to use it:
>>> select_rows(df, col, permutation)
A B C
1 Jean 2 8
3 Sue 2 8
Or you can coerce the permutation to an array and accomplish this with a single comparison, assuming numeric values:
import numpy as np
def select_rows(df, columns, values):
return df[(df[col] == np.array(values)).all(axis=1)]
But this does not work with your code sample as given
I figured out a solution. Aaron's above works well if I only have two columns. I need a solution that works regardless of the size of the df (as size will be 3-7 columns).
df = pd.DataFrame({'A' : ['Bob','Jean','Sally','Sue'], 'B' : [1,2,3, 2],'C' : [7,8,9,8] })
permutation = [2,8]
col = ['B','C']
interim = df[col].isin(permutation)
df[df.index.isin(interim[(interim != 0).all(1)].index)]
you can do it this way:
In [77]: permutation = np.array([0,2,2])
In [78]: col
Out[78]: ['a', 'b', 'c']
In [79]: df.loc[(df[col] == permutation).all(axis=1)]
Out[79]:
a b c
10 0 2 2
15 0 2 2
16 0 2 2
your solution will not always work properly:
sample DF:
In [71]: df
Out[71]:
a b c
0 0 2 1
1 1 1 1
2 0 1 2
3 2 0 1
4 0 1 0
5 2 0 0
6 2 0 0
7 0 1 0
8 2 1 0
9 0 0 0
10 0 2 2
11 1 0 1
12 2 1 1
13 1 0 0
14 2 1 0
15 0 2 2
16 0 2 2
17 1 0 2
18 0 1 1
19 1 2 0
In [67]: col = ['a','b','c']
In [68]: permutation = [0,2,2]
In [69]: interim = df[col].isin(permutation)
pay attention at the result:
In [70]: df[df.index.isin(interim[(interim != 0).all(1)].index)]
Out[70]:
a b c
5 2 0 0
6 2 0 0
9 0 0 0
10 0 2 2
15 0 2 2
16 0 2 2
I want to eliminate all rows that are equal to a certain values (or in a certain range) within a dataframe with a large number of columns. For example, if I had the following dataframe:
a b
0 1 0
1 2 1
2 3 2
3 0 3
and wanted to remove all rows containing 0, I could use:
a_df[(a_df['a'] != 0) & (a_df['b'] !=0)]
but this becomes a pain when you're dealing with a large number of columns. It could be done as:
for i in a_df.columns.values:
a_df = a_df[a_df[i] != 0]
but this seems inefficient. Is there a better way to do this?
Just do it for the whole df and call dropna:
In [45]:
df[df != 0].dropna()
Out[45]:
a b
1 2 1
2 3 2
The condition df != 0 produces a boolean mask:
In [47]:
df != 0
Out[47]:
a b
0 True False
1 True True
2 True True
3 False True
When this is combined with the df it produces NaN values where the condition is not met:
In [48]:
df[df != 0]
Out[48]:
a b
0 1 NaN
1 2 1
2 3 2
3 NaN 3
Calling dropna drops any row with a NaN value
Here's a variant of EdChum's approach. You could do df != 0 and then use all to make your selector:
>>> (df != 0).all(axis=1)
0 False
1 True
2 True
3 False
dtype: bool
and then use this to select:
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 1
2 3 2
The advantage of this is that it keeps NaNs if you want, e.g.
>>> df
a b
0 1 0
1 2 NaN
2 3 2
3 0 3
>>> df.loc[(df != 0).all(axis=1)]
a b
1 2 NaN
2 3 2
>>> df[(df != 0)].dropna()
a b
2 3 2
as you've mentioned in your question you may need to drop rows that have a value in a certain range you can do this by the following
suppose the range is 0 , 10 , 20
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
mask = frame.applymap(lambda x : False if x in [0 , 10 , 20] else True )
frame[mask.all(axis = 1)]
I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?
This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0
How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1
I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
I'm trying perform a specific operation on a dataframe.
Given the following dataframe:
df1 = pd.DataFrame({
'id': [0, 1, 2, 1, 3, 0],
'letter': ['a','b','c','b','b','a'],
'status':[0,1,0,0,0,1]})
id letter status
0 a 0
1 b 1
2 c 0
1 b 0
3 b 0
0 a 1
I'd like to create another dataframe which contains rows from df1 based on the following restriction.
If 2 or more rows have the same id and letter, then return whichever row has a status of 1. All other rows must be copied over.
The resulting dataframe should look like this:
id letter status
0 a 1
1 b 1
2 c 0
3 b 0
Any help is greatly appreciated. Thank you
this should work:
>>> fn = lambda obj: obj[obj.status == 1] if any(obj.status == 1) else obj
>>> df.groupby(['id', 'letter'], as_index=False).apply(fn)
id letter status
5 0 a 1
1 1 b 1
2 2 c 0
4 3 b 0
[4 rows x 3 columns]
sort by status first and then use groupby
In [1932]: df.sort_values(by='status').groupby('id', as_index=False).last()
Out[1932]:
id letter status
0 0 a 1
1 1 b 1
2 2 c 0
3 3 b 0