Select pandas Series elements based on condition - python

Given a dataframe, I know I can select rows by condition using below syntax:
df[df['colname'] == 'Target Value']
But what about a Series? Series does not have a column (axis 1) name, right?
My scenario is I have created a Series by through the nunique() function:
sr = df.nunique()
And I want to list out the index names of those rows with value 1.
Having failed to find a clear answer on the Net, I resorted to below solution:
for (colname, coldata) in sr.iteritems():
if coldata == 1:
print(colname)
Question: what is a better way to get my answer (i.e list out index names of Series (or column names of the original Dataframe) which has just a single value?)
The ultimate objective was to find which columns in a DF has one and only one unique value. Since I did not know how to do that direct from a DF, I first used nunique() and that gave me a Series. Thus i needed to process the Series with a "== 1" (i.e one and only one)
I hope my question isnt silly.

It is unclear what you want. Whether you want to work on the dataframe or on the Series ?
Case 1: Working on DataFrame
In case you want to work on the dataframe to to list out the index names of those rows with value 1, you can try:
df.index[df[df==1].any(axis=1)].tolist()
Demo
data = {'Col1': [0, 1, 2, 2, 0], 'Col2': [0, 2, 2, 1, 2], 'Col3': [0, 0, 0, 0, 1]}
df = pd.DataFrame(data)
Col1 Col2 Col3
0 0 0 0
1 1 2 0
2 2 2 0
3 2 1 0
4 0 2 1
Then, run the code, it gives:
[1, 3, 4]
Case 2: Working on Series
If you want to extract the index of a Series with value 1, you can extract it into a list, as follows:
sr.loc[sr == 1].index.tolist()
or use:
sr.index[sr == 1].tolist()

It would work the same way, due to the fact that pandas overloads the == operator:
selected_series = series[series == my_value]

Related

dataframe group by for all columns in new dataframe

I want to create a new dataframe with the values grouped by each column header dataset
this is the dataset i'm working with.
I essentially want a new dataframe which sums the occurences of 1 and 0 for each feature (chocolate, fruity etc)
i tried this code with the groupby and sort function
`
chocolate = data.groupby(["chocolate"]).size()
bar = data.groupby(["bar"]).size()
hard = data.groupby(["hard"]).size()
display(chocolate,bar, hard)
`
but this only gives me the sum per feature
this is the end result i want to become
end result
You could try the following:
res = (
data
.drop(columns="competitorname")
.melt().value_counts()
.unstack()
.fillna(0).astype("int").T
)
Eliminate the columns that aren't relevant (I've only seen competitorname, but there could be more).
.melt the dataframe. The result has 2 columns, one with the column names, and another with the resp. 0/1 values.
Now .value_counts gives you a series that essentially contains what you are looking for.
Then you just have to .unstack the first index level (column names) and transpose the dataframe.
Example:
data = pd.DataFrame({
"competitorname": ["A", "B", "C"],
"chocolate": [1, 0, 0], "bar": [1, 0, 1], "hard": [1, 1, 1]
})
competitorname chocolate bar hard
0 A 1 1 1
1 B 0 0 1
2 C 0 1 1
Result:
variable bar chocolate hard
value
0 1 2 0
1 2 1 3
Alternative with .pivot_table:
res = (
data
.drop(columns="competitorname")
.melt().value_counts().to_frame()
.pivot_table(index="value", columns="variable", fill_value=0)
.droplevel(0, axis=1)
)
PS: Please don't post images, provide a litte example (like here) that encapsulates your problem.

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.
Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Lookup a single value using a multi-column key from Pandas DataFrame

This thread doesn't seem to cover a situation I am routinely in.
Return single cell value from Pandas DataFrame
How does one return a single value, not a series or dataframe using a set of column conditions as keys? This seems to be a common need. Say you have a database of info and you need to pluck answers to questions from it, but you need one answer, not a series of possible answers. My method seems "hokey" -- not Pythonic? And maybe not good for technical reasons.
import pandas as pd
d = {'A': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'B': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'C': [1, 3, 5,
2, 9, 7, 4, 3, 2]}
df = pd.DataFrame(data=d)
df looks like:
A B C
0 1 1 1
1 1 2 3
2 1 3 5
3 2 1 2
4 2 2 9
5 2 3 7
6 3 1 4
7 3 2 3
8 3 3 2
How to get the value in the C column where A == 1 and B == 3? In my case it's always unique, but I can see how that cannot be assumed so this method returns a series:
df[(df['A'] == 1) & (df['B'] == 3)]['C']
I don't want a series. So how to get a single value, not a series or list of one row or one element?
My method:
df[(df['A'] == 1) & (df['B'] == 3)]['C'].tolist()[0]
In the Pandas library it seems DataFrame.at is the way to go, but this method doesn't look better, though I wonder if it is technically better:
df.at[df.loc[(df['A'] == 1) & (df['B'] == 3)].index[0], 'C']
So, in your opinion, what is the best way to using multiple column conditions to find a value in a dataframe and return a single value (not a list or series)?
I have sat with the same question a few times in the past. I have come to accept that it's not actually that common for me to do this anyway, so I usually just do this:
df.loc[(df['A'] == 1) & (df['B'] == 3), "C"].iat[0]
# frequently I also like to make it more readable like this
is1and3 = (df['A'] == 1) & (df['B'] == 3)
df.loc[is1and3, "C"].iat[0]
This is almost the same as
df.at[df.loc[(df['A'] == 1) & (df['B'] == 3)].index[0], 'C']
which essentially just grabs the first index matching the condition and passes it to .at, rather than subsetting and then grabbing the first returned value with .iat[0], but I don't really like seeing .loc and .index in the call to .at.
Obviously the problem that pandas needs to handle is that there is no guarantee that a condition will only be satisfied by exactly one value in the df, so it's left to the user to handle that.
Some basic guidance
some more in depth
If the combination of columns A and B is unique then we can set the index in advance to efficiently retrieve a single value
df.set_index(['A', 'B']).loc[(1, 3), 'C']
Alternative approach with item
df.loc[df['A'].eq(1) & df['B'].eq(3), 'C'].item()

Find index of cell in dateframe

I would like to modify the cell value based on its size.
If the dateframe is as below:
A B C
25802523 X1 2
M25JK0010 Y1 1
K25JK0010 Y2 1
I would like to modify the column 'A' and insert to another column.
For example, if the first cell value the size of column A is 8. I would like to break it and get least 5 values, similarly others depend on their sizes of each cell.
Is there any way I'm able to do this?
You can do this:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445'],
'size': [2, 1, 8]} )
Define a dictionary of your desired final length based on the corresponding size. Here if the size is 8 I will take the 5 last characters
size_dict = {8: 5, 2: 3, 1: 4}
Then use a simple pandas apply
t['A_bis'] = t.apply(lambda x: x['A'][len(x['A']) - size_dict[x['size']]:], axis=1)
The result is
0 523 >> 3 last characters (key 2)
1 0010 >> 4 last characters (key 1)
2 R4445 >> 5 last characters (key 8)
Another approach to do this:
Sample df:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445']})
Get the count of each elements of A:
t['Count'] =(t['A'].apply(len))
Then write a condition to replace:
t.loc[t.Count == 8, 'Number'] = t['A'].str[-5:]

Python pandas - select by row

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))

Categories