Looking for a solution to my problem an entire day and cannot find the answer. I'm trying to follow the example of this topic: Get column name where value is something in pandas dataframe
to make a version with multiple conditions.
I want to extract column name (under a list) where :
value == 4 or/and value == 3
+
Only if there is no 4 or/and 3, then extract the column name where value == 2
Example:
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'acne': [1, 4, 1, 2], 'wrinkles': [1, 3, 4, 4],'darkspot': [2, 2, 3, 4] }
df1 = pd.DataFrame(data)
df1
df1
'''
Name acne wrinkles darkspot
0 Tom 1 1 2
1 Joseph 4 3 2
2 Krish 1 4 3
3 John 2 4 4
'''
The result i'm looking for :
df2
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
'''
I tried with the apply function with a lambda detailled in the topic i mentionned above but it can only take one argument.
Many thanks for your answers if somebody can help me :)
You can use boolean mask:
problems = ['acne', 'wrinkles', 'darkspot']
m1 = df1[problems].isin([3, 4]) # main condition
m2 = df1[problems].eq(2) # fallback condition
mask = m1 | (m1.loc[~m1.any(axis=1)] | m2)
df1['problem'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
Output:
>>> df1
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
You can use a Boolean mask to figure out which columns you need.
First check if any of the values are 3 or 4, and then if not, check if any of the values are 2. Form the composite mask (variable m below) with an | (or) between those two conditions.
Finally you can NaN the False values, that way when you stack and groupby.agg(list) you're left with just the column labels for the Trues.
cols = ['acne', 'wrinkles', 'darkspot']
m1 = df1[cols].isin([3, 4])
# If no `3` or `4` on the rows, check if there is a `2`
m2 = pd.DataFrame((~m1.any(1)).to_numpy()[:, None] & df1[cols].eq(2).to_numpy(),
index=m1.index, columns=m1.columns)
m = (m1 | m2)
# acne wrinkles darkspot
#0 False False True
#1 True True False
#2 False True True
#3 False True True
# Assignment aligns on original DataFrame index, i.e. `'level_0'`
df1['problem'] = m.where(m).stack().reset_index().groupby('level_0')['level_1'].agg(list)
print(df1)
Name acne wrinkles darkspot problem
0 Tom 1 1 2 [darkspot]
1 Joseph 4 3 2 [acne, wrinkles]
2 Krish 1 4 3 [wrinkles, darkspot]
3 John 2 4 4 [wrinkles, darkspot]
Related
I have the dataframe that has employees, and their level.
import pandas as pd
d = {'employees': ["John", "Jamie", "Ann", "Jane", "Kim", "Steve"], 'Level': ["A/Ba", "C/A", "A", "C", "Ba/C", "D"]}
df = pd.DataFrame(data=d)
How do I add a new column that measures the number of employees with the same levels. For example, John would have 3 as there are 2 A's (Jamie and Ann) and one other Ba (Kim). Note it does not count the employee in this case John level(s) to that count.
My goal is for the end dataframe to be this.
Try this:
df['Number of levels'] = df['Level'].str.split('/').explode().map(df['Level'].str.split('/').explode().value_counts()).sub(1).groupby(level=0).sum()
Output:
>>> df
employees Level Number of levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
exploded = df.Level.str.split("/").explode()
counts = exploded.groupby(exploded).transform("count").sub(1)
df["Num Levels"] = counts.groupby(level=0).sum()
We first explode the "Level" column by splitting over "/" so we can reach each level:
>>> exploded = df.Level.str.split("/").explode()
>>> exploded
0 A
0 Ba
1 C
1 A
2 A
3 C
4 Ba
4 C
5 D
Name: Level, dtype: object
We now need counts of each element in this series so we group by itself and transform by counts:
>>> exploded.groupby(exploded).transform("count")
0 3
0 2
1 3
1 3
2 3
3 3
4 2
4 3
5 1
Name: Level, dtype: int64
Since it counts elements themselves but you look at other places, we subtract 1 to get counts:
>>> counts = exploded.groupby(exploded).transform("count").sub(1)
>>> counts
0 2
0 1
1 2
1 2
2 2
3 2
4 1
4 2
5 0
Name: Level, dtype: int64
Now, we need to "come back", and the index is our helper for that; we group by it (level=0 means that) and sum the counts thereof:
>>> counts.groupby(level=0).sum()
0 3
1 4
2 2
3 2
4 3
5 0
Name: Level, dtype: int64
This is the end result and is assigned to df["Num Levels"].
to get
employees Level Num Levels
0 John A/Ba 3
1 Jamie C/A 4
2 Ann A 2
3 Jane C 2
4 Kim Ba/C 3
5 Steve D 0
This is all writable in "1 line" but it may hinder readability and further debuggings!
df["Num Levels"] = (df.Level
.str.split("/")
.explode()
.pipe(lambda ex: ex.groupby(ex))
.transform("count")
.sub(1)
.groupby(level=0)
.sum())
Let's say we have the following df with the column names.
df = pd.DataFrame({
'names':['Alan', 'Alan', 'John', 'John', 'Alan', 'Alan','Alan', np.nan, np.nan, np.nan, np.nan, np.nan, 'Christy', 'Christy','John']})
>>> df
names
0 Alan
1 Alan
2 John
3 John
4 Alan
5 Alan
6 Alan
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 Christy
13 Christy
14 John
I would like to run a apply function on the column which returns the max consecutive times a particular values occurs. At first, I would like to do this for NaN but by extension would like to switch to any other value in the column.
Explanation:
If we run the apply for Nan, the result would be 5, as 5 is the highest times NaN occurs consecutively. If there were subsequent rows after other values in the column and then NaN occurs consecutively gt than 5 times, then that would be the result.
If we run the apply for Alan, the result would be 3 as 3 would superseed the 2 in the first occurrence of consecutive Alan's.
df_counts = df #create new df to keep the original
df_counts['names'].fillna("NaN", inplace=True) # replace np.nan with string
df_counts['counts'] = df.names.groupby((df.names != df.names.shift()).cumsum()).transform('size') # count consecutive names
df_counts = df_counts.sort_values('counts').drop_duplicates("names",keep='last') #keep only the highest counts
def get_counts(name):
return df_counts.loc[df['names'] == name, 'counts'].item()
Then get_counts("Alan") will return 3, and get_counts("NaN") will return 5.
Here's a solution you can use with groupby:
# convert nans to str
df["names"] = df["names"].fillna("NaN")
# assign a subgroup to each set of consecutive rows
df["subgroup"] = df["names"].ne(df["names"].shift()).cumsum()
# take the max length of any subgroup that belongs to "name"
def get_max_consecutive(name):
return df.groupby(["names", "subgroup"]).apply(len)[name].max()
for name in df.names.unique():
print(f"{name}: {get_max_consecutive(name)}")
Output:
Alan: 3
John: 2
NaN: 5
Christy: 2
Explanation:
pandas.Series.ne takes two series and returns a new series that is True for the elements in each row are not equal and False if they are equal.
We can use df["names"] and compare it to itself, except shifted by 1 (df["names"].shift()). This will return True whenever name changes from the previous value.
So this gives us a boolean series where each True marks a change in name:
df["names"].ne(df["names"].shift())
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
13 False
14 True
Name: names, dtype: bool
Then, .cumsum is just a cumulative sum of this series. In this case, True is equal to 1 and False is 0. This effectively gives us a new number each time the name changes from the previous value. We can assign this to its own column subgroup so we use groupby with it later.
df.names.ne(df.names.shift()).cumsum()
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
11 4
12 5
13 5
14 6
Name: names, dtype: int64
Lastly, we can use .groupby to group the dataframe using a multi-index on the "names" and "subgroups" columns. Now we can apply the len function to get the length of each subgroup.
df.groupby(["names", "subgroup"]).apply(len)
names subgroup
Alan 1 2
3 3
Christy 5 2
John 2 2
6 1
NaN 4 5
dtype: int64
Bonus: You can turn the series returned by .apply into a dataframe using .reset_index if you'd like to see the len of each name and subgroup:
df_count = df.groupby(["names", "subgroup"]).apply(len).reset_index(name="len")
df_count
Output:
names subgroup len
0 Alan 1 2
1 Alan 3 3
2 Christy 5 2
3 John 2 2
4 John 6 1
5 NaN 4 5
Since np.nan == np.nan is False, you have to check if the provided value is NaN before counting. For getting consecutive elements you can use itertools' groupby.
def max_consecutives(value):
if pd.isna(value):
value_equals = lambda x: pd.isna(x)
else:
value_equals = lambda x: x == value
def max_consecutive_values(col):
elements_per_group_counter = (
sum(1 for elem in group if value_equals(elem))
for _, group in groupby(col)
)
return max(elements_per_group_counter)
return max_consecutive_values
df.apply(max_consecutives(np.nan)) # returns 5
df.apply(max_consecutives("Alan")) # returns 3
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I want a new column in this df with the following condition. The column education is a categorical value that goes from 1 to 5 (1 is the lower level of education and 5 is the higher level of education). I want to create a function with the following logic (so as to create a new column in the df)
First, for any id check if there is at least a education level graduated, then the new column must have the higher level of education graduated.
Second, if there is no graduated education level for some particular id (must have all educaction level in "In course"). So, must check the maximium level of education and substract one.
df
id education stage
1 2 Graduated
1 3 Graduated
1 4 In course
2 3 In course
3 2 Graduated
3 3 In course
4 2 In course
expected output:
id education stage new_column
1 2 Graduated 3
1 3 Graduated 3
1 4 In course 3
2 3 In course 2
3 2 Graduated 2
3 3 In course 2
4 2 In course 1
You can do it like this:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 2, 3, 3, 4], 'education': [2, 3, 4, 3, 2, 3, 2],
'stage': ['Graduated', 'Graduated', 'In course', 'In course', 'Graduated', 'In course', 'In course']})
max_gr = df[df.stage == 'Graduated'].groupby('id').education.max()
max_ic = df[df.stage == 'In course'].groupby('id').education.max()
# set all cells to the value from max_ed
df['new_col'] = df.id.map(max_gr)
# set cells that have not been filled to the value from max_ic - 1
df.loc[df.new_col.isna(), ['new_col']] = df.id.map(max_ic - 1)
series.map(other_series) returns a new series where the values from series have been replaced by the values from other_series.
This is one way.
df['new'] = df.loc[df['stage'] == 'Graduated']\
.groupby('id')['education']\
.transform(max).astype(int)
df['new'] = df['new'].fillna(df.loc[df['stage'] == 'InCourse']\
.groupby('id')['education']\
.transform(max).sub(1)).astype(int)
Result
id education stage new
0 1 2 Graduated 3
1 1 3 Graduated 3
2 1 4 InCourse 3
3 2 3 InCourse 2
4 3 2 Graduated 2
5 3 3 InCourse 2
6 4 2 InCourse 1
Explanation
First, map to "Graduated" dataset grouped by id on max education.
Second, map to "InCourse" dataset grouped by id on max education minus 1.
Alternative solution based on Markus Löffler.
max_ic = df[df.stage.eq('In course')].groupby('id').education.max() - 1
max_gr = df[df.stage.eq('Graduated')].groupby('id').education.max()
# Update with max_gr
max_ic.update(max_gr)
df['new_col'] = df.id.map(max_ic)
I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7