How is the grouping done in groupby [duplicate] - python

This question already has an answer here:
How is pandas groupby method actually working?
(1 answer)
Closed 1 year ago.
df=pd.DataFrame({'key':['A','B','C','A','B','C'],
'data1':range(6), 'data2': rng.randint(0,10,6)}, columns=['key','data1','data2'])
l=[0,1,0,1,2,0]
df.groupby(l).sum()

So the output of your code is
https://ibb.co/kgXSwvL
We have three distinct values 0,1,2 in the list l.
If you see carefully it's the sum of values in data1 column on the same index where 0 is encountered.
For Example, in the given list l, 0 is at 0th, 2nd and 5th position where data1 has values 0, 2, 5 which sums up to 7 and data2 has values 3, 5, 2 which sums up to 10.
and similarly for value 1, in the list l, 1 is at index 1 and 3, hence the value corresponding to data1 in the dataframe df for the label 1 and 3 is 1 and 3 respectively which sums up to 4.
Similarly for value 3 in the list the answer can be verified.

Related

How to find last occurrence of value meeting condition in column in python

I have the following dataframe:
df = pd.DataFrame({"A":['a','b','c','d','e','f','g','h','i','j','k'],
"B":[1,3,4,5,6,7,6,5,8,5,5]})
df
displayed as:
A B
0 a 1
1 b 3
2 c 4
3 d 5
4 e 6
5 f 7
6 g 6
7 h 5
8 i 8
9 j 5
10 k 5
I first want to find the letter in column "A" that corresponds to the first occurrence of a value in column "B" that is >= 6. Looking at this, we see that this would be row index 4, corresponding to a value of 6 and "e" in column "A".
I can identify the column "A" value we just got with this code:
#Find first occurrence >= threshold
threshold = 6
array = df.values
array[np.where(array[:,1] >= threshold)][0,0]
This code returns 'e', which is what I want.
This code is referenced from this Stack Overflow source: Python find first occurrence in Pandas dataframe column 2 below threshold and return column 1 value same row using NumPy
What I am having trouble figuring out is how to modify this code to find the last occurrence meeting my criteria of being >= the threshold of 6. And so looking at my code above, I want to produce 'i', because looking at the above data frame, the row containing "i" in column "A" correspond to a value of 8 in column "B", which is the last occurrence of a value >= the threshold of 6. I want to preserve the order of the rows as alphabetical referencing column "A". I am guessing this might have to do with somehow modifying the indexing in my code, specifically the array[:,1] component or the [0,0] component, but I am not sure how to specifically call for the last occurrence meeting my criteria. How can I modify my code to find the value in column "A" corresponding to the last occurrence of a value >= the threshold of 6 in column "B"?
To get the first occurrence, You can use idxmax:
df.loc[df['B'].ge(6).idxmax()]
output:
A e
B 6
Name: 4, dtype: object
For just the value in 'A':
df.loc[df['B'].ge(6).idxmax(), 'A']
output: 'e'
For the last, do the same on the reversed Series:
df.loc[df.loc[::-1,'B'].ge(6).idxmax()]
output:
A k
B 8
Name: 10, dtype: object
df.loc[df.loc[::-1, 'B'].ge(6).idxmax(), 'A']
output: 'k'
here is one way to do it
search for the rows meeting your criteria and then get the values from the bottom of the resultset
df.loc[df['B'] >=6][-1:]
in text dataframe
A B
8 i 8
in dataframe code
A B
10 k 8

How to find elements that are in first pandas Data frame and not in second, and viceversa. python [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 11 months ago.
I have two data frames.
first_dataframe
id
9
8
6
5
7
4
second_dataframe
id
6
4
1
5
2
3
Note: My dataframe has many columns, but I need to compare only based on ID |
I need to find:
ids that are in first dataframe and not in second [1,2,3]
ids that are in second dataframe and not in first [7,8,9]
I have searched for an answer, but all solutions that I've found doesn't seem to work for me, because they look for changes based on index.
Use set subtraction:
inDF1_notinDF2 = set(df1['id']) - set(df2['id']) # Removes all items that are in df2 from df1
inDF2_notinDF1 = set(df2['id']) - set(df1['id']) # Removes all items that are in df1 from df2
Output:
>>> inDF1_notinDF2
{7, 8, 9}
>>> inDF2_notinDF1
{1, 2, 3}

Getting highest value out of a dataframe with value_counts()

I want to print out the highest, not unique value out of my dataframe.
With df['Value'].value_counts() i can count them, but how do i selected them by how often the numbers appear.
Value
1
2
1
2
3
2
As I understand you want the first highest value that has a frequency greater than 1. In this case you can write,
for val, cnt in df['Value'].value_counts().sort_index(ascending=False).iteritems():
if cnt > 1:
print(val)
break
The sort_index sorts the items by the 'Value' rather than the frequencies. For example if your 'Value' columns has values [1, 2, 3, 3, 2, 2,2, 1, 3, 2] then the result of df['Value'].value_counts().sort_index(ascending=False).iteritems() will be as follows,
3 3
2 5
1 2
Name: Value, dtype: int64
The answer in this example would then be 3 since it is the first highest value with frequency greater than 1.

Add new column based on odd/even condition [duplicate]

This question already has answers here:
How to conditionally update DataFrame column in Pandas based on list
(2 answers)
Check if a number is odd or even in Python [duplicate]
(6 answers)
Closed 3 years ago.
Let's say I have the dataframe below:
my_df = pd.DataFrame({'A': [1, 2, 3]})
my_df
A
0 1
1 2
2 3
I want to add a column B with values X if the corresponding number in A is odd, otherwise Y. I would like to do it in this way if possible:
my_df['B'] = np.where(my_df['A'] IS ODD, 'X', 'Y')
I don't know how to check if the value is odd.
You were so close!
my_df['b'] = np.where(my_df['A'] % 2 != 0, 'X', 'Y')
value % 2 != 0 will check if a number is odd. Where are value % 2 == 0 will check for evens.
Output:
A b
0 1 X
1 2 Y
2 3 X

how to subset by fixed column and row by boolean in pandas? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I am coming from R background. I need elementary with pandas.
if I have a dataframe like this
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
I want to subset dataframe to select a fixed column and select a row by a boolean.
For example
df.iloc[df.2 > 4][2]
then I want to set the value for the subset cell to equal a value.
something like
df.iloc[df.2 > 4][2] = 7
It seems valid for me however it seem pandas work with booleans in more strict way than R
In here it is .loc
df.loc[df[2] > 4,2]
1 6
Name: 2, dtype: int64
df.loc[df[2] > 4,2]=7
df
0 1 2
0 1 2 3
1 4 5 7

Categories