I want to print out the highest, not unique value out of my dataframe.
With df['Value'].value_counts() i can count them, but how do i selected them by how often the numbers appear.
Value
1
2
1
2
3
2
As I understand you want the first highest value that has a frequency greater than 1. In this case you can write,
for val, cnt in df['Value'].value_counts().sort_index(ascending=False).iteritems():
if cnt > 1:
print(val)
break
The sort_index sorts the items by the 'Value' rather than the frequencies. For example if your 'Value' columns has values [1, 2, 3, 3, 2, 2,2, 1, 3, 2] then the result of df['Value'].value_counts().sort_index(ascending=False).iteritems() will be as follows,
3 3
2 5
1 2
Name: Value, dtype: int64
The answer in this example would then be 3 since it is the first highest value with frequency greater than 1.
Related
I enter this:
test2 = [1, 2, 3, 4]
test3 = pd.Series(test2)
print(test3)
print(test3[3])
print(test3[test3[2]])
And get this:
0 1
1 2
2 3
3 4
type: int64
4
4
Basically I'm trying to understand what's happening to the index numbering. Why does a row effectively get cut, or is the index-search-result moved to the next lower row, when you list the series name twice? (as evidenced by selecting different index numbers on what appears to be the same series of unique values, but getting the same answer)
There is no row that is "cut", nor jump to a next value. Indexing is always reproducible, you might misunderstand how it is working.
test3[x] gives you the value y that is indexed by the indice x:
In the case of test3[test3[2]], first test3[2] is evaluated. 2 is in the index and the matching value is 3. Coincidentally, 3 as a value is also in the index, so test[test3[2]] is equivalent to test3[3] and returns 4.
If your Series was test3 = pd.Series([6, 7, 8, 9]) you would just get an error running test3[test3[2]] as test[8] is inexistent.
The indexes are not changing when you do test[x] IT return you the value at index x:
test[3]= the value at index 3 =4
test[test[2]]] :
1- test[2] = the value at index 2 =3
2- test[test[2]] = test[3] = the value at index 3 = 4
I am developing a clinical bioinformatic application and the input this application gets is a data frame that looks like this
df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2],
'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)
store quarter employee foo columnX
0 Blank_A09 1 Blank_A09 1 Blank_A09
1 Control_4p 1 Control_4p 1 Control_4p
2 13_MEG3 2 13_MEG3 2 13_MEG3
3 04_GRB10 2 04_GRB10 2 04_GRB10
4 02_PLAGL1 1 02_PLAGL1 1 02_PLAGL1
5 Control_21q 1 Control_21q 1 Control_21q
6 01_PLAGL1 2 01_PLAGL1 9 01_PLAGL1
7 11_KCNQ10T1 2 11_KCNQ10T1 2 11_KCNQ10T1
8 16_SNRPN 2 16_SNRPN 2 16_SNRPN
9 09_H19 2 09_H19 4 09_H19
10 Control_6p 2 Control_6p 2 Control_6p
11 06_MEST 2 06_MEST 2 06_MEST
This is a minimal reproducible example, but the real one has an uncertain number of columns in which the first, the third the 5th, the 7th, etc. "should" be exactly the same.
And this is what I want to check. I want to ensure that these columns have their values in the same order.
I know how to check if 2 columns are exactly the same but I don't know how to expand this checking across all data frame.
EDIT:
The name of the columns change, in my example, they are just two examples.
Refer here How to check if 3 columns are same and add a new column with the value if the values are same?
Here is a code that would check if more columns are the same and returns the index of rows which are the same
arr = df[['quarter','foo_test','foo']].values #You can add as many columns as you wish
np.where((arr == arr[:, [0]]).all(axis=1))
You need to tweak it for your usage
Edit
columns_to_check = [x for x in range(1, len(df.columns), 2)]
arr = df.iloc[:, columns_to_check].values
If you want an efficient method you can hash the Series using pandas.util.hash_pandas_object, making the operation O(n):
pd.util.hash_pandas_object(df.T, index=False)
We clearly see that store/employee/columnX have the same hash:
store 18266754969677227875
quarter 11367719614658692759
employee 18266754969677227875
foo 92544834319824418
columnX 18266754969677227875
dtype: uint64
You can further use groupby to identify the identical values:
df.columns.groupby(pd.util.hash_pandas_object(df.T, index=False))
output:
{ 92544834319824418: ['foo'],
11367719614658692759: ['quarter'],
18266754969677227875: ['store', 'employee', 'columnX']}
This question already has an answer here:
How is pandas groupby method actually working?
(1 answer)
Closed 1 year ago.
df=pd.DataFrame({'key':['A','B','C','A','B','C'],
'data1':range(6), 'data2': rng.randint(0,10,6)}, columns=['key','data1','data2'])
l=[0,1,0,1,2,0]
df.groupby(l).sum()
So the output of your code is
https://ibb.co/kgXSwvL
We have three distinct values 0,1,2 in the list l.
If you see carefully it's the sum of values in data1 column on the same index where 0 is encountered.
For Example, in the given list l, 0 is at 0th, 2nd and 5th position where data1 has values 0, 2, 5 which sums up to 7 and data2 has values 3, 5, 2 which sums up to 10.
and similarly for value 1, in the list l, 1 is at index 1 and 3, hence the value corresponding to data1 in the dataframe df for the label 1 and 3 is 1 and 3 respectively which sums up to 4.
Similarly for value 3 in the list the answer can be verified.
I have a dataframe like this:
seq score
0 TAAGAATTGTTCTCTGTGTATTT -23.19
1 AAGAATTGTTCTCTGTGTATTTC -3.67
2 AGAATTGTTCTCTGTGTATTTCA -16.49
3 GAATTGTTCTCTGTGTATTTCAG -11.83
4 AATTGTTCTCTGTGTATTTCAGG -10.86
5 ATTGTTCTCTGTGTATTTCAGGC -7.24
I want to select 3 rows in a loop and then get maximum value of the score.
The result I am looking for is like this:
seq score
1 AAGAATTGTTCTCTGTGTATTTC -3.67
5 ATTGTTCTCTGTGTATTTCAGGC -7.24
I tried applying groupby function and sort, but it does not seem to work as the seq column has unique values.
What other method can I use to get such result?
Use DataFrameGroupBy.idxmax for index of max value per groups created by integer division of index by 3 and then seelct rows by DataFrame.loc:
df = df.loc[df.groupby(df.index // 3)['score'].idxmax()]
print (df)
seq score
1 AAGAATTGTTCTCTGTGTATTTC -3.67
5 ATTGTTCTCTGTGTATTTCAGGC -7.24
Details:
print (df.index // 3)
Int64Index([0, 0, 0, 1, 1, 1], dtype='int64')
print (df.groupby(df.index // 3)['score'].idxmax())
0 1
1 5
Name: score, dtype: int64
import pandas as pd
df = pd.DataFrame({'seq':['TAAGAATTGTTCTCTGTGTATTT','AAGAATTGTTCTCTGTGTATTTC','AGAATTGTTCTCTGTGTATTTCA','GAATTGTTCTCTGTGTATTTCAG','AATTGTTCTCTGTGTATTTCAGG','ATTGTTCTCTGTGTATTTCAGGC'],
'score': [-23.19,-3.67,-16.49,-11.83,-10.86,-7.24]})
df = df.loc[df.groupby(df.index // 3)['score'].idxmax()]
print(df)
Let's say I have this dataframe
> df = pd.DataFrame({'A': [1,5], 'B':[3,4]})
A B
0 1 3
1 5 4
I can get the minimum value of each row with:
> df.min(1)
0 1
1 4
dtype: int64
Or its indexes with:
> df.idxmin(1)
0 A
1 B
dtype: object
Nevertheless, this implies searching the minimum values twice. Is there a way to use the idxmin results to access the respective columns and get the minimum value (without calling min)?
Edit: I am looking for something that is faster than calling min again. In theory, this should be possible as columns are indexed.
To get the values in a list, you could do the following:
> indicies = df.idxmin(1)
> [df.iloc[k][indicies[k]] for k in range(len(indicies))]
[1, 4]