Order data set using pandas dataframe based on lowest value inside - python

I have a dataset that I would like to order by date but second order with 'pass' value lowest inside of highest. The reason I don't have any code is because, I just have no idea where to begin.
dataframe input:
index date pass
0 11/14/2014 1
1 3/13/2015 1
2 3/20/2015 1
3 5/1/2015 2
4 5/1/2015 1
5 5/22/2015 3
6 5/22/2015 1
7 5/22/2015 2
8 9/25/2015 1
9 9/25/2015 2
10 9/25/2015 3
11 12/4/2015 2
12 12/4/2015 1
13 2/12/2016 2
14 2/12/2016 1
15 5/27/2016 1
16 6/10/2016 1
17 9/23/2016 1
18 12/23/2016 1
19 11/24/2017 1
20 12/29/2017 1
21 1/26/2018 2
22 1/26/2018 1
23 2/9/2018 1
24 3/16/2018 1
25 4/6/2018 2
26 4/6/2018 1
27 6/15/2018 1
28 6/15/2018 2
29 10/26/2018 1
30 11/30/2018 1
31 12/21/2018 1
** Expected Output **
index date pass
0 11/14/2014 1
1 3/13/2015 1
2 3/20/2015 1
3 5/1/2015 2
4 5/1/2015 1
5 5/22/2015 3
6 5/22/2015 2
7 5/22/2015 1
8 9/25/2015 3
9 9/25/2015 2
10 9/25/2015 1
11 12/4/2015 2
12 12/4/2015 1
13 2/12/2016 2
14 2/12/2016 1
15 5/27/2016 1
16 6/10/2016 1
17 9/23/2016 1
18 12/23/2016 1
19 11/24/2017 1
20 12/29/2017 1
21 1/26/2018 1
22 1/26/2018 2
23 2/9/2018 1
24 3/16/2018 1
25 4/6/2018 1
26 4/6/2018 2
27 6/15/2018 1
28 6/15/2018 2
29 10/26/2018 1
30 11/30/2018 1
31 12/21/2018 1
I have spaced out the results that would change. index 5,6,7 and 21,21 and 25,26. So all the bigger pass numbers should be inside the lower pass number if the dates are same.
So if you look at INDEX 5,6,7 the pass for it is changed to 3,2,1 and if you look at INDEX 25,26 the pass is changed to 1,2. Hope you understand.

Order first by pass then do it by date. This way you will be sure to have your df the way you want it

Related

How to perform Row wise sum based on column condition and add Class Wise specific value as Column?

Cluster Class Value
0 0 10 1
1 0 11 1
2 0 14 3
3 0 18 1
4 0 26 1
5 0 29 1
6 0 30 1
7 1 0 2
8 1 19 1
9 1 20 1
10 1 21 2
11 1 36 1
12 1 26 1
13 1 27 1
14 1 37 2
15 1 33 1
This table is based on Which class falls under which Cluster. Like Class 10, 11 , 14 and so on have fallen into Cluster 0. And Value column indicates how many of class member is there. Like 3 member of Class 14 have fallen into Cluster 0.
Now my desired output is like this:
Cluster Class Value Cluster_Sum
0 0 10 1 9
1 0 11 1 9
2 0 14 3 9
3 0 18 1 9
4 0 26 1 9
5 0 29 1 9
6 0 30 1 9
Same for other Clusters too. My final aim to make a column 'Precision' which is
df['Precision'] = df['Value']/ df['Cluster_Sum'] for each row.
How can I do that using python?
EDIT :- It works perfectly fine. Thanks for your help.
Ultimately this is My GOAL. For each class it's number is fixed. Like Class 1 : 10 , Class 2:12 .... so on. I need to add a Column like 'Class_Sum. Which consists the data of the total of class. Then I am able to find the Recall by
`df['Recall'] = df['Value']/ df['Class_Sum']`
But my question is how can I append this my information
Class 1 10
Class 2 12
Class 3 23
Class 4 11
Class 5 17
Class 6 13
Class 7 16
Class 8 15
Class 9 14
Class 10 18
Class 11 09
Class 12 07
Class 13 16
Class 14 21
Class 15 17
Class 16 23
Class 17 10
Class 18 21
Class 19 12
Class 20 45
Class 21 12
Class 22 12
Class 23 15
Class 24 11
Class 25 09
Class 26 11
Class 27 08
Class 28 10
Class 29 11
Class 30 19
Class 31 17
Class 32 15
Class 33 12
Class 34 07
Class 35 06
Class 36 14
Class 37 13
Class 38 16
to my Dataframe like this
Cluster Class Class_SUm Value ClusSum Precision RCll
10 18
11 09
14 21
18 21
26 11
29 11
30 19
How can it be done?
Try with groupby:
df["Cluster_Sum"] = df.groupby("Cluster")["Value"].transform("sum")
>>> df
Cluster Class Value Cluster_Sum
0 0 10 1 9
1 0 11 1 9
2 0 14 3 9
3 0 18 1 9
4 0 26 1 9
5 0 29 1 9
6 0 30 1 9
7 1 0 2 12
8 1 19 1 12
9 1 20 1 12
10 1 21 2 12
11 1 36 1 12
12 1 26 1 12
13 1 27 1 12
14 1 37 2 12
15 1 33 1 12
groupby + transform("sum") is your friend here:
df['Precision'] = df["Value"] / df.groupby("Cluster")["Value"].transform("sum")
Output:
>>> df
Cluster Class Value Precision
0 0 10 1 0.111111
1 0 11 1 0.111111
2 0 14 3 0.333333
3 0 18 1 0.111111
4 0 26 1 0.111111
5 0 29 1 0.111111
6 0 30 1 0.111111
7 1 0 2 0.166667
8 1 19 1 0.083333
9 1 20 1 0.083333
10 1 21 2 0.166667
11 1 36 1 0.083333
12 1 26 1 0.083333
13 1 27 1 0.083333
14 1 37 2 0.166667
15 1 33 1 0.083333

How to extract the value of column 1 when column 2 changes?(python)

I have a pandas.DataFrame of the form.(It doesn't matter if you use numpy.)
I want to output a value of 'moID' whenever the value of column 'time' changes.
I'll show you a simple example below.
I will mark the row that should be output as '<<<'.
index 'moID' 'time'
0 1 0 <<<
1 25 0
2 3 1 <<<
3 45 1
4 12 1
5 2 2 <<<
6 34 1 <<<
7 4 1
8 12 1
9 2 3 <<<
10 5 3
11 37 3
12 85 0 <<<
13 2 0
14 45 1 <<<
15 55 1
16 2 3 <<<
17 23 3
18 42 0 <<<
19 1 0
20 42 1 <<<
21 2 2 <<<
22 41 2
23 3 1 <<<
24 52 1
25 2 1
26 24 3 <<<
27 3 3
28 5 3
result is :
index 'moID'
1
3
2
34
2
85
45
2
42
42
2
3
24
help me please.
You can use shift + ne to see if consecutive rows match and create a boolean Series (where it's False if the time is the same but True if it's different). Then use it as a mask to filter the desired items:
out = df.loc[df['time'].ne(df['time'].shift()), 'moID']
Output:
0 1
2 3
5 2
6 34
9 2
12 85
14 45
16 2
18 42
20 42
21 2
23 3
26 24
Name: moID, dtype: int64
You can use boolean indexing the following way:
result = df.moID[df.time.diff() != 0]
df.time.diff() != 0 generates a Series of boolean and it is used
to index moID column.
The result, for your source data, is:
0 1
2 3
5 2
6 34
9 2
12 85
14 45
16 2
18 42
20 42
21 2
23 3
26 24
Name: moID, dtype: int64
The left column is the index and the right one - actual values.

Hash table mapping in Pandas

I have a large dataset with millions of rows of data. One of the data columns is ID.
I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.
What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas?
As an example, lets say that the dataset looks like this:
In [18]:
print(df_test)
Out [19]:
ID
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12 25
13 26
14 27
15 28
16 29
17 30
18 31
19 32
Now the hash table with the range of indices looks like this:
In [20]:
print(df_hash)
Out [21]:
ID_first
0 0
1 2
2 10
where the index specifies the group number that I need.
I tried doing something like this:
for index in range(df_hash.size):
try:
df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
except:
df_test.loc[df_hash.ID_first[index]:, 'Group'] = index
Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows). It produces the following answer (which I want):
In [23]:
print(df_test)
Out [24]:
ID Group
0 13 0
1 14 0
2 15 1
3 16 1
4 17 1
5 18 1
6 19 1
7 20 1
8 21 1
9 22 1
10 23 2
11 24 2
12 25 2
13 26 2
14 27 2
15 28 2
16 29 2
17 30 2
18 31 2
19 32 2
Is there a way to do this more efficiently?
You can map the index of df_test using ID_first to the index of df_hash, and then ffill. Need to construct a Series as the pd.Index class doesn't have a ffill method.
df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))),
index=df_test.index)
.ffill(downcast='infer'))
# ID group
#0 13 0
#1 14 0
#2 15 1
#...
#9 22 1
#10 23 2
#...
#17 30 2
#18 31 2
#19 32 2
you can do series.isin with series.cumsum
df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)
print(df_test)
ID group
0 0 1
1 1 1
2 2 2
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3
14 14 3
15 15 3
16 16 3
17 17 3
18 18 3
19 19 3

How to look for same columns from one dataframe in other dataframe pandas python?

I have one dataframe like this,
tabla_aciertos= {'Numeros_acertados' : [5,5,5,4,4,3,4,2,3,3,1,2,2],'Estrellas_acertadas': [2,1,0,2,1,2,0,2,1,0,2,1,0]}
categorias = [1,2,3,4,5,6,7,8,9,10,11,12,13]
categoria_de_premios = pd.DataFrame (tabla_aciertos,index = [categorias] )
categoria_de_premios
Numeros_acertados Estrellas_acertadas
1 5 2
2 5 1
3 5 0
4 4 2
5 4 1
6 3 2
7 4 0
8 2 2
9 3 1
10 3 0
11 1 2
12 2 1
13 2 0
and another df :
sorteos_anteriores.iloc[:,:]
uno dos tres cuatro cinco Estrella1 Estrella2 bolas_Acertadas estrellas_Acertadas
Fecha
2020-10-13 5 14 38 41 46 1 10 0 1
2020-09-10 11 15 35 41 50 5 8 1 0
2020-06-10 4 21 36 41 47 9 11 0 0
2020-02-10 6 12 15 40 45 3 9 0 0
2020-09-29 4 14 16 41 44 11 12 0 1
... ... ... ... ... ... ... ... ... ...
2004-12-03 15 24 28 44 47 4 5 0 0
2004-05-03 4 7 33 37 39 1 5 0 1
2004-02-27 14 18 19 31 37 4 5 0 0
2004-02-20 7 13 39 47 50 2 5 1 0
2004-02-13 16 29 32 36 41 7 9 0 0
1363 rows × 9 columns
Now I need to see in each and every row of the df "sorteos_anteriores" is in one of the all row from the first df, "tabla_aciertos" .
Let me give you one example,
Inmagine in "sorteos_anteriores" you have in:
2019-11-2 in the column "bolas_Acertadas"= 5 and "estrellas_Acertadas= 1". Now you go to fist table, "tabla_aciertos" and you find that in (index 2 = "Numeros_acertados" = 5 and Estrellas_acertadas=1) . You have won a second (index=2) class prize. You should create a new column "Prize" in "sorteos_anteriores" and in each row write a number from 1 to 13 if you have some kind of prize of 0 or Nan if you not.
I have try :
sorteos_anteriores ['categorias'] = sorteos_anteriores(sorteos_anteriores.loc[:,'bolas_Acertadas':'estrellas_Acertadas'] == tabla_premios.iloc[ : ,0:2])
Also with where and merge, but nothing works.
Thanks for your help.
Thanks to Cuina Max I could do it.
answer here
# supposing that the indexes, starting from one, correspond to the the premiums
categoria_de_premios['Categoria'] = df.index
# Merge using pd.merge and the appropriate arguments
sorteos_anteriores = (sorteos_anteriores.merge(
categoria_de_premios,
how='outer',
left_on=['bolas_Acertadas','estrellas_Acertadas'],
right_on=['Numeros_acertados', 'Estrellas_acertadas']
)).drop(columns=['Numeros_acertados', 'Estrellas_acertadas'])

How to find maximum value in Pandas data frame and assign a new Value to it?

This is my pandas data frame pandas data frame
ID Position Time(in Hours) Date
01 18 2 01/01/2016
01 21 4 01/10/2016
01 19 2 01/10/2016
05 19 5 01/10/2016
05 21 1 01/10/2016
05 19 8 01/10/2016
02 19 18 02/10/2016
02 35 11 02/10/2016
I need to assign '1' for the maximum Time for each Id and Date else assign '0'.
My code is
def find_max(db7):
max_row = db7['Time'].max()
labels = np.where((db7['Time_in_Second'] == max_row),'1','0')
return max_row
db7['Max'] = db7['Time'].map(find_max)
But I'm getting below error. How do I do this please?
TypeError: 'float' object is not subscriptable
My Expected out put should be:
ID Position Time(in Hours) Date Max
01 18 2 01/01/2016 0
01 21 4 01/10/2016 1
01 19 2 01/10/2016 0
05 19 5 01/10/2016 0
05 21 1 01/10/2016 0
05 19 8 01/10/2016 1
02 19 18 02/10/2016 1
02 35 11 02/10/2016 0
Use groupby with transform max and numpy.where for assign new values:
max1 = db7.groupby(['ID','Date'])['Time(in Hours)'].transform('max')
db7['Max'] = np.where(db7['Time(in Hours)'].eq(max1), '1', '0')
print (db7)
ID Position Time(in Hours) Date Max
0 1 18 2 01/01/2016 1
1 1 21 4 01/10/2016 1
2 1 19 2 01/10/2016 0
3 5 19 5 01/10/2016 0
4 5 21 1 01/10/2016 0
5 5 19 8 01/10/2016 1
6 2 19 18 02/10/2016 1
7 2 35 11 02/10/2016 0
Or convert Trues and Falses to '1' and '0' by double astype:
max1 = db7.groupby(['ID','Date'])['Time(in Hours)'].transform('max')
db7['Max'] = db7['Time(in Hours)'].eq(max1).astype(int).astype(str)
print (db7)
ID Position Time(in Hours) Date Max
0 1 18 2 01/01/2016 1
1 1 21 4 01/10/2016 1
2 1 19 2 01/10/2016 0
3 5 19 5 01/10/2016 0
4 5 21 1 01/10/2016 0
5 5 19 8 01/10/2016 1
6 2 19 18 02/10/2016 1
7 2 35 11 02/10/2016 0
Detail:
print (max1)
0 2
1 4
2 4
3 8
4 8
5 8
6 18
7 18
Name: Time(in Hours), dtype: int64
#eq is same as ==
print (db7['Time(in Hours)'].eq(max1))
0 True
1 True
2 False
3 False
4 False
5 True
6 True
7 False
Name: Time(in Hours), dtype: bool
EDIT:
If need group by only column ID:
max1 = db7.groupby('ID')['Time(in Hours)'].transform('max')
db7['Max'] = np.where(db7['Time(in Hours)'].eq(max1), '1', '0')
print (db7)
ID Position Time(in Hours) Date Max
0 1 18 2 01/01/2016 0
1 1 21 4 01/10/2016 1
2 1 19 2 01/10/2016 0
3 5 19 5 01/10/2016 0
4 5 21 1 01/10/2016 0
5 5 19 8 01/10/2016 1
6 2 19 18 02/10/2016 1
7 2 35 11 02/10/2016 0
print (max1)
0 4
1 4
2 4
3 8
4 8
5 8
6 18
7 18
Name: Time(in Hours), dtype: int64

Categories