Sorting Pandas dataframe data within Groupby groups

Sorting Pandas dataframe data within Groupby groups - python

I have a large pandas dataframe that can be represented structurally as:
id date status
0 12 2015-05-01 0
1 12 2015-05-22 1
2 12 2015-05-14 1
3 12 2015-05-06 0
4 45 2015-05-03 1
5 45 2015-05-12 1
6 45 2015-05-02 0
7 51 2015-05-05 1
8 51 2015-05-01 0
9 51 2015-05-23 1
10 51 2015-05-17 1
11 51 2015-05-03 0
12 51 2015-05-05 0
13 76 2015-05-04 1
14 76 2015-05-22 1
15 76 2015-05-08 0
And can be created in Python 3.4 using:
tempDF = pd.DataFrame({ 'id': [12,12,12,12,45,45,45,51,51,51,51,51,51,76,76,76],
'date': ['2015-05-01','2015-05-22','2015-05-14','2015-05-06','2015-05-03','2015-05-12','2015-05-02','2015-05-05','2015-05-01','2015-05-23','2015-05-17','2015-05-03','2015-05-05','2015-05-04','2015-05-22','2015-05-08'],
'status': [0,1,1,0,1,1,0,1,0,1,1,0,0,1,1,0]})
tempDF['date'] = pd.to_datetime(tempDF['date'])
I would like to divide the dataframe into groups based on variable 'id', sort within groups based on 'date' and then get the last 'status' value within each group.
So far, I have:
tempGrouped = tempDF.groupby('id')
tempGrouped['status'].last()
which produces:
id
12 0
45 0
51 0
76 0
However, the status should be 1 in each case (the value associated with the latest date). I can't work out how to sort the groups by date before selecting the last value. It's likely I'm a little snow-blind after trying to work this out for a while, so I apologise in advance if the solution is obvious.

you can sort and group like this :
tempDF.sort(['id','date']).groupby('id')['status'].last()

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)

See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111

In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

Python: How to repeat each combination of rows in Dataframe ranging 1 to n?

Have got a dataframe df like below:
Store Aisle Table
11 59 2
11 61 3
Need to expand each combination of row 3 times generating new column 'bit' with range value as below:
Store Aisle Table Bit
11 59 2 1
11 59 2 2
11 59 2 3
11 61 3 1
11 61 3 2
11 61 3 3
Have tried the below code but didn't worked out.
df.loc[df.index.repeat(range(3))]
Help me out! Thanks in Advance.

You should provide a number, not a range to repeat. Also, you need a bit of processing:
(df.loc[df.index.repeat(3)]
.assign(Bit=lambda d: d.groupby(level=0).cumcount().add(1))
.reset_index(drop=True)
)
output:
Store Aisle Table Bit
0 11 59 2 1
1 11 59 2 2
2 11 59 2 3
3 11 61 3 1
4 11 61 3 2
5 11 61 3 3
Alternatively, using MultiIndex.from_product:
idx = pd.MultiIndex.from_product([df.index, range(1,3+1)], names=(None, 'Bit'))
(df.reindex(idx.get_level_values(0))
.assign(Bit=idx.get_level_values(1))
)

df = df.iloc[np.repeat(np.arange(len(df)), 3)]
df['Bit'] = list(range(1, len(df)//3+1))*3

Python: How to replicate rows in Dataframe with column value but changing the column value to its range

Have got a dataframe df
Store Aisle Table
11 59 2
11 61 3
Need to replicate these rows w.r.t. column 'Table' times on changing 'Table' column value as below:
Store Aisle Table
11 59 1
11 59 2
11 61 1
11 61 2
11 61 3
Tried below code, but this doesn't change the value instead replicates the same row n times.
df.loc[df.index.repeat(df['Table'])]
Thanks!

You can do a groupby().cumcount() after that:
out = df.loc[df.index.repeat(df['Table'])]
out['Table'] = out.groupby(level=0).cumcount() + 1
Output:
Store Aisle Table
0 11 59 1
0 11 59 2
1 11 61 1
1 11 61 2
1 11 61 3

We can try explode
out = df.assign(Table=df['Table'].map(range)).explode('Table')
Out[160]:
Store Aisle Table
0 11 59 0
0 11 59 1
1 11 61 0
1 11 61 1
1 11 61 2

How to look for same columns from one dataframe in other dataframe pandas python?

I have one dataframe like this,
tabla_aciertos= {'Numeros_acertados' : [5,5,5,4,4,3,4,2,3,3,1,2,2],'Estrellas_acertadas': [2,1,0,2,1,2,0,2,1,0,2,1,0]}
categorias = [1,2,3,4,5,6,7,8,9,10,11,12,13]
categoria_de_premios = pd.DataFrame (tabla_aciertos,index = [categorias] )
categoria_de_premios
Numeros_acertados Estrellas_acertadas
1 5 2
2 5 1
3 5 0
4 4 2
5 4 1
6 3 2
7 4 0
8 2 2
9 3 1
10 3 0
11 1 2
12 2 1
13 2 0
and another df :
sorteos_anteriores.iloc[:,:]
uno dos tres cuatro cinco Estrella1 Estrella2 bolas_Acertadas estrellas_Acertadas
Fecha
2020-10-13 5 14 38 41 46 1 10 0 1
2020-09-10 11 15 35 41 50 5 8 1 0
2020-06-10 4 21 36 41 47 9 11 0 0
2020-02-10 6 12 15 40 45 3 9 0 0
2020-09-29 4 14 16 41 44 11 12 0 1
... ... ... ... ... ... ... ... ... ...
2004-12-03 15 24 28 44 47 4 5 0 0
2004-05-03 4 7 33 37 39 1 5 0 1
2004-02-27 14 18 19 31 37 4 5 0 0
2004-02-20 7 13 39 47 50 2 5 1 0
2004-02-13 16 29 32 36 41 7 9 0 0
1363 rows × 9 columns
Now I need to see in each and every row of the df "sorteos_anteriores" is in one of the all row from the first df, "tabla_aciertos" .
Let me give you one example,
Inmagine in "sorteos_anteriores" you have in:
2019-11-2 in the column "bolas_Acertadas"= 5 and "estrellas_Acertadas= 1". Now you go to fist table, "tabla_aciertos" and you find that in (index 2 = "Numeros_acertados" = 5 and Estrellas_acertadas=1) . You have won a second (index=2) class prize. You should create a new column "Prize" in "sorteos_anteriores" and in each row write a number from 1 to 13 if you have some kind of prize of 0 or Nan if you not.
I have try :
sorteos_anteriores ['categorias'] = sorteos_anteriores(sorteos_anteriores.loc[:,'bolas_Acertadas':'estrellas_Acertadas'] == tabla_premios.iloc[ : ,0:2])
Also with where and merge, but nothing works.
Thanks for your help.
Thanks to Cuina Max I could do it.
answer here

# supposing that the indexes, starting from one, correspond to the the premiums
categoria_de_premios['Categoria'] = df.index
# Merge using pd.merge and the appropriate arguments
sorteos_anteriores = (sorteos_anteriores.merge(
categoria_de_premios,
how='outer',
left_on=['bolas_Acertadas','estrellas_Acertadas'],
right_on=['Numeros_acertados', 'Estrellas_acertadas']
)).drop(columns=['Numeros_acertados', 'Estrellas_acertadas'])

get only previous three values from the dataframe

I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.

Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.

You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting Pandas dataframe data within Groupby groups - python

you can sort and group like this : tempDF.sort(['id','date']).groupby('id')['status'].last()

Related

Grouping of a dataframe monthly after calculating the highest daily values

Python: How to repeat each combination of rows in Dataframe ranging 1 to n?

Python: How to replicate rows in Dataframe with column value but changing the column value to its range

How to look for same columns from one dataframe in other dataframe pandas python?

get only previous three values from the dataframe

Categories

Resources