Pandas Groupby Newbie Conundrum - python

**Using Pandas 1.4.2, Python 3.9.12
I have a data frame as follows:
Neighbourhood No-show
0 JARDIM DA PENHA No
1 JARDIM DA PENHA Yes
2 MATA DA PRAIA No
3 PONTAL DE CAMBURI No
4 JARDIM DA PENHA No
5 MARIA ORTIZ Yes
6 MARIA ORTIZ Yes
7 MATA DA PRAIA Yes
8 PONTAL DE CAMBURI No
9 MARIA ORTIZ No
How would I use groupby to get the total(count) of 'Yes' and total(count) of 'No' grouped by each 'Neighbourhood'? I keep getting 'NoNoYesNo' if I use .sum() and if I can get these grouped correctly by Neighbourhood I think I can graph much easier.
This data frame is truncated as there are numerous other columns but these are the only 2 I care about for this exercise.

Use df.groupby() as follows:
totals = df.groupby(['Neighbourhood','No-show'])['No-show'].count()
print(totals)
Neighbourhood No-show
JARDIM DA PENHA No 2
Yes 1
MARIA ORTIZ No 1
Yes 2
MATA DA PRAIA No 1
Yes 1
PONTAL DE CAMBURI No 2
Name: No-show, dtype: int64
Good point raised by #JonClements: you might want to add .unstack(fill_value=0) to that. So:
totals_unstacked = df.groupby(['Neighbourhood','No-show'])['No-show'].count().unstack(fill_value=0)
print(totals_unstacked)
No-show No Yes
Neighbourhood
JARDIM DA PENHA 2 1
MARIA ORTIZ 1 2
MATA DA PRAIA 1 1
PONTAL DE CAMBURI 2 0

You can use:
df[['Neighbourhood', 'No-show']].value_counts()

Related

How to compare two dataframes by a key and create a new one, but just keeping the keys that are not in the first?

In python 3 and pandas I have two dataframes with the same structure:
data_1 = {
'numero_cnj' : ['0700488-61.2018.8.07.0017', '0003557-92.2008.4.01.3801', '1009486-37.2017.8.26.0053', '5005742-49.2017.4.04.9999', '0700488-61.2018.8.07.0017'],
'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA', 'SAO PAULO PREVIDENCIA - SPPREV', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_1 = pd.DataFrame(data_1)
data_2 = {
'numero_cnj' : ['0700488-61.2018.8.07.0017', '5005742-49.2017.4.04.9999', '1009486-37.2017.8.26.0053', '0700488-61.2018.8.07.0017'],
'nome_normalizado' : ['MARIA DOS REIS DE OLIVEIRA SILVA', 'INSTITUTO NACIONAL DO SEGURO SOCIAL', 'SAO PAULO PREVIDENCIA - SPPREV', 'GERALDO CAVALCANTE DA SILVEIRA']
}
df_2 = pd.DataFrame(data_2)
The "numero_cnj" column is an identifying key for the same item, but it can be repeated because more than one person/name can refer to that item.
I want to compare the two dataframes by the key "numero_cnj" and create a new dataframe from df_1, but just keeping the rows or keys that are in df_2 but not in df_1 - keep all keys from df_1 that were not found in df_2
For example
df_1
numero_cnj nome_normalizado
0 0700488-61.2018.8.07.0017 MARIA DOS REIS DE OLIVEIRA SILVA
1 0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
2 1009486-37.2017.8.26.0053 SAO PAULO PREVIDENCIA - SPPREV
3 5005742-49.2017.4.04.9999 INSTITUTO NACIONAL DO SEGURO SOCIAL
4 0700488-61.2018.8.07.0017 GERALDO CAVALCANTE DA SILVEIRA
df_2
numero_cnj nome_normalizado
0 0700488-61.2018.8.07.0017 MARIA DOS REIS DE OLIVEIRA SILVA
1 5005742-49.2017.4.04.9999 INSTITUTO NACIONAL DO SEGURO SOCIAL
2 1009486-37.2017.8.26.0053 SAO PAULO PREVIDENCIA - SPPREV
3 0700488-61.2018.8.07.0017 GERALDO CAVALCANTE DA SILVEIRA
In this case, the new dataframe would have only the line:
0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA
Please, does anyone know the best strategy to do this?
If I'm reading your question correctly, you should use join (merge) with how=outer:
merge = pd.merge(df_1, df_2, on = "numero_cnj", suffixes = ["", "_y"], how = "outer", indicator=True)
merge[merge._merge == "left_only"][["numero_cnj", "nome_normalizado"]]
The output is:
numero_cnj nome_normalizado
4 0003557-92.2008.4.01.3801 MARIA SELMA OLIVEIRA DE SOUZA E ANDRADE FERREIRA

How to create a duplicate flag (column) that counts duplicate rows based on two columns?

I have the following dataframe and would like to create a column at the end called "dup" showing the number of times the row shows up based on the "Seasons" and "Actor" columns. Ideally the dup column would look like this:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 1
This should do what you need:
df['dup'] = df.groupby(['Seasons', 'Actor']).cumcount() + 1
Output:
Name Seasons Actor dup
0 Stranger Things 3 Millie 1
1 Game of Thrones 8 Emilia 1
2 La Casa De Papel 4 Sergio 1
3 Westworld 3 Evan Rachel 1
4 Stranger Things 3 Millie 2
5 La Casa De Papel 4 Sergio 2
As Scott Boston mentioned, according to your criteria the last row should also be 2 in the dup column.
Here is a similar post that can provide you more information. SQL-like window functions in PANDAS

How do I use df.str.replace() only for complete matches?

I want to replace values in my df, but only if the values are a complete match, not partial. Here's an example:
import pandas as pd
df = pd.DataFrame({'Name':['Mark', 'Laura', 'Adam', 'Roger', 'Anna'],
'City':['Los Santos', 'Montreal', 'Los', 'Berlin', 'Glasgow']})
print(df)
Name City
0 Mark Los Santos
1 Laura Montreal
2 Adam Los
3 Roger Berlin
4 Anna Glasgow
I want to replace Los by Los Santos but if I do it the intuitive way, it results in this:
df['City'].str.replace('Los', 'Los Santos')
Out[121]:
0 Los Santos Santos
1 Montreal
2 Los Santos
3 Berlin
4 Glasgow
Name: City, dtype: object
Obviously, I don't want Los Santos Santos.
Use Series.replace, because Series.str.replace by default replace by substrings:
df['City'] = df['City'].replace('Los', 'Los Santos')
You can also use:
df['City'].str.replace('.*Los$', 'Los Santos')
0 Los Santos
1 Montreal
2 Los Santos
3 Berlin
4 Glasgow
Name: City, dtype: object

Iterating through a dataframe with info from another dataframe

I have a question that I think is more about logic than about coding. My goal is to calculate how many Kilometers a truck is loaded and charging.
I have two Dataframes
Lets call the first one trips:
Date Licence City State KM
01/05/2019 AAA-1111 Sao Paulo SP 10
02/05/2019 AAA-1111 Santos SP 10
03/05/2019 AAA-1111 Rio de Janeiro RJ 20
04/05/2019 AAA-1111 Sao Paulo SP 15
01/05/2019 AAA-2222 Curitiba PR 20
02/05/2019 AAA-2222 Sao Paulo SP 25
Lets call the second one invoice
Code Date License Origin State Destiny UF Value
A1 01/05/2019 AAA-1111 Sao Paulo SP Rio de Janeiro RJ 10.000,00
A2 01/05/2019 AAA-2222 Curitiba PR Sao Paulo SP 15.000,00
What I need to get is:
Date Licence City State KM Code
01/05/2019 AAA-1111 Sao Paulo SP 10 A1
02/05/2019 AAA-1111 Santos SP 10 A1
03/05/2019 AAA-1111 Rio de Janeiro RJ 20 A1
04/05/2019 AAA-1111 Sao Paulo SP 15 Nan
01/05/2019 AAA-2222 Curitiba PR 20 A2
02/05/2019 AAA-2222 Sao Paulo SP 25 A2
As I said, is more a question of logic. The truck got its cargo in the initial point that is São Paulo. How can I iterate through the rows knowing that it passed through Santos loaded and then went to Rio de Janeiro if I don´t have the date when the cargo was delivered?
tks
Assume the rows in the first dataframe(df1) are sorted, here is what I would do:
Note: below I am using df1 for trips and df2 for invoice
Left join with the df1 (left) and df2 (right) using as much information that is valid for matching two dataframes, so that we can find rows in df1 which are Origin of the trips. In my test, I am using the fields: ['Date', 'License', 'City', 'State'], save the result in a new dataframe df3
df3 = df1.merge(df2[df2.columns[:6]].rename(columns={'Origin':'City'})
, on = ['Date', 'License', 'City', 'State']
, how = 'left'
)
fill the NULL values in df3.Desitiny with ffill()
df3['Destiny'] = df3.Destiny.ffill()
setup the group label by the following flag:
g = (~df3.Code.isnull() | (df3.shift().City == df3.Destiny)).cumsum()
Note: I added df3['g'] in the above picture for reference
update df3.Code using ffill() based on the above group labels.
df3['Code'] = df3.groupby(g).Code.ffill()

How to merge two data frames in pandas?

I have two pandas dataframes
Unnamed: 0 sentiment numberagreed tweetid tweet
0 0 2 6 219584 Apple processa a Samsung no Japão - Notícias -...
1 1 1 3 399249 É O JACKI CHAN !!! RT #user ESSE É DOS MEUS!!!...
2 2 3 3 387155 Eras o samsung galaxy tab e muito lerdo para t...
3 3 3 3 205458 Dizem que a coisa mais triste que o homem enfr...
4 4 3 3 2054404 RAIVA vou ter que ir com meu nike dinovo pra e...
tweetid sent
219584 0.494428
399249 0.789241
387155 0.351972
205458 0.396907
2054404 0.000000
They are not the same length and there are some missing values in the second data frame
I want to merge the two data frames based on the tweetid and drop the missing values
Use pd.merge
pd.merge(left=df1, right=df2, on='tweetid', how='inner')
Because you take the inner, non-overlapping parts will be thrown away. on='tweetid' merges it on tweetid.
There is probably an extra character somewhere at the beginning of your file. Are you reading the data from a csv file? Post the source code of how you are reading the data.
or name your columns on both dataframe.
df_tweets.columns = ("tweetid", "sent")

Categories