group by a string column and datetime64[ns] column - python

I have the following data and I would like to know: Who was the first and last customer that each Driver pick-up for each day?
Data
This is how far I just got:
#Import libraries
import pandas as pd
import numpy as np
#Open and clean the data
df = pd.read_csv('Data.csv')
df = df.drop(['Cod'], axis=1)
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
#the following code is to respond the following question:
#Who was the first and last customer that each Driver picked-up for each day?
#link to access the data: https://drive.google.com/file/d/194byxNkgr2e9r-IOEmSuu9gpZyw27G7j/view?usp=sharing
unique_drivers = df['Driver'].value_counts()
for driver in unique_drivers:
d= vdf.groupby('Driver').get_group(driver)
time = d['Start'][0]
first_customer = d['Customer'][0]
end = d['End'][0]
last_customer = d['Customer'][-1]

You can first sort by the column Start which includes the hour and minutes, ensuring that
multiple same day events are sorted correctly for the next step. Group the frame by Driver to find
the drivers pick up for each day.
Using drop_duplicates drop repeated values using the flag keep="first" to preserve only
the first values during the evaluation, similarly use keep="last" to preserve only the last (from a cluster of repeated values). This will produce unique dates for each driver, the first pick up and the last one for each day, then use the index from those days over the Customer column to get the customer's name.
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
# sort including HH:MM
df = df.sort_values("Start")
drivers_df = []
for gname, group in df.groupby("Driver"):
dn = pd.DataFrame()
# split to get date and time in two columns
ts = group["Start"].str.split(expand=True)
# remove duplicate days keeping the first occurance
t_first = ts.drop_duplicates(subset=[0], keep="first")
# remove duplicate days keeping the last occurance
t_last = ts.drop_duplicates(subset=[0], keep="last")
dn["Date"] = t_first[0]
dn["Driver"] = gname
dn["Num_Customers"] = ts[0].groupby(ts[0]).count().values
# use the previous obtained indices over the "Customer" column
dn["First_Customer"] = df.loc[t_first.index, "Customer"].values
dn["Last_Customer"] = df.loc[t_last.index, "Customer"].values
drivers_df.append(dn)
dn = pd.concat(drivers_df)
# remove to sort by driver's name
dn = dn.sort_values("Date")
dn = dn.reset_index(drop=True)
print(dn)
Output from dn
Date Driver Num_Customers First_Customer Last_Customer
0 5/10/2020 Javier Pulgar 1 100998 - MARA MIRIAN BEATRIZ 100998 - MARA MIRIAN BEATRIZ
1 5/10/2020 Santiago Muruaga 1 103055 - ZANOTTO VALERIA 103055 - ZANOTTO VALERIA
2 5/10/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
3 5/10/2020 Pablo Aguilar 2 102737 - GONZALVE DE ACEVEDO 102737 - GONZALVE DE ACEVEDO
4 5/10/2020 Carlos Medina 1 102750 - COOP.DE TRABAJO 102750 - COOP.DE TRABAJO
5 5/11/2020 Facundo Papaleo 6 101209 - FARMACIA NAZCA 2602 105093 - BIO HERLPER
6 5/11/2020 Franco Chiarrappa 15 100288 - SAVINI LUCIANA MARIA 102690 - GIOIA ELIZABETH
7 5/11/2020 Hernán Navarro 14 106367 - FARMACIA BERAPHAR... 102631 - SPALVIERI MARINA
8 5/11/2020 Pablo Aguilar 9 102510 - CAZADORFARM SCS 101482 - JOAQUIN MARCIAL
9 5/11/2020 Daniel Godino 7 103572 - GIRALDEZ ALICIA OLGA 103363 - CADELLI ROBERTO JOSE
10 5/11/2020 Hernán Urquiza 1 105323 - GARCIA GERMAN REI... 105323 - GARCIA GERMAN REI...
11 5/11/2020 Héctor Naselli 19 103545 - FARMACIA DESANTI 102257 - FARMA NUOVA S.C.S.
12 5/11/2020 Santiago Muruaga 12 101735 - ALEGRE LEONARDO 500014 - Drogueria DIMEC
13 5/11/2020 Javier Pulgar 2 101009 - MIGUEL ANGEL MARA 103462 - DRAGONE CARLOS AL...
14 5/11/2020 Atilano Aguilera 1 104003 - FARMACIA SANTA 104003 - FARMACIA SANTA
15 5/11/2020 Muletto 3 101359 - FARMACIA COSENTINO 105886 - NEGRI GREGORIO
16 5/11/2020 Martín Venturino 8 102587 - JANISZEWSKI MATIL... 102672 - BORSOTTI GUSTAVO
17 5/11/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
18 5/11/2020 José Vallejos 13 102229 - LANDRIEL MARIA LO... 105721 - SOSA NANCY EDITH ...
19 5/11/2020 Edgardo Andrade 9 101524 - FARMACIA M Y A 101217 - MARISA TESORO
20 5/11/2020 Carlos Medina 14 105126 - QUISPE MURILLO RODY 100538 - MAXIMILIANO CAMPO...
21 5/11/2020 Javier Torales 1 200666 - CLINICA BOEDO SRL 200666 - CLINICA BOEDO SRL
22 5/12/2020 Hernán Urquiza 8 105293 - BENSAK MARIA EUGENIA 103005 - BONVISSUTO SANDRA
23 5/12/2020 Miguel Quilici 17 102918 - BRITO NICOLAS 102533 - SAMPEDRO PURA
...
...

Related

Python pandas 'reverse' split in a dataframe

i have a dataframe with a column called details that have this data :
130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -
when i want to get the first data 130 i did this :
df['superficie'] = df['details'].str.split('m²').str[0]
its gives me 130 in a new column that called "superficie"
for the the seconde data i did this :
df['nbPieces']= (df['details'].str.split('-').str[1].str.split('Pièces').str[0])
it gives me 3 in a new column that called "nbPieces"
but my problème is if i want to get the 2 of the champbres and 2 of the salles de bains and the 20-30 near of "ans" , how can i do that, i need to add them to new columns (nbChambre , nbSalleDeBain, NbAnnee)
thanks in advance .
I suggest you to use regular expressions in pandas for this kind of operations:
import pandas as pd
df = pd.DataFrame()
df['details'] = ["130 m² - 3 Pièces - 2 Chambres - 2 Salles de bains - Bon état - 20-30 ans -"]
df['nb_chbr'] = df['details'].str.split(" - ").str[2].str.findall(r'\d+').str[0].astype('int64')
df['nb_sdb'] = df['details'].str.split(" - ").str[3].str.findall(r'\d+').str[0].astype('int64')
df['nb_annee'] = df['details'].str.split(" - ").str[5].str.findall(r'\d+').str[0].astype('int64')
print(df)
Output:
details nb_chbr nb_sdb nb_annee
0 130 m² - 3 Pièces - 2 Chambres - 2 Salles de b... 2 2 20
Moreover, I used " - " as a split string. It returns a better list in your case. And for the "Nombre d'années" case I simply took the first integer that appears in the list, I don't know if it suits you.
Finally there may be a problem in your dataframe, 2 chambres and 2 salles de bain should be a 4 pièces flat ^^

fetching substring with a condition from another df

I have 2 data sets, 1 with only address like this
import pandas as pd
import numpy as np
df = pd.DataFrame({"Address": ["36 omar st, pal, galambo","33 pom kopd malan", "15 kop st,dogg, ghog", "23 malo st, pal, kola"]})
Address
0 36 omar st, pal, galambo
1 33 pom kopd malan
2 15 kop st,dogg, ghog
3 23 malo st, pal, kola
and the other is a dataset with every state and the cities inside of it
df2 = pd.DataFrame({"State": ["galambo", "ghog", "ghog", "kola", "malan", "malan"], "City": ["pal", "dogg", "kopd", "kop", "pal", "kold"]})
State City
0 galambo pal
1 ghog dogg
2 ghog kopd
3 kola kop
4 malan pal
5 malan kold
I'm trying to fetch state name and city name out of each address, so I tried this
df["State"] = df['Address'].apply(lambda x: next((a for a in df2["State"].to_list() if a in x), np.nan))
df["City"] = df['Address'].apply(lambda x: next((a for a in df2["City"].to_list() if a in x), np.nan))
Address State City
0 36 omar st, pal, galambo galambo pal
1 33 pom kopd malan malan kopd
2 15 kop st,dogg, ghog ghog dogg
3 23 malo st, pal, kola kola pal
but as you see, the rows 1,3 are incorrect because according to df2 the State malan has no City called kopd, and State kola has no City called pal
so how can I make the output shows only the cities that are in the States as suggested in df2?
Update:
Expected output
Address State City
0 36 omar st, pal, galambo galambo pal
1 33 pom kopd malan malan NaN
2 15 kop st,dogg, ghog ghog dogg
3 23 malo st, pal, kola kola NaN
You can extract the last matching state/city name, then perform a merge to replace the invalid cities by NaN:
# craft regexes
regex_state = f"({'|'.join(df2['State'].unique())})"
regex_city = f"({'|'.join(df2['City'].unique())})"
# extract state/city (last match)
df['State'] = df['Address'].str.findall(regex_state).str[-1]
df['City'] = df['Address'].str.findall(regex_city).str[-1]
# fix city
df['City'] = df.merge(df2.assign(c=df2['City']), on=['City', 'State'], how='left')['c']
Output:
Address State City
0 36 omar st, pal, galambo galambo pal
1 33 pom kopd malan malan NaN
2 15 kop st,dogg, ghog ghog dogg
3 23 malo st, pal, kola kola NaN

Web scraping with python - table with mutliple tbody elements

I'm trying to scrape the data from the top table on this page ("2021-2022 Regular Season Player Stats") using Python and BeautifulSoup. The page shows stats for 100 NHL players, 1 player per row. The code below works, but the problem is it only pulls the first ten rows into the dataframe. This is because the every ten rows is in a separate <tbody>, so it is only iterating through the rows in the first <tbody>. How can I get it to continue through the rest of the <tbody> elements on the page?
Another question: this table has about 1000 rows total, and only shows up to 100 per page. Is there a way to rewrite the code below to iterate through the entire table at once instead of just the 100 rows that show on the page?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')
table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')
df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])
for row in table.tbody.find_all('tr'):
columns = row.find_all('td')
Player = columns[1].text.strip()
Team = columns[2].text.strip()
GamesPlayed = columns[3].text.strip()
Goals = columns[4].text.strip()
Assists = columns[5].text.strip()
TotalPoints = columns[6].text.strip()
PointsPerGame = columns[7].text.strip()
PIM = columns[8].text.strip()
PM = columns[9].text.strip()
df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)
To load all player stats into a dataframe and save it to csv you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1, 11):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
...
1132 973.0 Austin Poganski (RW) Winnipeg Jets 16 0 0 0 0.00 7 -3.0
1133 974.0 Mikhail Maltsev (LW) Colorado Avalanche 18 0 0 0 0.00 2 -5.0
1134 975.0 Mason Geertsen (D/LW) New Jersey Devils 23 0 0 0 0.00 62 -4.0
1135 976.0 Jack McBain (C) Arizona Coyotes - - - - - - NaN
1136 977.0 Jordan Harris (D) Montréal Canadiens - - - - - - NaN
1137 978.0 Nikolai Knyzhov (D) San Jose Sharks - - - - - - NaN
1138 979.0 Marc McLaughlin (C) Boston Bruins - - - - - - NaN
1139 980.0 Carson Meyer (RW) Columbus Blue Jackets - - - - - - NaN
1140 981.0 Leon Gawanke (D) Winnipeg Jets - - - - - - NaN
1141 982.0 Brady Keeper (D) Vancouver Canucks - - - - - - NaN
1142 983.0 Miles Wood (LW) New Jersey Devils - - - - - - NaN
1143 984.0 Samuel Morin (D/LW) Philadelphia Flyers - - - - - - NaN
1144 985.0 Connor Carrick (D) Seattle Kraken - - - - - - NaN
1145 986.0 Micheal Ferland (LW/RW) Vancouver Canucks - - - - - - NaN
1146 987.0 Jake Gardiner (D) Carolina Hurricanes - - - - - - NaN
1147 988.0 Oscar Klefbom (D) Edmonton Oilers - - - - - - NaN
1148 989.0 Shea Weber (D) Montréal Canadiens - - - - - - NaN
1149 990.0 Brandon Sutter (C/RW) Vancouver Canucks - - - - - - NaN
1150 991.0 Brent Seabrook (D) Tampa Bay Lightning - - - - - - NaN
and saves data.csv (screenshot from LibreOffice):

Merge 2 columns in python

I need to do the same as what I can do with my function: df_g['Bidfloor'] = df_g[['Sitio', 'Country']].merge(df_seg, how='left').Precio but on the Country instead of the exactly same row only the first 2 keys because I can't change the language of the data. So I want to read only the 2 first keys of Countrycolumn instead of all keys of Countrycolumn
df_g:
Sitio,Country
Los Andes Online,HN - Honduras
Guarda14,US - Estados Unidos
Guarda14,PE - Peru
df_seg:
Sitio,Country,Precio
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
What I need:
Sitio,Country,Bidfloor
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
Guarda14,PE - Peru,NULL
You need additional key for help the merge , I am using cumcount to distinguish the repeat value
df1.assign(key=df1.groupby('Sitio').cumcount()).\
merge(df2.assign(key=df2.groupby('Sitio').cumcount()).
drop('Country',1),
how='left',
on=['Sitio','key'])
Out[1491]:
Sitio Country key Precio
0 Los Andes Online HN - Honduras 0 0.5
1 Guarda14 US - Estados Unidos 0 2.1
2 Guarda14 PE - Peru 1 NaN
Just add and drop a merge column and you are done:
df_seg['merge_col'] = df_seg.Country.apply(lambda x: x.split('-')[0])
df_g['merge_col'] = df_g.Country.apply(lambda x: x.split('-')[0])
then do:
df = pd.merge(df_g, df_seg[['merge_col', 'Precio']], on='merge_col', how='left').drop('merge_col', 1)
returns
Sitio Country Precio
0 Los Andes Online HN - Honduras 0.5
1 Guarda14 US - Estados Unidos 2.1
2 Guarda14 PE - Peru NaN

Selecting rows by last 3 characters in a column with strings

I have this dataframe
name year ...
0 Carlos - xyz 2019
1 Marcos - yws 2031
3 Fran - xxz 2431
4 Matt - yre 1985
...
I want to create a new column, called type.
If the name of the person ends with "xyz" or "xxz" I want type to be "big"
So, it should look like this:
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
...
Any suggestions?
Option 1
Use str.contains to generate a mask:
m = df.name.str.contains(r'x[yx]z$')
Or,
sub_str = ['xyz', 'xxz']
m = df.name.str.contains(r'{}$'.format('|'.join(sub_str)))
Now, you may either create your column with np.where,
df['type'] = np.where(m, 'big', '')
Or, loc in place of np.where;
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Option 2
As an alternative, consider str.endswith + np.logical_or.reduce
sub_str = ['xyz', 'xxz']
m = np.logical_or.reduce([df.name.str.endswith(s) for s in sub_str])
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Here is one way using pandas.Series.str.
df = pd.DataFrame([['Carlos - xyz', 2019], ['Marcos - yws', 2031],
['Fran - xxz', 2431], ['Matt - yre', 1985]],
columns=['name', 'year'])
df['type'] = np.where(df['name'].str[-3:].isin({'xyz', 'xxz'}), 'big', '')
Alternatively, you can use .loc accessor instead of numpy.where:
df['type'] = ''
df.loc[df['name'].str[-3:].isin({'xyz', 'xxz'}), 'type'] = 'big'
Result
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
2 Fran - xxz 2431 big
3 Matt - yre 1985
Explanation
Extract last 3 letters using pd.Series.str.
Compare to a specified set of values for O(1) complexity lookup.
Use numpy.where to perform conditional assignment for new series.

Categories