Web scraping with python - table with mutliple tbody elements

Web scraping with python - table with mutliple tbody elements - python

I'm trying to scrape the data from the top table on this page ("2021-2022 Regular Season Player Stats") using Python and BeautifulSoup. The page shows stats for 100 NHL players, 1 player per row. The code below works, but the problem is it only pulls the first ten rows into the dataframe. This is because the every ten rows is in a separate <tbody>, so it is only iterating through the rows in the first <tbody>. How can I get it to continue through the rest of the <tbody> elements on the page?
Another question: this table has about 1000 rows total, and only shows up to 100 per page. Is there a way to rewrite the code below to iterate through the entire table at once instead of just the 100 rows that show on the page?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')
table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')
df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])
for row in table.tbody.find_all('tr'):
columns = row.find_all('td')
Player = columns[1].text.strip()
Team = columns[2].text.strip()
GamesPlayed = columns[3].text.strip()
Goals = columns[4].text.strip()
Assists = columns[5].text.strip()
TotalPoints = columns[6].text.strip()
PointsPerGame = columns[7].text.strip()
PIM = columns[8].text.strip()
PM = columns[9].text.strip()
df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)

To load all player stats into a dataframe and save it to csv you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1, 11):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
...
1132 973.0 Austin Poganski (RW) Winnipeg Jets 16 0 0 0 0.00 7 -3.0
1133 974.0 Mikhail Maltsev (LW) Colorado Avalanche 18 0 0 0 0.00 2 -5.0
1134 975.0 Mason Geertsen (D/LW) New Jersey Devils 23 0 0 0 0.00 62 -4.0
1135 976.0 Jack McBain (C) Arizona Coyotes - - - - - - NaN
1136 977.0 Jordan Harris (D) Montréal Canadiens - - - - - - NaN
1137 978.0 Nikolai Knyzhov (D) San Jose Sharks - - - - - - NaN
1138 979.0 Marc McLaughlin (C) Boston Bruins - - - - - - NaN
1139 980.0 Carson Meyer (RW) Columbus Blue Jackets - - - - - - NaN
1140 981.0 Leon Gawanke (D) Winnipeg Jets - - - - - - NaN
1141 982.0 Brady Keeper (D) Vancouver Canucks - - - - - - NaN
1142 983.0 Miles Wood (LW) New Jersey Devils - - - - - - NaN
1143 984.0 Samuel Morin (D/LW) Philadelphia Flyers - - - - - - NaN
1144 985.0 Connor Carrick (D) Seattle Kraken - - - - - - NaN
1145 986.0 Micheal Ferland (LW/RW) Vancouver Canucks - - - - - - NaN
1146 987.0 Jake Gardiner (D) Carolina Hurricanes - - - - - - NaN
1147 988.0 Oscar Klefbom (D) Edmonton Oilers - - - - - - NaN
1148 989.0 Shea Weber (D) Montréal Canadiens - - - - - - NaN
1149 990.0 Brandon Sutter (C/RW) Vancouver Canucks - - - - - - NaN
1150 991.0 Brent Seabrook (D) Tampa Bay Lightning - - - - - - NaN
and saves data.csv (screenshot from LibreOffice):

Related

How to scrape all results from list instead of be limited to only 20?

I'm using beautifulsoup and trying to scrape some cars24.com data. The list, however, only contains 20 cars details. That's weird, since the page contains a lot more car details (I tried saving it). What am I doing wrong and how can I get it to scrape the whole page?
This is my code:
from bs4 import BeautifulSoup as bs
import requests
link = 'https://www.cars24.com/buy-used-car?sort=P&storeCityId=2&pinId=110001'
page=requests.get(link)
soup = bs(page.content,'html.parser')
car_name = soup.find_all('h2',class_='_3FpCg')
cust_name = []
for i in range(0, len(car_name)):
cust_name.append(car_name[i].get_text())
cust_name
Is there a workaround for this? Appreciate the help.

Use the API endpoint.
For example:
import requests
url = "https://api-sell24.cars24.team/buy-used-car?sort=P&serveWarrantyCount=true&gaId=&page=1&storeCityId=2&pinId=110001"
cars = requests.get(url).json()['data']['content']
base = "https://www.cars24.com/buy-used-"
for car in cars:
car_name = "-".join(car['carName'].lower().split())
car_city = "-".join(car['city'].lower().split())
offer = f"{base}{car_name}-{car['year']}-cars-{car_city}-{car['carId']}"
print(f"{car['carName']} - {car['year']} - {car['price']}")
print(offer)
Output:
Maruti Swift Dzire - 2010 - 256299
https://www.cars24.com/buy-used-maruti-swift-dzire-2010-cars-new-delhi-10084891724
Hyundai Grand i10 - 2018 - 526599
https://www.cars24.com/buy-used-hyundai-grand-i10-2018-cars-noida-10572294761
Datsun Redi Go - 2018 - 234499
https://www.cars24.com/buy-used-datsun-redi-go-2018-cars-gurgaon-11073694705
Maruti Swift - 2020 - 566499
https://www.cars24.com/buy-used-maruti-swift-2020-cars-faridabad-11041770770
Hyundai i10 - 2009 - 170699
https://www.cars24.com/buy-used-hyundai-i10-2009-cars-rohtak-1007315463
Maruti Swift - 2020 - 577399
https://www.cars24.com/buy-used-maruti-swift-2020-cars-new-delhi-10065678773
Hyundai Grand i10 - 2018 - 508799
https://www.cars24.com/buy-used-hyundai-grand-i10-2018-cars-ghaziabad-11261195767
Maruti Swift - 2020 - 587599
https://www.cars24.com/buy-used-maruti-swift-2020-cars-new-delhi-10016194709
Maruti Swift - 2020 - 524099
https://www.cars24.com/buy-used-maruti-swift-2020-cars-new-delhi-10010390743
Hyundai AURA - 2021 - 675099
https://www.cars24.com/buy-used-hyundai-aura-2021-cars-faridabad-11095494760
Maruti Swift - 2019 - 541899
https://www.cars24.com/buy-used-maruti-swift-2019-cars-new-delhi-10016570794
Hyundai Grand i10 - 2019 - 490449
https://www.cars24.com/buy-used-hyundai-grand-i10-2019-cars-noida-10532691707
Hyundai Santro Xing - 2013 - 281999
https://www.cars24.com/buy-used-hyundai-santro-xing-2013-cars-gurgaon-10168291760
Hyundai Santro Xing - 2014 - 272099
https://www.cars24.com/buy-used-hyundai-santro-xing-2014-cars-gurgaon-10121974770
Mercedes Benz C Class - 2014 - 1854499
https://www.cars24.com/buy-used-mercedes-benz-c-class-2014-cars-new-delhi-1050064264
KIA CARENS - 2022 - 1608099
https://www.cars24.com/buy-used-kia-carens-2022-cars-gurgaon-10160777793
Tata ALTROZ - 2021 - 711599
https://www.cars24.com/buy-used-tata-altroz-2021-cars-new-delhi-10083196703
Maruti New Wagon-R - 2020 - 508899
https://www.cars24.com/buy-used-maruti-new-wagon-r-2020-cars-new-delhi-10084875775
Hyundai Grand i10 - 2018 - 509099
https://www.cars24.com/buy-used-hyundai-grand-i10-2018-cars-new-delhi-10011277773
Maruti Wagon R 1.0 - 2011 - 282499
https://www.cars24.com/buy-used-maruti-wagon-r-1.0-2011-cars-new-delhi-10080499706
Note: You can paginate the API by incrementing the value of page in the URL.
For example:
import requests
import pandas as pd
base = "https://www.cars24.com/buy-used-"
table = []
with requests.Session() as s:
for page in range(1, 11):
url = f"https://api-sell24.cars24.team/buy-used-car?sort=P&serveWarrantyCount=true&gaId=&page={page}&storeCityId=2&pinId=110001"
cars = s.get(url).json()['data']['content']
print(f"Getting page {page}...")
for car in cars:
car_name = "-".join(car['carName'].lower().split())
car_city = "-".join(car['city'].lower().split())
offer_url = f"{base}{car_name}-{car['year']}-cars-{car_city}-{car['carId']}"
table.append([car['carName'], car['year'], car['price'], offer_url])
df = pd.DataFrame(table, columns=['Car Name', 'Year', 'Price', 'Offer URL'])
df.to_csv('cars.csv', index=False)
Output: a .csv file:

How to get strings between two specific strings terms in Pandas

good evening.
I would like help with how to get information between two strings in pandas (python).
Imagine that I have a database of car prices for each car dealer in which each cell had a text similar to this (note: the car dealer column can be the index of each row):
"1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes -
Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery
(caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) -
new:R$90000 / used:R$75500 hatch car"
Thanks for the help!!

Three step process
split whole string into line items using re.split()
parse out constituent parts of line using pandas extract
finally shape dataframe as wide...
import re
import pandas as pd
import numpy as np
s = "1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes - Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery (caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) - new:R$90000 / used:R$75500 hatch car"
# there as a car after "<n> - ". split into lines
df = pd.DataFrame(re.split("[ ]?[0-9] - ", s)).replace("", np.nan).dropna()
# parse out each of the strings
df = df[0].str.extract("(?P<car>.*) \([0-9]\) - new:R\$(?P<new>[0-9]*) \/ used:R\$(?P<used>[0-9]*).*")
# finally format as wide format...
df = (df.melt().assign(car=lambda dfa: dfa.groupby("variable").cumcount(),
col=lambda dfa: dfa.variable + (dfa.car+1).astype(str))
.drop(columns=["variable","car"])
.set_index("col")
.T
)
car1
car2
car3
car4
new1
new2
new3
new4
used1
used2
used3
used4
value
Ford
Mercedes - Benz
Chery (caoa)
Others
60000
130000
80000
90000
30000
95000
60000
75500

You could use extractall to get a multiIndex dataframe that in summary will contain the dealer, the car number and the values extracted from regex named groups. After extractall, use stack to reshape the dataframe and the inner-most level index, this will allow you to set a new index with the format [(dealer, carN)...] and subsequently groupby that same first index level to keep the capturing order. Append each dealer data into a list and create the dataframe.
import pandas as pd
import re
df = pd.DataFrame(
["1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes - Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery (caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) - new:R$90000 / used:R$75500 hatch car",
"2 - Toyota (1) - new:R$10543 / used:R$9020 silver sedan car",
"3 - Honda (1) - new:R$123600 / used:R$34400 sedan car 2 - Fiat (1) - new:R$1955 / used:R$877 silver sedan car 3 - Cadillac (1) - new:R$174500 / used:R$12999 SUV car"])
regex = re.compile(
r"\d\s-\s(?P<car>.*?)(?:\s\(\d+\)?\s)-\s"
r"new:R\$(?P<new>[\d\.\,]+)\s/\s"
r"used:R\$(?P<used>[\d\.\,]+).*?car"
)
df_out = df[0].str.extractall(regex).stack()
df_out.index = [df_out.index.get_level_values(0), \
df_out.index.map(lambda x: f'{x[2]+str(x[1]+1)}')]
dealers = []
for n, g in df_out.groupby(level=0):
dealers.append(g.droplevel(0))
df1 = pd.DataFrame(dealers).rename_axis('Dealer')
print(df1)
Output from df1
car1 new1 used1 car2 new2 used2 car3 new3 used3 car4 new4 used4
Dealer
0 Ford 60000 30000 Mercedes - Benz 130000 95000 Chery (caoa) 80000 60000 Others 90000 75500
1 Toyota 10543 9020 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Honda 123600 34400 Fiat 1955 877 Cadillac 174500 12999 NaN NaN NaN

group by a string column and datetime64[ns] column

I have the following data and I would like to know: Who was the first and last customer that each Driver pick-up for each day?
Data
This is how far I just got:
#Import libraries
import pandas as pd
import numpy as np
#Open and clean the data
df = pd.read_csv('Data.csv')
df = df.drop(['Cod'], axis=1)
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
#the following code is to respond the following question:
#Who was the first and last customer that each Driver picked-up for each day?
#link to access the data: https://drive.google.com/file/d/194byxNkgr2e9r-IOEmSuu9gpZyw27G7j/view?usp=sharing
unique_drivers = df['Driver'].value_counts()
for driver in unique_drivers:
d= vdf.groupby('Driver').get_group(driver)
time = d['Start'][0]
first_customer = d['Customer'][0]
end = d['End'][0]
last_customer = d['Customer'][-1]

You can first sort by the column Start which includes the hour and minutes, ensuring that
multiple same day events are sorted correctly for the next step. Group the frame by Driver to find
the drivers pick up for each day.
Using drop_duplicates drop repeated values using the flag keep="first" to preserve only
the first values during the evaluation, similarly use keep="last" to preserve only the last (from a cluster of repeated values). This will produce unique dates for each driver, the first pick up and the last one for each day, then use the index from those days over the Customer column to get the customer's name.
import pandas as pd
df = pd.read_csv("data.csv")
print(df)
# sort including HH:MM
df = df.sort_values("Start")
drivers_df = []
for gname, group in df.groupby("Driver"):
dn = pd.DataFrame()
# split to get date and time in two columns
ts = group["Start"].str.split(expand=True)
# remove duplicate days keeping the first occurance
t_first = ts.drop_duplicates(subset=[0], keep="first")
# remove duplicate days keeping the last occurance
t_last = ts.drop_duplicates(subset=[0], keep="last")
dn["Date"] = t_first[0]
dn["Driver"] = gname
dn["Num_Customers"] = ts[0].groupby(ts[0]).count().values
# use the previous obtained indices over the "Customer" column
dn["First_Customer"] = df.loc[t_first.index, "Customer"].values
dn["Last_Customer"] = df.loc[t_last.index, "Customer"].values
drivers_df.append(dn)
dn = pd.concat(drivers_df)
# remove to sort by driver's name
dn = dn.sort_values("Date")
dn = dn.reset_index(drop=True)
print(dn)
Output from dn
Date Driver Num_Customers First_Customer Last_Customer
0 5/10/2020 Javier Pulgar 1 100998 - MARA MIRIAN BEATRIZ 100998 - MARA MIRIAN BEATRIZ
1 5/10/2020 Santiago Muruaga 1 103055 - ZANOTTO VALERIA 103055 - ZANOTTO VALERIA
2 5/10/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
3 5/10/2020 Pablo Aguilar 2 102737 - GONZALVE DE ACEVEDO 102737 - GONZALVE DE ACEVEDO
4 5/10/2020 Carlos Medina 1 102750 - COOP.DE TRABAJO 102750 - COOP.DE TRABAJO
5 5/11/2020 Facundo Papaleo 6 101209 - FARMACIA NAZCA 2602 105093 - BIO HERLPER
6 5/11/2020 Franco Chiarrappa 15 100288 - SAVINI LUCIANA MARIA 102690 - GIOIA ELIZABETH
7 5/11/2020 Hernán Navarro 14 106367 - FARMACIA BERAPHAR... 102631 - SPALVIERI MARINA
8 5/11/2020 Pablo Aguilar 9 102510 - CAZADORFARM SCS 101482 - JOAQUIN MARCIAL
9 5/11/2020 Daniel Godino 7 103572 - GIRALDEZ ALICIA OLGA 103363 - CADELLI ROBERTO JOSE
10 5/11/2020 Hernán Urquiza 1 105323 - GARCIA GERMAN REI... 105323 - GARCIA GERMAN REI...
11 5/11/2020 Héctor Naselli 19 103545 - FARMACIA DESANTI 102257 - FARMA NUOVA S.C.S.
12 5/11/2020 Santiago Muruaga 12 101735 - ALEGRE LEONARDO 500014 - Drogueria DIMEC
13 5/11/2020 Javier Pulgar 2 101009 - MIGUEL ANGEL MARA 103462 - DRAGONE CARLOS AL...
14 5/11/2020 Atilano Aguilera 1 104003 - FARMACIA SANTA 104003 - FARMACIA SANTA
15 5/11/2020 Muletto 3 101359 - FARMACIA COSENTINO 105886 - NEGRI GREGORIO
16 5/11/2020 Martín Venturino 8 102587 - JANISZEWSKI MATIL... 102672 - BORSOTTI GUSTAVO
17 5/11/2020 Martín Gutierrez 1 105645 - PAWLIW MARTA SOFI... 105645 - PAWLIW MARTA SOFI...
18 5/11/2020 José Vallejos 13 102229 - LANDRIEL MARIA LO... 105721 - SOSA NANCY EDITH ...
19 5/11/2020 Edgardo Andrade 9 101524 - FARMACIA M Y A 101217 - MARISA TESORO
20 5/11/2020 Carlos Medina 14 105126 - QUISPE MURILLO RODY 100538 - MAXIMILIANO CAMPO...
21 5/11/2020 Javier Torales 1 200666 - CLINICA BOEDO SRL 200666 - CLINICA BOEDO SRL
22 5/12/2020 Hernán Urquiza 8 105293 - BENSAK MARIA EUGENIA 103005 - BONVISSUTO SANDRA
23 5/12/2020 Miguel Quilici 17 102918 - BRITO NICOLAS 102533 - SAMPEDRO PURA
...
...

Scrape a JS rendered site without Chrome GUI?

I am trying to scrape a js rendered site using selenium and BeautifulSoup. The code works fine but I need to run it on a server which don't have any chrome. What shall I change in the code that it works without a GUI?
Below is the current code:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
import json
from selenium.webdriver.common.keys import Keys
url = 'https://www.bigbasket.com/pc/fruits-vegetables/fresh-vegetables/?nc=nb'
chromepath = "/Users/Nitin/Desktop/Milkbasket/Scraping/chromedriver"
driver = driver = webdriver.Chrome(chromepath)
driver.get(url)
#rest of code for fetching prices

I would recommend you ditch the Selenium approach, and work on getting the information you need using using the built in urllib libraries, or if possible the requests library. The information for all of the products can be obtained from returned JSON data. For example:
import requests
import re
params = {
"type" : "pc",
"slug" : "fresh-vegetables",
"tab_type" : '["all"]',
"sorted_on" : "popularity",
"listtype" : "pc",
}
session = requests.Session()
for page in range(1, 10):
params['page'] = page
req_vegetables = session.get("https://www.bigbasket.com/product/get-products", params=params)
json_vegetables = req_vegetables.json()
print(f'Page {page}')
for product in json_vegetables['tab_info']['product_map']['all']['prods']:
print(f" {product['p_desc']} - {product['sp']} - {product['mrp']}")
This would give you the following output:
Page 1
Onion - 21.00 - 26.25
Potato - 27.00 - 33.75
Tomato - Hybrid - 40.00 - 50.00
Ladies Finger - 10.00 - 12.50
Cauliflower - 35.00 - 43.75
Palak - 30.00 - 37.50
Potato Onion Tomato 1 kg Each - 88.00 - 110.00
Carrot - Local - 59.00 - 73.75
Capsicum - Green - 89.00 - 111.25
Tomato - Local - 47.00 - 58.75
Mushrooms - Button - 49.00 - 61.25
Cucumber - 25.00 - 31.25
Broccoli - 18.40 - 23.00
Bottle Gourd - 17.00 - 21.25
Cabbage - 32.00 - 40.00
Cucumber - English - 23.00 - 28.75
Tomato - Local, Organically Grown - 29.00 - 36.25
Brinjal - Bottle Shape - 72.00 - 90.00
Onion - Organically Grown - 23.00 - 28.75
Methi - 19.00 - 23.75
Page 2
Bitter Gourd / Karela - 59.20 - 74.00
Beetroot - 40.00 - 50.00
Fresho Palak - Without Root 250 Gm + Amul Malai Paneer 200 Gm - 94.20 - 102.00
Capsicum - Red - 299.00 - 373.75
... etc

Merge 2 columns in python

I need to do the same as what I can do with my function: df_g['Bidfloor'] = df_g[['Sitio', 'Country']].merge(df_seg, how='left').Precio but on the Country instead of the exactly same row only the first 2 keys because I can't change the language of the data. So I want to read only the 2 first keys of Countrycolumn instead of all keys of Countrycolumn
df_g:
Sitio,Country
Los Andes Online,HN - Honduras
Guarda14,US - Estados Unidos
Guarda14,PE - Peru
df_seg:
Sitio,Country,Precio
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
What I need:
Sitio,Country,Bidfloor
Los Andes Online,HN - Honduras,0.5
Guarda14,US - United States,2.1
Guarda14,PE - Peru,NULL

You need additional key for help the merge , I am using cumcount to distinguish the repeat value
df1.assign(key=df1.groupby('Sitio').cumcount()).\
merge(df2.assign(key=df2.groupby('Sitio').cumcount()).
drop('Country',1),
how='left',
on=['Sitio','key'])
Out[1491]:
Sitio Country key Precio
0 Los Andes Online HN - Honduras 0 0.5
1 Guarda14 US - Estados Unidos 0 2.1
2 Guarda14 PE - Peru 1 NaN

Just add and drop a merge column and you are done:
df_seg['merge_col'] = df_seg.Country.apply(lambda x: x.split('-')[0])
df_g['merge_col'] = df_g.Country.apply(lambda x: x.split('-')[0])
then do:
df = pd.merge(df_g, df_seg[['merge_col', 'Precio']], on='merge_col', how='left').drop('merge_col', 1)
returns
Sitio Country Precio
0 Los Andes Online HN - Honduras 0.5
1 Guarda14 US - Estados Unidos 2.1
2 Guarda14 PE - Peru NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping with python - table with mutliple tbody elements - python

Related

How to scrape all results from list instead of be limited to only 20?

How to get strings between two specific strings terms in Pandas

group by a string column and datetime64[ns] column

Scrape a JS rendered site without Chrome GUI?

Merge 2 columns in python

Categories

Resources