Web scraping with Xpath - python

So I want to obtain the name of each player in all fotball clubs in the Premier League from transfermarkt.
The page I am trying to do for, as a test is: https://www.transfermarkt.co.uk/ederson/profil/spieler/238223
I have found the Xpath to be:
//*[#id="main"]/div[10]/div[1]/div[2]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td
Keep in mind that I have to use the Xpath due to the structure of the Html code, and that I have to do a For loop for all the players in a club, for all the clubs in the Premier league. I have already obtained the links trough this code:
# Create empty list for player link
playerLink1 = []
playerLink2 = []
playerLink3 = []
#For each team link page...
for i in range(len(Full_Links)):
#...Download the team page and process the html code...
squadPage = requests.get(Full_Links[i], headers=headers)
squadTree = squadPage.text
SquadSoup = BeautifulSoup(squadTree,'html.parser')
#...Extract the player links...
playerLocation = SquadSoup.find("div", {"class":"responsive-table"}).find_all("a",{"class":"spielprofil_tooltip"})
for a in playerLocation:
playerLink1.append(a['href'])
[playerLink2.append(x) for x in playerLink1 if x not in playerLink2]
#...For each player link within the team page...
for j in range(len(playerLink2)):
#...Save the link, complete with domain...
temp2 = "https://www.transfermarkt.co.uk" + playerLink2[j]
#...Add the finished link to our teamLinks list...
playerLink3.append(temp2)
The links are in a list variable called "playerLink3_u"
How can I do this?

I am not sure how to ge the name with the XPath. You have BS4 imported already so I have written some code to get the player name from the URL you have posted.
import requests
from bs4 import BeautifulSoup
request_page = requests.get("http://www.transfermarkt.co.uk/ederson/profil/spieler/238223", headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"})
page_soup = BeautifulSoup(request_page.text, 'html.parser')
player_table = page_soup.find('table', {'class': 'auflistung'})
table_data = player_table.findAll('td')
print('Name: ', table_data[0].text)
print('Date Of Birth: ', table_data[1].text)
print('Place Of Birth: ', table_data[2].text)
This will return the name, date_of_birth, and place_of_birth.

Related

How do I access a table thats split into different pages but the same url for transfermarkt?

I'm having an issue where I access the first page of the table but not the rest. When i click on say tab 2, it gives me players 26-50 but i can't scrape it as it is the same URL and not a different one. Is there a way to edit my code so that I could get all the pages of the tables?
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/plus/1/galerie/0?saison_id=2021&land_id=alle&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&leihe=&w_s=s&zuab=0"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
TransferPrice = pageSoup.find_all("td",{"class","rechts hauptlink"})
transfer_prices = []
cleaned_transfer_prices = []
for i in TransferPrice:
transfer_prices.append(i.text)
for i in transfer_prices:
i = i[1:-1]
i = float(i)
cleaned_transfer_prices.append(i)
cleaned_transfer_prices
some_list = []
#Players = pageSoup.find_all("td",{"class", "hauptlink"})
for td_tag in pageSoup.find_all("td",{"class", "hauptlink"}):
a_tag = td_tag.find('a')
if a_tag == None:
pass
else:
some_list.append(a_tag.text)
players = []
team_left = []
team_gone_to = []
for i in range(0,len(some_list),3):
players.append(some_list[i])
for i in range(1,len(some_list),3):
team_left.append(some_list[i])
for i in range(2,len(some_list),3):
team_gone_to.append(some_list[i])
df_2 = pd.DataFrame()
df_2['Player Name'] = players
df_2['Team Left'] = team_left
df_2['New Team'] = team_gone_to
df_2['Transfer Price'] = cleaned_transfer_prices
df_2.index+=1
df_2
you are using the wrong link here.
you should use this one instead:
https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/ajax/yw1/saison_id/2021/land_id/alle/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/leihe//w_s/s/zuab/0/plus/1/galerie/0/page/2?ajax=yw1
pay attention to the ajax part of it (those dynamic code you need to get is uploaded by ajax)
the last part of the link is page/2?ajax=yw1 here you can solve your described problem by rotating the number (here it's 2 - the second page, you can change it to any number you need (by using f-strings))

How to get all page results - Web Scraping - Pagination

I am a beginner in regards to coding. Right now I am trying to get a grip on simple web scrapers using python.
I want to scrape a real estate website and get the Title, price, sqm, and what not into a CSV file.
My questions:
It seems to work for the first page of results but then it repeats and it does not run through the 40 pages. It rather fills the file with the same results.
The listings have info about "square meter" and the "number of rooms". When I inspect the page it seems though that it uses the same class for both elements. How would I extract the room numbers for example?
Here is the code that I have gathered so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={1}'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'col-xs-12 place-over-understitial sel-bg-gray-lighter')
for item in divs:
title = item.find('div', {'class': 'text-225'}).text.strip().replace('\n', '')
title2 = title.replace('\t', '')
hausart = item.find('span', class_ = 'text-100').text.strip().replace('\n', '')
hausart2 = hausart.replace('\t', '')
try:
price = item.find('span', class_ = 'text-250 text-strong text-nowrap').text.strip()
except:
price = 'Auf Anfrage'
wohnflaeche = item.find('p', class_ = 'text-250 text-strong text-nowrap').text.strip().replace('m²', '')
angebot = {
'title': title2,
'hausart': hausart2,
'price': price
}
hauslist.append(angebot)
return
hauslist=[]
for i in range(0, 40):
print(f'Getting page {i}...')
c = extract(i)
transform(c)
df = pd.DataFrame(hauslist)
print(df.head())
df.to_csv('immonetHamburg.csv')
This is my first post on stackoverflow so please be kind if I should have posted my problem differently.
Thanks
Pat
You have stupid mistake.
In url you have to use {page} instead of {1}. That's all.
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={page}'
I see other problem:
You start scraping at page 0 but servers often give the same result for page 0 and 1.
You should use range(1, ...) instead of range(0, ...)
As for searching elements.
Beautifulsoup may search not only classes but also id and any other value in tag - ie. name, style, data, etc. It can also search by text "number of rooms". It can also use regex for this. You can also assign own function which will check element and return True/False to decide if it has to keep it in results.
You can also combine .find() with another .find() or .find_all().
price = item.find('div', {"id": lambda value:value and value.startswith('selPrice')}).find('span')
if price:
print("price:", price.text)
And if you know that "square meter" is before "number of rooms" then you could use find_all() to get both of them and later use [0] to get first of them and [1] to get second of them.
You should read all documentation beacause it can be very useful.
I advice you use Selenium instead, because you can physically click the 'next-page' button until you cover all pages and the whole code will only take a few lines.
As #furas mentioned you have a mistake with the page.
To get all rooms you need to find_all and get the last index with -1. Because sometimes there are 3 items or 2.
#to remote all \n and \r
translator = str.maketrans({chr(10): '', chr(9): ''})
rooms = item.find_all('p', {'class': 'text-250'})
if rooms:
rooms = rooms[-1].text.translate(translator).strip()

How to change the code to asynchronously iterate links and IDs for scrap web page?

I have the list of links, each link has an id that is in the Id list
How to change the code so that when iterating the link, the corresponding id is substituted into the string:
All code is below:
import pandas as pd
from bs4 import BeautifulSoup
import requests
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125', 'accept': '*/*'}
links = ['https://www..ie', 'https://www..ch', 'https://www..com']
Id = ['164240372761e5178f0488d', '164240372661e5178e1b377', '164240365661e517481a1e6']
def get_html(url, params=None):
r = requests.get(url, headers=HEADERS, params=params)
def get_data_no_products(html):
data = []
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', id= '') # How to iteration paste id???????
for item in items:
data.append({'pn': item.find('a').get('href')})
return print(data)
def parse():
for i in links:
html = get_html(i)
get_data_no_products(html.text)
parse()
Parametrise your code:
def get_data_no_products(html, id_):
data = []
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', id=id_)
And then use zip():
for link, id_ in zip(links, ids):
get_data_no_producs(link, id_)
Note that there's a likely bug in your code: you return print(data) which will always be none. You likely just want to return data.
PS
There is another solution to this which you will frequently encounter from people beginning in python:
for i in range(len(links)):
link = links[i]
id_ = ids[i]
...
This... works. It might even be easier or more natural, if you are coming from e.g. C. (Then again I'd likely use pointers...). Style is very much personal, but if you're going to write in a high level language like python you might as well avoid thinking about things like 'the index of the current item' as much as possible. Just my £0.02.

Scrape table fields from html with specific class

So I want to build a simple scraper for google shopping and I encountered some problems.
This is the html text from my request(to https://www.google.es/shopping/product/7541391777504770249/online) where I'm trying to query the highlighted div class sh-osd__total-price inside the div class sh-osd__offer-row :
My code is currently:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
r = html_soup.findAll('tr', {'class': 'sh-osd__offer-row'}) #Returns empty
print(r)
r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty
print(r)
Where both r are empty, beatiful soup doesn't find anything.
Is there any way to find these two div classes with beautiful soup?
You need to add user agent into the headers:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'} #<-- added line
response = get(url, headers=headers) #<--- include here
html_soup = BeautifulSoup(response.text, 'html.parser')
r = html_soup.find_all('tr', {'class': 'sh-osd__offer-row'}) #Returns empty
print(r)
r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty
print(r)
But, since it's a <table> tag, you can use pandas (it uses beautifulsoup under the hood), but does the hard work for you. It will return a list of all elements that are <table>s as dataframes
import pandas as pd
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
dfs = pd.read_html(url)
print(dfs[-1])
Output:
print(dfs[-1])
Sellers Seller Rating ... Base Price Total Price
0 One Fragance No rating ... £30.95 +£8.76 delivery £39.71
1 eBay No rating ... £46.81 £46.81
2 Carethy.co.uk No rating ... £34.46 +£3.99 delivery £38.45
3 fruugo.co.uk No rating ... £36.95 +£9.30 delivery £46.25
4 cosmeticsmegastore.com/gb No rating ... £36.95 +£9.30 delivery £46.25
5 Perfumes Club UK No rating ... £30.39 +£5.99 delivery £36.38
[6 rows x 5 columns]

Web-scraping data from a graph

I am working with lobbying data from opensecrets.org, in particular industry data. I want to have a time series of lobby expenditures for each industry going back since the 90's.
I want to web-scrape the data automatically. Urls where the data is have the following format:
https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019
which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage. It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code. I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.
If there is an API, that your best bet as mentioned above. But the data is able to be parsed anyway provided you get the right url/query parameters:
I've managed to iterate through it with the links for you to grab each table. I stored it in a dictionary with the key being the Firm name, and the value being the table/data. You can change it up to anyway you'd like. Maybe just store as json, or save each as csv.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a', href=True)
root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}
for each in links:
if 'clientsum.php?' in each['href']:
w=1
firms = each.text
link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
links_dict[firms] = link
all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():
print ('%s of %s ---- %s' %(n, tot, firms))
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
results = pd.DataFrame()
graph = soup.find_all('set')
for each in graph:
year = each['label']
total = each['value']
temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
results = results.append(temp_df,sort=True).reset_index(drop=True)
all_tables[firms] = results
n+=1
*Output:**
Not going to print as there are 347 tables, but just so you see the structure:

Categories