unable to scrape website pages with unchanged url - python

unable to scrape website pages with unchanged url - python - python

im trying to get the names of all games within this website "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList".To do so im using the following code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
table = soup.find_all('div', attrs={'class':'providerCard'})
for game in range(0,len(table)-1):
print(table[game].find('a')['title'])
and i get what i want.
I would like to replicate the same across all pages available on the website, but given that the url is not changing, I looked at the network (XMR) events on the page happening when clicking on a different page and I tried to send a request using the following code:
for page_no in range(1, 100):
data = {
"blck":"fltrGamesBlk",
"ajax":"1",
"lang":"end",
"p":str(page_no),
"translit":"The-Best-Slots",
"tag":"TOP",
"dt1":"",
"dt2":"",
"sorting":"SRANK",
"cISO":"GB",
"dt_period":"",
"rtp_1":"50.00",
"rtp_2":"100.00",
"max_exp_1":"2.00",
"max_exp_2":"250000.00",
"min_bet_1":"0.01",
"min_bet_2":"5.00",
"max_bet_1":"3.00",
"max_bet_2":"10000.00"
}
page = requests.post('https://slotcatalog.com/index.php',
data=data,
headers={'Host' : 'slotcatalog.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'
})
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.find_all('div', attrs={'class':'providerCard'}):
name = row.find('a')['title']
print(name)
result : ("KeyError: 'title'") - meaning that its not finding the class "providerCard".
Has the request to the website been done in the wrong way? If so, where should i change the code?
thanks in advance

Alright, so, you had a typo. XD It was this "lang":"end" from the payload but it should have been "lang": "en", among other things.
Anyhow, I've cleaned your code up a bit and it works as expected. You can keep looping for all the games, if you want.
import requests
from bs4 import BeautifulSoup
headers = {
"referer": "https://slotcatalog.com/en/The-Best-Slots",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
payload = {
"blck": "fltrGamesBlk",
"ajax": "1",
"lang": "en",
"p": 1,
"translit": "The-Best-Slots",
"tag": "TOP",
"dt1": "",
"dt2": "",
"sorting": "SRANK",
"cISO": "EN",
"dt_period": "",
"rtp_1": "50.00",
"rtp_2": "100.00",
"max_exp_1": "2.00",
"max_exp_2": "250000.00",
"min_bet_1": "0.01",
"min_bet_2": "5.00",
"max_bet_1": "3.00",
"max_bet_2": "10000.00"
}
page = requests.post(
"https://slotcatalog.com/index.php",
data=payload,
headers=headers,
)
soup = BeautifulSoup(page.content, "html.parser")
print([i.get("title") for i in soup.find_all("a", {"class": "providerName"})])
Output (for page 1 only):
['Starburst', 'Bonanza', 'Rainbow Riches', 'Book of Dead', "Fishin' Frenzy", 'Wolf Gold', 'Twin Spin', 'Slingo Rainbow Riches', "Gonzo's Quest", "Gonzo's Quest Megaways", 'Eye of Horus (Reel Time Gaming)', 'Age of the Gods God of Storms', 'Lightning Roulette', 'Buffalo Blitz', "Fishin' Frenzy Megaways", 'Fluffy Favourites', 'Blue Wizard', 'Legacy of Dead', '9 Pots of Gold', 'Buffalo Blitz II', 'Cleopatra (IGT)', 'Quantum Roulette', 'Reel King Mega', 'Mega Moolah', '7s Deluxe', "Rainbow Riches Pick'n'Mix", "Shaman's Dream"]

Related

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

I am trying to create a list containing all unique year links from a website (see below).
When I execute the append function it gives me a huge list containing dupli-multiplicate entries.
I need to get a list containing only the unique year links.
The website : https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
Code written so far :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
print(link.get('href'))
#year.append(link.get('href'))
#print(year)
The desired result would look like this (but I need this in list format):
https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/83-2022.html
/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/71-2020.html
/apofaseis-gnomodotiseis/itemlist/category/4-2019.html
/apofaseis-gnomodotiseis/itemlist/category/5-2018.html
/apofaseis-gnomodotiseis/itemlist/category/6-2017.html
/apofaseis-gnomodotiseis/itemlist/category/7-2016.html
/apofaseis-gnomodotiseis/itemlist/category/8-2015.html
/apofaseis-gnomodotiseis/itemlist/category/9-2014.html
/apofaseis-gnomodotiseis/itemlist/category/10-2013.html
/apofaseis-gnomodotiseis/itemlist/category/11-2012.html
/apofaseis-gnomodotiseis/itemlist/category/12-2011.html
/apofaseis-gnomodotiseis/itemlist/category/13-2010.html
/apofaseis-gnomodotiseis/itemlist/category/18-2009.html
/apofaseis-gnomodotiseis/itemlist/category/19-2008.html
/apofaseis-gnomodotiseis/itemlist/category/20-2007.html
/apofaseis-gnomodotiseis/itemlist/category/21-2006.html
/apofaseis-gnomodotiseis/itemlist/category/22-2005.html
/apofaseis-gnomodotiseis/itemlist/category/23-2004.html
/apofaseis-gnomodotiseis/itemlist/category/24-2003.html
/apofaseis-gnomodotiseis/itemlist/category/25-2002.html
/apofaseis-gnomodotiseis/itemlist/category/26-2001.html
/apofaseis-gnomodotiseis/itemlist/category/27-2000.html
/apofaseis-gnomodotiseis/itemlist/category/44-1999.html
/apofaseis-gnomodotiseis/itemlist/category/45-1998.html
/apofaseis-gnomodotiseis/itemlist/category/48-1997.html
/apofaseis-gnomodotiseis/itemlist/category/47-1996.html
/apofaseis-gnomodotiseis/itemlist/category/46-1995.html
/apofaseis-gnomodotiseis/itemlist/category/49-1994.html
Edit : I am Trying to create a case list for every year in year list :
Code :
# 1) Created an year list (year = [])
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
total_cases = []
#Url to scrape
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
# 2) Created a case list
case = []
for link in soup.find_all('a', href=lambda href: href and "apofasi" in href):
if link.get('href') not in case :
case.append(link.get('href'))
print(case)
#Trying to create a case list for every year in year list
# A)Get every year link in year list
for year_link in year :
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(year_link, headers = headers1)
soup2 = BeautifulSoup(page.content,"html.parser")
print(year)
# B)Get every case link for every case in a fixed year
for case_link in case :
total_cases.append(case_link)
#Get case link for every case for every year_link (element of year[])
???
EDIT 2 :
When I try to run the code you (HedgeHog) so kinldy posted it gives me this error :
--------------------------------------------------------------------------
FeatureNotFound Traceback (most recent call last)
C:\Users\ARISTE~1\AppData\Local\Temp/ipykernel_13944/1621925083.py in <module>
8 "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
9 page = requests.get(URL, headers = headers)
---> 10 soup = BeautifulSoup(page.content,'lxml')
11
12 baseUrl = 'https://www.epant.gr'
~\Documents\conda\envs\conda\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
243 builder_class = builder_registry.lookup(*features)
244 if builder_class is None:
--> 245 raise FeatureNotFound(
246 "Couldn't find a tree builder with the features you "
247 "requested: %s. Do you need to install a parser library?"
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Any ideas? Thanks!

EDIT
Based on your question edits I would recommend to use a dict instead of all this lists - Following example will create a data dictionary with years as keys, it´s own url and a list of cases urls.
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
data = {}
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
data[href.split('-')[-1].split('.')[0]] = {
'url': f'{baseUrl}{href}'
}
data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
data
Output
{'2022': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/83-2022.html',
'cases': []},
'2021': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html',
'cases': ['https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1508-apofasi-728-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1584-apofasi-727-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1586-apofasi-726-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1583-apofasi-725-2021.html']},...}
How to fix?
Just check if the link is not in your list of links - So it is True append it to your list:
if link.get('href') not in year:
year.append(link.get('href'))
Note
The desired result would look like this (but I need this in list
format)
This is not a list in the sense of data structure it is a printed version of each single element of a list.
Alternativ
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
Output
['https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/83-2022.html', '/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/71-2020.html', '/apofaseis-gnomodotiseis/itemlist/category/4-2019.html', '/apofaseis-gnomodotiseis/itemlist/category/5-2018.html', '/apofaseis-gnomodotiseis/itemlist/category/6-2017.html', '/apofaseis-gnomodotiseis/itemlist/category/7-2016.html', '/apofaseis-gnomodotiseis/itemlist/category/8-2015.html', '/apofaseis-gnomodotiseis/itemlist/category/9-2014.html', '/apofaseis-gnomodotiseis/itemlist/category/10-2013.html', '/apofaseis-gnomodotiseis/itemlist/category/11-2012.html', '/apofaseis-gnomodotiseis/itemlist/category/12-2011.html', '/apofaseis-gnomodotiseis/itemlist/category/13-2010.html', '/apofaseis-gnomodotiseis/itemlist/category/18-2009.html', '/apofaseis-gnomodotiseis/itemlist/category/19-2008.html', '/apofaseis-gnomodotiseis/itemlist/category/20-2007.html', '/apofaseis-gnomodotiseis/itemlist/category/21-2006.html', '/apofaseis-gnomodotiseis/itemlist/category/22-2005.html', '/apofaseis-gnomodotiseis/itemlist/category/23-2004.html', '/apofaseis-gnomodotiseis/itemlist/category/24-2003.html', '/apofaseis-gnomodotiseis/itemlist/category/25-2002.html', '/apofaseis-gnomodotiseis/itemlist/category/26-2001.html', '/apofaseis-gnomodotiseis/itemlist/category/27-2000.html', '/apofaseis-gnomodotiseis/itemlist/category/44-1999.html', '/apofaseis-gnomodotiseis/itemlist/category/45-1998.html', '/apofaseis-gnomodotiseis/itemlist/category/48-1997.html', '/apofaseis-gnomodotiseis/itemlist/category/47-1996.html', '/apofaseis-gnomodotiseis/itemlist/category/46-1995.html', '/apofaseis-gnomodotiseis/itemlist/category/49-1994.html']

Use a set as the intermediate storage for the HREFs then convert to a list later.
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB"}
page = requests.get(URL, headers=headers1)
soup = BeautifulSoup(page.content, "lxml")
year = set()
for link in soup.find_all('a', href=lambda href: href and "category" in href):
year.add(link.get('href'))
print(list(year))

How to extract this dictionary using Beautiful Soup

I want to get to variable last element of dictionary (pasted below), it's in another dictionary "offers", and i have no clue how to extract it.
html = s.get(url=url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(html.text, 'html.parser')
products = soup.find_all('script', {'type': "application/ld+json"})
{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)

As mentioned extract contents via BeautifulSoup decode the string with json.loads():
import json
products = '{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'
products = json.loads(products)
To get the last element (dict) in offers:
products['offers'][-1]
Output:
{'#type': 'Offer',
'availability': 'http://schema.org/OutOfStock',
'price': '489',
'priceCurrency': 'PLN',
'sku': 'NE215O06U-A110002000',
'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}
Example
In your special case you also have to replace('"','"') first:
from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')
jsonData = json.loads(soup.select_one('script[type="application/ld+json"]').text.replace('"','"'))
jsonData['offers'][-1]

I was trying to scrape data from Monster Job using Beautiful Soup But I did not manage to get any job cards

This is my code as below.
I inspected the page and found the "div" tag and "class" for all the job cards.
However, there is no output.
When I preview my code on Postman, none of the job cards are shown.
def get_url(position, location):
position = position.replace(" ", "%20")
location = location.replace(" ", "%20")
template = "https://www.monster.com.my/srp/results?query={}&locations={}"
url = template.format(position, location)
return url
url = get_url("Python Developers", "Kuala Lumpur")
print(url)
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
# Get job info of all cards
cards = soup.find("div", "card-panel apply-panel job-apply-card")
print(cards)
output --> None
May I know how to resolve this?
If Selenium is involved in this web scraping, how do I get the job details?

import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
"Referer": "https://www.monster.com.my/"
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
req.head('https://www.monster.com.my/')
for item in range(0, 100, 25):
params = {
"start": item,
"sort": "1",
"limit": "25",
"query": "Python Developers",
"locations": "Kuala Lumpur",
"searchId": "4a4b18ed-d3f1-4673-b371-0231d4bcc00a"
}
r = req.get(url, params=params)
for item in r.json()['jobSearchResponse']['data']:
# or print item.keys() and pickup what you need.
print(item.get('title', 'N/A'))
main('https://www.monster.com.my/middleware/jobsearch')

Scrape Verify link Href From Sites

I want to get the Verify href from GmailnatorInbox and this site contains the href discord verify which is the following Discord Verify HREF
I want to get this href using bs4 and pass it into a selenium driver link like driver.get(url) the url being the href ofc.
Can someone make some code to scrape the href from the gmailnator inbox please? I did try the page source however the page source does not contain the href.
This is the code I have written to get the href but the href that I require (discord one) is in a frame source so I think that's why it doesnt come up.
UPDATE! EVERYTHING IS DONE AND FIXED
driver.get('https://www.gmailnator.com/inbox/#for.ev.e.r.my.girlt.m.p#gmail.com')
time.sleep(6)
driver.find_element_by_xpath('//*[#id="mailList"]/tbody/tr[2]/td/a/table/tbody/tr/td[1]').click()
time.sleep(4)
url = driver.current_url
email_for_data = driver.current_url.split('/')[-3]
print(url)
time.sleep(2)
print('Getting Your Discord Verify link')
print('Time To Get Your Discord Link')
soup = BeautifulSoup(requests.get(url).text, "lxml")
data_email = soup.find("")
token = soup.find("meta", {"name": "csrf-token"})["content"]
cf_email = soup.find("a", class_="__cf_email__")["data-cfemail"]
endpoint = "https://www.gmailnator.com/mailbox/get_single_message/"
data = {
"csrf_gmailnator_token": token,
"action": "get_message",
"message_id": url.split("#")[-1],
"email": f"{email_for_data}",
}
headers = {
"referer": f"https://www.gmailnator.com/{email_for_data}/messageid/",
"cookie": f"csrf_gmailnator_cookie={token}; ci_session={cf_email}",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 "
"YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
r = requests.post(endpoint, data=data, headers=headers)
the_real_slim_shady = (
BeautifulSoup(r.json()["content"], "lxml")
.find_all("a", {"target": "_blank"})[1]["href"]
)
print(the_real_slim_shady)

You can fake it all with pure requests to get the Verify link. First, you need to get the token and the cf_email values. Then, things are pretty straightforward.
Here's how to get the link:
import requests
from bs4 import BeautifulSoup
url = "https://www.gmailnator.com/geralddoreyestmp/messageid/#179b454b4c482c4d"
soup = BeautifulSoup(requests.get(url).text, "lxml")
token = soup.find("meta", {"name": "csrf-token"})["content"]
cf_email = soup.find("a", class_="__cf_email__")["data-cfemail"]
endpoint = "https://www.gmailnator.com/mailbox/get_single_message/"
data = {
"csrf_gmailnator_token": token,
"action": "get_message",
"message_id": url.split("#")[-1],
"email": "geralddoreyestmp",
}
headers = {
"referer": "https://www.gmailnator.com/geralddoreyestmp/messageid/",
"cookie": f"csrf_gmailnator_cookie={token}; ci_session={cf_email}",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 "
"YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
r = requests.post(endpoint, data=data, headers=headers)
the_real_slim_shady = (
BeautifulSoup(r.json()["content"], "lxml")
.find_all("a", {"target": "_blank"})[1]["href"]
)
print(the_real_slim_shady)
Output (your link will be different!):
https://click.discord.com/ls/click?upn=qDOo8cnwIoKzt0aLL1cBeARJoBrGSa2vu41A5vK-2B4us-3D77CR_3Tswyie9C2vHlXKXm6tJrQwhGg-2FvQ76GD2o0Zl2plCYHULNsKdCuB6s-2BHk1oNirSuR8goxCccVgwsQHdq1YYeGQki4wtPdDA3zi661IJL7H0cOYMH0IJ0t3sgrvr2oMX-2BJBA-2BWZzY42AwgjdQ-2BMAN9Y5ctocPNK-2FUQLxf6HQusMayIeATMiTO-2BlpDytu-2FnIW4axB32RYQpxPGO-2BeHtcSj7a7QeZmqK-2B-2FYkKA4dl5q8I-3D

What am I doing wrong in this for loop for Web Scraping with bs4?

I'm trying to loop through a list of players on Transfermarkt, enter each profile, get their profile picture, and then scrape the original list of information. The latter, I've achived (which you will see in my code), but the former I can't seem to get working. I'm not an expert at his, and have received help with my code.
I wan't to save the source link for each players picture, not the image itself, and then store that link into "PlayerImgURL" in my dataframe. (Row 73).
This is my error message:
(.venv) PS C:\Users\cljkn\Desktop\Python scraper github> & "c:/Users/cljkn/Desktop/Python scraper github/.venv/Scripts/python.exe" "c:/Users/cljkn/Desktop/Python scraper github/.vscode/test.py"
File "c:/Users/cljkn/Desktop/Python scraper github/.vscode/test.py", line 45
for page in range(1, 21):
^
SyntaxError: invalid syntax
Thanks.
from bs4 import BeautifulSoup
import requests
import pandas as pd
playerID = []
playerImage = []
playerName = []
result = []
for page in range(1, 21):
r = requests.get("https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?land_id=0&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&jahrgang=0&kontinent_id=0&plus=1",
params= {"page": page},
headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.select('a.spielprofil_tooltip')
for i in range(len(links)):
playerID.append(links[i].get('id'))
for i in range(len(playerID)):
playerID[i] = 'https://www.transfermarkt.com/kylian-mbappe/profil/spieler/'+playerID[i]
playerID = list(set(playerID))
for i in range(len(playerID)):
r = requests.get(playerID[i],
params= {"page": page},
headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
)
soup = BeautifulSoup(r.content, "html.parser")
name = soup.find_all('h1')
for image in soup.find_all('img'):
playerName.append('title')
playerImage.append[image.get('src')
for page in range(1, 21):
r = requests.get("https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?land_id=0&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&jahrgang=0&kontinent_id=0&plus=1",
params= {"page": page},
headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
)
soup = BeautifulSoup(r.content, "html.parser")
tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)
result.extend([
{
"Club": t[4].find("img")["alt"],
"Age": t[2].text.strip(),
"GamesPlayed": t[6].text.strip(),
"GoalsDone": t[7].text.strip(),
"OwnGoals": t[8].text.strip(),
"Assists": t[9].text.strip(),
"YellowCards": t[10].text.strip(),
"SecondYellow": t[11].text.strip(),
"StraightRed": t[12].text.strip(),
"SubsOn": t[13].text.strip(),
"SubsOff": t[14].text.strip(),
"Nationality": t[3].find("img")["alt"], # for all nationality : [ i["alt"] for i in t[3].find_all("img")],
"Position": t[1].find_all("td")[2].text,
"Value": t[5].text.strip(),
#"PlayerImgURL":
"ClubImgURL": t[4].find("img")["src"],
"CountryImgURL": t[3].find("img")["src"] # for all country url: [ i["src"] for i in t[3].find_all("img")]
}
for t in (t.find_all(recursive=False) for t in tr)
])
df = pd.DataFrame(result,{'Name':playerImage, 'Source':playerImage})
#df.to_csv (r'S:\_ALL\Internal Projects\Introduction_2020\Transfermarkt\PlayerDetails.csv', index = False, header=True)
print(df)

The issue in this line
playerImage.append[image.get('src')
try to replace with this line
playerImage.append(image.get('src'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

unable to scrape website pages with unchanged url - python - python

Related

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

How to extract this dictionary using Beautiful Soup

I was trying to scrape data from Monster Job using Beautiful Soup But I did not manage to get any job cards

Scrape Verify link Href From Sites

What am I doing wrong in this for loop for Web Scraping with bs4?

Categories

Resources