How To Scrape Similiar Classes With One Different Attribute - python

Searched around on SO, but couldn't find anything for this.
I'm scraping using beautifulsoup... This is the code I'm using which I found on SO:
for section in soup.findAll('div',attrs={'id':'dmusic_tracklist_track_title_B00KHQOKGW'}):
nextNode = section
while True:
nextNode = nextNode.nextSibling
try:
tag_name = nextNode.name
except AttributeError:
tag_name = ""
if tag_name == "a":
print nextNode.text()
else:
print "*****"
break
If went to this 50 Cent album (Animal Ambition: An Untamed Desire To Win) and wanted to scrape each song, how would I do so? The problem is each song has a different ID associated with it based upon its product code. For example, here is the XPath of the first two songs' titles: //*[#id="dmusic_tracklist_track_title_B00KHQOKGW"]/div/a/text() and //*[#id="dmusic_tracklist_track_title_B00KHQOLWK"]/div/a/text().
You'll notice the end of the first id is B00KHQOKGW, while the second is B00KHQOLWK. Is there a way I can add a "wild card to the end of the id to grab each of the songs no matter what product id is at the end? For example, something like id="dmusic_tracklist_track_title_*... I replaced the product ID with a *.
Or can I use a div to target the title I want like this (I feel like this would be the best. It uses the div's class right above the title. There isn't any product ID in it):
for section in soup.findAll('div',attrs={'class':'a-section a-spacing-none overflow_ellipsis'}):
nextNode = section
while True:
nextNode = nextNode.nextSibling
try:
tag_name = nextNode.name
except AttributeError:
tag_name = ""
if tag_name == "a":
print nextNode.text()
else:
print "*****"
break

You can pass a function as an id attribute value and check if it starts with dmusic_tracklist_track_title_:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)
soup = BeautifulSoup(response.content)
for song in soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')):
print song.text.strip()
Prints:
Hold On [Explicit]
Don't Worry 'Bout It [feat. Yo Gotti] [Explicit]
Animal Ambition [Explicit]
Pilot [Explicit]
Smoke [feat. Trey Songz] [Explicit]
Everytime I Come Around [feat. Kidd Kidd] [Explicit]
Irregular Heartbeat [feat. Jadakiss] [Explicit]
Hustler [Explicit]
Twisted [feat. Mr. Probz] [Explicit]
Winners Circle [feat. Guordan Banks] [Explicit]
Chase The Paper [feat. Kidd Kidd] [Explicit]
Alternatively, you can pass a regular expression pattern as an attribute value:
import re
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)
soup = BeautifulSoup(response.content)
for song in soup.find_all(id=re.compile('^dmusic_tracklist_track_title_\w+$')):
print song.text.strip()
^dmusic_tracklist_track_title_\w+$ would match dmusic_tracklist_track_title_ followed by 1 or more "alphanumeric" (0-9a-zA-Z and _) characters.

Related

How do I fix the code to scrape Zomato website?

I wrote this code but got this as the error "IndexError: list index out of range" after running the last line. Please, how do I fix this?
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div",attrs={"class": "sc-bblaLu dOXFUL"})
list_tr = top_rest[0].find_all("div",attrs={"class": "sc-gTAwTn cKXlHE"})
list_rest =[]
for tr in list_tr:
dataframe ={}
dataframe["rest_name"] = (tr.find("div",attrs={"class": "res_title zblack bold nowrap"})).text.replace('\n', ' ')
dataframe["rest_address"] = (tr.find("div",attrs={"class": "nowrap grey-text fontsize5 ttupper"})).text.replace('\n', ' ')
dataframe["cuisine_type"] = (tr.find("div",attrs={"class":"nowrap grey-text"})).text.replace('\n', ' ')
list_rest.append(dataframe)
list_rest
You are receiving this error because top_rest is empty when you attempt to get the first element of it "top_rest[0]". The reason for that is because the first class your attempting to reference is dynamically named. You will notice if you refresh the page the same location of that div will not be named the same. So when you attempt to scrape you get empty results.
An alternative would be to scrape ALL divs, then narrow in on the elements you want, be mindful of the dynamic div naming schema so from one request to another you will get different results:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/bangalore/top-restaurants",headers=headers)
content = response.content
soup = BeautifulSoup(content,"html.parser")
top_rest = soup.find_all("div")
list_tr = top_rest[0].find_all("div",attrs={"class": "bke1zw-1 eMsYsc"})
list_tr
I recently did a project that made me research scraping the Zomato's website in Manila, Philippines. I used Geolibrary to get the longitude and latitude values of Manila City, then scraped the restaurants' details using this information.
ADD: You can get your own API key on zomato website to make up to 1000 calls in a day.
# Use geopy library to get the latitude and longitude values of Manila City.
from geopy.geocoders import Nominatim
address = 'Manila City, Philippines'
geolocator = Nominatim(user_agent = 'Makati_explorer')
location = geolocator.geocode(address)
latitude = location.lenter code hereatitude
longitude = location.longitude
print('The geographical coordinate of Makati City are {}, {}.'.format(latitude, longitude))
# Use Zomato's API to make call
headers = {'user-key': '617e6e315c6ec2ad5234e884957bfa4d'}
venues_information = []
for index, row in foursquare_venues.iterrows():
print("Fetching data for venue: {}".format(index + 1))
venue = []
url = ('https://developers.zomato.com/api/v2.1/search?q={}' +
'&start=0&count=1&lat={}&lon={}&sort=real_distance').format(row['name'], row['lat'], row['lng'])
try:
result = requests.get(url, headers = headers).json()
except:
print("There was an error...")
try:
if (len(result['restaurants']) > 0):
venue.append(result['restaurants'][0]['restaurant']['name'])
venue.append(result['restaurants'][0]['restaurant']['location']['latitude'])
venue.append(result['restaurants'][0]['restaurant']['location']['longitude'])
venue.append(result['restaurants'][0]['restaurant']['average_cost_for_two'])
venue.append(result['restaurants'][0]['restaurant']['price_range'])
venue.append(result['restaurants'][0]['restaurant']['user_rating']['aggregate_rating'])
venue.append(result['restaurants'][0]['restaurant']['location']['address'])
venues_information.append(venue)
else:
venues_information.append(np.zeros(6))
except:
pass
ZomatoVenues = pd.DataFrame(venues_information,
columns = ['venue', 'latitude',
'longitude', 'price_for_two',
'price_range', 'rating', 'address'])
Using Web Scraping Language I was able to write this:
GOTO https://www.zomato.com/bangalore/top-restaurants
EXTRACT {'rest_name': '//div[#class="res_title zblack bold nowrap"]',
'rest_address': '//div[#class="nowrap grey-text fontsize5 ttupper',
'cusine_type': '//div[#class="nowrap grey-text"]'} IN //div[#class="bke1zw-1 eMsYsc"]
This will iterate over each record element with class bke1zw-1 eMsYsc and pull
each restaurant information.

Web scraping with Xpath

So I want to obtain the name of each player in all fotball clubs in the Premier League from transfermarkt.
The page I am trying to do for, as a test is: https://www.transfermarkt.co.uk/ederson/profil/spieler/238223
I have found the Xpath to be:
//*[#id="main"]/div[10]/div[1]/div[2]/div[2]/div[2]/div[2]/table/tbody/tr[1]/td
Keep in mind that I have to use the Xpath due to the structure of the Html code, and that I have to do a For loop for all the players in a club, for all the clubs in the Premier league. I have already obtained the links trough this code:
# Create empty list for player link
playerLink1 = []
playerLink2 = []
playerLink3 = []
#For each team link page...
for i in range(len(Full_Links)):
#...Download the team page and process the html code...
squadPage = requests.get(Full_Links[i], headers=headers)
squadTree = squadPage.text
SquadSoup = BeautifulSoup(squadTree,'html.parser')
#...Extract the player links...
playerLocation = SquadSoup.find("div", {"class":"responsive-table"}).find_all("a",{"class":"spielprofil_tooltip"})
for a in playerLocation:
playerLink1.append(a['href'])
[playerLink2.append(x) for x in playerLink1 if x not in playerLink2]
#...For each player link within the team page...
for j in range(len(playerLink2)):
#...Save the link, complete with domain...
temp2 = "https://www.transfermarkt.co.uk" + playerLink2[j]
#...Add the finished link to our teamLinks list...
playerLink3.append(temp2)
The links are in a list variable called "playerLink3_u"
How can I do this?
I am not sure how to ge the name with the XPath. You have BS4 imported already so I have written some code to get the player name from the URL you have posted.
import requests
from bs4 import BeautifulSoup
request_page = requests.get("http://www.transfermarkt.co.uk/ederson/profil/spieler/238223", headers={'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.37"})
page_soup = BeautifulSoup(request_page.text, 'html.parser')
player_table = page_soup.find('table', {'class': 'auflistung'})
table_data = player_table.findAll('td')
print('Name: ', table_data[0].text)
print('Date Of Birth: ', table_data[1].text)
print('Place Of Birth: ', table_data[2].text)
This will return the name, date_of_birth, and place_of_birth.

How to Webscraping Instagram Profile link BeautifulSoup?

I'm just starting to learn how to web scrape using BeautifulSoup and want to write a simple program that will get the profile links (instagram url) of my idol via FullName in Instagram.
Example: I have FullName list stored in file fullname.txt as follow:
#cat fullname.txt
Cristiano Ronaldo
David Beckham
Michael Jackson
My result desire is:
https://www.instagram.com/cristiano/
https://www.instagram.com/davidbeckham/
https://www.instagram.com/michaeljackson/
Can you give me some suggestions?
This worked for all 3 names, and a few others I added to fullname.txt
It uses the Requests library and a Bing search to find the correct link, then uses regular expressions to parse the link out of the returned packet.
import requests, re
def bingsearch(searchfor):
link = 'https://www.bing.com/search?q={}&ia=web'.format(searchfor)
ua = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36'}
payload = {'q': searchfor}
response = requests.get(link, headers=ua, params=payload)
try:
found = re.search('Search Results(.+?)</a>', response.text).group(1)
iglink = re.search('a href="(.+?)"', found).group(1)
except AttributeError:
iglink = "link not found"
return iglink
with open("fullname.txt", "r") as f:
names = f.readlines()
for name in names:
name = name.strip().replace(" ", "+")
searchterm = name + "+instagram"
IGLink = bingsearch(searchterm)
print(IGLink)

Recursively parse all category links and get all products

I've been playing around with web-scraping (for this practice exercise using Python 3.6.2) and I feel like I'm loosing it a bit. Given this example link, here's what I want to do:
First, as you can see, there are multiple categories on the page. Clicking each of the categories from above will give me other categories, then other ones, an so on, until I reach the products page. So I have to go in depth x number of times. I thought recursion will help me achieve this, but somewhere I did something wrong.
Code:
Here, I'll explain the way I approached the problem. First, I created a session and a simple generic function which will return a lxml.html.HtmlElement object:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
Then, I thought I'll need two other functions:
one to get the category links
and another one to get the product links
To distinguish between one and another, I figured out that only on category pages, there's a title which contains CATEGORIES every time, so I used that:
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[#class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[#id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[#id='prodResult']/li//div[#class='imgWrapper']/a")
]
Now, the only thing left, is the recursion part, where I'm sure I did something wrong:
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
Here's all the code put together:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[#class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[#id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[#id='prodResult']/li//div[#class='imgWrapper']/a")
]
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
if __name__ == '__main__':
main()
Could someone please point me into the right direction regarding the recursion function?
Here is how I would solve this:
from lxml import html as html_parser
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
def dig_up_products(url, session=Session()):
html = session.get(url, headers=HEADERS).text
page = html_parser.fromstring(html)
# if it appears to be a categories page, recurse
for link in page.xpath('//h2[contains(., "CATEGORIES")]/'
'following-sibling::div[#id="carouselSegment1b"]//li//a'):
yield from dig_up_products(link.attrib["href"], session)
# if it appears to be a products page, return the links
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a'):
yield link.attrib["href"]
def main():
start = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
for link in dig_up_products(start):
print(link)
if __name__ == '__main__':
main()
There is nothing wrong with iterating over an empty XPath expression result, so you can simply put both cases (categories page/products page) into the same function, as long as the XPath expressions are specific enough to identify each case.
You can do like this as well to make your script slightly concise. I used lxml library along with css selector to do the job. The script will parse all the links under category and look for the dead end, when it appears then it parse title from there and do the whole stuff over and over again until all the links are exhausted.
from lxml.html import fromstring
import requests
def products_links(link):
res = requests.get(link, headers={"User-Agent": "Mozilla/5.0"})
page = fromstring(res.text)
try:
for item in page.cssselect(".contentHeading h1"): #check for the match available in target page
print(item.text)
except:
pass
for link in page.cssselect("h2:contains('CATEGORIES')+[id^='carouselSegment'] .touchcarousel-item a"):
products_links(link.attrib["href"])
if __name__ == '__main__':
main_page = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
products_links(main_page)
Partial result:
BRILLANTÉ DOORS
BRILLANTÉ DRAWER FRONTS
BRILLANTÉ CUT TO SIZE PANELS
BRILLANTÉ EDGEBANDING
LACQUERED ZENIT DOORS
ZENIT CUT-TO-SIZE PANELS
EDGEBANDING
ZENIT CUT-TO-SIZE PANELS

Unsure how to web-scrape a specific value that could be in several different places

So I've been working on a web-scraping program and have been having some difficulties with one of the last bits.
There is this website that shows records of in-game fights like so:
Example 1: https://zkillboard.com/kill/44998359/
Example 2: https://zkillboard.com/kill/44917133/
I am trying to always scrape the full information of the player who scored the killing blow. That means their name, their corporation name, and their alliance name.
The information for the above examples are:
Example 1: Name = Happosait, Corp. = Arctic Light Inc., Alliance = Arctic Light
Example 2: Name = Lord Veninal, Corp. = Sniggerdly, Alliance = Pandemic Legion
While the "Final Blow" is always listed in the top right with the name, the name does not have the corporation and alliance with it as well. The full information is always listed below in the right-hand column, "## Involved", but their location in that column depends on how much damage they did in the fight, so it is not always on top, or anywhere specific for that matter.
So while I can get their names with:
kbPilotName = soup.find_all('td', style="text-align: center;")[0].find_all('a', href=re.compile('/character/'))[0].img.get('alt')
How can I get the rest of their information?
There is a textarea element containing all the data you are looking for. It's all in one text, but it's structured. You can choose a different way to parse it, but here is an example using regex:
import re
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
pattern = re.compile(r"(?s)Name: (.*?)Security: (.*?)Corp: (.*?)Alliance: (.*?)")
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(url)
soup = BeautifulSoup(response.content)
data = soup.select('form.form textarea#eft')[0].text
for name, security, corp, alliance in pattern.findall(data):
print name.strip()
Prints:
Happosait (laid the final blow)
Baneken
Perkel
Tibor Vherok
Kheo Dons
Kayakka
Lina Ectelion
Jay Burner
Zalamus
Draacan Ferox
Luwanii
Jousen Momaki
Varcuntis Morannear
Grimm K-Man
Wob'Niar
Godfrey Silvarna
Quintus Corvus
Shadow Altair
Sieren
Isha Vir
Argyrosdraco
Jack None
Strixi
Alternative solution (parsing "involved" page):
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
involved_url = 'https://zkillboard.com/kill/44998359/involved/'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
session.get(url)
response = session.get(involved_url)
soup = BeautifulSoup(response.content)
for row in soup.select('table.table tr.attacker'):
name, corp, alliance = row.select('td.pilot > a')
print name.text, corp.text, alliance.text

Categories