Can't parse bs4 src attribute using the getattr() function - python

I've created a script to parse two fields from every movie container from a webpage. The script is doing fine.
I'm trying to use this getattr() function to scrape text and src from two fields, as in movie_name and image_link. In case of movie_name, it works. However, it fails when I try to parse image_link.
There is a function currently commented out which works when I uncomment. However, my goal here is to make use of getattr() to parse src.
import requests
from bs4 import BeautifulSoup
url = "https://yts.am/browse-movies"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
# def get_information(url):
# res = requests.get(url,headers=headers)
# soup = BeautifulSoup(res.text,'lxml')
# for row in soup.select(".browse-movie-wrap"):
# movie_name = row.select_one("a.browse-movie-title").text
# image_link = row.select_one("img.img-responsive").get("src")
# yield movie_name,image_link
def get_information(url):
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'lxml')
for row in soup.select(".browse-movie-wrap"):
movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
image_link = getattr(row.select_one("img.img-responsive"),"src",None)
yield movie_name,image_link
if __name__ == '__main__':
for items in get_information(url):
print(items)
How can I scrape src using getattr() function?

The reason this works:
movie_name = getattr(row.select_one("a.browse-movie-title"),"text",None)
But this doesn't:
image_link = getattr(row.select_one("img.img-responsive"),"src",None)
is because methods of a class are also attributes. So, effectively, what you're doing is getting a function text for the first example. In other words, there's no method or attribute called src.
If you look at attributes of:
row.select_one("a.browse-movie-title").attrs
You'll get:
{'href': 'https://yts.mx/movies/imperial-blue-2019', 'class': ['browse-movie-title']}
Likewise, for
row.select_one(".img-responsive").attrs
The output is:
{'class': ['img-responsive'], 'src': 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg', 'alt': 'Imperial Blue (2019) download', 'width': '170', 'height': '255'}
So, if we experiment and do this:
getattr(row.select_one(".img-responsive"), "attrs", None).src
We'll end up with:
AttributeError: 'dict' object has no attribute 'src'
Therefore, as mentioned in the comments, this is not how you'd use getattr() in pure Python sense on bs4 objects. You can either use the .get() method or the [key] syntax.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_information(url):
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
for row in soup.select(".browse-movie-wrap"):
movie_name = row.select_one("a.browse-movie-title").getText()
image_link = row.select_one("img.img-responsive").get("src")
yield movie_name, image_link
if __name__ == '__main__':
for items in get_information("https://yts.am/browse-movies"):
print(items)
This produces:
('Imperial Blue', 'https://img.yts.mx/assets/images/movies/imperial_blue_2019/medium-cover.jpg')
('Ablaze', 'https://img.yts.mx/assets/images/movies/ablaze_2001/medium-cover.jpg')
('[CN] Long feng zhi duo xing', 'https://img.yts.mx/assets/images/movies/long_feng_zhi_duo_xing_1984/medium-cover.jpg')
('Bobbie Jo and the Outlaw', 'https://img.yts.mx/assets/images/movies/bobbie_jo_and_the_outlaw_1976/medium-cover.jpg')
('Adam Resurrected', 'https://img.yts.mx/assets/images/movies/adam_resurrected_2008/medium-cover.jpg')
('[ZH] The Wasted Times', 'https://img.yts.mx/assets/images/movies/the_wasted_times_2016/medium-cover.jpg')
('Promise', 'https://img.yts.mx/assets/images/movies/promise_2021/medium-cover.jpg')
and so on ...
Finally, if you really want to parse this with getattr() you can try this:
movie_name = getattr(row.select_one("a.browse-movie-title"), "getText", None)()
image_link = getattr(row.select_one("img.img-responsive"), "attrs", None)["src"]
And you'll still get the same results, but, IMHO, this is way too complicated and not too readable either than a plain .getText() and .get("src") syntax.

Related

How to scrap Airbnb properly

I'm trying to scrap Airbnb, indeed only three simples informations: description, city and price of each apartment in a country. However it is not working. All the time it apears the AttributeError: "ResultSet object has no attribute 'get_text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"
How can I scrap these data properly?
Here is my code:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'Accept-Language':'en-GB, en; q=0.5',
'Referer':'https://google.com',
'DNT':'1'}
url = 'https://www.airbnb.com.br/s/Italia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&flexible_trip_lengths%5B%5D=one_week&price_filter_input_type=0&price_filter_num_nights=5&query=Italia&place_id=ChIJA9KNRIL-1BIRb15jJFz1LOI&date_picker_type=calendar&checkin=2023-03-09&checkout=2023-04-09&adults=1&source=structured_search_input_header&search_type=autocomplete_click'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
features_dict = {}
descp = soup.find_all("div", {"class": "t1jojoys dir dir-ltr"}).get_text()
city = soup.find_all("div", {"class": "nquyp1l s1cjsi4j dir dir-ltr"}).get_text()
price = soup.find_all("div", {"class": "phbjkf1 dir dir-ltr"}).get_text()
features_dict['descrição'] = descp
features_dict['cidade'] = city
features_dict['preço'] = price
Well, I erased the ".get_text()" and the above error did not apper anymore, however all my lists of html elements are empty. When I to a brief check to see whether all html classes are there, I discover that the classes which I'm interested they don't appear.
class_list = set()
tags = {tag.name for tag in soup.find_all()}
for tag in tags:
for i in soup.find_all(tag):
if i.has_attr("class"):
if len(i['class']) != 0:
class_list.add(" ".join(i['class']))
classes = list(class_list)
What am I doing wrong?

python web scraping none value issue

I am trying to get the salary from this web_page but each time i got the same value "None"
however i tried to take different tags!
link_content = requests.get("https://wuzzuf.net/jobs/p/KxrcG1SmaBZB-Facility-Administrator-Majorel-Egypt-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3")
soup = BeautifulSoup(link_content.text, 'html.parser')
salary = soup.find("span", {"class":"css-47jx3m"})
print(salary)
output:
None
Page is being generated dynamically with Javascript, so Requests cannot see it as you see it. Try disabling Javascript in your browser and hard reload the page, and you will see a lot of information missing. However, data exists in page in a script tag.
One way of getting that information is by slicing that script tag, to get to the information you need [EDITED to account for different encoded keys - now it should work for any job]:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://wuzzuf.net/jobs/p/KxrcG1SmaBZB-Facility-Administrator-Majorel-Egypt-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3'
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
salary = soup.select_one('script').text.split('Wuzzuf.initialStoreState = ')[1].split('Wuzzuf.serverRenderedURL = ')[0].rsplit(';', 1)[0]
data = json.loads(salary)['entities']['job']['collection']
enc_key = [x for x in data.keys()][0]
df = pd.json_normalize(data[enc_key]['attributes']['salary'])
print(df)
Result in terminal:
min max currency period additionalDetails isPaid
0 None None None None None True

How to change the code to asynchronously iterate links and IDs for scrap web page?

I have the list of links, each link has an id that is in the Id list
How to change the code so that when iterating the link, the corresponding id is substituted into the string:
All code is below:
import pandas as pd
from bs4 import BeautifulSoup
import requests
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.125', 'accept': '*/*'}
links = ['https://www..ie', 'https://www..ch', 'https://www..com']
Id = ['164240372761e5178f0488d', '164240372661e5178e1b377', '164240365661e517481a1e6']
def get_html(url, params=None):
r = requests.get(url, headers=HEADERS, params=params)
def get_data_no_products(html):
data = []
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', id= '') # How to iteration paste id???????
for item in items:
data.append({'pn': item.find('a').get('href')})
return print(data)
def parse():
for i in links:
html = get_html(i)
get_data_no_products(html.text)
parse()
Parametrise your code:
def get_data_no_products(html, id_):
data = []
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', id=id_)
And then use zip():
for link, id_ in zip(links, ids):
get_data_no_producs(link, id_)
Note that there's a likely bug in your code: you return print(data) which will always be none. You likely just want to return data.
PS
There is another solution to this which you will frequently encounter from people beginning in python:
for i in range(len(links)):
link = links[i]
id_ = ids[i]
...
This... works. It might even be easier or more natural, if you are coming from e.g. C. (Then again I'd likely use pointers...). Style is very much personal, but if you're going to write in a high level language like python you might as well avoid thinking about things like 'the index of the current item' as much as possible. Just my £0.02.

Recursively parse all category links and get all products

I've been playing around with web-scraping (for this practice exercise using Python 3.6.2) and I feel like I'm loosing it a bit. Given this example link, here's what I want to do:
First, as you can see, there are multiple categories on the page. Clicking each of the categories from above will give me other categories, then other ones, an so on, until I reach the products page. So I have to go in depth x number of times. I thought recursion will help me achieve this, but somewhere I did something wrong.
Code:
Here, I'll explain the way I approached the problem. First, I created a session and a simple generic function which will return a lxml.html.HtmlElement object:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
Then, I thought I'll need two other functions:
one to get the category links
and another one to get the product links
To distinguish between one and another, I figured out that only on category pages, there's a title which contains CATEGORIES every time, so I used that:
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[#class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[#id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[#id='prodResult']/li//div[#class='imgWrapper']/a")
]
Now, the only thing left, is the recursion part, where I'm sure I did something wrong:
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
Here's all the code put together:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[#class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[#id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[#id='prodResult']/li//div[#class='imgWrapper']/a")
]
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
if __name__ == '__main__':
main()
Could someone please point me into the right direction regarding the recursion function?
Here is how I would solve this:
from lxml import html as html_parser
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
def dig_up_products(url, session=Session()):
html = session.get(url, headers=HEADERS).text
page = html_parser.fromstring(html)
# if it appears to be a categories page, recurse
for link in page.xpath('//h2[contains(., "CATEGORIES")]/'
'following-sibling::div[#id="carouselSegment1b"]//li//a'):
yield from dig_up_products(link.attrib["href"], session)
# if it appears to be a products page, return the links
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a'):
yield link.attrib["href"]
def main():
start = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
for link in dig_up_products(start):
print(link)
if __name__ == '__main__':
main()
There is nothing wrong with iterating over an empty XPath expression result, so you can simply put both cases (categories page/products page) into the same function, as long as the XPath expressions are specific enough to identify each case.
You can do like this as well to make your script slightly concise. I used lxml library along with css selector to do the job. The script will parse all the links under category and look for the dead end, when it appears then it parse title from there and do the whole stuff over and over again until all the links are exhausted.
from lxml.html import fromstring
import requests
def products_links(link):
res = requests.get(link, headers={"User-Agent": "Mozilla/5.0"})
page = fromstring(res.text)
try:
for item in page.cssselect(".contentHeading h1"): #check for the match available in target page
print(item.text)
except:
pass
for link in page.cssselect("h2:contains('CATEGORIES')+[id^='carouselSegment'] .touchcarousel-item a"):
products_links(link.attrib["href"])
if __name__ == '__main__':
main_page = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
products_links(main_page)
Partial result:
BRILLANTÉ DOORS
BRILLANTÉ DRAWER FRONTS
BRILLANTÉ CUT TO SIZE PANELS
BRILLANTÉ EDGEBANDING
LACQUERED ZENIT DOORS
ZENIT CUT-TO-SIZE PANELS
EDGEBANDING
ZENIT CUT-TO-SIZE PANELS

Getting a blank string when I try print dl reference from a url within a loop

I have already created a loop that runs through the following results page https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1
I would now like to open the urls in results page in sequence and scrape the data from them. Example of results page, https://beta.companieshouse.gov.uk/company/08569390
I was hoping that by defining properties_col that by classifying the columns as per the code below it would generate the contents of the tags but its simply giving me, what i believe to be, a blank string []. output in python is x 25
My full code is below. any ideas? thanks and regards
import requests
from bs4 import BeautifulSoup
import csv
base_url = 'https://beta.companieshouse.gov.uk/'
header={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'en-US,en;q=0.8,fr;q=0.6',
'Connection':'keep-alive',
'Cookie':'mdtp=y4Ts2Vvql5V9MMZNjqB9T+7S/vkQKPqjHHMIq5jk0J1l5l131dU0YXsq7Rr15GDyghKHrS/qcD2vdsMCVtzKByJEDZFI+roS6tN9FN5IS70q8PkCCBjgFPDZjlR1A3H9FJ/zCWXMNJbaXqF8MgqE+nhR3/lji+eK4mm/GP9b8oxlVdupo9KN9SKanxu/JFEyNXutjyN+BsxRztNem1Z+ExSQCojyxflI/tc70+bXAu3/ppdP7fIXixfEOAWezmOh3ywchn9DV7Af8wH45t8u4+Y=; mdtpdi=mdtpdi#f523cd04-e09e-48bc-9977-73f974d50cea#1484041095424_zXDAuNhEkKdpRUsfXt+/1g==; seen_cookie_message=yes; _ga=GA1.4.666959744.1484041122; _gat=1',
'Host':'https://beta.companieshouse.gov.uk/',
#'Referer':'https://beta.companieshouse.gov.uk/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.51 Safari/537.36'
}
session = requests.session()
url = 'https://beta.companieshouse.gov.uk/search/companies?q=SW181Db&page=1'
response = session.get(url, headers=header)
soup = BeautifulSoup(response.content,"lxml")
rslt_table = soup.find_all('a', {'title': 'View company'})
for elem in rslt_table:
det_url = base_url+elem['href']
print det_url
response = session.get(det_url, headers=header)
soup = BeautifulSoup(response.content,"lxml")
properties_col = soup.find_all('dl',{'class':'column-two-thirds'})
print properties_col
base_url = 'https://beta.companieshouse.gov.uk'
change the base url
remove the slash at the end
First output:
https://beta.companieshouse.gov.uk/company/08569390
[<dl class="column-two-thirds">\n <dt>Company status</dt>\n <dd class="text data" id="company-status">\n Dissolved\n </dd>\n </dl>, <dl class="column-two-thirds">\n <dt>Company type</dt>\n <dd class="text data" id="company-type">\n Private limited Company\n </dd>\n </dl>]
from urllib.parse import urljoin
base_url = 'https://beta.companieshouse.gov.uk/'
href = '/company/08569390'
urljoin(base_url, href)
out:
'https://beta.companieshouse.gov.uk/company/08569390'
you have an extra / in the base_url, use urljoin to avoid this problem.
if you use + in the url, the output is:
'https://beta.companieshouse.gov.uk//company/08569390'

Categories