How to Scrape CSS icon with python and beautiful soup - python

I am scraping payment methods of website but all payment methods are added with the help of CSS. I don't know how to scrape that code. I tried to find on StackOverflow but unable to find any helpful material.
payment methods are given at the end of the page on the bottom left side.
payment_method = soup.find("div", class_="footer-second")
payyment_method = payment_method.find("div", class_="drz-footer-width-25 payment-column")
payment_method = payment_method.find_all("span")
this is the code that I used. But I don't have an idea of how I can scrape class images or image links I am unable to code further. There is no href or src link in tag only CSS class is used to show the icon on the page.

It comes from an image url. You would need to regex out the image url from the relevant source css file and then try some form of optical recognition software. The following gets you the url.
import requests, re
r = requests.get('https://laz-g-cdn.alicdn.com/lzdmod/desktop-footer-daraz/5.2.38/??pc/index.css')
p = re.compile(r'icon-yatra-v-pk{.*\.drz-footer-sprit{background-image:url\((.*?)\);')
image_url = 'https:' + p.findall(r.text)[0]
print(image_url)
Regex:
The css instructions place parts of this image "in frame" via the class attributes and css styling instructions. For example, enter .icon-yatra-payment-8 in browser and hit enter then examine the css instructions for the node; you will see background-position, width and height specified as well as the background image as inline-block. You will also see links to the source css files for these instructions.

Related

I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)

I'm a complete beginner with webscraping and programming with Python. The answer might be somewhere at the forum, but i'm so new, that i dont really now, what to look for. So i hope, you can help me:
Last week I completed a three day course in webscraping with Python, and at the moment i'm trying to brush up on what i've learned so far.
I'm trying to scrape out a spcific link from a website, so that i later on can create a loop, that extracts all the other links. But i can't seem to extract any link even though they are visible in the HTML-code.
Here is the website (danish)
Here is the link i'm trying to extract
The link i'm trying extract is located in this html-code:
<a class="nav-action-arrow-underlined" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/" aria-label="Læs mere om Regionen tilbød ikke"\>Læs mere\</a\>
Here is the programming in Python, that i've tried so far:
url = "https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a") len(a_tags)
#there is 34
've then tried going through all "a-tags" from 0-33 without finding the link.
If i'm printing a_tags [26] - i'm getting this code:
<a aria-current="page" class="nav-action is-current" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"\>Afgørelser fra Styrelsen for Patientklager\</a\>
Which is somewhere at the top of the website. But the next a_tag [27] is a code at the bottom of the site:
<a class="footer-linkedin" href="``https://www.linkedin.com/company/styrelsen-for-patientklager/``" rel="noopener" target="_blank" title="``https://www.linkedin.com/company/styrelsen-for-patientklager/``"><span class="sr-only">Linkedin profil</span></a>
Can anyone help me by telling me, how to access the specific part of the HTML-code, that contains the link?
When i find out how to pull out the link, my plan is to make the following programming:
path = "/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/"
full_url = f"htps://stpk.dk{path}"
print(full_url)
You will not find what you are looking for, cause requests do not render websites like a browser will do - but no worry, there is an alterntive.
Content is dynamically loaded via api, so you should call these directly and you will get JSON that contains the displayed information.
To find such information take a closer look into the developer tools of your browser and check the tab for the XHR Requests - May take a
minute to read and follow the topic:
https://developer.mozilla.org/en-US/docs/Glossary/XHR_(XMLHttpRequest)
Simply iterate over the items, extract the url value and prepend the base_url.
Check and manipulate the following parameters to your needs:
containerKey: a76f4a50-6106-4128-bc09-a1da7695902b
query:
year:
category:
legalTheme:
specialty:
profession:
treatmentPlace:
critiqueType:
take: 200
skip: 0
Example
import requests
url = 'https://stpk.dk/api/verdicts/settlements/?containerKey=a76f4a50-6106-4128-bc09-a1da7695902b&query=&year=&category=&legalTheme=&specialty=&profession=&treatmentPlace=&critiqueType=&take=200&skip=0'
base_url = 'https://stpk.dk'
for e in requests.get(url).json()['items']:
print(base_url+e['url'])
Output
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp107/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp106/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp105/
...

Extract Instagram Post description using Python selenium

Good morning,
I'm currently trying to download a certain field of an instagram post using Python selenium. Specifically, I'm trying to download the caption (description) of the picture (which, for example, in the picture below would be the section that starts with the text "Thanks #lolap ....." all the way down to the hashtags.
I tried the following code, however it appears that it isn't working (it throws an exception right away):
caption = driver.findElement(By.xpath("/html/body/div[3]/div[2]/div/article/div[2]/div[1]/ul/div/li/div/div/div[2]/span/text()")) #get all the caption text in a String
Thanks for your help.
Are you just trying to collect all the hashtags?
Try this:
hashtags = driver.find_elements_by_xpath("//a[#class='xil3i']")
for tag in hashtags:
print(tag.text)
Or, if you are looking for the picture description:
desc_text = driver.find_element_by_xpath("//span[#title='Edited']").text
print(desc_text)
This worked for me.
soup = BeautifulSoup(driver.page_source, 'html.parser')
hashtags = soup.find_all('a', class_='xil3i')
for tag in hashtags:
print(tag.text)
My ig posts' class is xil3i, but I get an empty value when using .text
. This code solves my problem.
Get the full description using the below:
comments = driver.find_elements(
by=By.CSS_SELECTOR,
"span._aacl._aaco._aacu._aacx._aad7._aade",
)
description = comments[0].text
print(f"Description: {description}")
return description

Scraping Linkedin Job Requirements

I am new to Python and I hope someone on here can help me. I am building a program as part of my learning to scrape information from linkedin job adverts. So far it has gone well however seem to have hit a brick wall with this particular issue. I am attempting to scrape the full job description, including the qualifications. I have identified the xpath for the description and am able to reference this via the following:
desc_xpath = '/html/body/main/section/div[2]/section[2]/div'
This gives me nearly all of the job description information, however does not include the qualifications section of a linkedin job profile. I extract the high level, wordy element of each job profile, however the further drill downs such as responsibilities, qualifications, extra qualifications do not seem to get pulled by this reference.
Is anybody able to help?
Kind regards
D
Example Code
driver.get('https://www.linkedin.com/jobs/view/etl-developer-at-barclays-2376164866/?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic&originalSubdomain=uk')
time.sleep(3)
#job description
jobdesc_xpath = '/html/body/main/section[1]/section[3]/div/section/div'
job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
print(job_descs) ```
Selenium struggles to get the text located in different sub-tags. You could try to use an html parser, such as BeautifulSoup. Try this:
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/jobs/view/etl-developer-at-barclays-2376164866/?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic&originalSubdomain=uk'
driver.get(url)
#Find the job description
job_desc = driver.find_element_by_xpath('//div[#class="show-more-less-html__markup show-more-less-html__markup--clamp-after-5"]')
#Get the html of the element and pass into BeautifulSoup parser
soup = BeautifulSoup(job_desc.get_attribute('outerHTML'), 'html.parser')
#The parser will print each paragraph on the same line. Use 'separator = \n' to print each each paragraph on a new line and '\n\n' to print an empty line between paragraphs
soup.get_text(separator='\n\n')

How do I dynamically scrape websites for CSS files based on user input?

I'm working on a group project where we're trying rank website designs based on the number of colours.
I used a regex to parse through a 'style.css' file that I had already downloaded and got the colour counting down but I'm struggling with the scraping URLs part. I want to be able to access the CSS code straight from whatever URLs the user inputs.
I'm pretty new at programming so I'd appreciate any help offered cause I've been looking at multiple solutions but I don't really understand them or how to reappropriate them for my needs.
Here is a simple example program that will find all the in-page style data for a page, as well as find all linked style pages and print out everything. This should get you started, but you'll have to link it up to your color counting system.
import urllib.request as req
from bs4 import BeautifulSoup
url = input('enter a full website address: ')
html = req.urlopen(url) # request the initial page
soup = BeautifulSoup(html, 'html.parser')
for styles in soup.select('style'): # get in-page style tags
print('in page style:')
print(styles.string)
for link in soup.find_all('link', type='text/css'): # get links to external style sheets
address = link['href'] # the address of the stylesheet
if address.startswith('/'): # relative link
address = url + address
css = req.urlopen(address).read() # make a request to download the stylesheet from the address
print('linked stylesheet')
print(css)

Extracting comments from news articles

My question is similar to the one asked here:
https://stackoverflow.com/questions/14599485/news-website-comment-analysis
I am trying to extract comments from any news article. E.g. i have a news url here:
http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/
I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?
This is what i have done till now although this is not much:
import urllib2
from bs4 import BeautifulSoup
opener = urllib2.build_opener()
url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')
urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text
print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
i=i.text.encode('ascii','ignore')
outfile.write(i +'\n')
Any help in what I need to do or how to go about it will be much appreciated.
its inside an iframe. check for a frame with id="dsq2".
now the iframe has a src attr which is a link to the actual site that has the comments.
so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.
to get the actual comments, after you get the page from src you can use this css selector: .post-message p
and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:
http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

Categories