I am new to Python and I hope someone on here can help me. I am building a program as part of my learning to scrape information from linkedin job adverts. So far it has gone well however seem to have hit a brick wall with this particular issue. I am attempting to scrape the full job description, including the qualifications. I have identified the xpath for the description and am able to reference this via the following:
desc_xpath = '/html/body/main/section/div[2]/section[2]/div'
This gives me nearly all of the job description information, however does not include the qualifications section of a linkedin job profile. I extract the high level, wordy element of each job profile, however the further drill downs such as responsibilities, qualifications, extra qualifications do not seem to get pulled by this reference.
Is anybody able to help?
Kind regards
D
Example Code
driver.get('https://www.linkedin.com/jobs/view/etl-developer-at-barclays-2376164866/?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic&originalSubdomain=uk')
time.sleep(3)
#job description
jobdesc_xpath = '/html/body/main/section[1]/section[3]/div/section/div'
job_descs = driver.find_element_by_xpath(jobdesc_xpath).text
print(job_descs) ```
Selenium struggles to get the text located in different sub-tags. You could try to use an html parser, such as BeautifulSoup. Try this:
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/jobs/view/etl-developer-at-barclays-2376164866/?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic&originalSubdomain=uk'
driver.get(url)
#Find the job description
job_desc = driver.find_element_by_xpath('//div[#class="show-more-less-html__markup show-more-less-html__markup--clamp-after-5"]')
#Get the html of the element and pass into BeautifulSoup parser
soup = BeautifulSoup(job_desc.get_attribute('outerHTML'), 'html.parser')
#The parser will print each paragraph on the same line. Use 'separator = \n' to print each each paragraph on a new line and '\n\n' to print an empty line between paragraphs
soup.get_text(separator='\n\n')
Related
I'm a complete beginner with webscraping and programming with Python. The answer might be somewhere at the forum, but i'm so new, that i dont really now, what to look for. So i hope, you can help me:
Last week I completed a three day course in webscraping with Python, and at the moment i'm trying to brush up on what i've learned so far.
I'm trying to scrape out a spcific link from a website, so that i later on can create a loop, that extracts all the other links. But i can't seem to extract any link even though they are visible in the HTML-code.
Here is the website (danish)
Here is the link i'm trying to extract
The link i'm trying extract is located in this html-code:
<a class="nav-action-arrow-underlined" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/" aria-label="Læs mere om Regionen tilbød ikke"\>Læs mere\</a\>
Here is the programming in Python, that i've tried so far:
url = "https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a") len(a_tags)
#there is 34
've then tried going through all "a-tags" from 0-33 without finding the link.
If i'm printing a_tags [26] - i'm getting this code:
<a aria-current="page" class="nav-action is-current" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"\>Afgørelser fra Styrelsen for Patientklager\</a\>
Which is somewhere at the top of the website. But the next a_tag [27] is a code at the bottom of the site:
<a class="footer-linkedin" href="``https://www.linkedin.com/company/styrelsen-for-patientklager/``" rel="noopener" target="_blank" title="``https://www.linkedin.com/company/styrelsen-for-patientklager/``"><span class="sr-only">Linkedin profil</span></a>
Can anyone help me by telling me, how to access the specific part of the HTML-code, that contains the link?
When i find out how to pull out the link, my plan is to make the following programming:
path = "/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/"
full_url = f"htps://stpk.dk{path}"
print(full_url)
You will not find what you are looking for, cause requests do not render websites like a browser will do - but no worry, there is an alterntive.
Content is dynamically loaded via api, so you should call these directly and you will get JSON that contains the displayed information.
To find such information take a closer look into the developer tools of your browser and check the tab for the XHR Requests - May take a
minute to read and follow the topic:
https://developer.mozilla.org/en-US/docs/Glossary/XHR_(XMLHttpRequest)
Simply iterate over the items, extract the url value and prepend the base_url.
Check and manipulate the following parameters to your needs:
containerKey: a76f4a50-6106-4128-bc09-a1da7695902b
query:
year:
category:
legalTheme:
specialty:
profession:
treatmentPlace:
critiqueType:
take: 200
skip: 0
Example
import requests
url = 'https://stpk.dk/api/verdicts/settlements/?containerKey=a76f4a50-6106-4128-bc09-a1da7695902b&query=&year=&category=&legalTheme=&specialty=&profession=&treatmentPlace=&critiqueType=&take=200&skip=0'
base_url = 'https://stpk.dk'
for e in requests.get(url).json()['items']:
print(base_url+e['url'])
Output
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp107/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp106/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp105/
...
For a little personal project, I would like to scrape the episode summary in Wikipedia for TV series:
for example, I started with this page Andor.
I write this script and it seems to do what I would like:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/Obi-Wan_Kenobi_(TV_series)').read()
# Make a soup
soup = BeautifulSoup(source,'lxml')
print(set([text.parent.name for text in soup.find_all(text=True)]))
tab = soup.find("table",{"class":"wikitable plainrowheaders wikiepisodetable"})
spans = tab.find_all('td')
# tds with actual text
x = [i for i in range(4,len(spans),5)]
tds = [i for i in spans if spans.index(i) in x]
text = ''
for paragraph in tds:
text += paragraph.text
#cleaning a bit
import re
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
text
The problem is that this is not working in other cases:
Big bang theory page
Here, you have to go to main page for the episode list, then there is a page for each season.
Or another different example is:
Loki
Here, the link for the episode summary is in the same page of the main article, but still you have to pass by another page to access the summary.
I would like to know if there is a way to create a script that can take care in a simple way for all these cases. Or there is a simpler way (maybe instead of scraping there is a Wikipedia database that can be access and thus to access the same information).
You don't need to scrape Wikipedia because they already have Client Library;
pip install Wikipedia
detailed documentation:
https://wikipedia.readthedocs.io/en/latest/code.html#api
Good morning,
I'm currently trying to download a certain field of an instagram post using Python selenium. Specifically, I'm trying to download the caption (description) of the picture (which, for example, in the picture below would be the section that starts with the text "Thanks #lolap ....." all the way down to the hashtags.
I tried the following code, however it appears that it isn't working (it throws an exception right away):
caption = driver.findElement(By.xpath("/html/body/div[3]/div[2]/div/article/div[2]/div[1]/ul/div/li/div/div/div[2]/span/text()")) #get all the caption text in a String
Thanks for your help.
Are you just trying to collect all the hashtags?
Try this:
hashtags = driver.find_elements_by_xpath("//a[#class='xil3i']")
for tag in hashtags:
print(tag.text)
Or, if you are looking for the picture description:
desc_text = driver.find_element_by_xpath("//span[#title='Edited']").text
print(desc_text)
This worked for me.
soup = BeautifulSoup(driver.page_source, 'html.parser')
hashtags = soup.find_all('a', class_='xil3i')
for tag in hashtags:
print(tag.text)
My ig posts' class is xil3i, but I get an empty value when using .text
. This code solves my problem.
Get the full description using the below:
comments = driver.find_elements(
by=By.CSS_SELECTOR,
"span._aacl._aaco._aacu._aacx._aad7._aade",
)
description = comments[0].text
print(f"Description: {description}")
return description
Ok so I'm working on a self-directed term project for my college programming course. My plan is to scrape different parts of the overwatch league website for stats etc, save them in a db and then pull from that db with a discord bot. However, I'm running into issues with the website itself. Here's a screenshot of the html for the standings page.
As you can see it's quite convoluted and hard to navigate with the repeated div and body tags and I'm pretty sure it's dynamically created. My prof recommended I find a way to isolate the rank title on the top of the table and then access the parent line and then iterate through the siblings to pull the data such as the team name, position etc into a dictionary for now. I haven't been able to find anything online that helps me, most websites don't provide enough information or are out of date.
Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import link
import re
import pprint
url = 'https://overwatchleague.com/en-us/standings'
response = requests.get(url).text
page = BeautifulSoup(response, features='html.parser')
# for stat in page.find(string=re.compile("rank")):
# statObject = {
# 'standing' : stat.find(string=re.compile, attrs={'class' : 'standings-table-v2styles__TableCellContent-sc-3q1or9-6 jxEkss'}).text.encode('utf-8')
# }
# print(page.find_all('span', re.compile("rank")))
# for tag in page.find_all(re.compile("rank")):
# print(tag.name)
print(page.find(string=re.compile('rank')))
"""
# locate branch with the rank header,
# move up to the parent branch
# iterate through all the siblings and
# save the data to objects
"""
The comments are all failed attempts and all return nothing. the only line not commented out returns a massive json with a lot of unnecessary information which does include what I want to parse out and use for my project. I've linked it as a google doc and highlighted what I'm looking to grab.
I'm not really sure how else to approach this at this point. I've considered using selenium however I lack knowledge of javascript so I'm trying to avoid it if possible. Even if you could comment with some advice on how else to approach this I would greatly appreciate it.
Thank you
As you have noticed, your data is in JSON format. It is embedded in a script tag directly in the page so it's easy to get it using beautifulsoup. Then you need to parse the json to extract all the tables (corresponding to the 3 tabs) :
import requests
from bs4 import BeautifulSoup
import json
url = 'https://overwatchleague.com/en-us/standings'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
script = soup.find("script",{"id":"__NEXT_DATA__"})
data = json.loads(script.text)
tabs = [
i.get("standings")["tabs"]
for i in data["props"]["pageProps"]["blocks"]
if i.get("standings") is not None
]
result = [
{ i["title"] : i["tables"][0]["teams"] }
for i in tabs[0]
]
print(json.dumps(result, indent=4, sort_keys=True))
The above code gives you a dictionnary, the keys are the title of the 3 tabs and the value is the table data
I'm working on a group project where we're trying rank website designs based on the number of colours.
I used a regex to parse through a 'style.css' file that I had already downloaded and got the colour counting down but I'm struggling with the scraping URLs part. I want to be able to access the CSS code straight from whatever URLs the user inputs.
I'm pretty new at programming so I'd appreciate any help offered cause I've been looking at multiple solutions but I don't really understand them or how to reappropriate them for my needs.
Here is a simple example program that will find all the in-page style data for a page, as well as find all linked style pages and print out everything. This should get you started, but you'll have to link it up to your color counting system.
import urllib.request as req
from bs4 import BeautifulSoup
url = input('enter a full website address: ')
html = req.urlopen(url) # request the initial page
soup = BeautifulSoup(html, 'html.parser')
for styles in soup.select('style'): # get in-page style tags
print('in page style:')
print(styles.string)
for link in soup.find_all('link', type='text/css'): # get links to external style sheets
address = link['href'] # the address of the stylesheet
if address.startswith('/'): # relative link
address = url + address
css = req.urlopen(address).read() # make a request to download the stylesheet from the address
print('linked stylesheet')
print(css)