Issue with web scraping Job search sites - python

I'm web scraping the Monster job site with the search aimed at "Software Developer" and my aim is to simply print out only the jobs that have "python" listed in their description in the Python terminal, while discarding all the other jobs for Java, HTML, CSS etc. However when I run this code I end up printing all the jobs on the page.
To solve this I created a variable (called 'search') that searches for all jobs with 'python' and converts it to lowercase. Also I created a variable (called 'python_jobs') that includes all the job listings on the page.
Then I made a "for" loop that looks for every instance where 'search' is found in 'python_jobs'. However this gives the same result as before and prints out every job listing on the page anyways. Any suggestions?
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer"
page = requests.get(URL)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
search = results.find_all("h2", string=lambda text: "python" in text.lower())
python_jobs = results.find_all("section", class_="card-content")
print(len(search))
for search in python_jobs:
title = search.find("h2", class_="title")
company = search.find("div", class_="company")
if None in (title, company):
continue
print(title.text.strip())
print(company.text.strip())
print()

Your problem is you have two separated list search and python_jobs which are not related. And later you don't even use list search. You should rather get every item from python_jobs and search python inside this item.
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer"
page = requests.get(URL)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
all_jobs = results.find_all("section", class_="card-content")
for job in all_jobs:
python = job.find("h2", string=lambda text: "python" in text.lower())
if python:
title = job.find("h2", class_="title")
company = job.find("div", class_="company")
print(title.text.strip())
print(company.text.strip())
print()
or
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer"
page = requests.get(URL)
print(page)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
all_jobs = results.find_all("section", class_="card-content")
for job in all_jobs:
title = job.find("h2")
if title:
title = title.text.strip()
if 'python' in title.lower():
company = job.find("div", class_="company").text.strip()
print(title)
print(company)
print()

Related

Retrieving multiple values with Beautiful Soup

I have a file which I am using to parse articles in the reference section of wikipedia. I currently have it set up in such a way that it returns the URLs of any item in the reference section.
I'm trying to get it to export a single line containing both the link (which it does currently) and the text of the link in either a single line:
https://this.is.the.url "And this is the article header"
or over consecutive lines:
https://this.is.the.url
"And this is the article header"
Link Sample
<a
rel="nofollow"
class="external text"
href="https://www.mmajunkie.usatoday.com/2020/08/gerald-meerschaert-tests-positive-covid-19-ed-herman-fight-off-ufc-on-espn-plus-31/amp">
"Gerald Meerschaert tests positive for COVID-19; Ed Herman fight off UFC on ESPN+ 31"
</a>
Scraper
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
links = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
title = [a["href"] for a in href.find_all("a", class_="external text", href=True)]
for link in links:
print(link)
else:
print("Error: Please enter a valid Wikipedia URL")
Fixed it:
import requests
import sys
from bs4 import BeautifulSoup
session = requests.Session()
selectWikiPage = "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Waterson_vs._Hill"
if "wikipedia" in selectWikiPage:
html = session.post(selectWikiPage)
bsObj = BeautifulSoup(html.text, "html.parser")
references = bsObj.find('ol', {'class': 'references'})
href = BeautifulSoup(str(references), "html.parser")
for a in href.find_all("a", class_="external text", href=True):
listitem = [a["href"],a.getText()]
print(listitem)
else:
print("Error: Please enter a valid Wikipedia URL")
Instead of only getting the href attribute of the anchor tag you can also get the text of the link.
This can be done simply by
links = [(a["href"], a.text)
for a in href.find_all("a", class_="external text", href=True)]
for link, title in links:
print(link, title)
Now each links element will be a tuple with the link and the title.
You can now display it however you want.
Also the a.text can be written in like a.getText() or a.get_text() so choose what suits your code style.

Follow link in forum to scrape thread (comments) using BS4

I have a forum with 3 threads. I am trying to scrape the data in all three posts. so I need to follow the href link to each post and scrape the data. this is giving me an error and I'm not sure what I am dong wrong...
import csv
import time
from bs4 import BeautifulSoup
import requests
source = requests.get('https://mainforum.com').text
soup = BeautifulSoup(source, 'lxml')
#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
thread_name = threads.text
thread_link = threads.a.get('href')# there are three threads and this gets all 3 links
print (thread_link)
Rest of the code is where I am having an issue with?
# request the individual thread links
for follow_link in thread_link:
response = requests.get(follow_link)
#parse thread link
soup= BeautifulSoup(response, 'lxml')
#print Data
for p in soup.find_all('p'):
print(p)
As to your schema error...
You're getting the schema error because you are overwriting one link over and over. Then you attempt to call that link as if it were a list of links. At this point it is a string and you just iterate through the characters (starting with h) hence the error.
See here: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied
As to the general query and how to solve something like this...
If I was to do this the flow would go as follows:
Get the three hrefs (similar to what you've already done)
Use a function that scrapes the thread hrefs individually and returns whatever you want them to return
Save/append that returned information wherever you want.
Repeat
Something like this perhaps
import csv
import time
from bs4 import BeautifulSoup
import requests
source = requests.get('https://mainforum.com')
soup = BeautifulSoup(source.content, 'lxml')
all_thread_info = []
def scrape_thread_link(href):
response = requests.get(href)
#parse thread link
soup= BeautifulSoup(response.content, 'lxml')
#return data
return [p.text for p in soup.find_all('p')]
#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
this_thread_info = {}
this_thread_info["thread_name"] = threads.text
this_thread_info["thread_link"] = threads.a.get('href')
this_thread_info["thread_data"] = scrape_thread_link(this_thread_info["thread_link"])
all_thread_info.append(this_thread_info)
print(all_thread_info)
There's quite a lot unspecified in the original question so I made some assumptions. Ideally though you can see the gist.
Also note I prefer to use the .content of the response instead of .text.
#Darien Schettler I made some changes/adjustments to the code would love to hear if I messed up somewhere?
all_thread_info = []
def scrape_thread_link(href):
response = requests.get(href)
soup= BeautifulSoup(response.content, 'lxml')
for Thread in soup.find_all(id= 'discussionReplies'):
Thread_Name = Thread.find_all('div', class_='xg_user_generated')
for Posts in Thread_Name:
print(Posts.text)
for threads in soup.find_all('p', class_= 'small'):
thread_name = threads.text
thread_link = threads.a.get('href')
thread_data = scrape_thread_link(thread_link)
all_thread_info.append(thread_data)

Collecting information by scraping

I am trying to collect names of politicians by scraping Wikipedia.
What I would need is to scrape all parties from this page: https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito, then for each of them scrape all the names of politicians within that party (for each party listed in the link that I mentioned above).
I wrote the following code:
from bs4 import BeautifulSoup as bs
import requests
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito")
soup = bs(res.text, "html.parser")
array1 = {}
possible_links = soup.find_all('a')
for link in possible_links:
url = link.get("href", "")
if "/wiki/Provenienza" in url: # It is incomplete, as I should scrape also links including word "Politici di/dei"
res1=requests.get("https://it.wikipedia.org"+url)
print("https://it.wikipedia.org"+url)
soup = bs(res1, "html.parser")
possible_links1 = soup.find_all('a')
for link in possible_links1:
url_1 = link.get("href", "")
array1[link.text.strip()] = url_1
but it does not work, as it does not collect names for each party. It collects all the parties (from the wikipedia page that I mentioned above): however, when I try to scrape the parties' pages, it does not collect the names of politician within that party.
I hope you can help me.
You could collect the urls and party names from first page and then loop those urls and add the list of associated politician names to a dict whose key is the party name. You would gain efficiency from using a session object and thereby re-use underlying tcp connection
from bs4 import BeautifulSoup as bs
import requests
results = {}
with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito')
soup = bs(r.content, 'lxml')
party_info = {i.text:'https://it.wikipedia.org/' + i['href'] for i in soup.select('.CategoryTreeItem a')} #dict of party names and party links
for party, link in party_info.items():
r = s.get(link)
soup = bs(r.content, 'lxml')
results[party] = [i.text for i in soup.select('.mw-content-ltr .mw-content-ltr a')] # get politicians names
EDIT : Please refer to QHarr's answer above.
I have already scraped all the parties, and nothing more, I'm sharing this code and I'll edit my answer when I get all the politicians.
from bs4 import BeautifulSoup as bs
import requests
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito")
soup = bs(res.text, "html.parser")
url_list = []
politicians_dict = {}
possible_links = soup.find_all('a')
for link in possible_links:
url = link.get("href", "")
if (("/wiki/Provenienza" in url) or ("/wiki/Categoria:Politici_d" in url)):
full_url = "https://it.wikipedia.org"+url
url_list.append(full_url)
for url in url_list:
print(url)

Python Href scraping

I'm trying to loop over a href and get the URL. I've managed to extrat the href but i need the full url to get into this link. This is my code at the minute
import requests
from bs4 import BeautifulSoup
webpage_response = requests.get('http://www.harness.org.au/racing/results/?activeTab=tab')
webpage_response.content
webpage_response = requests.get
soup = BeautifulSoup(webpage, "html.parser")
#only finding one track
#soup.table to find all links for days racing
harness_table = soup.table
#scraps a href that is an incomplete URL that im trying to get to
for link in soup.select(".meetingText > a"):
link.insert(0, "http://www.harness.org.au")
webpage = requests.get(link)
new_soup = BeautifulSoup(webpage.content, "html.parser")
#work through table to get links to tracks
print(new_soup)'''
You can store the base url of website in a variable and then once you get the href from link you can join them both to create the next url.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.harness.org.au"
webpage_response = requests.get('http://www.harness.org.au/racing/results/?activeTab=tab')
soup = BeautifulSoup(webpage_response.content, "html.parser")
# only finding one track
# soup.table to find all links for days racing
harness_table = soup.table
# scraps a href that is an incomplete URL that im trying to get to
for link in soup.select(".meetingText > a"):
webpage = requests.get(base_url + link["href"])
new_soup = BeautifulSoup(webpage.content, "html.parser")
# work through table to get links to tracks
print(new_soup)
Try this solution. Maybe you'll like this library.
from simplified_scrapy import SimplifiedDoc,req
url = 'http://www.harness.org.au/racing/results/?activeTab=tab'
html = req.get(url)
doc = SimplifiedDoc(html)
links = [doc.absoluteUrl(url,ele.a['href']) for ele in doc.selects('td.meetingText')]
print(links)
Result:
['http://www.harness.org.au/racing/fields/race-fields/?mc=BA040320', 'http://www.harness.org.au/racing/fields/race-fields/?mc=BH040320', 'http://www.harness.org.au/racing/fields/race-fields/?mc=RE040320']

How to scrape next page data as i do in the first page?

I have the following code:
from bs4 import BeautifulSoup
import requests
import csv
url = "https://coingecko.com/en"
base_url = "https://coingecko.com"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
names = [div.a.span.text for div in soup.find_all("div",attrs={"class":"coin-content center"})]
Link = [base_url+div.a["href"] for div in soup.find_all("div",attrs={"class":"coin-content center"})]
for link in Link:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.content,"html.parser")
indent = inner_soup.find("div",attrs={"class":"py-2"})
content = indent.div.next_siblings
Allcontent = [sibling for sibling in content if sibling.string is not None]
print(Allcontent)
I have successfully enter to innerpage and grabbed all coins' information from the first page listed coin. But there is next page as 1,2,3,4,5,6,7,8,9 etc. How can I go to all the next page and do the same as previously?
Further, the output of my code contains a lot of \n and space. How can I fix that.
You need to generate all the pages and requests one by one and parse using bs4
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.coingecko.com/en')
soup = BeautifulSoup(req.content, 'html.parser')
last_page = soup.select('ul.pagination li:nth-of-type(8) > a:nth-of-type(1)')[0]['href']
lp = last_page.split('=')[-1]
count = 0
for i in range(int(lp)):
count+=1
url = 'https://www.coingecko.com/en?page='+str(count)
print(url)
requests.get(url)#requests each page one by one till last page
##parse your fileds here using bs4
The way you have written your script has got a messy look. Try with .select() to make it concise and less prone to break. Although I could not find the further usage of names in your script, I kept it as it is. Here is how you can get all the available links traversing multiple pages.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "https://coingecko.com/en"
while True:
page = requests.get(url)
soup = BeautifulSoup(page.text,"lxml")
names = [item.text for item in soup.select("span.d-lg-block")]
for link in [urljoin(url,item["href"]) for item in soup.select(".coin-content a")]:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.text,"lxml")
desc = [item.get_text(strip=True) for item in inner_soup.select(".py-2 p") if item.text]
print(desc)
try:
url = urljoin(url,soup.select_one(".pagination a[rel='next']")['href'])
except TypeError:break
Btw, whitespaces have also been taken care of by using .get_text(strip=True)

Categories