How to parse data from HTML using Regex?

How to parse data from HTML using Regex? - python

I want to scrape the job title, location, and job description from Indeed (1st page only) using Regex, and store the results to a data frame. Here is the link: https://www.indeed.com/jobs?q=data+scientist&l=California
I have completed the task using BeautifulSoup and they worked totally fine:
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
import pandas as pd
url = 'https://www.indeed.com/jobs?q=data+scientist&l=California'
htmlfile = urlopen(url)
soup = BS(htmlfile,'html.parser')
companies = []
locations = []
summaries = []
company = soup.findAll('span', attrs={'class':'company'})
for c in company:
companies.append(c.text.replace("\n",""))
location = soup.findAll(class_ = 'location accessible-contrast-color-location')
for l in location:
locations.append(l.text)
summary = soup.findAll('div', attrs={'class':'summary'})
for s in summary:
summaries.append(s.text.replace("\n",""))
jobs_df = pd.DataFrame({'Company':companies, 'Location':locations, 'Summary':summaries})
jobs_df
Result from BS:
Company Location Summary
0 Cisco Careers San Jose, CA Work on massive structured, unstru...
1 AllyO Palo Alto, CA Extensive knowledge of scientific ...
2 Driven Brands Benicia, CA 94510 Develop scalable statistical, mach...
3 eBay Inc. San Jose, CA These problems require deep analys...
4 Disney Streaming Services San Francisco, CA Deep knowledge of machine learning...
5 Trimark Associates, Inc. Sacramento, CA The primary focus is in applying d...
But when I tried to use the same tags in Regex it failed.
import urllib.request, urllib.parse, urllib.error
import re
import pandas as pd
url = 'https://www.indeed.com/jobs?q=data+scientist&l=California'
text = urllib.request.urlopen(url).read().decode()
companies = []
locations = []
summaries = []
company = re.findall('<span class="company">(.*?)</span>', text)
for c in company:
companies.append(str(c))
location = re.findall('<div class="location accessible-contrast-color-location">(.*?)</div>', text)
for l in location:
locations.append(str(l))
summary = re.findall('<div class="summary">(.*?)</div>', text)
for s in summary:
summaries.append(str(s))
print(companies)
print(locations)
print(summaries)
There was an error saying the length of lists don't match so I checked on the individual lists. It turned out the contents could not be fetched. What I got from above:
[]
['Palo Alto, CA', 'Sunnyvale, CA', 'San Francisco, CA', 'South San Francisco, CA 94080', 'Pleasanton, CA 94566', 'Aliso Viejo, CA', 'Sacramento, CA', 'Benicia, CA 94510', 'San Bruno, CA']
[]
What did I do wrong?

. matches any character except newline. In the HTML code, there are newlines as well.
So you need to use re.DOTALL as flags option in the re.findall like below:
company = re.findall('<span class="company">(.*?)</span>', text, flags=re.DOTALL)
From the above code, you will not get the names only. Instead you will get all the descendents of the span element you are selecting.
So, you need to select only that part of regex which you want.
for c in company:
# selecting only the company name, discarding everything in the anchor tag.
name = re.findall('<a.*>(.*)</a>', c, flags = re.DOTALL)
for n in name:
# doing a little cleanup by removing the newlines and spaces.
companies.append(str(n.strip()))
print(companies)
Output:
['Driven Brands', 'Southern California Edison', 'Paypal', "Children's Hospital Los Angeles", 'Cisco Careers', 'University of California, Santa Cruz', 'Beyond Limits', 'Shutterfly', 'Walmart', 'Trimark Associates, Inc.']
For location and summary, there are no further HTML tags.
Only the text is present.
So, only re.DOTALL and stripping the text will do the job.
No need of second for loop and second findall.

. will match any character except line terminators. The content you are trying to get are on new lines \n. So you need to mach anything, including line terminators.
you'll want to do: company = re.findall('<span class="company">(.*?)</span>', text, re.DOTALL)
But this will also require a little cleanup after.

Related

python BeautifulSoup Wikipedia Webscapping -learning

I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks:
Example:
By market capitalization
Rank
Bank
Cap Rate
1
JP Morgan
466.1
2
Bank of China
300
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
or
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output
By market capitalization
Rank
Bank name
Market cap(US$ billion)
1
JPMorgan Chase
466.21[5]
2
Industrial and Commercial Bank of China
295.65
3
Bank of America
279.73
4
Wells Fargo
214.34
5
China Construction Bank
207.98
6
Agricultural Bank of China
181.49
7
HSBC Holdings PLC
169.47
8
Citigroup Inc.
163.58
9
Bank of China
151.15
10
China Merchants Bank
133.37
11
Royal Bank of Canada
113.80
12
Toronto-Dominion Bank
106.61
...

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

i have 3 list which i want to print items by order, in a enumerate way

import requests
from bs4 import BeautifulSoup
URL= "https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia"
page= requests.get(URL)
soup=BeautifulSoup(page.content, "html.parser")
weak=soup.find(id="SearchResults")
jobname=(weak.find_all(class_="summary"))
jobnamelists=[]
companyname=(weak.find_all(class_="company"))
companynamelists=[]
locations=(weak.find_all(class_="location"))
locationlist=[]
for job in jobname:
jobnamelists.append(job.find(class_="title").get_text())
for company in companyname:
companynamelists.append(company.find(class_="name").get_text())
for location in locations:
locationlist.append(location.find(class_="name").get_text())
this is the code, in the end it makes me 3 seperate lists which i scrape from the web,
now i want them to be printed in an enumerated way that the first job will be printed with the first company and the first location
one by one
anyone can help me on that?

As stated in the comments, use zip() function to iterate over the three lists together. For example:
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for j, c, l in zip(soup.select('#SearchResults .summary .title'),
soup.select('#SearchResults .company .name'),
soup.select('#SearchResults .location .name')):
print(j.get_text(strip=True))
print(c.get_text(strip=True))
print(l.get_text(strip=True))
print('-' * 80)
Prints:
Resident Engineer (Software) Cyber Security - Sydney
Varmour
Sydney, NSW
--------------------------------------------------------------------------------
Senior/Lead Software Engineer, Browser
Magic Leap, Inc.
Sunnyvale, CA; Plantation, FL (HQ); Austin, TX; Culver New York City, CA; Seattle, WA; Toronto, NY
--------------------------------------------------------------------------------
Service Consultant REST
TAL
Sydney, NSW
--------------------------------------------------------------------------------
...and so on.

Return empty bracket [ ] when web scraping

I try to print all the titles on nytimes.com. I used requests and beautifulsoup module. But I got empty brackets in the end. The return result is [ ]. How can I fix this problem?
import requests
from bs4 import BeautifulSoup
url = "https://www.nytimes.com/"
r = requests.get(url)
text = r.text
soup = BeautifulSoup(text, "html.parser")
title = soup.find_all("span", "balanceHeadline")
print(title)

I am assuming that you are trying to retrieve the headlines of nytimes. Doing title = soup.find_all("span", {'class':'balancedHeadline'}) will not get you your results. The <span> tag found using the element selector is often misleading. What you have to do is to look into the source code of the page and find the tags wrapped around the title.
For nytimes its a little tricky because the headlines are wrapped in the <script> tag with a lot of junk inside. Hence what you can do is to "clean" it first and deserialize the string by convertinng it into a python dictionary object.
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")
scripts = soup.find_all('script')
for script in scripts:
if 'preloadedData' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('=', 1)[1].strip() # remove "window.__preloadedData = "
jsonStr = jsonStr.rsplit(';', 1)[0] # remove trailing ;
jsonStr = json.loads(jsonStr)
for key,value in jsonStr['initialState'].items():
try:
if value['promotionalHeadline'] != "":
print(value['promotionalHeadline'])
except:
continue
outputs
Jeffrey Epstein Autopsy Results Conclude He Hanged Himself
Trump and Netanyahu Put Bipartisan Support for Israel at Risk
Congresswoman Rejects Israel’s Offer of a West Bank Visit
In Tlaib’s Ancestral Village, a Grandmother Weathers a Global Political Storm
Cathay Chief’s Resignation Shows China’s Power Over Hong Kong Unrest
Trump Administration Approves Fighter Jet Sales to Taiwan
Peace Road Map for Afghanistan Will Let Taliban Negotiate Women’s Rights
Debate Flares Over Afghanistan as Trump Considers Troop Withdrawal
In El Paso, Hundreds Show Up to Mourn a Woman They Didn’t Know
Is Slavery’s Legacy in the Power Dynamics of Sports?
Listen: ‘Modern Love’ Podcast
‘The Interpreter’
If You Think Trump Is Helping Israel, You’re a Fool
First They Came for the Black Feminists
How Women Can Escape the Likability Trap
With Trump as President, the World Is Spiraling Into Chaos
To Understand Hong Kong, Don’t Think About Tiananmen
The Abrupt End of My Big-Girl Summer
From Trump Boom to Trump Gloom
What Are Trump and Netanyahu Afraid Of?
King Bibi Bows Before a Tweet
Ebola Could Be Eradicated — But Only if the World Works Together
The Online Mob Came for Me. What Happened to the Reckoning?
A German TV Star Takes On Bullies
Why Is Hollywood So Scared of Climate Change?
Solving Medical Mysteries With Your Help: Now on Netflix

title = soup.find_all("span", "balanceHeadline")
replace it with
title = soup.find_all("span", {'class':'balanceHeadline'})

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.
I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible.
The difficulty is that the various elements I require reside in different classes.
I have tried the following, so far.
for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
print(listing.text)
which successfully gives the following list
15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale
Separately, to retrieve the address details for each listing I have used the following successfully;
for address in content.findAll('a', attrs={"class": "listing-results-address"}):
print(address.text)
which gives this
22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode
And for property price I have used this...
for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
print(prop_price.text)
which gives...
$350,000
$1,250,000
$750,000
$100,000
This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.
At present I can do this using something like the code below:
all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")
This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.
</a>, <h2 class="listing-results-attr">
15 room mansion for sale
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">
$350,000
Expected results should look something like this:
15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000
3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000
etc
etc
I then need to be able to store the results as JSON objects for later analysis.
Thanks in advance.

Change your selectors as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])
You can view separately with, for example,
prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)

find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})
for li in results.find_all("li",{'class':"srp clearfix"}):
price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
address = li.find("a",{'class':"listing-results-address"}).text.strip()
description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()
print(description)
print(address)
print(price)
O/P:
2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

Nested, Same-Level For Loop, Output to List

I am having trouble appending data into a list as I iterate through the following:
import urllib
import urllib.request
from bs4 import BeautifulSoup
import pandas
def make_soup(url):
thepage = urllib.request.urlopen(url)
thepage.addheaders = [('User-Agent', 'Mozilla/5.0')]
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
soup = make_soup('https://www.wellstar.org/locations/pages/default.aspx')
locationdata = []
for table in soup.findAll('table', class_ = 's4-wpTopTable'):
for name in table.findAll('div', 'PurpleBackgroundHeading'):
name = name.get_text(strip = True)
for loc_type in table.findAll('h3', class_ = 'WebFont SpotBodyGreen'):
loc_type = loc_type.get_text()
for address in table.findAll('div', class_ = ['WS_Location_Address', 'WS_Location_Adddress']):
address = address.get_text(strip = True, separator = ' ')
locationdata.append([name, loc_type, address])
df = pandas.DataFrame(columns = ['name', 'loc_type', 'address'], data = locationdata)
print(df)
The produced dataframe includes all unique addresses, however only the last possible text corresponding to the name.
For example, even though 'WellStar Windy Hill Hospital' is the last hospital within the hospital category/type, it appears as the name for all hospitals. If possible, I prefer a list.append solution as I have several more, similar steps to go through to finalize this project.

This is occurring because you're looping through all the names and loc_types before you're appending to locationdata.
You can instead do:
import itertools as it
from pprint import pprint as pp
for table in soup.findAll('table', class_='s4-wpTopTable'):
names = [name.get_text(strip=True) for
name in table.findAll('div', 'PurpleBackgroundHeading')]
loc_types = [loc_type.get_text() for
loc_type in table.findAll('h3', class_='WebFont SpotBodyGreen')]
addresses = [address.get_text(strip=True, separator=' ') for
address in table.findAll('div', class_=['WS_Location_Address',
'WS_Location_Adddress'])]
for name, loc_type, address in it.izip_longest(names,loc_types,addresses):
locationdata.append([name, loc_type, address])
Result:
>>> pp.pprint(locationdata)
[[u'WellStar Urgent Care in Acworth',
u'WellStar Urgent Care Centers',
u'4550 Cobb Parkway NW Suite 101 Acworth, GA 30101 770-917-8140'],
[u'WellStar Urgent Care in Kennesaw',
None,
u'3805 Cherokee Street Kennesaw, GA 30144 770-426-5665'],
[u'WellStar Urgent Care in Marietta - Delk Road',
None,
u'2890 Delk Road Marietta, GA 30067 770-955-8620'],
[u'WellStar Urgent Care in Marietta - East Cobb',
None,
u'3747 Roswell Road Ne Suite 107 Marietta, GA 30062 470-956-0150'],
[u'WellStar Urgent Care in Marietta - Kennestone',
None,
u'818 Church Street Suite 100 Marietta, GA 30060 770-590-4190'],
[u'WellStar Urgent Care in Marietta - Sandy Plains Road',
None,
u'3600 Sandy Plains Road Marietta, GA 30066 770-977-4547'],
[u'WellStar Urgent Care in Smyrna',
None,
u'4480 North Cooper Lake Road SE Suite 100 Smryna, GA 30082 770-333-1300'],
[u'WellStar Urgent Care in Woodstock',
None,
u'120 Stonebridge Parkway Suite 310 Woodstock, GA 30189 678-494-2500']]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse data from HTML using Regex? - python

Related

python BeautifulSoup Wikipedia Webscapping -learning

i have 3 list which i want to print items by order, in a enumerate way

Return empty bracket [ ] when web scraping

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

Nested, Same-Level For Loop, Output to List

Categories

Resources