python web scraping none value issue - python

I am trying to get the salary from this web_page but each time i got the same value "None"
however i tried to take different tags!
link_content = requests.get("https://wuzzuf.net/jobs/p/KxrcG1SmaBZB-Facility-Administrator-Majorel-Egypt-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3")
soup = BeautifulSoup(link_content.text, 'html.parser')
salary = soup.find("span", {"class":"css-47jx3m"})
print(salary)
output:
None

Page is being generated dynamically with Javascript, so Requests cannot see it as you see it. Try disabling Javascript in your browser and hard reload the page, and you will see a lot of information missing. However, data exists in page in a script tag.
One way of getting that information is by slicing that script tag, to get to the information you need [EDITED to account for different encoded keys - now it should work for any job]:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://wuzzuf.net/jobs/p/KxrcG1SmaBZB-Facility-Administrator-Majorel-Egypt-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3'
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
salary = soup.select_one('script').text.split('Wuzzuf.initialStoreState = ')[1].split('Wuzzuf.serverRenderedURL = ')[0].rsplit(';', 1)[0]
data = json.loads(salary)['entities']['job']['collection']
enc_key = [x for x in data.keys()][0]
df = pd.json_normalize(data[enc_key]['attributes']['salary'])
print(df)
Result in terminal:
min max currency period additionalDetails isPaid
0 None None None None None True

Related

Fix for missing 'tr' class in webscraping

I'm trying to webscrape different stocks by rows, with the data scraped from https://www.slickcharts.com/sp500. I am following a tutorial using a similar website, however that website uses classes for each of its rows, while mine doesn't (attached below).
This is the code I'm trying to use, however I don't get any output whatsoever. I'm still pretty new at coding so any feedback is welcome.
import requests
import pandas as pd
from bs4 import BeautifulSoup
company = []
symbol = []
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for i in rows:
row = i.find_all('td')
print(row[0])
First of all, you need to add some headers to your request because most likely you get the same as me: status code 403 Forbidden. It's because the website is blocking your request. Adding User-Agent does the trick:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
page = requests.get(url, headers=headers)
Then you can iterate over tr tags as you do. But you should be careful, because, for example first tr doesn't have td tags and you will get exception in the row:
print(row[0])
Here is the example of code that prints names of all companies:
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)
But this code also outputs some other data besides company names. It's because you are iterating over all tr tags on the page. But you need to iterate over a specific table only (first table on the page in this case).
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
first_table_on_the_page = soup.find('table')
rows = first_table_on_the_page.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)

BeautifulSoup organize data into dataframe table

I have been working with BeautifulSoup to try and organize some data that I am pulling from an website (html) I have been able to boil the data down but am getting stuck on how to:
eliminate not needed info
organize remaining data to be put into a pandas dataframe
Here is the code I am working with:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url,headers = headers)
soup = bs(page.text)
names = soup.body.findAll('tr')
function_names = re.findall('th class="\w+', str(names))
function_names = [item[10:] for item in function_names]
description = soup.body.findAll('td')
#description = re.findall('td class="\w+', str(description))
data = pd.DataFrame({'Title':function_names,'Info':description})
The error I have been getting is that the array numbers don't match up, which I know to be true but when I un-hashtag out the second description line it removes the numbers I want from there and even then the table isn't organizing itself properly.
What I would like the output to look like is:
(headers) title: location | studio | 1 BR | 2 BR | 3 BR
(new line) data : Lehi, UT| $1,335 |$1,309|$1,454|$1,580
That is really all that I need but I can't get BS or Pandas to do it properly.
Any help would be greatly appreciated!
Try the following approach. It first extracts all of the data in the table and then transposes it (columns swapped with rows):
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url, headers=headers)
soup = bs(page.text, 'lxml')
table = soup.find("table", class_="rentTrendGrid")
rows = []
for tr in table.find_all('tr'):
rows.append([td.text for td in tr.find_all(['th', 'td'])])
#header_row = rows[0]
rows = list(zip(*rows[1:])) # tranpose the table
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Giving you the following kind of output:
Studio 1 BR 2 BR 3 BR
0 0 729 1,041 1,333
1 $1,335 $1,247 $1,464 $1,738

scrape data from a table that has a "show all" button

I am trying to scrape "ALL EQUITIES" table in the following link which has a show all button
https://www.trading212.com/en/Trade-Equities
I should be able to get the expanded table, not just some of the rows before the table is expanded.
here is my code
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/88.0.4324.150 Safari/537.36"}
url = 'https://www.trading212.com/en/Trade-Equities'
r = requests.get(url, headers = header)
soup = bs(r.content, 'html.parser')
all_equities = soup.find('table' , class_ = 'I cant find the name of the class')
print(all_equities)
The contents are actually in a div, not a table. You can grab all of the content by using the class that is on each of the divs.
all_equities = soup.find_all('div' , class_ = 'js-search-row')
will give you a list of all of the divs with the equities in them.
Give this code a try:
all_equities = soup.find_all('div' , class_ = 'd-row js-search-row js-acc-wrapper')

Multiple Page BeautifulSoup Script only Pulling first value

New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres   Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))

Only scrape a portion of the page

I am using Python/requests to gather data from a website. Ideally I only want the latest 'banking' information, which always at the top of the page.
The code I have currently does that, but then it attempts to keep going and hits an index out of range error. I am not very good with aspx pages, but is it possible to only gather the data under the 'banking' heading?
Here's what I have so far:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping South Dakota Banking Activity Actions...')
url2 = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
r2 = requests.get(url2, headers=headers)
soup = BeautifulSoup(r2.text, 'html.parser')
mylist5 = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
print(tds[0].text, tds[1].text)
Ideally I'd be able to slice the information as well so I can only show the activity or approval status, etc.
With bs4 4.7.1 + you can use :contains to isolate the latest month by filtering out the later months. I explain the principle of filtering out later general siblings using :not in this SO answer. In short, find the row containing "August 2019" (this month is determined dynamically) and grab it and all its siblings, then find the row containing "July 2019" and all its general siblings and remove the latter from the former.
import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx')
soup = bs(r.content, 'lxml')
months = [i.text for i in soup.select('[colspan="2"]:has(a)')][0::2]
latest_month = months[0]
next_month = months[1]
rows_of_interest = soup.select(f'tr:contains("{latest_month}"), tr:contains("{latest_month}") ~ tr:not(:contains("{next_month}"), :contains("{next_month}") ~ tr)')
results = []
for row in rows_of_interest:
data = [re.sub('\xa0|\s{2,}',' ',td.text) for td in row.select('td')]
if len(data) == 1:
data.extend([''])
results.append(data)
df = pd.DataFrame(results)
print(df)
Same as before
import requests
from bs4 import BeautifulSoup, Tag
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
print('Scraping South Dakota Banking Activity Actions...')
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
Inspecting data source, we can find the id of the element you need (the table of values).
banking = soup.find(id='secondarycontent')
After this, we filter out elements of soup that aren't tags (like NavigableString or others). You can see how to get texts too (for other options, check Tag doc).
blocks = [b for b in banking.table.contents if type(b) is Tag] # filter out NavigableString
texts = [b.text for b in blocks]
Now, if it's the goal you're achieving when you talk about latest, we must determine which month is latest and which is the month before.
current_month_idx, last_month_idx = None, None
current_month, last_month = 'August 2019', 'July 2019' # can parse with datetime too
for i, b in enumerate(blocks):
if current_month in b.text:
current_month_idx = i
elif last_month in b.text:
last_month_idx = i
if all(idx is not None for idx in (current_month_idx, last_month_idx)):
break # break when both indeces are not null
assert current_month_idx < last_month_idx
curr_month_blocks = [b for i, b in enumerate(blocks) if current_month_idx < i < last_month_idx]
curr_month_texts = [b.text for b in curr_month_blocks]

Categories