BeautifulSoup organize data into dataframe table - python

I have been working with BeautifulSoup to try and organize some data that I am pulling from an website (html) I have been able to boil the data down but am getting stuck on how to:
eliminate not needed info
organize remaining data to be put into a pandas dataframe
Here is the code I am working with:
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url,headers = headers)
soup = bs(page.text)
names = soup.body.findAll('tr')
function_names = re.findall('th class="\w+', str(names))
function_names = [item[10:] for item in function_names]
description = soup.body.findAll('td')
#description = re.findall('td class="\w+', str(description))
data = pd.DataFrame({'Title':function_names,'Info':description})
The error I have been getting is that the array numbers don't match up, which I know to be true but when I un-hashtag out the second description line it removes the numbers I want from there and even then the table isn't organizing itself properly.
What I would like the output to look like is:
(headers) title: location | studio | 1 BR | 2 BR | 3 BR
(new line) data : Lehi, UT| $1,335 |$1,309|$1,454|$1,580
That is really all that I need but I can't get BS or Pandas to do it properly.
Any help would be greatly appreciated!

Try the following approach. It first extracts all of the data in the table and then transposes it (columns swapped with rows):
import urllib.request
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
import requests
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
url = 'https://www.apartments.com/lehi-ut/1-bedrooms/'
page = requests.get(url, headers=headers)
soup = bs(page.text, 'lxml')
table = soup.find("table", class_="rentTrendGrid")
rows = []
for tr in table.find_all('tr'):
rows.append([td.text for td in tr.find_all(['th', 'td'])])
#header_row = rows[0]
rows = list(zip(*rows[1:])) # tranpose the table
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)
Giving you the following kind of output:
Studio 1 BR 2 BR 3 BR
0 0 729 1,041 1,333
1 $1,335 $1,247 $1,464 $1,738

Related

Can't get all results in tripadvisor using python al beautifulsoup due to pagination

I am trying to get links of restaurants but i can only get the first 30 and not all the others.
Restaurants in Madrid Area are hundreads, the pagination only shows 30 in each page and the following code only get those 30
import re
import requests
from openpyxl import Workbook
from bs4 import BeautifulSoup as b
city_name = 'Madrid'
geo_code = '187514'
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
data = requests.get(
"https://www.tripadvisor.com//Restaurants-g{}-{}.html".format(geo_code, city_name), headers=headers
).text
for link in re.findall(r'"detailPageUrl":"(.*?)"', data):
print("https://www.tripadvisor.com.sg/" + link)
next_link = "https://www.tripadvisor.com.sg/" + link
f.write('%s\n' % next_link)
Found the solution, had to add ao with number of the result in the url like:
"https://www.tripadvisor.com//Restaurants-g{}-{}-{}.html".format(geo_code, city_name, n_review), headers=headers

python web scraping none value issue

I am trying to get the salary from this web_page but each time i got the same value "None"
however i tried to take different tags!
link_content = requests.get("https://wuzzuf.net/jobs/p/KxrcG1SmaBZB-Facility-Administrator-Majorel-Egypt-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3")
soup = BeautifulSoup(link_content.text, 'html.parser')
salary = soup.find("span", {"class":"css-47jx3m"})
print(salary)
output:
None
Page is being generated dynamically with Javascript, so Requests cannot see it as you see it. Try disabling Javascript in your browser and hard reload the page, and you will see a lot of information missing. However, data exists in page in a script tag.
One way of getting that information is by slicing that script tag, to get to the information you need [EDITED to account for different encoded keys - now it should work for any job]:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://wuzzuf.net/jobs/p/KxrcG1SmaBZB-Facility-Administrator-Majorel-Egypt-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3'
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
salary = soup.select_one('script').text.split('Wuzzuf.initialStoreState = ')[1].split('Wuzzuf.serverRenderedURL = ')[0].rsplit(';', 1)[0]
data = json.loads(salary)['entities']['job']['collection']
enc_key = [x for x in data.keys()][0]
df = pd.json_normalize(data[enc_key]['attributes']['salary'])
print(df)
Result in terminal:
min max currency period additionalDetails isPaid
0 None None None None None True

Fix for missing 'tr' class in webscraping

I'm trying to webscrape different stocks by rows, with the data scraped from https://www.slickcharts.com/sp500. I am following a tutorial using a similar website, however that website uses classes for each of its rows, while mine doesn't (attached below).
This is the code I'm trying to use, however I don't get any output whatsoever. I'm still pretty new at coding so any feedback is welcome.
import requests
import pandas as pd
from bs4 import BeautifulSoup
company = []
symbol = []
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for i in rows:
row = i.find_all('td')
print(row[0])
First of all, you need to add some headers to your request because most likely you get the same as me: status code 403 Forbidden. It's because the website is blocking your request. Adding User-Agent does the trick:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
page = requests.get(url, headers=headers)
Then you can iterate over tr tags as you do. But you should be careful, because, for example first tr doesn't have td tags and you will get exception in the row:
print(row[0])
Here is the example of code that prints names of all companies:
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)
But this code also outputs some other data besides company names. It's because you are iterating over all tr tags on the page. But you need to iterate over a specific table only (first table on the page in this case).
import requests
from bs4 import BeautifulSoup
company = []
symbol = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://www.slickcharts.com/sp500' #Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
first_table_on_the_page = soup.find('table')
rows = first_table_on_the_page.find_all('tr')
for row in rows:
all_td_tags = row.find_all('td')
if len(all_td_tags) > 0:
print(all_td_tags[1].text)

Printing Text Scraped Using BeautifulSoup to Pandas Dataframe without Tags

I have been working on the code below and getting myself tied up in knots. What I am trying to do is build a simple dataframe using text scraped using BeautifulSoup.
I have scraped the applicable text from the <h5> and <p> tags but using find_all means that when I build the dataframe and write to csv the tags are included. To deal with this I have added the print(p.text, end=" ") statements but now nothing is being written to the csv.
Can anyone see what I am doing wrong?
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
course = []
runner = []
page = requests.get('https://www.attheraces.com/tips/atr-tipsters/hugh-taylor', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
tips = soup.find('div', class_='sticky')
for h5 in tips.find_all("h5"):
course_name = print(h5.text, end=" ")
course.append(course_name)
for p in tips.find_all("p"):
runner_name = print(p.text, end=" ")
runner.append(runner_name)
todays_tips = pd.DataFrame(
{'Course': course,
'Selection': runner,
})
print(todays_tips)
todays_tips.to_csv(r'C:\Users\*****\Today.csv')
Don't use the assignment for print and consider using a list comprehension. Applying this should get you the dataframe you want.
For example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
page = requests.get('https://www.attheraces.com/tips/atr-tipsters/hugh-taylor', headers=headers)
tips = BeautifulSoup(page.content, 'html.parser').find('div', class_='sticky')
course = [h5.getText() for h5 in tips.find_all("h5")]
runner = [p.getText() for p in tips.find_all("p")]
todays_tips = pd.DataFrame({'Course': course, 'Selection': runner})
print(todays_tips)
todays_tips.to_csv("your_data.csv", index=False)
Output:
Course Selection
0 1.00 HAYDOCK 1pt win RAINBOW JET (12-1 & 11-1 general)
1 2.50 GOODWOOD 1pt win MARSABIT (11-2 general)
And a .csv file:

scrape data from a table that has a "show all" button

I am trying to scrape "ALL EQUITIES" table in the following link which has a show all button
https://www.trading212.com/en/Trade-Equities
I should be able to get the expanded table, not just some of the rows before the table is expanded.
here is my code
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/88.0.4324.150 Safari/537.36"}
url = 'https://www.trading212.com/en/Trade-Equities'
r = requests.get(url, headers = header)
soup = bs(r.content, 'html.parser')
all_equities = soup.find('table' , class_ = 'I cant find the name of the class')
print(all_equities)
The contents are actually in a div, not a table. You can grab all of the content by using the class that is on each of the divs.
all_equities = soup.find_all('div' , class_ = 'js-search-row')
will give you a list of all of the divs with the equities in them.
Give this code a try:
all_equities = soup.find_all('div' , class_ = 'd-row js-search-row js-acc-wrapper')

Categories