Can't get a table from a web page - python

I'm using BeautifulSoup to try to get the whole table of all 2000 companies from this URL:
https://www.forbes.com/global2000/list/#tab:overall.
This is the code I have written:
from bs4 import BeautifulSoup
import urllib.request
html_content = urllib.request.urlopen('https://www.forbes.com/global2000/list/#header:position')
soup = BeautifulSoup(html_content, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,7), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
In the result, I get only the names of the columns, but not the table itself.
How can I get the whole table.

The content is generated via javascript, so you can must selenium to mimic a browser and scroll movements, and then parse the page source with beautiful soup, or, in some cases, like this one, you can access those values by querying their ajax API:
import requests
import json
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
target = 'https://www.forbes.com/ajax/list/data?year=2017&uri=global2000&type=organization'
with requests.Session() as s:
s.headers = headers
data = json.loads(s.get(target).text)
print([x['name'] for x in data[:5]])
Output (first 5 items):
['3M', '3i Group', '77 Bank', 'AAC Technologies Holdings', 'ABB']

Related

Multiple Page BeautifulSoup Script only Pulling first value

New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres   Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))

Pandas return empty dataframe when trying to scrape table

I'm trying to get the transfer history of the top 500 most valuable players on Transfermarkt. I've managed (with some help) to loop through each players profile and scraped image and name. Now I want the transfer history, which can be found in a table on each players profile: Player Profile
I want to save the table in a dataframe, using Pandas and then write it to a CSV, with Season, Date etc as headers. For Monaco and PSG, for example, I just want the names of the clubs, not pictures or Nationality. But right now, all I get is this:
Empty DataFrame
Columns: []
Index: []
Expected output:
Season Date Left Joined MV Fee
0 18/19 Jul 1, 2018 Monaco PSG 120.00m 145.00m
I've viewed the source and inspected the page, but can't find anything that helps me, apart from that the tbody and tr. But the way I'm doing it I want to precise that table, since there are several others.
This is my code:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
result = []
def main(url):
with requests.Session() as req:
result = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)
result.extend([
{
"Season": t[1].text.strip()
}
for t in (t.find_all(recursive=False) for t in tr)
])
df = pd.DataFrame(result)
print(df)
import requests
from bs4 import BeautifulSoup
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
links = []
names = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
urls = [f"{url[:29]}{item.get('href')}" for item in soup.findAll(
"a", class_="spielprofil_tooltip")]
ns = [item.text for item in soup.findAll(
"a", class_="spielprofil_tooltip")][:-5]
links.extend(urls)
names.extend(ns)
return links, names
def parser():
links, names = main(site)
for link, name in zip(links, names):
with requests.Session() as req:
r = req.get(link, headers=headers)
df = pd.read_html(r.content)[1]
df.loc[-1] = name
df.index = df.index + 1
df.sort_index(inplace=True)
print(df)
parser()

How do I convert a web-scraped table into a csv?

A year ago I learned some python in one of my classes but haven't had to use much since then so this may or may not be a simple question.
I'm trying to web-scrape the top grossing films of all time table from Box Office Mojo and I want to grab the rank, title, and gross for the top 10 films in the 2010s. I've been playing around in python and I can get the entire table into python but I don't know how to manipulate it from there, let alone write out a csv file. Any guidance/tips?
Here is what will print the entire table for me (the first few lines are copied from an old web-scraping assignment to get me started):
import bs4
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.boxofficemojo.com/chart/top_lifetime_gross/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page_html = requests.get(url, headers=headers)
page_soup = soup(page_html.text, "html.parser")
boxofficemojo_table = page_soup.find("div", {"class": "a-section imdb-scroll-table-inner"})
complete_table = boxofficemojo_table.get_text()
print(complete_table)`
You Can use pd.read_html for this.
import pandas as pd
Data = pd.read_html(r'https://www.boxofficemojo.com/chart/top_lifetime_gross/')
for data in Data:
data.to_csv('Data.csv', ',')
2.Using Bs4
import pandas as pd
from bs4 import BeautifulSoup
import requests
URL = r'https://www.boxofficemojo.com/chart/top_lifetime_gross/'
print('\n>> Exctracting Data using Beautiful Soup for :'+ URL)
try:
res = requests.get(URL)
except Exception as e:
print(repr(e))
print('\n<> URL present status Code = ',(res.status_code))
soup = BeautifulSoup(res.text,"lxml")
table = soup.find('table')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
' '.join(item)
Data = pd.DataFrame(list_of_rows)
Data.dropna(axis = 0, how = 'all',inplace = True)
print(Data.head(10))
Data.to_csv('Table.csv')

How to scrape additional pages of a webpage

With some help from the community, I was able to scrape some information off a webpage. However, I am facing some trouble scraping information off the additional pages of the website.
The code shown below is able to obtain the following information: ('date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat') from each year of the webpage (from 1919 - 2019). An example of the URL by year is
https://aviation-safety.net/database/dblist.php?Year=1946
However, i realised that there are additional pages in each of the URLs by years such as
https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=2 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=3 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=4
Was wondering how to scrape the additional pages for each year?
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
try:
html = requests.get(x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %x)
except:
print ('No table found: %s' %x)
no_table.append(x)
results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]
You can use beautifulsoup to check for the <div> tag that contains the number of pages, then it looks like you can just iterate through those. Might be a better way to do it, but I just added another try/except in there to deal with if additional pages are found:
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
#Check for additional pages
try:
html = requests.get(x, headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
pages = soup.find('div',{'class':'pagenumbers'}).text.strip().split(' ')[-1]
for page in range(1,int(pages)+1):
page_x = x + '&lang=&page=%s' %page
try:
html = requests.get(page_x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %page_x)
except:
print ('No table found: %s' %page_x)
no_table.append(page_x)
except:
try:
html = requests.get(x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %x)
except:
print ('No table found: %s' %x)
no_table.append(x)
results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]

Only scrape a portion of the page

I am using Python/requests to gather data from a website. Ideally I only want the latest 'banking' information, which always at the top of the page.
The code I have currently does that, but then it attempts to keep going and hits an index out of range error. I am not very good with aspx pages, but is it possible to only gather the data under the 'banking' heading?
Here's what I have so far:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping South Dakota Banking Activity Actions...')
url2 = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
r2 = requests.get(url2, headers=headers)
soup = BeautifulSoup(r2.text, 'html.parser')
mylist5 = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
print(tds[0].text, tds[1].text)
Ideally I'd be able to slice the information as well so I can only show the activity or approval status, etc.
With bs4 4.7.1 + you can use :contains to isolate the latest month by filtering out the later months. I explain the principle of filtering out later general siblings using :not in this SO answer. In short, find the row containing "August 2019" (this month is determined dynamically) and grab it and all its siblings, then find the row containing "July 2019" and all its general siblings and remove the latter from the former.
import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx')
soup = bs(r.content, 'lxml')
months = [i.text for i in soup.select('[colspan="2"]:has(a)')][0::2]
latest_month = months[0]
next_month = months[1]
rows_of_interest = soup.select(f'tr:contains("{latest_month}"), tr:contains("{latest_month}") ~ tr:not(:contains("{next_month}"), :contains("{next_month}") ~ tr)')
results = []
for row in rows_of_interest:
data = [re.sub('\xa0|\s{2,}',' ',td.text) for td in row.select('td')]
if len(data) == 1:
data.extend([''])
results.append(data)
df = pd.DataFrame(results)
print(df)
Same as before
import requests
from bs4 import BeautifulSoup, Tag
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
print('Scraping South Dakota Banking Activity Actions...')
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
Inspecting data source, we can find the id of the element you need (the table of values).
banking = soup.find(id='secondarycontent')
After this, we filter out elements of soup that aren't tags (like NavigableString or others). You can see how to get texts too (for other options, check Tag doc).
blocks = [b for b in banking.table.contents if type(b) is Tag] # filter out NavigableString
texts = [b.text for b in blocks]
Now, if it's the goal you're achieving when you talk about latest, we must determine which month is latest and which is the month before.
current_month_idx, last_month_idx = None, None
current_month, last_month = 'August 2019', 'July 2019' # can parse with datetime too
for i, b in enumerate(blocks):
if current_month in b.text:
current_month_idx = i
elif last_month in b.text:
last_month_idx = i
if all(idx is not None for idx in (current_month_idx, last_month_idx)):
break # break when both indeces are not null
assert current_month_idx < last_month_idx
curr_month_blocks = [b for i, b in enumerate(blocks) if current_month_idx < i < last_month_idx]
curr_month_texts = [b.text for b in curr_month_blocks]

Categories