With some help from the community, I was able to scrape some information off a webpage. However, I am facing some trouble scraping information off the additional pages of the website.
The code shown below is able to obtain the following information: ('date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat') from each year of the webpage (from 1919 - 2019). An example of the URL by year is
https://aviation-safety.net/database/dblist.php?Year=1946
However, i realised that there are additional pages in each of the URLs by years such as
https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=2 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=3 https://aviation-safety.net/database/dblist.php?Year=1946&lang=&page=4
Was wondering how to scrape the additional pages for each year?
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
try:
html = requests.get(x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %x)
except:
print ('No table found: %s' %x)
no_table.append(x)
results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]
You can use beautifulsoup to check for the <div> tag that contains the number of pages, then it looks like you can just iterate through those. Might be a better way to do it, but I just added another try/except in there to deal with if additional pages are found:
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#create empty datframe and empty list to store urls that didn't pull a table
results_df = pd.DataFrame()
no_table = []
#Loop through the URLs retrieved previously and append to results_df
for x in df['url']:
#Check for additional pages
try:
html = requests.get(x, headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
pages = soup.find('div',{'class':'pagenumbers'}).text.strip().split(' ')[-1]
for page in range(1,int(pages)+1):
page_x = x + '&lang=&page=%s' %page
try:
html = requests.get(page_x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %page_x)
except:
print ('No table found: %s' %page_x)
no_table.append(page_x)
except:
try:
html = requests.get(x, headers=headers).text # <----- added headers
table = pd.read_html(html)[0] # <---- used pandas to read in the html and parse table tags. this will return a list of dataframes and want the dataframe in position 0
results_df = results_df.append(table, sort=True).reset_index(drop=True)
print ('Processed: %s' %x)
except:
print ('No table found: %s' %x)
no_table.append(x)
results_df = results_df[['date', 'type', 'registration', 'operator', 'fat.', 'location', 'cat']]
Related
I'm having an issue where I access the first page of the table but not the rest. When i click on say tab 2, it gives me players 26-50 but i can't scrape it as it is the same URL and not a different one. Is there a way to edit my code so that I could get all the pages of the tables?
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/plus/1/galerie/0?saison_id=2021&land_id=alle&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&leihe=&w_s=s&zuab=0"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
TransferPrice = pageSoup.find_all("td",{"class","rechts hauptlink"})
transfer_prices = []
cleaned_transfer_prices = []
for i in TransferPrice:
transfer_prices.append(i.text)
for i in transfer_prices:
i = i[1:-1]
i = float(i)
cleaned_transfer_prices.append(i)
cleaned_transfer_prices
some_list = []
#Players = pageSoup.find_all("td",{"class", "hauptlink"})
for td_tag in pageSoup.find_all("td",{"class", "hauptlink"}):
a_tag = td_tag.find('a')
if a_tag == None:
pass
else:
some_list.append(a_tag.text)
players = []
team_left = []
team_gone_to = []
for i in range(0,len(some_list),3):
players.append(some_list[i])
for i in range(1,len(some_list),3):
team_left.append(some_list[i])
for i in range(2,len(some_list),3):
team_gone_to.append(some_list[i])
df_2 = pd.DataFrame()
df_2['Player Name'] = players
df_2['Team Left'] = team_left
df_2['New Team'] = team_gone_to
df_2['Transfer Price'] = cleaned_transfer_prices
df_2.index+=1
df_2
you are using the wrong link here.
you should use this one instead:
https://www.transfermarkt.us/premier-league/transferrekorde/wettbewerb/GB1/ajax/yw1/saison_id/2021/land_id/alle/ausrichtung/alle/spielerposition_id/alle/altersklasse/alle/leihe//w_s/s/zuab/0/plus/1/galerie/0/page/2?ajax=yw1
pay attention to the ajax part of it (those dynamic code you need to get is uploaded by ajax)
the last part of the link is page/2?ajax=yw1 here you can solve your described problem by rotating the number (here it's 2 - the second page, you can change it to any number you need (by using f-strings))
New to screen scraping here and this is my first time posting on stackoverflow. Aplogies in advance for any formatting errors in this post. Attempting to extract data from multiple pages with URL:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
For instance, page 1 is:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-1
Page 2:
https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-2
and so on...
My script is running without errors. However, my Pandas exported csv only contains 1 row with the first extracted value. At the time of this posting, the first value is:
14.01 Acres  Vestaburg, Montcalm County, MI$275,000
My intent is to create a spreadsheet with hundreds of rows that pull the property description from the URLs.
Here is my code:
import requests
from requests import get
from bs4 import BeautifulSoup
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
for page in range(1,900):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
else:
break
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
import pandas as pd
df = pd.DataFrame({'description': [desc]})
df.to_csv('test4.csv', encoding = 'utf-8')
I suspect the problem is with the line reading desc = container.getText(strip=True) and have tried changing the line but keep getting errors when running.
Any help is appreciated.
I believe the mistake is in the line:
desc = container.getText(strip=True)
Every time it loops, the value in desc is replaced, not added on. To add items into the list, do:
desc.append(container.getText(strip=True))
Also, since it is already a list, you can remove the brackets from the DataFrame creation like so:
df = pd.DataFrame({'description': desc})
The cause is that no data is being added in the loop, so only the final data is being saved. For testing purposes, this code is now on page 2, so please fix it.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}
)
n_pages = 0
desc = []
all_data = pd.DataFrame(index=[], columns=['description'])
for page in range(1,3):
n_pages += 1
sapo_url = 'https://www.landwatch.com/Michigan_land_for_sale/West_Central_Region/Page-' + str(page)
r=get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = page_html.find_all('div', class_="propName")
if house_containers != []:
for container in house_containers:
desc = container.getText(strip=True)
df = pd.DataFrame({'description': [desc]})
all_data = pd.concat([all_data, df], ignore_index=True)
else:
break
all_data.to_csv('test4.csv', encoding = 'utf-8')
print('you scraped {} pages containing {} Properties'.format(n_pages, len(desc)))
A year ago I learned some python in one of my classes but haven't had to use much since then so this may or may not be a simple question.
I'm trying to web-scrape the top grossing films of all time table from Box Office Mojo and I want to grab the rank, title, and gross for the top 10 films in the 2010s. I've been playing around in python and I can get the entire table into python but I don't know how to manipulate it from there, let alone write out a csv file. Any guidance/tips?
Here is what will print the entire table for me (the first few lines are copied from an old web-scraping assignment to get me started):
import bs4
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.boxofficemojo.com/chart/top_lifetime_gross/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
page_html = requests.get(url, headers=headers)
page_soup = soup(page_html.text, "html.parser")
boxofficemojo_table = page_soup.find("div", {"class": "a-section imdb-scroll-table-inner"})
complete_table = boxofficemojo_table.get_text()
print(complete_table)`
You Can use pd.read_html for this.
import pandas as pd
Data = pd.read_html(r'https://www.boxofficemojo.com/chart/top_lifetime_gross/')
for data in Data:
data.to_csv('Data.csv', ',')
2.Using Bs4
import pandas as pd
from bs4 import BeautifulSoup
import requests
URL = r'https://www.boxofficemojo.com/chart/top_lifetime_gross/'
print('\n>> Exctracting Data using Beautiful Soup for :'+ URL)
try:
res = requests.get(URL)
except Exception as e:
print(repr(e))
print('\n<> URL present status Code = ',(res.status_code))
soup = BeautifulSoup(res.text,"lxml")
table = soup.find('table')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll(["td"]):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
for item in list_of_rows:
' '.join(item)
Data = pd.DataFrame(list_of_rows)
Data.dropna(axis = 0, how = 'all',inplace = True)
print(Data.head(10))
Data.to_csv('Table.csv')
I want to replace duplicate title and price and links with empty column values.
import requests
import csv
from bs4 import BeautifulSoup
requests.packages.urllib3.disable_warnings()
import pandas as pd
url = 'http://shop.kvgems-preciousstones.com/'
while True:
session = requests.Session()
session.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"}
content = session.get(url, verify=False).content
soup = BeautifulSoup(content, "html.parser")
posts = soup.find_all('li',{'class':'item'})
data = []
for url in posts:
title = url.find('h2',{'product-name'}).text
price = url.find('span',{'price'}).text
link = url.find('a').get('href')
url_response = requests.get(link)
url_data = url_response.text
url_soup = BeautifulSoup(url_data, 'html.parser')
desciption = url_soup.find('tr')
for tr in url_soup.find_all('tr'):
planet_data = dict()
values = [td.text for td in tr.find_all('td')]
planet_data['name'] = tr.find('td').text.strip()
planet_data['info'] = tr.find_all('td')[1].text.strip()
data.append((title,price,planet_data,link))
#data_new = data +","+ data_desciption
#urls = soup.find('a',{'class': 'next i-next'}).get('href')
#url = urls
#print(url)
with open('ineryrge5szdqzrt.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['title','price','name','info','link'])
#The for loop
for title,price,planet_data,link in data:
writer.writerow([title,price,planet_data['name'],planet_data['info'],link])
When I write CSV I got the result of duplicated title, price, link but I want to get only 1 title, price, info and link while the rest are empty.
The first for loop extracts the common values (title, price and link). The second for loop then extracts all the data attributes for each item.
However, you are then writing title, price and link fields to the CSV file for every row of data. You only need to do it for the first row of data.
To detect if your second for loop is on the first row or not, you can change it to use the enumerate function which gives you an extra index variable. You can then use this value to only write the title, price, link if 0:
for index, tr in enumerate(url_soup.find_all('tr')):
planet_data = dict()
values = [td.text for td in tr.find_all('td')]
planet_data['name'] = tr.find('td').text.strip()
planet_data['info'] = tr.find_all('td')[1].text.strip()
if index == 0:
data.append((title,price,planet_data,link))
else:
data.append((None,None,planet_data,None))
(Also I don't think you need the initial while True: part.)
I'm using BeautifulSoup to try to get the whole table of all 2000 companies from this URL:
https://www.forbes.com/global2000/list/#tab:overall.
This is the code I have written:
from bs4 import BeautifulSoup
import urllib.request
html_content = urllib.request.urlopen('https://www.forbes.com/global2000/list/#header:position')
soup = BeautifulSoup(html_content, 'lxml')
table = soup.find_all('table')[0]
new_table = pd.DataFrame(columns=range(0,7), index = [0])
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
new_table
In the result, I get only the names of the columns, but not the table itself.
How can I get the whole table.
The content is generated via javascript, so you can must selenium to mimic a browser and scroll movements, and then parse the page source with beautiful soup, or, in some cases, like this one, you can access those values by querying their ajax API:
import requests
import json
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
target = 'https://www.forbes.com/ajax/list/data?year=2017&uri=global2000&type=organization'
with requests.Session() as s:
s.headers = headers
data = json.loads(s.get(target).text)
print([x['name'] for x in data[:5]])
Output (first 5 items):
['3M', '3i Group', '77 Bank', 'AAC Technologies Holdings', 'ABB']