How to get data from a link inside a webpage in Python? - python

I need to collect data from the website - https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow= and store it in a dataframe using pandas. For this I use the following code and get the data quite easily -
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url).text
df = pd.read_html(link)
df = df[-1]
But if you notice there is another hyperlink in the table on the extreme right hand side of every row of the webpage by the name "Details". I would also like to add the data from inside that hyperlink to every row in our dataframe. How do we do that?

As suggested by Shi XiuFeng, BeautifulSoup is better suited for your problem but if you still want to proceed with your current code, you would have to use regex to extract the URLs and add them as a column like this:
import pandas as pd
import requests
url = "https://webgate.ec.europa.eu/rasff-window/portal/?event=notificationsList&StartRow="
link = requests.get(url)
link_content = str(link.content)
res = re.findall(r'(<tbody.*?>.*?</tbody>)', link_content)[0]
res = re.findall(r'(<a href=\"(.*?)\">Details\<\/a\>)', res)
res = [i[1] for i in res]
link_text = link.text
df = pd.read_html(link_text)
df = df[-1]
df['links'] = res
print(df)
Hope that solves your problem.

Related

Splitting up text into separate rows - BeautifulSoup

I'm trying to loop over a few hundred pages of a site and grab Buddhist quotes and then save them into a dataframe. I've mostly got the code working, but am struggling with parsing some of the text appropriately. On each page i'm scraping there are 5 quotes, and from what I can tell in the HTML output no obvious identifier for each. So i've attempted to loop over what I scrape from each page but it's either overwriting all previous quotes (i.e quotes 1-4) or just grouping them all together into a single cell.
See set-up and code below:
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
new_result = pd.DataFrame({
"quote":[soup.find_all("div", class_="post-content")],
"quote_date":[soup.find_all("div", class_="post-date")],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata
As you can see from the image attached this is bundling each 5 quotes into a single cell and making a new row of the data for each page. Any thoughts on how I can split these up so I have one row per quote and date? I tried looping over soup.find_all("div", class_="post-content") but figure I must have been constructing the dataframe incorrectly as that overwrote all but the last quote on each page.
what my dataframe currently looks like
Thanks in advance! Chris
How to fix?
You should add an additional for loop to get your goal:
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
Example
# For data handling:
import pandas as pd
# Set Pandas output options
pd.set_option('display.max_colwidth', None)
# For the scrape:
from bs4 import BeautifulSoup as BShtml
import urllib.request as ur
# Make empty dataframe
emptydata = pd.DataFrame({"quote":[], "quote_date":[], "page_no":[]})
# Populate dataframe with quotes for first three pages
for i in range(1, 4):
url = "https://www.sgi-usa.org/tag/to-my-friends/page/" + str(i)
r = ur.urlopen(url).read()
soup = BShtml(r, "html.parser")
for post in soup.find_all("div", class_="quote-inner"):
new_result = pd.DataFrame({
"quote":[post.find("div", class_="post-content").get_text(strip=True)],
"quote_date":[post.find_all("div", class_="post-date")[1].get_text()],
"page_no": [str(i)]
})
emptydata = emptydata.append(new_result)
emptydata

Web scraping golf data from ESPN. I am receiving 3 ouputs of the same table and only want 1. How can I limit this?

I am new to python and am stuck. I cant figure out how to only output one of the tables given. In the output, it gives the desired table, but three versions of them. The first two are awfully formatted, and the last table is the table desired.
I have tried running a for loop and counting to only print the third table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in dfs:
print(df[0:])
Just use index to print the table.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
print(dfs[2])
OR
print(dfs[-1])
OR If you want to use loop then try that.
import pandas as pd
url = 'https://www.espn.com/golf/leaderboard'
dfs = pd.read_html(url, header = 0)
for df in range(len(dfs)):
if df==2:
print(dfs[df])

web scraping table from multiple pages from a search and creating a pandas dataframe

I got this code working for the first page and needed the user agent as it didn't work otherwise.
The problem I get is the search brings the first page, but on the second you have "page=2" and continuing so need to scrape all or as much as needed from the search
"https://www.vesselfinder.com/vessels?page=2&minDW=20000&maxDW=300000&type=4"
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
site= "https://www.vesselfinder.com/vessels?type=4&minDW=20000&maxDW=300000"
hdr = {'User-Agent': 'Chrome/70.0.3538.110'}
req = Request(site,headers=hdr)
page = urlopen(req)
import pandas as pd
import numpy as np
soup = BeautifulSoup(page, 'lxml')
type(soup)
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)
import re
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
print(clean2)
type(clean2)
df = pd.DataFrame(list_rows)
df.head(10)
df1 = df[0].str.split(',', expand=True)
df1.head(10)
Output is a Pandas DataFrame
need to scrape all pages to output a large dataframe
Okay, so this problem ended up getting stuck in my head, so I worked it out.
import pandas as pd
import requests
hdr={'User-Agent':'Chrome/70.0.3538.110'}
table_dfs={}
for page_number in range(951):
http= "https://www.vesselfinder.com/vessels?page={}&minDW=20000&maxDW=300000&type=4".format(page_number+1)
url= requests.get(http,headers=hdr)
table_dfs[page_number]= pd.read_html(url.text)
it will return the first column (vessel) as a nan value. That's the column for the image, ignore it if you don't need it.
the next column will be called 'built' it has the ships name, and type of ship in it. You'll need to .split() to separate them, and then you can replace column(vessel) with the ships name.
If it works for you I'd love to boost my reputation with a nice green check mark.
rows = soup.find_all('tr')
print(rows[:10])
for row in rows:
row_td = row.find_all('td')
print(row_td)
type(row_td)
^this code above is the same thing as
urls=['some list of urls you want to scrape']
table_dfs= [pd.read_html(url) for url in urls]
you can crawl through the urls you're looking for and apply that, and then if you want to do something with/to the tables you can just go:
for table in table_dfs:
table + 'the thing you want to do'
Note that the in-line for loop of table_dfs is in a list. That means that you might not be able to discern which url it came from if the scrape is big enough. Pieca seemed to have a solution that could be used to iterate the websites urls, and create a dictionary key. Note that this solution may not apply to every website.
url_list = {page_number:"https://www.vesselfinder.com/vessels?page=
{}&minDW=20000&maxDW=300000&type=4".format(page_number) for page_number
in list(range(1, 953))}
table_dfs={}
for url in range(1,len(url_list)):
table_dfs[url]= pd.read_html(url_list[url],header=hdr)

python pd.read_html gives error

I've been using pandas and request to pull some tables to get NFL statistics. It's been going pretty well, I've been able to pull tables from other sites, until I tried to get NFL combine table from this one particular site.
It gives me the error message after df_list = pd.read_html(html)
The error I get is:
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('<U1')
Here's the code I've been using at other sites that worked really well.
import requests
import pandas as pd
df = pd.DataFrame()
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=1987&pos=&college='
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
I've read and seen a little bit about BeautifulSoup, but the simplicity of the pd.read_html() is just so nice and compact. So I don't know if there's a quick fix that I am not aware of, or if I need to indeed dive into BeautifulSoup to get these tables from 1987 - 2017.
This isn't shorter, but may be more robust:
import requests
import pandas as pd
from bs4 import BeautifulSoup
A convenience function:
def souptable(table):
for row in table.find_all('tr'):
yield [col.text for col in row.find_all('td')]
Return a DataFrame with data loaded for a given year:
def getyear(year):
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?year=%d&pos=&college=' % year
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
data = list(souptable(soup.table))
df = pd.DataFrame(data[1:], columns=data[0])
df = df[pd.notnull(df['Name'])]
return df.apply(pd.to_numeric, errors="ignore")
This function slices out the heading row when the DataFrame is created, uses the first row for column names, and filters out any rows with an empty Name value.
Finally, concatenate up as many years as you need into a single DataFrame:
dfs = pd.concat([getyear(year) for year in range(1987, 1990)])
Ok. After doing some more research. I looks like the issue is that the last row is one merged cell, and that's possibly where the issue is coming in. So I did go into using BeautifulSoup to pull the data. Here is my solution:
import requests
import pandas as pd
from bs4 import BeautifulSoup
I wanted to pull for each year from 1987 to 2017
seasons = list(range(1987, 2018))
df = pd.DataFrame()
temp_df = pd.DataFrame()
So it would run through each year. Appending each cell into a new row. Then again knowing the last cell is a "blank", I eliminate that last row by defining the dataframe as df[:-1] before it loops and appends the next years data.
for i in seasons:
df = df[:-1]
url = 'http://nflcombineresults.com/nflcombinedata_expanded.php?
year=%s&pos=&college=' % (i)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
temp_df = row
df = df.append(temp_df, ignore_index = True)
Finally, since there is no new year to append, I need to eliminate the last row. Then I reshape the dataframe into the 16 columns, rename the columns from the first row, and then eliminate the row headers within the dataframe.
df = df[:-1]
df = (pd.DataFrame(df.values.reshape(-1, 16)))
df.columns = df.iloc[0]
df = df[df.Name != 'Name']
I'm still learning python, so any input, advice, any respectful constructive criticism is always welcome. Maybe there is a better, more appropriate solution?

Extract a table without class or id

I am trying to scrape a table from http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BDI
Although it seemed fairly easy it is not possible for me to identify the table in such a way that i could scrape it, i'm not able to extract the data. Can any one help with the right identification
import urllib3
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import requests
import csv
import re
url = 'http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BDI'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
col = row.find_all('td')
column_1 = col[0].string.strip()
#
date = []
closing_rate = []
#Here i need a reference to the correct table
table = soup.find()
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
column_1 = col[0].string.strip()
date.append(column_1)
column_2 = col[1].string.strip()
closing_rate.append(column_2)
columns = {'date': date, 'closing_rate': ClosingRate}
df = pd.DataFrame(columns)
df.to_csv('Baltic_Dry.csv')
You could use unique style attributes to identify the table you need.
For example, on this page here, it looks like the table containing index data is 550px wide. You can use:
soup.findAll('table', width="550")
Please note: I had to use another page on the same website because the one you posted requires a login. Hopefully the page structure will be similar.

Categories