Using BeautifulSoup to extract table - python

I would like to extract the table from the following URL: "https://www.nordpoolgroup.com/en/Market-data1/#/nordic/table", and eventually store it in a pandas dataframe.
The code bellow returns:
table day-headers="true" enable-filter="false" nps-data-table="" table-data="ctrl.data[ctrl.selectedTab].table.data"></table
URL = "https://www.nordpoolgroup.com/en/Market-data1/#/nordic/table"
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find('table')[0])
I am not sure how to continue. I have been able to extract the content under the table tag on other sites using this code. Could someone please give me some advice and maybe explain what is happening in this case?

from pprint import pp
import requests
def main(url):
r = requests.get(url)
pp(r.json())
main('https://www.nordpoolgroup.com/api/marketdata/page/11')
from here you can parse it as JSON and create your dataframe!

Related

Extracting link from soup python

I'm trying make an app gets the source links on bandcamp but im kinda stuck. Is there a way to get the source link with beautifulsoup.
The link im trying to get
Bandcamp
The data is within the <script> tags in json format. So use BeautifulSoup to get the 'script'. The data you are after is in the data-tralbum attribute.
Onece you get thet, have json read it in, then just iterate through the json structure:
from bs4 import BeautifulSoup
import requests
import json
url = 'https://vine.bandcamp.com/album/another-light'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = str(soup.find_all('script')[4]['data-tralbum'])
jsonData = json.loads(script)
trackinfo = jsonData['trackinfo']
links = []
for each in trackinfo:
links.append(each['file']['mp3-128'])
Output:
print(links)
['https://t4.bcbits.com/stream/efbba461835eff472bd04a2f9e9910a9/mp3-128/1761020287?p=0&ts=1638288735&t=8ae6343808036ab513cd5436ea009e5d0de784e4&token=1638288735_9139d56ec86f2d44b83a11f3eed8caf7075d6039', 'https://t4.bcbits.com/stream/3e5ef92e6d83e853958ed01955c95f5f/mp3-128/1256475880?p=0&ts=1638288735&t=745a6c701cf1c5772489da5467f6cae5d3622818&token=1638288735_7e86a32c635ba92e0b8320ef56a457d988286cff', 'https://t4.bcbits.com/stream/bbb49d4a72cb80feaf759ec7890abbb6/mp3-128/3439518541?p=0&ts=1638288735&t=dcc7ef7d1d7823e227339fb3243385089478ebe7&token=1638288735_5db36a29c58ea038828d7b34b67e13bd80597dd8', 'https://t4.bcbits.com/stream/8c8a69959337f6f4809f6491c2822b45/mp3-128/1330130896?p=0&ts=1638288735&t=d108dac84dfaac901a546c5fcf5064240cca376b&token=1638288735_8d9151aa82e7a00042025f924660dd3a093c2f74', 'https://t4.bcbits.com/stream/4d4253633405f204d7b1c101379a73be/mp3-128/2478242466?p=0&ts=1638288735&t=a8cd539d0ce8ff417f9b69740070870ed9a182a5&token=1638288735_ad8b5e93c8ffef6623615ce82a6754678fa67b67', 'https://t4.bcbits.com/stream/6c4feee38e289aea76080e9ddc997fa5/mp3-128/2243532902?p=0&ts=1638288735&t=83417c3aba0cef0f969f93bac5165e582f24a588&token=1638288735_c1d9d43b4e10cc6d02c822de90eda3a52c382df2', 'https://t4.bcbits.com/stream/a24dc5dad7b619d47b006e26084ff38f/mp3-128/3054008347?p=0&ts=1638288735&t=4563c326a272c9f5b8462fef1d082e46fac7f605&token=1638288735_55978e7edbe0410ff745913224b8740becad59d5', 'https://t4.bcbits.com/stream/6221790d7f55d3b1f006bd5fac5458fe/mp3-128/1500140939?p=0&ts=1638288735&t=9ecc210c53af05f4034ee00cd1a96a043312a4a7&token=1638288735_0f2faba41da8952f841669513d04bdaaae35a629', 'https://t4.bcbits.com/stream/030506909569626a0d2d7d182b61c691/mp3-128/1707615013?p=0&ts=1638288735&t=c8dcbb2c491789928f5cb6ef8b755df999cb58b8&token=1638288735_b278ba825129ae1b5588b47d5cda345ef2db4e58', 'https://t4.bcbits.com/stream/d1ae0cbc281fc81ddd91f3a3e3d80973/mp3-128/2808772965?p=0&ts=1638288735&t=1080ff51fc40bb5b7afb3a2460f3209cbda549e3&token=1638288735_c93249c847acba5cf23521fa745e05b426a5ba05', 'https://t4.bcbits.com/stream/1b9d50f8210bdc3cf4d2e33986f319ae/mp3-128/2751220220?p=0&ts=1638288735&t=9f24f06dfc5c8a06f24f28664438a6f1a75a038c&token=1638288735_f3a98a20b3c344dc5a37a602a41572d5fe8539c1', 'https://t4.bcbits.com/stream/203cd15629ba03e3249f850d5e1ac42e/mp3-128/4188265472?p=0&ts=1638288735&t=4b4bc2f2194c63a1d3b957e3dd6046bd764c272a&token=1638288735_53a70e7d83ce8c2800baeaf92a5c19db4e146e3f', 'https://t4.bcbits.com/stream/c63b5c9ca090b233e675974c7e7ee4b2/mp3-128/258670123?p=0&ts=1638288735&t=a81ae9dc33dea2b2660d13dbbec93dbcb06e6b63&token=1638288735_446d0ae442cbbadbceb342fe4f7b69d0fbab2928', 'https://t4.bcbits.com/stream/2e824d3c643658c8e9e24b548bc8cb0b/mp3-128/2332945345?p=0&ts=1638288735&t=5bdf0264b9ffe4616d920c55f5081744bf0822d4&token=1638288735_872191bb67a3438ef0fd1ce7e8a9e5ca09e6c37e']

How to get the results of multiple iterations in one dataframe when crawling with beautifulsoup?

Wondering if you could help please. Python newbie here.
I am crawling multiple urls in the one script, but it's returning 3 dataframes. One for each iteration.
What I am looking to achieve is one single dataframe that contains the URLs from each iteration.
What I have so far is the following:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_links(url):
links = []
website = requests.get(url)
website_text = website.text
soup = BeautifulSoup(website_text)
for link in soup.find_all('a'):
links.append(link.get('href'))
data = (link in links)
df = pd.DataFrame(links)
display(df)
get_links('https://www.example.com/')
get_links('https://www.example2.com/')
get_links('https://www.example3.com/')
Thank you for your help
you should let the function return the dataframe instead of only display it
and then you should combine the outputs like that
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
df = pd.concat([df1,df2,df3])
from bs4 import BeautifulSoup, SoupStrainer
import requests
# Reason Behind using `SoupStrainer` is to increase performance to parse only `a` tags.
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#soupstrainer
urls = ['Link1', 'Link2', 'Link3']
def get_links():
for url in urls:
r = requests.get(url)
links = BeautifulSoup(r.text, 'lxml', parse_only=SoupStrainer('a'))
for link in links:
print(link.get('href', 'N/A'))
get_links()

BeautifulSoup to CSV

I have setup BeautifulSoup to find a specific class for two webpages.
I would like to know how to write each URL's result to a unique cell in one CSV?
Also is there a limit to the number of URLs I can read as I would like to expand this to about 200 URLs once I get this working.
The class is always the same and I don't need any formatting just the raw HTML in one cell per URL.
Thanks for any ideas.
from bs4 import BeautifulSoup
import requests
urls = ['https://www.ozbargain.com.au/','https://www.ozbargain.com.au/forum']
for u in urls:
response = requests.get(u)
data = response.text
soup = BeautifulSoup(data,'lxml')
soup.find('div', class_="block")
Use pandas to work with tabular data: pd.DataFrame to create a table, and pd.to_csv to save table as csv (might also check out the documentation, append mode for example).
Basically it.
import requests
import pandas as pd
from bs4 import BeautifulSoup
def func(urls):
for url in urls:
data = requests.get(url).text
soup = BeautifulSoup(data,'lxml')
yield {
"url": url, "raw_html": soup.find('div', class_="block")
}
urls = ['https://www.ozbargain.com.au/','https://www.ozbargain.com.au/forum']
data = func(urls)
table = pd.DataFrame(data)
table.to_csv("output.csv", index=False)

How to extract data from table using Beautifulsoup, with no text

Im trying to extract the leaderboard on https://fortnitetracker.com/events/epicgames_S11_PlatformCup_PC_NAE.
However the data inside the table is not extractable as 'text' elements. Do not know how to extract player names...
from bs4 import BeautifulSoup
import requests
url = "https://fortnitetracker.com/events/epicgames_S11_PlatformCup_PC_NAE"
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, "html.parser")
table_needed = soup.find_all('table',attrs={'class':'trn-table'})
table_needed=table_needed[0]
for i in table_needed:
print(i.find('div'))
This gives me the output. What can i do from here?
'{{ getPlayerNameList(entry.teamAccountIds, 4) }}'
I see theres a getPlayerNameList...
the player names are dynamically injected into the table with javascript(vue) so I don't think that you can scrape it with just beautifulsoup, i would suggest you try tools like selenium.

Download table from wunderground with beautiful soup

I would like to Weather History & Observations table from the following link:
https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=
This is the code I have so far:
import pandas as pd
from bs4 import BeautifulSoup
import requests
link = 'https://www.wunderground.com/history/airport/HDY/2011/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2011&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo='
resp = requests.get(link)
c = resp.text
soup = BeautifulSoup(c)
I would like to know what is the next step to access the table info at the bottom of the page (assuming this is a good website format to allow this to happen).
Thank you
You can use find_all
table = soup.find('table', class_="responsive obs-table daily")
rows = table.find_all('tr')

Categories