Having problem while scraping Klaytn Scope Table - python

I have some problems while I scrape 'https://scope.klaytn.com/account/0xb5471a00bcc02ea297df2c4a4fd1d073465c662b?tabId=tokenBalance'
this website using python with bs4, requests.
from bs4 import BeautifulSoup
import requests
import json
import urllib3
import pandas as pd
urllib3.disable_warnings()
I want to scrape Token Balance Table, so I requests, but nothing respond.
Klaytn Scope
How can I scrape this 'Token Balnace' Table Value?
When I use 'find' method to find all of the table value, but it prints 'None'.
html = requests.get(url, verify=False).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('span', {'class': 'ValueWithUnit__value'})
print(title)

This page is generated using the API. For example, a table is obtained from the following address: https://api-cypress-v3.scope.klaytn.com/v2/accounts/0xb5471a00bcc02ea297df2c4a4fd1d073465c662b/ftBalances?page=1
Account info: https://api-cypress-v3.scope.klaytn.com/v2/accounts/0xb5471a00bcc02ea297df2c4a4fd1d073465c662b
So u can get table:
url = 'https://api-cypress-v3.scope.klaytn.com/v2/accounts/0xb5471a00bcc02ea297df2c4a4fd1d073465c662b/ftBalances?page=1'
response = requests.get(url)
for token in response.json()['tokens']:
tokenName = response.json()['tokens'][token]['tokenName']
tokenAmount = next(amount['amount'] for amount in response.json()['result'] if amount['tokenAddress'] == token)
print(tokenName, tokenAmount)
OUTPUT:
Ironscale 128
Bloater 158
Lantern-Eye 144
Gaia's Tears 101
Redgill 11
...
Blue Egg 1
Green Egg 1
Health Vial 1
Mana Vial 1

Related

Can't scrape .aspx sites

I am trying to scrape numerous companies sites in Python for their news releases.
I figured out I need to use chickennoodle = soup(html_text, 'lxml') instead of chickennoodle = soup(html_text, 'html.parser') for aspx sites. I am still getting the basic urls back like their contact and careers links instead of the actual news article links. When I inspect the website it looks something like:
<a class="module_headline-link" href="/news-and-events/news/news-details/2022/Compugen-to-Release-Second-Quarter-Results-on-Thursday-August-4-2022/default.aspx">Compugen to Release Second Quarter Results on Thursday, August 4, 2022</a>.
On the basic html sites it works to print all of my_links and I can filter which link by the hashed out lines. I thought I'd add a few examples of troubled scrapes and one of a working one. I assume the not working ones are the same problem and probably due to not understanding the intricacies of lxml. I just assume it can't see the articles for some reason (unlike the html) because they start with /. Thanks for any help.
COMPANY 1-
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
URL = 'https://ir.cgen.com/news-and-events/news/default.aspx'
full = ''
html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'lxml')
for link in chickennoodle.find_all('a'):
my_links = (link.get('href'))
print(my_links)
#if str(my_links).startswith("/news-and-events/news/news-details/"):
# print(str(full)+my_links)
#else:
# None
COMPANY 2-
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
URL = 'https://www.meipharma.com/media/press-releases'
full = ''
html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'html.parser')
for link in chickennoodle.find_all('a'):
my_links = (link.get('href'))
print(my_links)
# if str(my_links).startswith(""):
# print(str(full)+my_links)
# else:
# None
COMPANY 3-
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
URL = 'https://investor.sierraoncology.com/news-releases/default.aspx'
full = ''
html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'lxml')
for link in chickennoodle.find_all('a'):
my_links = (link.get('href'))
print(my_links)
VS html site that works for my purposes
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
URL = "https://investors.aileronrx.com/index.php/news-releases"
full = "https://investors.aileronrx.com"
ALRNlinks = []
html_text = requests.get(URL).text
chickennoodle = soup(html_text, 'html.parser')
for link in chickennoodle.find_all('a'):
my_links = (link.get('href'))
if str(my_links).startswith("/news-rele"):
ALRN = (str(full)+my_links)
ALRNlinks.append(ALRN)
print(ALRNlinks)
The website from your first example is loading information dynamically in page, so requests won't see the information pulled by javascript, after the page loaded. You can however look into Dev Tools - network tab, and see which urls are being accessed by javascript, and try and scrape those. For example:
import requests
import pandas as pd
url = 'https://ir.cgen.com/feed/PressRelease.svc/GetPressReleaseList?LanguageId=1&bodyType=0&pressReleaseDateFilter=3&categoryId=1cb807d2-208f-4bc3-9133-6a9ad45ac3b0&pageSize=-1&pageNumber=0&tagList=&includeTags=true&year=2022&excludeSelection=1'
r = requests.get(url)
df = pd.DataFrame(r.json()['GetPressReleaseListResult'])
print(df)
This will print out:
Attachments Body Category DocumentFileSize DocumentFileType DocumentPath ExcludeFromLatest Headline LanguageId LinkToDetailPage ... RevisionNumber SeoName ShortBody ShortDescription Subheadline SubheadlineHtml TagsList ThumbnailPath WorkflowId PressReleaseDate
0 [] None PDF https://s26.q4cdn.com/977440944/files/doc_news... False Compugen to Release Second Quarter Results on ... 1 /news-and-events/news/news-details/2022/Compug... ... 33221 Compugen-to-Release-Second-Quarter-Results-on-... None None None [] https://s26.q4cdn.com/977440944/files/doc_news... e7b13fbb-ddc7-4955-a9c6-b44e6ab223ec 07/21/2022 07:00:00
1 [] None PDF https://s26.q4cdn.com/977440944/files/doc_news... False Compugen to Present at Upcoming Industry Confe... 1 /news-and-events/news/news-details/2022/Compug... ... 33213 Compugen-to-Present-at-Upcoming-Industry-Confe... None None None [] https://s26.q4cdn.com/977440944/files/doc_news... 1e5cb121-a9f7-4e1b-86c1-1571065d40b5 06/27/2022 07:00:00
2 [] None PDF https://s26.q4cdn.com/977440944/files/doc_news... False Compugen to Present at Upcoming Investor Confe... 1 /news-and-events/news/news-details/2022/Compug... ... 33202 Compugen-to-Present-at-Upcoming-Investor-Confe... None None None [] https://s26.q4cdn.com/977440944/files/doc_news... 8c004950-09c8-4831-bdfa-25f660afe250 06/01/2022 07:00:00
[...]
You can apply this for your other examples as well.

bs4 findAll not collecting all of the data from the other pages on the website

I'm trying to scrape a real estate website using BeautifulSoup.
I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code
data = soup(response.content, 'lxml')
prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
price = int(price)
prices.append(price)
Any idea as to why I can't collect the prices from all the pages using this script?
Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>
Any help would be much appreciated!
You can append &pn=<page number> parameter to the URL to get next pages:
import re
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home&pn="
prices = []
for page in range(1, 3): # <-- increase number of pages here
data = soup(requests.get(url + str(page)).content, "lxml")
for line in data.findAll(
"div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
):
price = line.get_text(strip=True)
price = int(re.sub(r"[^\d]", "", price))
prices.append(price)
print(price)
print("-" * 80)
print(len(prices))
Prints:
...
1993
1993
--------------------------------------------------------------------------------
50

BeautifulSoup: Reading Span Class Elements

I am having some issues web scraping information from a particular pages span class element, using the beautifulsoup and requests addon in python. It keeps returning me blank information: " ". Heres my code:
headers = {'User-Agent':'Mozilla/5.0'}
res = requests.get('https://www.theweathernetwork.com/ca/weather/ontario/toronto')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
weather_elem = soup.find('span', {'class':'wxcondition'})
weather = weather_elem
print(weather)
return weather`
The data is loaded through JavaScript so BeautifulSoup doesn't see anything. But you can simulate the Ajax with the requests module:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.theweathernetwork.com/ca/weather/ontario/toronto'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
place_code = soup.select_one('link[rel="alternate"]')['href'].split('=')[-1].lower()
ajax_url = 'https://weatherapi.pelmorex.com/api/v1/observation/placecode/' + place_code
data = requests.get(ajax_url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
print(data['observation']['weatherCode']['text'])
Prints:
Partly cloudy

Get a <span> value using python web scrape

I am trying to get a product price using BeautifulSoup in python.
But i keep getting erroes, no matter what I try.
The picture of the site i am trying to web scrape
I want to get the 19,90 value.
I have already done a code to get all the product names, and now need their prices.
import requests
from bs4 import BeautifulSoup
url = 'https://www.zattini.com.br/busca?nsCat=Natural&q=amaro&searchTermCapitalized=Amaro&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
price = soup.find('span', itemprop_='price')
print(price)
Less ideal is parsing out the JSON containing the prices
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.zattini.com.br/busca?nsCat=Natural&q=amaro&searchTermCapitalized=Amaro&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
scripts = [script.text for script in soup.select('script') if 'var freedom = freedom ||' in script.text]
pricesJson = scripts[0].split('"items":')[1].split(']')[0] + ']'
prices = [item['price'] for item in json.loads(pricesJson)]
names = [name.text for name in soup.select('#item-list [itemprop=name]')]
results = list(zip(names,prices))
df = pd.DataFrame(results)
print(df)
Sample output:
span[itemprop='price'] is generated by javascript. Original value stored in div[data-final-price] with value like 1990 and you can format it to 19,90 with Regex.
import re
...
soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.select('div[data-final-price]')
for price in prices:
price = re.sub(r'(\d\d$)', r',\1', price['data-final-price'])
print(price)
Results:
19,90
134,89
29,90
119,90
104,90
59,90
....

python crawling beautifulsoup how to crawl several pages?

Please Help.
I want to get all the company names of each pages and they have 12 pages.
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2
-- this website only changes the number.
So Here is my code so far.
Can I get just the title (company name) of 12 pages?
Thank you in advance.
from bs4 import BeautifulSoup
import requests
maximum = 0
page = 1
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1'
response = requests.get(URL)
source = response.text
soup = BeautifulSoup(source, 'html.parser')
whole_source = ""
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/' + str(page_number)
response = requests.get(URL)
whole_source = whole_source + response.text
soup = BeautifulSoup(whole_source, 'html.parser')
find_company = soup.select("#content > div.wrap_analysis_data > div.public_con_box.public_list_wrap > ul > li:nth-child(13) > div > strong")
for company in find_company:
print(company.text)
---------Output of one page
---------page source :)
So, you want to remove all the headers and get only the string of the company name?
Basically, you can use the soup.findAll to find the list of company in the format like this:
<strong class="company"><span>중소기업진흥공단</span></strong>
Then you use the .find function to extract information from the <span> tag:
<span>중소기업진흥공단</span>
After that, you use .contents function to get the string from the <span> tag:
'중소기업진흥공단'
So you write a loop to do the same for each page, and make a list called company_list to store the results from each page and append them together.
Here's the code:
from bs4 import BeautifulSoup
import requests
maximum = 12
company_list = [] # List for result storing
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(page_number)
response = requests.get(URL)
print(page_number)
whole_source = response.text
soup = BeautifulSoup(whole_source, 'html.parser')
for entry in soup.findAll('strong', attrs={'class': 'company'}): # Finding all company names in the page
company_list.append(entry.find('span').contents[0]) # Extracting name from the result
The company_list will give you all the company names you want
I figured it out eventually. Thank you for your answer though!
image : code captured in jupyter notebook
Here is my final code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
company_list=[]
for n in range(12):
url = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(n+1)
webpage = urlopen(url)
source = BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
companys = source.findAll('strong',{'class':'company'})
for company in companys:
company_list.append(company.get_text().strip().replace('\n','').replace('\t','').replace('\r',''))
file = open('company_name1.txt','w',encoding='utf-8')
for company in company_list:
file.write(company+'\n')
file.close()

Categories