I'm pretty new to Python, so would like some guidance. I would like to pull "Name, Protocol, APY, TVL" data from https://coindix.com/?sort=-tvl by scraping (as I believe there is no API), but having some issue. When I execute below:
import requests
from bs4 import BeautifulSoup
url = "https://coindix.com/?sort=-tvl"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
the returned data does not include the information I would like to get. Could someone please help?
There's an api. Find the url in Dev Tools -> Network -> XHR -> Headers
import requests
import pandas as pd
url = 'https://apiv2.coindix.com/search'
payload = {
'sort': '-tvl',
'first': 'true',
'screen': '1114'}
data = requests.get(url, params=payload).json()['data']
df = pd.DataFrame(data)
Output:
print(df.head(5).to_string())
id name icon chain protocol base reward rewards apy apy_7_day tvl risk link is_new
0 17419 UST https://apiv2.coindix.com/icons/UST.png Terra Anchor 0.193600 0.0000 {} 0.193600 0.19570 5977961341 2 https://apiv2.coindix.com/vault/17419/redirect False
1 17206 DAI-USDC-USDT https://apiv2.coindix.com/icons/DAI-USDC-USDT.png Ethereum Curve 0.002800 0.0087 {'CRV': 0.0087} 0.011500 0.01210 5952854016 1 https://apiv2.coindix.com/vault/17206/redirect False
2 17174 LUNA https://apiv2.coindix.com/icons/LUNA.png Terra Lido 0.079000 0.0000 {} 0.079000 0.07900 5534798290 1 https://apiv2.coindix.com/vault/17174/redirect False
3 15940 ETH https://apiv2.coindix.com/icons/ETH.png Ethereum Lido 0.047000 0.0000 {} 0.047000 0.04700 5347746431 1 https://apiv2.coindix.com/vault/15940/redirect False
4 13517 cUSD-cEUR https://apiv2.coindix.com/icons/cUSD-cEUR.png Celo Sushi 0.002466 0.0000 {} 0.002466 0.01058 4609514119 2 https://apiv2.coindix.com/vault/13517/redirect False
[100 rows x 14 columns]
Related
I don't usually play with BeautifulSoup in Python so I am struggling to find the value 8.133,00 that matches with the Ibex 35 in the web page: https://es.investing.com/indices/indices-futures
So far I am getting all the info of the page, but I can't filter to get that value:
site = 'https://es.investing.com/indices/indices-futures'
hardware = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101
Firefox/106.0'}
request = Request(site,headers=hardware)
page = urlopen(request)
soup = BeautifulSoup(page, 'html.parser')
print(soup)
I appreciate a hand to get that value.
Regards
Here is a way of getting that bit of information - a dataframe with all the info in that table containing IBEX 35, DAX, and so on, you can then slice that dataframe as you wish.
import pandas as pd
from bs4 import BeautifulSoup as bs
import cloudscraper
scraper = cloudscraper.create_scraper(disableCloudflareV1=True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
url = 'https://es.investing.com/indices/indices-futures'
r = scraper.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
table = soup.select_one('table[class="datatable_table__D_jso quotes-box_table__nndS2 datatable_table--mobile-basic__W2ilt"]')
df = pd.read_html(str(table))[0]
print(df)
Result in terminal:
0 1 2 3 4
0 IBEX 35derived 8.098,10 -3510 -0,43% NaN
1 US 500derived 3.991,90 355 +0,90% NaN
2 US Tech 100derived 11.802,20 1962 +1,69% NaN
3 Dow Jones 33.747,86 3249 +0,10% NaN
4 DAXderived 14.224,86 7877 +0,56% NaN
5 Índice dólarderived 106255 -1837 -1,70% NaN
6 Índice euroderived 11404 89 +0,79% NaN
See https://pypi.org/project/cloudscraper/
I am writing a small program to fetch stock exchange data using Python. The sample code below makes a request to a URL and it should return the appropriate data. Here is the resource that I am using:
https://python.plainenglish.io/4-python-libraries-to-help-you-make-money-from-webscraping-57ba6d8ce56d
from xml.dom.minidom import Element
from selenium import webdriver
from bs4 import BeautifulSoup
import logging
from selenium.webdriver.common.by import By
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
url = "http://eoddata.com/stocklist/NASDAQ/A.htm"
driver = webdriver.Chrome(executable_path="C:\Program Files\Chrome\chromedriver")
page = driver.get(url)
# TODO: find element by CSS selector
stock_symbol = driver.find_elements(by=By.CSS_SELECTOR, value='#ctl00_cph1_divSymbols')
soup = BeautifulSoup(driver.page_source, features="html.parser")
elements = []
table = soup.find('div', {'id','ct100_cph1_divSymbols'})
logging.info(f"{table}")
I've added a todo for getting the element that I am trying to retrieve from the program.
Expected:
The proper data should be returned.
Actual:
Nothing is returned.
It is most common practice to scrape tables with pandas.read_html() to get its texts, so I would also recommend it.
But to answer your question and follow your approach, select <div> and <table> more specific:
soup.select('#ctl00_cph1_divSymbols table')`
To get and store the data you could iterat the rows and append results to a list:
data = []
for row in soup.select('#ctl00_cph1_divSymbols table tr:has(td)'):
d = dict(zip(soup.select_one('#ctl00_cph1_divSymbols table tr:has(th)').stripped_strings,row.stripped_strings))
d.update({'url': 'https://eoddata.com'+row.a.get('href')})
data.append(d)
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://eoddata.com/stocklist/NASDAQ/A.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text)
data = []
for row in soup.select('#ctl00_cph1_divSymbols table tr:has(td)'):
d = dict(zip(soup.select_one('#ctl00_cph1_divSymbols table tr:has(th)').stripped_strings,row.stripped_strings))
d.update({'url': 'https://eoddata.com'+row.a.get('href')})
data.append(d)
pd.DataFrame(data)
Output
Code
Name
High
Low
Close
Volume
Change
url
0
AACG
Ata Creativity Global ADR
1.390
1.360
1.380
8,900
0
https://eoddata.com/stockquote/NASDAQ/AACG.htm
1
AACI
Armada Acquisition Corp I
9.895
9.880
9.880
5,400
-0.001
https://eoddata.com/stockquote/NASDAQ/AACI.htm
2
AACIU
Armada Acquisition Corp I
9.960
9.960
9.960
300
-0.01
https://eoddata.com/stockquote/NASDAQ/AACIU.htm
3
AACIW
Armada Acquisition Corp I WT
0.1900
0.1699
0.1700
36,400
-0.0193
https://eoddata.com/stockquote/NASDAQ/AACIW.htm
4
AADI
Aadi Biosciences Inc
13.40
12.66
12.90
98,500
-0.05
https://eoddata.com/stockquote/NASDAQ/AADI.htm
5
AADR
Advisorshares Dorsey Wright ETF
47.49
46.82
47.49
1,100
0.3
https://eoddata.com/stockquote/NASDAQ/AADR.htm
6
AAL
American Airlines Gp
14.44
13.70
14.31
45,193,100
-0.46
https://eoddata.com/stockquote/NASDAQ/AAL.htm
...
I am trying to scrape this table https://momentranks.com/topshot/account/mariodustice?limit=250
I have tried this:
import requests
from bs4 import BeautifulSoup
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})
But it returns an empty list. Can someone give advice on how to approach this?
Selenium is a bit overkill when there is an available api. Just get the data directly:
import requests
import pandas as pd
url = 'https://momentranks.com/api/account/details'
rows = []
page = 0
while True:
payload = {
'filters': {'page': '%s' %page, 'limit': "250", 'type': "moments"},
'flowAddress': "f64f1763e61e4087"}
jsonData = requests.post(url, json=payload).json()
data = jsonData['data']
rows += data
print('%s of %s' %(len(rows),jsonData['totalCount'] ))
if len(rows) == jsonData['totalCount']:
break
page += 1
df = pd.DataFrame(rows)
Output:
print(df)
_id flowId ... challenges priceFloor
0 619d2f82fda908ecbe74b607 24001245 ... NaN NaN
1 61ba30837c1f070eadc0f8e4 25651781 ... NaN NaN
2 618d87b290209c5a51128516 21958292 ... NaN NaN
3 61aea763fda908ecbe9e8fbf 25201655 ... NaN NaN
4 60c38188e245f89daf7c4383 15153366 ... NaN NaN
... ... ... ... ...
1787 61d0a2c37c1f070ead6b10a8 27014524 ... NaN NaN
1788 61d0a2c37c1f070ead6b10a8 27025557 ... NaN NaN
1789 61e9fafcd8acfcf57792dc5d 28711771 ... NaN NaN
1790 61ef40fcd8acfcf577273709 28723650 ... NaN NaN
1791 616a6dcb14bfee6c9aba30f9 18394076 ... NaN NaN
[1792 rows x 40 columns]
The data is indexed into the page using js code you cant use requests alone however you can use selenium
Keep in mind that Selenium's driver.get dosnt wait for the page to completley load which means you need to wait
Here to get you started with selenium
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = driver.get(url)
time.sleep(5) #edit the time of this depending on your case (in seconds)
soup = BeautifulSoup(page.source, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})
The source HTML you see in your browser is rendered using javascript. When you use requests this does not happen which is why your script is not working. If you print the HTML that is returned, it will not contain the information you wanted.
All of the information is though available via the API which your browser makes calls to to build the page. You will need to take a detailed look at the JSON data structure returned to decide which information you wish to extract.
The following example shows how to get a list of the names and MRvalue of each player:
import requests
from bs4 import BeautifulSoup
import json
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
req_main = s.get(url, headers=headers)
soup = BeautifulSoup(req_main.content, 'lxml')
data = soup.find('script', id='__NEXT_DATA__')
json_data = json.loads(data.string)
account = json_data['props']['pageProps']['account']['flowAddress']
post_data = {"flowAddress" : account,"filters" : {"page" : 0, "limit":"250", "type":"moments"}}
req_json = s.post('https://momentranks.com/api/account/details', headers=headers, data=post_data)
player_data = req_json.json()
for player in player_data['data']:
name = player['moment']['playerName']
mrvalue = player['MRvalue']
print(f"{name:30} ${mrvalue:.02f}")
Giving you output starting:
Scottie Barnes $672.38
Cade Cunningham $549.00
Josh Giddey $527.11
Franz Wagner $439.26
Trae Young $429.51
A'ja Wilson $387.07
Ja Morant $386.00
The flowAddress is needed from the first page request to allow the API to be used correctly. This happens to be embedded in a <script> section at the bottom of the HTML.
All of this was worked out by using the browser's network tools to watch how the actual webpage made requests to the server to build its page.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
#put all item in this array
response = requests.get('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
soup = BeautifulSoup(response.content, 'html.parser')
table=soup.find_all('table', class_='expo-table general-color')
for row in table:
for up in row.find_all('td'):
text_list = [text for text in up.stripped_strings]
print(text_list)
These code is working good and they will get me the correct output but they will not give output in these format as you seen below I want output in these format can you help me
Indirizzo Bliedinghauserstrasse 27
Città Remscheid
Nazionalità Germania
Sito web www.amannesmann.de
Stand Pad. 3 E14 F11
Telefono +492191989-0
Fax +492191989-201
E-mail sales#mannesmann.de
Membro di Cecimo
Social
pandas has a builtin html table scraper, so you can run:
df = pd.read_html('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
This returns a list of all tables on the page as dataframes, you can access your data with df[0]:
0
1
0
Indirizzo
Bliedinghauserstrasse 27
1
Città
Remscheid
2
Nazionalità
Germania
3
Sito web
www.amannesmann.de
4
Stand
Pad. 3 E14 F11
5
Telefono
+492191989-0
6
Fax
+492191989-201
7
E-mail
sales#mannesmann.de
8
Membro di
nan
9
Social
nan
You can use .get_text() method to extract text and use parameters to avoid whitespaces and give extra space using separator
data=table.find_all("tr")
for i in data:
print(i.get_text(strip=True,separator=" "))
Output:
Indirizzo Bliedinghauserstrasse 27
Città Remscheid
...
Instead of selecting <td>, select <tr> and use .stripped_strings on it to get the row wise data and then append them to the Dataframe.
Here is the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
#put all item in this array
temp = []
response = requests.get('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
soup = BeautifulSoup(response.content, 'html.parser')
table=soup.find_all('table', class_='expo-table general-color')
for row in table:
for up in row.find_all('tr'):
temp.append([text for text in up.stripped_strings])
df = pd.DataFrame(temp)
print(df)
0 1
0 Indirizzo Bliedinghauserstrasse 27
1 Città Remscheid
2 Nazionalità Germania
3 Sito web www.amannesmann.de
4 Stand Pad. 3 E14 F11
5 Telefono +492191989-0
6 Fax +492191989-201
7 E-mail sales#mannesmann.de
8 Membro di None
9 Social None
I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))