i can't figure it out how to get TEXT and NUMBERS from this tag <td>THERE IS TEXT I WANT TO GET</td> and there is "Quantity" also with <td>QUANTITY</td>
link:https://bscscan.com/tokenholdings?a=0x00a2c3d755c21bc837a3ca9a32279275eae9e3d6
there is image what i want to get.
thanks in advance
The table in the website is loaded dynamically, so you can't scrape it using requests. You have to use selenium in order to do it. Here is the full code:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
url = 'https://bscscan.com/tokenholdings?a=0x00a2c3d755c21bc837a3ca9a32279275eae9e3d6'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)
html = driver.page_source
driver.close()
soup = BeautifulSoup(html,'html5lib')
tbody = soup.find('tbody', id = "tb1")
tr_tags = tbody.find_all('tr')
symbols = []
quantities = []
for tr in tr_tags:
td_tags = tr.find_all('td')
symbols.append(td_tags[2].text)
quantities.append(td_tags[3].text)
df = pd.DataFrame((symbols,quantities))
df = df.T
df.columns = ['Symbol','Quantity']
print(df)
Output:
Symbol Quantity
0 BNB 17.98420742
1 Cake 19.76899295
2 ANY 1
3 FREE 1,502
4 LFI 326.87340092
5 LFI 326.87340092
I recommend a really good tool called re, and you can search the specific string from two substrings, e.g.
import re
s = ''<td>THERE IS TEXT I WANT TO GET</td>"
result = re.search('<td>(.*)</td>', s)
print(result.group(1))
>>> html="<td>THERE IS TEXT I WANT TO GET</td>\n<td>THERE IS TEXT I WANT TO GET</td>\n<td>THERE IS TEXT I WANT TO GET</td>\n<td>THERE IS TEXT I WANT TO GET</td>"
>>> soup = BeautifulSoup(html)
>>> for td in soup.find_all('td'): print(td.text)
Related
I am using beautifulsoup to scrape a website but need help with this as I am new to python and beautifulsoup
How do I get VET from the following
"[[VET]]"
This is my code so far
import bs4 as bs
import urllib.request
import pandas as pd
#This is the Home page of the website
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'lxml')
#find the Div and put all info into varTable
table = soup.find('table',{"id":"decliners_tbl"}).tbody
#find all Rows in table and puts into varTableRows
tableRows = table.find_all('tr')
print ("There is ",len(tableRows),"Rows in the Table")
print(tableRows)
columns = [tableRows[1].find_all('td')]
print(columns)
a = [tableRows[1].find_all("a")]
print(a)
So my output from print(a) is "[[<a class="mplink popup_link" href="https://marketchameleon.com/Overview/VET/">VET</a>]]"
and I want to extract VET out
AD
You can use a.text or a.get_text().
If you have multiple elements you'd need list comprehension on this function
Thank you for all the reply, I was able to work it out using the following code
source = urllib.request.urlopen('file:///C:/Users/Aiden/Downloads/stocks/Stock%20Premarket%20Trading%20Activity%20_%20Biggest%20Movers%20Before%20the%20Market%20Opens.html').read().decode('utf-8')
soup = bs.BeautifulSoup(source,'html.parser')
table = soup.find("table",id="decliners_tbl")
for decliners in table.find_all("tbody"):
rows = decliners.find_all("tr")
for row in rows:
ticker = row.find("a").text
volume = row.findAll("td", class_="rightcell")[3].text
print(ticker, volume)
I am currently running the following python script:
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
rows = my_table.findChildren(['td'])
i = i +1
for rows in rows:
cells = rows.findChildren('a')
for cell in cells:
value = cell.string
print(value)
To scrape data from this HTML:
https://i.stack.imgur.com/DkX83.png
The problem I have is that I'm struggling to only scrape the first column without scraping the second one as well because they are both under tags and in the same table row as each other. The href is the only thing which differentiates between the two tags and I have tried filtering using this but it doesn't seem to work and returns a blank value. Also when i try to sort the data manually the output is amended vertically and not horizontally, I am new to coding so any help would be appreciated :)
There is another way you might wanna try as well to achieve the same:
import requests
from bs4 import BeautifulSoup
keywords = ["USD","GBP","EUR"]
for keyword in keywords:
page = requests.get("https://www.x-rates.com/table/?from={}&amount=1".format(keyword))
soup = BeautifulSoup(page.content, "html.parser")
for items in soup.select_one(".ratesTable tbody").find_all("tr"):
data = [item.text for item in items.find_all("td")[1:2]]
print(data)
It is easier to follow what happens when you print every item you got from the top e.g. in this case from table item. The idea is to go one by one so you can follow.
import requests
from bs4 import BeautifulSoup
origin= ["USD","GBP","EUR"]
i=0
while i < len(origin):
page = requests.get("https://www.x-rates.com/table/?from="+origin[i]+"&amount=1")
soup = BeautifulSoup(page.content, "html.parser")
tables = soup.findChildren('table')
my_table = tables[0]
i = i +1
rows = my_table.findChildren('tr')
for row in rows:
cells = row.findAll('td',class_='rtRates')
if len(cells) > 0:
first_item = cells[0].find('a')
value = first_item.string
print(value)
cannot get the span text within the "table", thanks !
from bs4 import BeautifulSoup
import urllib2
url1 = "url"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table.find_all('span',recursive=False)
for row in rows:
print(row.text)
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
In the above line, findAll() returns a list.
So, in the next line you are getting the error because its expecting an HTML string.
If you expect only one table, try using the following code. Just replace
rows = table.find_all('span',recursive=False)
with
rows = table[0].find_all('span')
If you expect multiple tables in the page, run a for loop on the table and then run the rest of the statements inside the for loop.
Also, for pretty output, you can replace the tabs with spaces as in the following code:
row = row.get_text()
row = row.replace('\t', '')
print(row)
The final working code for you is:
from bs4 import BeautifulSoup
import urllib2
url1 = "url"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table[0].find_all('span')
for row in rows:
row_str = row.get_text()
row_str = row_str.replace('\t', '')
print(row_str)
Regarding recursive=False parameter, if it's set to false, it will only find in direct children which, in your case will give no result.
Recursive Argument in find()
If you only want Beautiful Soup to consider direct children, you can pass in recursive=False
Here's another approach using lxml instead of beautifulsoup:
import requests
from lxml import html
req = requests.get("<URL>")
raw_html = html.fromstring(req.text)
spans = raw_html.xpath('//div[#id="c1417094965154"]//span/text()')
print("".join([x.replace("\t", "").replace("\r\n","").strip() for x in spans]))
Output: Kranji Mile Day simulcast races, Kranji Racecourse, SINClass 3 Handicap - 1200M TURFSaturday, 26 May 2018Race 1, 5:15 PM
As you see, the output need a little formatting, spans is a list of all spans text, so you can do any processing you need.
You seem to use python 2.x, here is a python 3.x solution, since I do not have a python 2.x environment at the moment :
from bs4 import BeautifulSoup
import urllib.request as urllib
url1 = "<URL>"
# Read the HTML page
content1 = urllib.urlopen(url1).read()
soup = BeautifulSoup(content1, "lxml")
# Find the div (there is only one, so you do not need findAll) -> this is your problem
div = soup.find("div", class_="iw_component", id="c1417094965154")
# Now you retrieve all the span within this div
rows = div.find_all("span")
# You can do what you want with it !
line = ""
for row in rows:
row_str = row.get_text()
row_str = row_str.replace('\t', '')
line += row_str + ", "
print(line)
I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you
Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.
I am writing a Python script using BeautifulSoup to scrape values from this webpage: https://uk-air.defra.gov.uk/latest/currentlevels
I want to use soup.find() to get values for "Hourly mean Nitrogen dioxide" and "Last updated" from the table row where the "Monitoring site" is "Edinburgh St Leonards".
As I am new to web scraping I am having a bit of trouble so would be grateful for any help on this.
Scrap all the html tables in a list of tables.
The table index may change, then you should not rely on a row/column index.
A part of the folowing script look up for the index of the searched data. Moreover, it prints the header name: so you know want are the data you get.
from bs4 import BeautifulSoup
import urllib.request
import re
with urllib.request.urlopen('https://uk-air.defra.gov.uk/latest/currentlevels?view=region') as response:
htmlData = response.read()
soup = BeautifulSoup(htmlData, 'html5lib')
tables = soup.find_all('table', attrs={'class':'current_levels_table'})
#what you want to check:
Iwant = ['nitrogen', 'update']
about = 'Edinburgh'
for table in tables:
#get header to have the data (we're looking for) column number and table real names
table_head = table.find('thead')
headrows = table_head.find_all('tr')
measures = headrows[1].find_all('th')
for colnum, measure in enumerate(measures):
index.update({colnum: measure.text.strip() for wanted in Iwant if re.search(wanted+'(?iu)', measure.text)})
#get table content and look for Edinburgh
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cels = row.find_all('td')
rowContent = [cel.text.strip().replace(u'\xa0', u' ').replace(u'\n Timeseries Graph', u'') for cel in cels if cel]
if re.search(about+'(?iu)', rowContent[0]):
for indexwanted, measurewanted in index.items():
print(measurewanted, ':', rowContent[indexwanted])
Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.
First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr element. Now identify the td elements in the tr. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.
You will need to do something clever to extract properly readable strings from these results.
>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
Edinburgh St Leonards
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'
you can start with this:
import requests
from bs4 import BeautifulSoup
# Request the page, set headers to prevent 403 Forbidden
page = requests.get(
url='https://uk-air.defra.gov.uk/latest/currentlevels',
headers={'User-Agent': 'Not blank'})
# Get html from page
html = page.text
# BeautifulSoup object
soup = BeautifulSoup(html, 'html5lib')
for table in soup.find_all('table'):
# Print all tables on the page
print(table)