How can I parse a table from a specific string using BeautifulSoup?

How can I parse a table from a specific string using BeautifulSoup? - python

sorry for the noobish question.
I'm learning to use BeautifulSoup, and I'm trying to extract a specific string of data within a table.
The website is https://airtmrates.com/ and the exact string I'm trying to get is:
VES Bolivar Soberano Bank Value Value Value
The table doesn't have any class so I have no idea how to find and parse that string.
I've been pulling something out of my buttcheeks but I've failed miserably. Here's the last code I tried so you can have a laugh:
def airtm():
#URLs y ejecución de BS
url = requests.get("https://airtmrates.com/")
response = requests.get(url)
html = response.content
soup_ = soup(url, 'html.parser')
columns = soup_.findAll('td', text = re.compile('VES'), attrs = {'::before'})
return columns

The page is dynamic meaning you'll need the page to render before parsing. You can do that with either Selenium or Requests-HTML
I'm not too familiar with Requests-HTML, but I have used Selenium in the past. This should get you going. Also, whenever I'm tooking to pull a <table>, tag I like to use pandas to parse. But BeautifulSoup can still be used, just takes a little more work to iterate through the table, tr, td tags. Pandas can do that work for you with .read_html():
from selenium import webdriver
import pandas as pd
def airtm(url):
#URLs y ejecución de BS
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
tables = pd.read_html(driver.page_source)
df = tables[0]
df = df[df['Code'] == 'VES']
driver.close()
return df
results = airtm('https://airtmrates.com/')
Output:
print (results)
Code Name Method Rate Buy Sell
120 VES Bolivar Soberano Bank 2526.7 2687.98 2383.68
143 VES Bolivar Soberano Mercado Pago 2526.7 2631.98 2429.52
264 VES Bolivar Soberano MoneyGram 2526.7 2776.59 2339.54
455 VES Bolivar Soberano Western Union 2526.7 2746.41 2383.68

Related

How do I scrape a particular table from Wikipedia, using Python?

I'm having difficulty scraping specific tables from Wikipedia. Here is my code.
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)
soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find('table', {"class":"wikitable sortable jquery-tablesorter"})
df = pd.read_html(str(cities))
df=pd.DataFrame(df[0])
print(df.to_string())
The class is taken from the info inside the table tag when you inspect the page, I'm using Edge as a browser. Changing the index (df[0]) causes it to say the index is out of range.
Is there a unique identifier in the wikipedia source code for each table? I would like a solution, but I'd really like to know where I'm going wrong too, as I feel I'm close and understand this.

I think your main difficulty was in extracting the html that corresponds to your class... "wikitable sortable jquery-tablesorter" is actually three separate classes and need to be separate entries in the dictionary. I have included two of those entries in the code below.
Hopefully this should help:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
table_class = "wikitable sortable jquery-tablesorter"
response = requests.get(wikiurl)
print(response.status_code)
# 200
soup = BeautifulSoup(response.text, 'html.parser')
cities = soup.find_all('table', {"class": "wikitable", "class": "sortable"})
print(cities[0])
# <table class="wikitable sortable">
# <tbody><tr>
# <th>Name of Town
# </th>
# <th>State
# ....
tables = pd.read_html(str(cities[0]))
print(tables[0])
# Name of Town State ... Population (2011) Ref
# 0 Achhnera Uttar Pradesh ... 22781 NaN
# 1 Adalaj Gujarat ... 11957 NaN
# 2 Adoor Kerala ... 29171 NaN
# ....

For simpler solution, you only need pandas. No need for requests and BeautifulSoup
import pandas as pd
wikiurl = 'https://en.wikipedia.org/wiki/List_of_towns_in_India_by_population'
tables = pd.read_html(wikiurl)
In here, tables will return lists of dataframe, you can select from the dataframe tables[0] .. etc

Don't parse the HTML directly. Use the provided API by MediaWiki as shown here: https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page
In your case, I use the Method 2: Use the Parse API with the following URL: https://en.wikipedia.org/w/api.php?action=parse&page=List_of_towns_in_India_by_population&prop=text&formatversion=2&format=json
Process the result accordingly. You might still need to use BeautifulSoup to extract the HTML table and it's content

Scraping certain text from a web page using beautiful soup

I am downloading some data to help improve my Spanish. On this webpage I am able to download the table of conjugations, however I can't seem to get the English translation & the box beneath it.
At the top of the page their is a Spanish flag & to the right of it is a Union Jack flag, I'm trying to get that text which is "laugh; smile; giggle;..."
Beneath, there's a box, which has the following values I'm also trying to get,
Infinitivo reír Gerundo riendo Participio Pasado reído
The code I have used to get the other tables is below. I'm not sure how to find the other elements mentioned above?
URL = 'https://conjugator.reverso.net/conjugation-spanish-verb-reír.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ch_divSimple')
verb_tbls = results.find_all('ul', class_='wrap-verbs-listing')

You might want to try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://conjugator.reverso.net/conjugation-spanish-verb-reír.html')
soup = BeautifulSoup(page.content, 'html.parser')
conjugations = soup.find_all('div', class_='blue-box-wrap')
for form in conjugations:
print(form.find("p").getText().upper() if form.find("p") else "N/A")
for row in form.find_all("li"):
print(row.getText())
print("-" * 80)
Output:
PRESENTE
yo río
tú ríes
él/ella/Ud. ríe
nosotros reímos
vosotros reís
ellos/ellas/Uds. ríen
--------------------------------------------------------------------------------
FUTURO
yo reiré
tú reirás
él/ella/Ud. reirá
nosotros reiremos
vosotros reiréis
ellos/ellas/Uds. reirán
--------------------------------------------------------------------------------
and so on...
As for the English words, these are generated dynamically and BeautifulSoup won't see them.

Scraping options price from yahoo finance using BeautifulSoup and find

I am building an MS Excel document to get stock and option prices from Yahoo finance, ariva etc. I am using xlwings and BeautifulSoup to get the data.
Everything works fine, I get stock prices from Yahoo and I also get stock/German option prices from ariva directly to Excel. Unfortunately, the option prices (not stock prices) are more difficult to get.
I am using this code (e.g. ticker is 'NVDA', date is '44211' (for 15/1/2021) and option_name is 'NVDA210115C00210000'):
import xlwings as xw
import bs4 as bs
import requests
#xw.func
def get_stock(ticker,date,option_name):
url_base = 'http://finance.yahoo.com/quote/'
new_date = str(86400*(date-25569))
src_base = requests.get(url_base+ticker+'/options?date='+new_date).text
soup = bs.BeautifulSoup(src_base,'lxml')
This results in loading https://finance.yahoo.com/quote/NVDA/options?date=1610668800 (that works fine).
How do I get the option price for this option: NVDA210115C00210000? I tried:
price = soup.find_all('div',attrs={'id':'Col1-1-OptionContracts-Proxy'})[0].find(attrs={'class':'data-col2'}).get_text()
return[price]
But it only returns the price of the first option on this page.
See picture: Yahoo Finance Code and option price I want the 324,37.
Somehow, I have to find the place of 'NVDA210115C00210000' and THEN get the text of data-col2. I just startet using Python two days ago and I am not a progammer, but I think it shouldn't be that difficult.
How can I use the 'find' to find that place and THEN get the price?

You have many errors,
soup = bs.BeautifulSoup(src_base,'lxml')
should be:
soup = bs.BeautifulSoup(src_base.content,'lxml')
It's .content you are missing.
What I done was instead find the the table row: table_data = soup.find('tr',{'data-reactid':'200'}) then found the data in that row option_price = table_data.find('td', {'data-reactid':'206'}).text
def get_stock(ticker,date,option_name):
url_base = 'http://finance.yahoo.com/quote/'
new_date = str(86400*(date-25569))
src_base = requests.get(url_base+ticker+'/options?date='+new_date)
soup = bs.BeautifulSoup(src_base.content,'lxml')
table_data = soup.find('tr',{'data-reactid':'200'})
option_price = table_data.find('td', {'data-reactid':'206'}).text
print(option_price)
get_stock('NVDA',44211,'NVDA210115C00210000')
>>> 407.35
HTML Code
<tr class="data-row7 Bgc($hoverBgColor):h BdT Bdc($seperatorColor) H(33px) in-the-money Bgc($hoverBgColor)" data-reactid="200">
<td class="data-col3 Ta(end) Pstart(7px)" data-reactid="206">407.35</td>

Python - How to extract text between two variables in a big text

I'm kinda new to Python and I decided to automatically collect the menu of my canteen. I don't want to use chrome driver so I planned to use requests but I can't properly collect only the menu of the current day and have to take the menu for all the week.
So I have this text with all the menus and want to extract the part between the current day and the next one. I have the variables with the current day and the next one but don't know how to do then.
Here is what I did:
(The thing I can't fix to get the text is at the end when I use re.search)
from datetime import *
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.crous-lille.fr/restaurant/r-u-mont-houy-2/"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
def getMenu():
#This gives us the date number without the 0 if it starts with it
#i.e 06 october becomes 6 october
day = datetime.now()
day = day.strftime("%d")
listeD = [int(d) for d in str(day)]
if listeD[0] == 0:
listeD.pop(0)
strings = [str(integer) for integer in listeD]
a_string = "".join(strings)
pfDay = int(a_string)
dayP1 = pfDay+1 #I have to change the +1 because it doesn't go back to 1 if the next day is a new month
#I'm too lazy for now though
#collect menu
data = soup.find_all('div', {"id":"menu-repas"})
data = data[0].text
result = re.search(r'str(pfDay) \.(.*?) str(dayP1)', data) #The thing I don't know how to fix
#The variables are between the quotes and then not recognized as so
print(result)
getMenu()
How should I fix it?
Thank you in advance ^^

You can use .get_text() method. The current day is first class="content" under id="menu-repas":
import requests
from bs4 import BeautifulSoup
url = "https://www.crous-lille.fr/restaurant/r-u-mont-houy-2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one('#menu-repas .content').get_text(strip=True, separator='\n'))
Prints:
Petit déjeuner
Pas de service
Déjeuner
Les Plats Du Jour
pizza kébab
gnocchis fromage
chili con carné
dahl coco et riz blanc végé
poisson blanc amande
pâtes sauce carbonara
Les accompagnements
riz safrane
pâtes
chou vert
poêlée festive
frites
Dîner
Pas de service

Python Webscraping with BeautifulSoup not displaying full content

I am trying to scrape all the text from a webpage which is embedded within the "td" tags that have a class="calendar__cell calendar__currency currency ". As of now my code only returns the first occurence of this tag and class. How can I keep it iterating through the source code. So that it returns all occurrences one by one. The webpage is forexfactory.com
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.forexfactory.com/#detail=108867").text
soup = BeautifulSoup(source, 'lxml')
body = soup.find("body")
article = body.find("table", class_="calendar__table")
actual = article.find("td", class_="calendar__cell calendar__actual actual")
forecast = article.find("td", class_="calendar__cell calendar__forecast forecast").text
currency = article.find("td", class_="calendar__cell calendar__currency currency")
Tcurrency = currency.text
Tactual = actual.text
print(Tcurrency)

You have to use find_all() to get all elements and then you can use for-loop to iterate it.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.forexfactory.com/#detail=108867")
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find("table", class_="calendar__table")
for row in table.find_all('tr', class_='calendar__row--grey'):
currency = row.find("td", class_="currency")
#print(currency.prettify()) # before get text
currency = currency.get_text(strip=True)
actual = row.find("td", class_="actual")
actual = actual.get_text(strip=True)
forecast = row.find("td", class_="forecast")
forecast = forecast.get_text(strip=True)
print(currency, actual, forecast)
Result
CHF 96.4 94.6
EUR 0.8% 0.9%
GBP 43.7K 41.3K
EUR 1.35|1.3
USD -63.2B -69.2B
USD 0.0% 0.2%
USD 48.9 48.2
USD 1.2% 1.5%
BTW: I found that this page uses JavaScript to redirect page and in browser I see table with different values. But if I turn off JavaScript in browser then it shows me data which I get with Python code. BeautifulSoup and requests can't run JavaScript. If you need data like in browser then you may need Selenium to control web browser which can run JavaScript.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse a table from a specific string using BeautifulSoup? - python

Related

How do I scrape a particular table from Wikipedia, using Python?

Scraping certain text from a web page using beautiful soup

Scraping options price from yahoo finance using BeautifulSoup and find

Python - How to extract text between two variables in a big text

Python Webscraping with BeautifulSoup not displaying full content

Categories

Resources