Python - How to extract text between two variables in a big text

Python - How to extract text between two variables in a big text - python

I'm kinda new to Python and I decided to automatically collect the menu of my canteen. I don't want to use chrome driver so I planned to use requests but I can't properly collect only the menu of the current day and have to take the menu for all the week.
So I have this text with all the menus and want to extract the part between the current day and the next one. I have the variables with the current day and the next one but don't know how to do then.
Here is what I did:
(The thing I can't fix to get the text is at the end when I use re.search)
from datetime import *
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.crous-lille.fr/restaurant/r-u-mont-houy-2/"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
def getMenu():
#This gives us the date number without the 0 if it starts with it
#i.e 06 october becomes 6 october
day = datetime.now()
day = day.strftime("%d")
listeD = [int(d) for d in str(day)]
if listeD[0] == 0:
listeD.pop(0)
strings = [str(integer) for integer in listeD]
a_string = "".join(strings)
pfDay = int(a_string)
dayP1 = pfDay+1 #I have to change the +1 because it doesn't go back to 1 if the next day is a new month
#I'm too lazy for now though
#collect menu
data = soup.find_all('div', {"id":"menu-repas"})
data = data[0].text
result = re.search(r'str(pfDay) \.(.*?) str(dayP1)', data) #The thing I don't know how to fix
#The variables are between the quotes and then not recognized as so
print(result)
getMenu()
How should I fix it?
Thank you in advance ^^

You can use .get_text() method. The current day is first class="content" under id="menu-repas":
import requests
from bs4 import BeautifulSoup
url = "https://www.crous-lille.fr/restaurant/r-u-mont-houy-2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one('#menu-repas .content').get_text(strip=True, separator='\n'))
Prints:
Petit déjeuner
Pas de service
Déjeuner
Les Plats Du Jour
pizza kébab
gnocchis fromage
chili con carné
dahl coco et riz blanc végé
poisson blanc amande
pâtes sauce carbonara
Les accompagnements
riz safrane
pâtes
chou vert
poêlée festive
frites
Dîner
Pas de service

Related

Web scraping with bs4 python: How to display football matchups

I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)

Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)

Avoid to copy some content while scraping through pages

I have some difficulties in saving the results that I am scraping.
Please refer to this code (this code was slightly changed for my specific case):
import bs4, requests
import pandas as pd
import re
import time
headline=[]
corpus=[]
dates=[]
tag=[]
start=1
url="https://www.imolaoggi.it/category/cron/"
while True:
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html')
headlines=soup.find_all('h3')
corpora=soup.find_all('p')
dates=soup.find_all('time', attrs={'class':'entry-date published updated'})
tags=soup.find_all('span', attrs={'class':'cat-links'})
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text)
for d in date:
dates.append(d.text)
for c in tags:
tag.append(c.text)
if soup.find_all('a', attrs={'class':'page-numbers'}):
url = f"https://www.imolaoggi.it/category/cron/page/{page}"
page +=1
else:
break
Create dataframe
df = pd.DataFrame(list(zip(date, headline, tag, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
I would like to save all the pages from this link. The code works, but it seems that it writes everytime (i.e. every page) two identical sentences for the corpus:
I think this is happening because of the tag I chosen:
corpora=soup.find_all('p')
This causes a misalignment in rows in my dataframe, as data are saved in lists and corpus starts being correctly scraped later, if compared to others.
I hope you cab help to understand how to fix it.

You were close, but your selectors were off, and you mis-naned some of your variables.
I would use css selectors like this:
eadline=[]
corpus=[]
date_list=[]
tag_list=[]
headlines=soup.select('h3.entry-title')
corpora=soup.select('div.entry-meta + p')
dates=soup.select('div.entry-meta span.posted-on')
tags=soup.select('span.cat-links')
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text.strip())
for d in dates:
date_list.append(d.text)
for c in tags:
tag_list.append(c.text)
df = pd.DataFrame(list(zip(date_list, headline, tag_list, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
df
Output:
Date Headlines Tags Corpus
0 30 Ottobre 2020 Roma: con spranga di ferro danneggia 50 auto i... CRONACA, NEWS Notte di vandalismi a Colli Albani dove un uom...
1 30 Ottobre 2020\n30 Ottobre 2020 Aggressione con machete: grave un 28enne, arre... CRONACA, NEWS Roma - Ha impugnato il suo machete e lo ha agi...
2 30 Ottobre 2020\n30 Ottobre 2020 Deep State e globalismo, Mons. Viganò scrive a... CRONACA, NEWS LETTERA APERTA\r\nAL PRESIDENTE DEGLI STATI UN...
3 30 Ottobre 2020 Meluzzi e Scandurra: “Sacrificare libertà per ... CRONACA, NEWS "Sacrificare la libertà per la sicurezza è un ...

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get("https://www.imolaoggi.it/category/cron/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.time.text, x.h3.a.text, x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
for x in soup.select("div.entry-content")]
return goal
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 2937)]
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Tags", "Content"])
print(df)
df.to_csv("result.csv", index=False)

Python: How can i get text from a tag like this in BeautiflSoup

I need to get the date and hour of this links : 'https://www.pagina12.com.ar/225378-murio-cacho-castana-simbolo-del-macho-porteno' or any in the site 'https://www.pagina12.com.ar/'.
the structure is this:
<div class="article-info"><div class="breadcrumb"><div class="suplement">Cultura y Espectáculos</div><div class="topic"></div></div><div class="time"><span datetime="2019-10-15" pubdate="pubdate">15 de octubre de 2019</span><span> · </span><span>Actualizado hace <span class="article-time" data-time="1571156914">3 hs</span></span></div></div>
and i did this:
cosa = requests.get('https://www.pagina12.com.ar/225378-murio-cacho-castana-simbolo-del-macho-porteno').text
parse = BeautifulSoup(cosa, 'html5lib')
info = parse.findAll('div', {'class':'article-info'})
then i try to get the text that says '3 Hs' and cant access to it and dont know how to do it. Anyone have an idea ?
Thanks!

You could calculate from the data-time attribute
from bs4 import BeautifulSoup as bs
import requests, datetime
import dateutil.relativedelta
r = requests.get('https://www.pagina12.com.ar/225378-murio-cacho-castana-simbolo-del-macho-porteno')
soup = bs(r.content, 'lxml')
dt1 = datetime.datetime.fromtimestamp(float(soup.select_one('[data-time]')['data-time']))
dt2 = datetime.datetime.fromtimestamp(datetime.datetime.now().timestamp())
diff = dateutil.relativedelta.relativedelta(dt2, dt1)
print(diff.hours)

How can I parse a table from a specific string using BeautifulSoup?

sorry for the noobish question.
I'm learning to use BeautifulSoup, and I'm trying to extract a specific string of data within a table.
The website is https://airtmrates.com/ and the exact string I'm trying to get is:
VES Bolivar Soberano Bank Value Value Value
The table doesn't have any class so I have no idea how to find and parse that string.
I've been pulling something out of my buttcheeks but I've failed miserably. Here's the last code I tried so you can have a laugh:
def airtm():
#URLs y ejecución de BS
url = requests.get("https://airtmrates.com/")
response = requests.get(url)
html = response.content
soup_ = soup(url, 'html.parser')
columns = soup_.findAll('td', text = re.compile('VES'), attrs = {'::before'})
return columns

The page is dynamic meaning you'll need the page to render before parsing. You can do that with either Selenium or Requests-HTML
I'm not too familiar with Requests-HTML, but I have used Selenium in the past. This should get you going. Also, whenever I'm tooking to pull a <table>, tag I like to use pandas to parse. But BeautifulSoup can still be used, just takes a little more work to iterate through the table, tr, td tags. Pandas can do that work for you with .read_html():
from selenium import webdriver
import pandas as pd
def airtm(url):
#URLs y ejecución de BS
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
tables = pd.read_html(driver.page_source)
df = tables[0]
df = df[df['Code'] == 'VES']
driver.close()
return df
results = airtm('https://airtmrates.com/')
Output:
print (results)
Code Name Method Rate Buy Sell
120 VES Bolivar Soberano Bank 2526.7 2687.98 2383.68
143 VES Bolivar Soberano Mercado Pago 2526.7 2631.98 2429.52
264 VES Bolivar Soberano MoneyGram 2526.7 2776.59 2339.54
455 VES Bolivar Soberano Western Union 2526.7 2746.41 2383.68

Extracting Web Data Using Beautiful Soup (Python 2.7)

in the code sample below, 3 of the 5 elements I am attempting to scrape return values as expected. 2 (goals_scored and assists) return no values. I have verified that the data does exist on the web page and that I am using the correct attribute, but not sure why results are not returning. Is there something obvious I am overlooking?
import sys
from bs4 import BeautifulSoup as bs
import urllib2
import datetime as dt
import time
import pandas as pd
proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support)
player_name=[]
club =[]
position = []
goals_scored = []
assists = []
for p in range(25):
player_url = 'http://www.mlssoccer.com/stats/season?page={p}&franchise=select&year=2017&season_type=REG&group=goals'.format(
p=p)
page = opener.open(player_url).read()
player_soup = bs(page,"lxml")
print >>sys.stderr, '[{time}] Running page {n}...'.format(
time=dt.datetime.now(), n=p)
length = len(player_soup.find('tbody').findAll('tr'))
for row in range(0, length):
try:
name = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Player'})[row].find('a').contents[0]
player_name.append(name)
team = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Club'})[row].contents[0]
club.append(team)
pos = player_soup.find('tbody').findAll('td', attrs={'data-title': 'POS'})[row].contents[0]
position.append(pos)
goals = player_soup.find('tbody').findAll('td', attrs={'data-title': 'G' ,'class': 'responsive'})[row].contents[0]
goals_scored.apppend(goals)
a = player_soup.find('tbody').findAll('td', attrs={'data-title': 'A'})[row].contents[0]
assists.append(a)
except:
pass
player_data = {'player_name':player_name,
'club':club,
'position' : position,
'goals_scored' : goals_scored,
'assists' : assists,
}
df = pd.DataFrame.from_dict(player_data,orient='index')
df
The only thing I can figure out is that there is a slight difference in the HTML for the variables not returning data. Do i need to include the class= responsive in my code? If so, any examples of how that might look?
Position HTML : F
Goals HTML: 11
Any insight is appreciated

You can try like this to get your desired data. I've only parsed the portion you needed. The rest you can do for dataframe. FYI, there are two types of classes attached to different td tags. odd and even. Don't forget to consider that as well.
from bs4 import BeautifulSoup
import requests
page_url = "https://www.mlssoccer.com/stats/season?page={0}&franchise=select&year=2017&season_type=REG&group=goals"
for url in [page_url.format(p) for p in range(5)]:
soup = BeautifulSoup(requests.get(url).text, "lxml")
table = soup.select("table")[0]
for items in table.select(".odd,.even"):
player = items.select("td[data-title='Player']")[0].text
club = items.select("td[data-title='Club']")[0].text
position = items.select("td[data-title='POS']")[0].text
goals = items.select("td[data-title='G']")[0].text
assist = items.select("td[data-title='A']")[0].text
print(player,club,position,goals,assist)
Partial result looks like:
Nemanja Nikolic CHI F 24 4
Diego Valeri POR M 21 11
Ola Kamara CLB F 18 3
As I've included both the classes in my script so you will get all data from that site.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - How to extract text between two variables in a big text - python

Related

Web scraping with bs4 python: How to display football matchups

Avoid to copy some content while scraping through pages

Python: How can i get text from a tag like this in BeautiflSoup

How can I parse a table from a specific string using BeautifulSoup?

Extracting Web Data Using Beautiful Soup (Python 2.7)

Categories

Resources