Avoid to copy some content while scraping through pages - python

I have some difficulties in saving the results that I am scraping.
Please refer to this code (this code was slightly changed for my specific case):
import bs4, requests
import pandas as pd
import re
import time
headline=[]
corpus=[]
dates=[]
tag=[]
start=1
url="https://www.imolaoggi.it/category/cron/"
while True:
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html')
headlines=soup.find_all('h3')
corpora=soup.find_all('p')
dates=soup.find_all('time', attrs={'class':'entry-date published updated'})
tags=soup.find_all('span', attrs={'class':'cat-links'})
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text)
for d in date:
dates.append(d.text)
for c in tags:
tag.append(c.text)
if soup.find_all('a', attrs={'class':'page-numbers'}):
url = f"https://www.imolaoggi.it/category/cron/page/{page}"
page +=1
else:
break
Create dataframe
df = pd.DataFrame(list(zip(date, headline, tag, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
I would like to save all the pages from this link. The code works, but it seems that it writes everytime (i.e. every page) two identical sentences for the corpus:
I think this is happening because of the tag I chosen:
corpora=soup.find_all('p')
This causes a misalignment in rows in my dataframe, as data are saved in lists and corpus starts being correctly scraped later, if compared to others.
I hope you cab help to understand how to fix it.

You were close, but your selectors were off, and you mis-naned some of your variables.
I would use css selectors like this:
eadline=[]
corpus=[]
date_list=[]
tag_list=[]
headlines=soup.select('h3.entry-title')
corpora=soup.select('div.entry-meta + p')
dates=soup.select('div.entry-meta span.posted-on')
tags=soup.select('span.cat-links')
for t in headlines:
headline.append(t.text)
for s in corpora:
corpus.append(s.text.strip())
for d in dates:
date_list.append(d.text)
for c in tags:
tag_list.append(c.text)
df = pd.DataFrame(list(zip(date_list, headline, tag_list, corpus)),
columns =['Date', 'Headlines', 'Tags', 'Corpus'])
df
Output:
Date Headlines Tags Corpus
0 30 Ottobre 2020 Roma: con spranga di ferro danneggia 50 auto i... CRONACA, NEWS Notte di vandalismi a Colli Albani dove un uom...
1 30 Ottobre 2020\n30 Ottobre 2020 Aggressione con machete: grave un 28enne, arre... CRONACA, NEWS Roma - Ha impugnato il suo machete e lo ha agi...
2 30 Ottobre 2020\n30 Ottobre 2020 Deep State e globalismo, Mons. Viganò scrive a... CRONACA, NEWS LETTERA APERTA\r\nAL PRESIDENTE DEGLI STATI UN...
3 30 Ottobre 2020 Meluzzi e Scandurra: “Sacrificare libertà per ... CRONACA, NEWS "Sacrificare la libertà per la sicurezza è un ...

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def main(req, num):
r = req.get("https://www.imolaoggi.it/category/cron/page/{}/".format(num))
soup = BeautifulSoup(r.content, 'html.parser')
goal = [(x.time.text, x.h3.a.text, x.select_one("span.cat-links").get_text(strip=True), x.p.get_text(strip=True))
for x in soup.select("div.entry-content")]
return goal
with ThreadPoolExecutor(max_workers=30) as executor:
with requests.Session() as req:
fs = [executor.submit(main, req, num) for num in range(1, 2937)]
allin = []
for f in fs:
allin.extend(f.result())
df = pd.DataFrame.from_records(
allin, columns=["Date", "Title", "Tags", "Content"])
print(df)
df.to_csv("result.csv", index=False)

Related

Trying to scrape .md file into a csv

I'm trying to scrape a git .md file. I have made a python scraper but I'm kind of stuck on how to actually get the data I want. The page has a long list of job listings. they are all in separate Li elements. I want to get the A elements. After the A elements, there is just plain text separated by | I want to scrape those as well. I really want this to end up as a CSV file with the A tag as a column, the location text before the | as a column, and the remaining description text as a column.
Here's my code:
from bs4 import BeautifulSoup
import requests
import json
def getLinkData(link):
return requests.get(link).content
content = getLinkData('https://github.com/poteto/hiring-without-whiteboards/blob/master/README.md')
soup = BeautifulSoup(content, 'html.parser')
ul = soup.find_all('ul')
li = soup.find_all("li")
data = []
for uls in ul:
rows = uls.find_all('a')
data.append(rows)
print(data)
When I run this I get the A tags, but obviously not the rest yet. There seem to be a few other ul elements that are included. I just want the one with all the job LIs but the LIs nor the UL have any ids or classes. Any suggestions on how to accomplish what I want? Maybe add Pandas into this(not sure how)
screenshot:
screenshot2:
import requests
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/poteto/hiring-without-whiteboards/master/README.md'
res = requests.get(url).text
jobs = res.split('## A - C\n\n')[1].split('\n\n## Also see')[0]
jobs = [j[3:] for j in jobs.split('\n') if j.startswith('- [')]
df = pd.DataFrame(columns=['Company', 'URL', 'Location', 'Info'])
for i, job in enumerate(jobs):
company, rest = job.split(']', 1)
url, rest = rest[1:].split(')', 1)
rest = rest.split(' | ')
if len(rest) == 3:
_, location, info = rest
else:
_, location = rest
info = np.NaN
df.loc[i, :] = (company, url, location, info)
df.to_csv('file.csv')
print(df.head())
prints
index
Company
URL
Location
Info
0
Able
https://able.co/careers
Lima, PE / Remote
Coding interview, Technical interview (Backlog Refinement + System Design), Leadership interview (Behavioural)
1
Abstract
https://angel.co/abstract/jobs
San Francisco, CA
NaN
2
Accenture
https://www.accenture.com/us-en/careers
San Francisco, CA / Los Angeles, CA / New York, NY / Kuala Lumpur, Malaysia
Technical phone discussion with architecture manager, followed by behavioral interview focusing on soft skills
3
Accredible
https://www.accredible.com/careers
Cambridge, UK / San Francisco, CA / Remote
Take home project, then a pair-programming and discussion onsite / Skype round.
4
Acko
https://acko.com
Mumbai, India
Phone interview, followed by a small take home problem. Finally a F2F or skype pair programming session
import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import chain
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = list(chain.from_iterable([[(
i['href'],
i.get_text(strip=True),
*(i.next_sibling[3:].split(' | ', 1) if i.next_sibling else ['']*2))
for i in x.select('a')] for x in soup.select(
'h2[dir=auto] + ul', limit=9)]))
df = pd.DataFrame(goal)
df.to_csv('data.csv', index=False)
main('https://github.com/poteto/hiring-without-whiteboards/blob/master/README.md')

Webscraping in Python; pagination problem

I am still a beginner so Im sorry if this is a stupid question. I am trying to scrape some new articles for my master analysis through Jupyter notebook, but I am struggling with pagination. How can I fix that?
Here is the code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
danas = []
base_url = 'https://www.danas.rs/tag/izbori-2020/page/'
r = requests.get(base_url)
c = r.content
soup = BeautifulSoup(c,"html.parser")
paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
start_page = paging[1].int
last_page = paging[len(paging)-1].int
web_content_list = []
for page_number in range(int(float(start_page)),int(float(last_page)) + 1):
url = base_url+str(page_number)+"/.html"
r = requests.get(base_url+str(page_number))
c = r.content
soup = BeautifulSoup(c,"html.parser")
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
try:
headline = soup.find('h1', {'class': 'post-title'}).text.strip()
except:
headline = None
try:
time = soup.find('time', {'class': 'entry-date published'}).text.strip()[:17]
except:
time = None
try:
descr = soup.find('div', {'class': 'post-intro-content content'}).text.strip()
except:
descr = None
try:
txt = soup.find('div', {'class': 'post-content content'}).text.strip()
except:
txt = None
# create a list with all scraped info
danas = [headline,
date,
time,
descr,
txt]
web_content_list.append(danas)
else:
print('Oh No! ' + l)
dh = pd.DataFrame(danas)
dh.head()
And here is the error that pops out:
*AttributeError Traceback (most recent call last)
<ipython-input-10-1c9e3a7e6f48> in <module>
11 soup = BeautifulSoup(c,"html.parser")
12
---> 13 paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
14 start_page = paging[1].int
15 last_page = paging[len(paging)-1].int
AttributeError: 'NoneType' object has no attribute 'find'*
Well one issue is that 'https://www.danas.rs/tag/izbori-2020/page/' returns Greška 404: Tražena stranica nije pronađena. on the initial request. So wil lneed to address that.
Second issue is pulling in the start page and end page. Just curious, why would you search for a start page? All pages start at 1.
Another question, why convert to float, then int. Just get the page as int.
3rd, you never declare your variable date.
4th you are only grabbing the 1st article on the page. Is that what you want? Or do you want all the articles on the page? I left your code as is, since you're question is referring to iterating through the pages.
5th If you want the full text of the articles, you'll need to get to each of the article links.
There are few more issues too with the code. I tried to comment so you could see it. So compare this code to yours, and if you have questions, let me know:
Code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
base_url = 'https://www.danas.rs/tag/izbori-2020/'
r = requests.get(base_url)
c = r.text
soup = BeautifulSoup(c,"html.parser")
paging = soup.find("div",{"column is-8"}).find("div",{"nav-links"}).find_all("a")
start_page = 1
last_page = int(paging[1].text)
web_content_list = []
for page_number in range(int(start_page),int(last_page) + 1):
url = base_url+ 'page/' + str(page_number) #<-- fixed this
r = requests.get(url)
c = r.text
soup = BeautifulSoup(c,"html.parser")
if r.status_code == 200:
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('article')
for article in articles:
w=1
try:
headline = soup.find('h2', {'class': 'article-post-title'}).text.strip()
except:
headline = None
try:
time = soup.find('time')['datetime']
except:
time = None
try:
descr = soup.find('div', {'class': 'article-post-excerpt'}).text.strip()
except:
descr = None
# create a list with all scraped info <--- changed to dictionary so that you have column:value when you create the dataframe
danas = {'headline':headline,
'time':time,
'descr':descr}
web_content_list.append(danas)
print('Collected: %s of %s' %(page_number, last_page))
else:
#print('Oh No! ' + l) #<--- what is l?
print('Oh No!')
dh = pd.DataFrame(web_content_list) #<-- need to get the full appended list, not the danas, as thats overwritten after each iteration
dh.head()
Output:
print(dh.head().to_string())
headline time descr
0 Vučić saopštava ime mandatara vlade u 20 časova 2020-10-05T09:00:05+02:00 Predsednik Aleksandar Vučić će u danas, nakon sastanka Predsedništva Srpske napredne stranke (SNS) saopštiti ime mandatara za sastav nove Vlade Srbije. Vučić će odluku saopštiti u 20 sati u Palati Srbija, rečeno je FoNetu u kabinetu predsednika Srbije.
1 Nova skupština i nova vlada 2020-08-01T14:00:13+02:00 Saša Radulović biće poslanik još nekoliko dana i prvi je objavio da se vrši dezinfekcija skupštinskih prostorija, što govori u prilog tome da će se novi saziv, izabran 21. juna, ipak zakleti u Domu Narodne skupštine.
2 Brnabić o novom mandatu: To ne zavisi od mene, SNS ima dobre kandidate 2020-07-15T18:59:43+02:00 Premijerka Ana Brnabić izjavila je danas da ne zavisi od nje da li će i u novom mandatu biti na čelu Vlade Srbije.
3 Državna izborna komisija objavila prve rezultate, HDZ ubedljivo vodi 2020-07-05T21:46:56+02:00 Državna izborna komisija (DIP) objavila je večeras prve nepotpune rezultate po kojima vladajuća Hrvatska demokratska zajednica (HDZ) osvaja čak 69 mandata, ali je reč o rezultatima na malom broju prebrojanih glasova.
4 Analiza Pravnog tima liste „Šabac je naš“: Ozbiljni dokazi za krađu izbora 2020-07-02T10:53:57+02:00 Na osnovu izjave 123 birača, od kojih je 121 potpisana i sa matičnim brojem, prikupljenim u roku od 96 sati nakon zatvaranja biračkih mesta u nedelju 21. 6. 2020. godine u 20 časova, uočena su 263 kršenja propisa na 55 biračkih mesta, navodi se na početku Analize koju je o kršenju izbornih pravila 21. juna i uoči izbora sačinio pravni tim liste „Nebojša Zelenović – Šabac je naš“.

Python - How to extract text between two variables in a big text

I'm kinda new to Python and I decided to automatically collect the menu of my canteen. I don't want to use chrome driver so I planned to use requests but I can't properly collect only the menu of the current day and have to take the menu for all the week.
So I have this text with all the menus and want to extract the part between the current day and the next one. I have the variables with the current day and the next one but don't know how to do then.
Here is what I did:
(The thing I can't fix to get the text is at the end when I use re.search)
from datetime import *
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.crous-lille.fr/restaurant/r-u-mont-houy-2/"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
def getMenu():
#This gives us the date number without the 0 if it starts with it
#i.e 06 october becomes 6 october
day = datetime.now()
day = day.strftime("%d")
listeD = [int(d) for d in str(day)]
if listeD[0] == 0:
listeD.pop(0)
strings = [str(integer) for integer in listeD]
a_string = "".join(strings)
pfDay = int(a_string)
dayP1 = pfDay+1 #I have to change the +1 because it doesn't go back to 1 if the next day is a new month
#I'm too lazy for now though
#collect menu
data = soup.find_all('div', {"id":"menu-repas"})
data = data[0].text
result = re.search(r'str(pfDay) \.(.*?) str(dayP1)', data) #The thing I don't know how to fix
#The variables are between the quotes and then not recognized as so
print(result)
getMenu()
How should I fix it?
Thank you in advance ^^
You can use .get_text() method. The current day is first class="content" under id="menu-repas":
import requests
from bs4 import BeautifulSoup
url = "https://www.crous-lille.fr/restaurant/r-u-mont-houy-2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one('#menu-repas .content').get_text(strip=True, separator='\n'))
Prints:
Petit déjeuner
Pas de service
Déjeuner
Les Plats Du Jour
pizza kébab
gnocchis fromage
chili con carné
dahl coco et riz blanc végé
poisson blanc amande
pâtes sauce carbonara
Les accompagnements
riz safrane
pâtes
chou vert
poêlée festive
frites
Dîner
Pas de service

I can´t get all the data while Web Scraping

I'm trying to web scrape this URL = https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php.
I need to gather the values of "N°cole." column and "Nombre Colegiado" column.
I'm using BeautifulSoup but I get only values of "N°cole." column. How can I fix that?
Thanks!
This is my code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
page = requests.get('https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php')
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.find_all("span",{'class':'colColegiado'})
numero_col = []
for i in data:
data_num = i.text.strip()
numero_col.append(data_num)
numero_col
['Nº cole.',
'6478',
'13107',
'7341',
'12110',
'5625',
'4877',
'4700',
'9126',
'8444',
'13120',
'5023',
'12235',
'7747',
'17701',
'17391',
'17944',
'17772',
'7230',
'11729',
'17275']
You're currently fetching the values from the wrong html elements - it should be from all <p>s with the resalto class.
import requests
from bs4 import BeautifulSoup
#import pandas as pd
#import numpy as np
page = requests.get('https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php')
soup = BeautifulSoup(page.text, 'html.parser')
data = soup.find_all("p",{'class':'resalto'})
schools = []
for result in data:
data_num = result.contents[0].text.strip()
#numero_col.append(data_num)
data_name = str(result.contents[1])
schools.append((data_num,data_name))
print(schools)
Instead of selecting all p at once, you can loop through the paragraphs in the table only. The following code takes page number and saves the table to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
pageno = 1
res = requests.get(f'https://www.ventanillaunicaenfermeria.es/BuscarColegiados.php?nombre=&ap=&colegio=&col=&nif=&pagina={pageno}')
soup = BeautifulSoup(res.text,"html.parser")
header = soup.find("div", {"id":"contactaForm"}).find("h4")
cols = [header.find("span").get_text(), header.get_text().replace(header.find("span").get_text(),"")]
data = []
for p in soup.find("div", {"id":"contactaForm"}).find_all("p"):
if len(p['class']) == 0 or p['class'][0] == "resalto":
child = list(p.children)
data.append([child[0].get_text(strip=True), child[1]])
df = pd.DataFrame(data,columns=cols)
df.to_csv("data.csv", index=False)
print(df)
Output:
Nº cole. Nombre colegiado
0 6478 GUADALUPE LAZARO LAZARO
1 13107 JOSE MARIA PIÑA MANZANO
2 7341 HEIKE ELFRIEDE BIRKHOLZ
3 12110 ESTHER TIZON ROLDAN
4 5625 MARIA DOLORES TOMAS GARCIA-VAQUERO
5 4877 MARIA CARMEN CASADO LLAVONA
6 4700 MANUEL GUILABERT ORTEGA-VILLAIZAN
7 9126 MARIA ESPERANZA ASENSIO ALMAZAN
8 8444 CONCEPCION VIALARD RODRIGUEZ
9 13120 NURIA VILLAESCUSA SANCHEZ
10 5023 ARTURO BONET BLANCO
11 12235 ALFONSO JIMENEZ LOPEZ
12 7747 JACOBUS PETRUS SINNIGE
13 17701 ANIA BRAVO FIGUEREDO
14 17391 LUSINE DAMIRCHYAN
15 17944 ISALKOU DJIL MERHBA
16 17772 CARLA DENISSE FIGUEROA PIEDRA
17 7230 MARIA ISABEL VISO CABAÑERO
18 11729 PILAR GARCIA SALAZAR
19 17275 MARIA LOURDES MALLEN LLUIS

Scraping web site

I have this:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
aux.append(equipo)
If i do print(aux[0]) i got this:
,
Villarreal
Entrenador:
Javier Calleja
Jugadores:
1 Sergio Asenjo
13 Andrés Fernández
25 Mariano Barbosa
...
And my problem is i want to take the tag:
<h2 class="cintillo">Villarreal</h2>
And the tag:
1 Sergio Asenjo
And put it into a bataBase
How can i take that?
Thanks
You can extract the first <h2 class="cintillo"> element from equipo like this:
h2 = str(equipo.find('h2', {'class':'cintillo'}))
If you only want the inner HTML (without any tags), use:
h2 = equipo.find('h2', {'class':'cintillo'}).text
And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
Then append h2 and jugadores to a multi-dimensional list.
Full code:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
soup = BeautifulSoup(page.content, 'html.parser')
equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
aux = []
for equipo in equipos:
h2 = equipo.find('h2', {'class':'cintillo'}).text
jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
aux.append([h2,[j.text for j in jugadores]])
# format list for printing
print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1]) for i in aux]))
Output sample:
--Alavés--
Fernando Pacheco
Antonio Sivera
Álex Domínguez
Carlos Vigaray
...
Demo: https://repl.it/#glhr/55550385
You could create a dictionary of team names as keys with lists of [entrenador, players ] as values
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
soup = bs(r.content, 'lxml')
teams = {}
for team in soup.select('[id=nombreEquipo]'):
team_name = team.select_one('.cintillo').text
entrenador = team.select_one('dd').text
players = [item.text for item in team.select('.dorsal-jugador')]
teams[team_name] = {entrenador : players}
print(teams)

Categories