Web scraping returning empty dictionary

Web scraping returning empty dictionary - python

I'm trying to scrape all the data from this website https://ricetta.it/ricette-secondi using Python-Selenium.
I'd like to put them into a dictionary, as seen from the code below.
However, this is just returning an empty list back.
import pprint
detail_recipes = []
for recipe in list_recipes:
title = ""
description = ""
ingredient = ""
if(len(recipe.find_elements_by_css_selector(".post-title")) > 0):
title = recipe.find_elements_by_css_selector(".post-title")[0].text
if(len(recipe.find_elements_by_css_selector(".post-excerpt")) > 0):
description = recipe.find_elements_by_css_selector(".post-excerpt")[0].text
if(len(recipe.find_elements_by_css_selector(".nm-ingr")) > 0):
ingredient = recipe.find_elements_by_css_selector(".nm-ingr")[0].text
detail_recipes.append({'title': title,
'description': description,
'ingredient': ingredient
})
len(detail_recipes)
pprint.pprint(detail_recipes[0:10])

You can try this:
import requests
import numpy as np
from bs4 import BeautifulSoup as bs
import pandas as pd
url="https://ricetta.it/ricette-secondi"
page=requests.get(url)
soup=bs(page.content,'lxml')
df={'title': [],'description': [],'ingredient':[]}
for div in soup.find_all("div",class_="post-bordered"):
df["title"].append(div.find(class_="post-title").text)
try:
df["description"].append(div.find(class_="post-excerpt").text)
except:
df["description"].append(np.nan)
i=div.find_all(class_="nm-ingr")
if len(i)>0:
df["ingredient"].append([j.text for j in i])
else:
df["ingredient"].append(np.nan)
df=pd.DataFrame(df)
df.dropna(axis=0,inplace=True)
print(df)
Output:
title ... ingredient
0 Polpette di pane e formaggio ... [uovo, pane, pangrattato, parmigiano, latte, s...
1 Torta 7 vasetti alle melanzane ... [uovo, olio, latte, yogurt, farina 00, fecola ...
2 Torta a sole con zucchine e speck ... [pasta sfoglia, zucchina, ricotta, uovo, speck...
3 Pesto di limoni ... [limone, pinoli, parmigiano, basilico, prezzem...
4 Bombe di patate ... [patata, farina 00, uovo, parmigiano, sale e p...
5 Polpettone di zucchine ... [zucchina, uovo, parmigiano, pangrattato, pros...
6 Insalata di pollo ... [petto di pollo, zucchina, pomodorino, insalat...
7 Club sandwich ... [pane, petto di pollo, pomodoro, lattuga, maio...
8 Crostata di verdure ... [farina 00, burro, acqua, sale, zucchina, pomo...
9 Pesto di barbabietola ... [barbabietola, parmigiano, pinoli, olio, sale,...
[10 rows x 3 columns]
I don't know if you use these library or not, but that website doesn't uses javascript to load data, so we can scrape that website using requests and bs4. Most of the people prefer to use these library, if website doesn't uses javascript to load data. It is easy and faster then selenium. And for showing/displaying data I am using pandas with is also preferable library for working on table like data. It exactly print data in table like structure and you can save that scraped data in csv, excel file also.
If you want to scrape all of the data from next page also then try this:
df={'title': [],'description': [],'ingredient':[]}
for i in range(0,108):
url=f"https://ricetta.it/ricette-secondi?page={i}"
page=requests.get(url)
soup=bs(page.content,'lxml')
for div in soup.find_all("div",class_="post-bordered"):
df["title"].append(div.find(class_="post-title").text)
try:
df["description"].append(div.find(class_="post-excerpt").text)
except:
df["description"].append(np.nan)
i=div.find_all(class_="nm-ingr")
if len(i)>0:
df["ingredient"].append([j.text for j in i])
else:
df["ingredient"].append(np.nan)
It will scrape all of the 107 pages of data from that website.
You can save this df to csv or excel file by using :
df.to_csv("<filename.csv>")
# or for excel:
df.to_excel("<filename.xlsx>")
Edit :
As you ask you want to scrape, link of all recipes, I have figure out two things, first just replace space of titles by - and that is the link for that recipe and another is scrape link from there, for that you can use this piece of code:
div.find(class_="post-title")["href"]
It will return the link of that recipe. And for another approach you can do this:
df["links"]=df["title"].apply(lambda x: "https://ricetta.it/"+x.replace(" ","-").lower())
#.lower() is just to not make like a random text but it you remove it also it works.
But I personally suggest you just to scrape link from website cuz while making link own our own we may made mistakes.

Related

Python web scraping: Can I somehow convert a "string"[2] to "string[2]" in python?

It might be difficult to understand the question, that's why providing a kind of visualization below.
a = "Bailando"[43]
how to convert a into b or c:
b = "Bailando[43]"
c = "Bailando"
I am trying to scrap data from a wikipedia table and in that table there were some elements that were link in themself while the other ones were just plain texts.
So when I tried to access those element's name within the a tag, i was getting there name, but the plain text were returning [n] (these are those number in square bracket link in themselves beside any definition.
Here's a part of the code.
for tr in body.find_all('tr')[1:]:
for td in tr.find_all('td')[1:2] :
video_name = (td.a.text.strip())
for td in tr.find_all('td')[2:3] :
try:
u = (td.a['title'])
except TypeError:
for td in tr.find_all('td')[2:3] :
uploader = (td.text)
else:
uploader = u
uploaders.append(uploader)
video_names.append(video_name)
image of the table
output
after changing one line of code i am get something like this

import pandas as pd
df = pd.read_html(
'https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos', attrs={'class': 'wikitable'})[1].iloc[:-1, 0]
df = df.str.split('"').str[1]
print(df.values)
['Baby Shark Dance' 'Despacito' 'See You Again' 'Gangnam Style' 'Baby'
'Bad Romance' 'Charlie Bit My Finger' 'Evolution of Dance' 'Girlfriend'
'Evolution of Dance' 'Music Is My Hot Hot Sex' 'Evolution of Dance'
'Pokemon Theme Music Video' 'Myspace – The Movie' 'Phony Photo Booth'
'The Chronic of Narnia Rap' 'Ronaldinho: Touch of Gold' 'I/O Brush'
'Me at the zoo']

How do I separate multiple items within a div class into separate lists?

I am trying to scrape a website using Selenium. The element on the site is formatted in a way where it has 3 categories worth of information I want to split up. The following is the code when I inspect element on my browser for what I am trying to scrape.
<div class="break-text ng-binding ng-scope" ng-if="category.dataType == "breakText"">
Pinson, AL
<br>
Pinson Valley
</div>
This format has:
"City", "State"
.
"High School"
.
"Pinson", "AL"
.
"Pinson Valley"
.
respectively. How do I differentiate these lists when scraping the data?
city = driver.find_elements_by_class_name('break-text')
state = driver.find_elements_by_class_name('break-text')
highschool = driver.find_elements_by_class_name('break-text')

Try something like this:
data= driver.find_elements_by_xpath('//div[#class="break-text ng-binding ng-scope"]')
for d in data:
city = d.text.split('\n')[0].split(',')[0]
state = d.text.split('\n')[0].split(',')[1]
highschool = d.text.split('\n')[1]
print(city)
print(state.strip())
print(highschool)
Output:
Pinson
AL
Pinson Valley

Webscraping with BeautifulSoup in Python tags

I am currently trying to scrape some information from the following link:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument
I would like to scrape some of the information in the table using BeautifulSoup in Python. Ideally I would like to scrape the "Groupo Parliamentario," "Titulo," "Sumilla," and "Autores" from the table as separate items.
So far I've developed the following code using BeautifulSoup:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2001.nsf/ee3e4953228bd84705256dcd008385e7/4ec9c3be3fc593e2052571c40071de75?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
contents = []
summary = []
authors = []
contents.append(table.findAll('font'))
authors.append(table.findAll('a'))
What I'm struggling with is that the code to scrape the authors only scrapes the first author in the list. Ideally I need to scrape all of the authors in the list. This seems odd to me because looking at the html code for the webpage, all authors in the list are indicated with '<a href = >' tags. I would think table.findAll('a')) would grab all of the authors in the list then.
Finally, I'm sort of just dumping the rest of the very messy html (title, summary, parliamentary group) all into one long string under contents. I'm not sure if I'm missing something, I'm sort of new to html and webscraping, but would there be a way to pull these items out and store them individually (ie: storing just the title in an object, just the summary in an object, etc). I'm having a tough time identifying unique tags to do this in the code for the web page. Or is this something I should just clean and parse after scraping?

to get the authors you can use:
soup.find('input', {'name': 'NomCongre'})['value']
output:
'Santa María Calderón Luis,Alva Castro Luis,Armas Vela Carlos,Cabanillas Bustamante Mercedes,Carrasco Távara José,De la Mata Fernández Judith,De La Puente Haya Elvira,Del Castillo Gálvez Jorge,Delgado Nuñez Del Arco José,Gasco Bravo Luis,Gonzales Posada Eyzaguirre Luis,León Flores Rosa Marina,Noriega Toledo Víctor,Pastor Valdivieso Aurelio,Peralta Cruz Jonhy,Zumaeta Flores César'
to scrape Grupo Parlamentario
table.find_all('td', {'width': 446})[1].text
output:
'Célula Parlamentaria Aprista'
to scrape Título:
table.find_all('td', {'width': 446})[2].text
output:
'IGV/SELECTIVO:D.L.821/LEY INTERPRETATIVA '
to scrape Sumilla:
table.find_all('td', {'width': 446})[3].text
output:
' Propone la aprobación de una Ley Interpretativa del Texto Original del Numeral 1 del Apéndice II del Decreto Legislativo N°821,modificatorio del Texto Vigente del Numeral 1 del Apéndice II del Texto Único Ordenado de la Ley del Impuesto General a las Ventas y Selectivo al Consumo,aprobado por Decreto Supremo N°054-99-EF. '

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.
I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible.
The difficulty is that the various elements I require reside in different classes.
I have tried the following, so far.
for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
print(listing.text)
which successfully gives the following list
15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale
Separately, to retrieve the address details for each listing I have used the following successfully;
for address in content.findAll('a', attrs={"class": "listing-results-address"}):
print(address.text)
which gives this
22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode
And for property price I have used this...
for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
print(prop_price.text)
which gives...
$350,000
$1,250,000
$750,000
$100,000
This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.
At present I can do this using something like the code below:
all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")
This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.
</a>, <h2 class="listing-results-attr">
15 room mansion for sale
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">
$350,000
Expected results should look something like this:
15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000
3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000
etc
etc
I then need to be able to store the results as JSON objects for later analysis.
Thanks in advance.

Change your selectors as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])
You can view separately with, for example,
prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)

find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})
for li in results.find_all("li",{'class':"srp clearfix"}):
price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
address = li.find("a",{'class':"listing-results-address"}).text.strip()
description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()
print(description)
print(address)
print(price)
O/P:
2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

issues in extracting data from a csv

class QuotesSpider(scrapy.Spider):
name = "googlemailverif"
with open('input.csv', "r") as csvfile:
datareader = csv.reader(csvfile)
start_urls=['https://www.google.fr/search?q=email'+str(row[2]) for row in datareader]
# starting parsing
def parse(self, response):
yield {
'url': response.url,
'nom': "nom",
'emails': re.findall(r"[a-zA-Z0-9_\.+-]+#[a-zA-Z0-9_\.+-]+\.[a-zA-Z]{2,6}",''.join(response.xpath("//body//text()").extract()).strip()),
'SIRET':"SIRET",
}
This is a code that try from a csv file (with in 3rd column to extract a name of a company) to check for emails on google.
The first column contains an information i want to extract in the csv as "SIRET".
How can I do it?
If i extract it in start_urls when reading the csv, my url will be bad. If I use it it parse I will not : have the good data related to the data parsed, and I may have an error because accessing a file 2 times.
How can I make the information out of the first reading going to SIRET in the parse function?
I am struggling for hours on it :(
Best,

We can use zip for this.
sirets, start_urls = zip(*[(row[0], 'https://www.google.fr/search?q=email'+str(row[2])) for row in datareader])
Now you have one list containing the SIRET values and another list containing urls

"SIRET","NIC","L1_NORMALISEE","L2_NORMALISEE","L3_NORMALISEE","L4_NORMALISEE","L5_NORMALISEE","L6_NORMALISEE","L7_NORMALISEE","L1_DECLAREE","L2_DECLAREE","L3_DECLAREE","L4_DECLAREE","L5_DECLAREE","L6_DECLAREE","L7_DECLAREE","NUMVOIE","INDREP","TYPVOIE","LIBVOIE","CODPOS","CEDEX","RPET","LIBREG","DEPET","ARRONET","CTONET","COMET","LIBCOM","DU","TU","UU","EPCI","TCD","ZEMET","SIEGE","ENSEIGNE","IND_PUBLIPO","DIFFCOM","AMINTRET","NATETAB","LIBNATETAB","APET700","LIBAPET","DAPET","TEFET","LIBTEFET","EFETCENT","DEFET","ORIGINE","DCRET","DATE_DEB_ETAT_ADM_ET","ACTIVNAT","LIEUACT","ACTISURF","SAISONAT","MODET","PRODET","PRODPART","AUXILT","NOMEN_LONG","SIGLE","NOM","PRENOM","CIVILITE","RNA","NICSIEGE","RPEN","DEPCOMEN","ADR_MAIL","NJ","LIBNJ","APEN700","LIBAPEN","DAPEN","APRM","ESSEN","DATEESS","TEFEN","LIBTEFEN","EFENCENT","DEFEN","CATEGORIE","DCREN","AMINTREN","MONOACT","MODEN","PRODEN","ESAANN","TCA","ESAAPEN","ESASEC1N","ESASEC2N","ESASEC3N","ESASEC4N","VMAJ","VMAJ1","VMAJ2","VMAJ3","DATEMAJ"
"005720164","00028","SA SAINTE ISABELLE","","","236 ROUTE D AMIENS","","80100 ABBEVILLE","FRANCE","SA SAINTE-ISABELLE","","","236 RTE D AMIENS","","80100 ABBEVILLE","","236","","RTE","D AMIENS","80100","","32","Nord-Pas-de-Calais-Picardie","80","1","98","001","ABBEVILLE","80","4","01","248000556","41","2209","1","","1","O","201209","","","8610Z","Activités hospitalières","2008","22","100 à 199 salariés","100","2015","1","19830928","19830928","NR","99","","P","S","O","","0","SA SAINTE-ISABELLE","","","","","","00028","32","80001","","5599","SA à conseil d'administration (s.a.i.)","8610Z","Activités hospitalières","2008","","","","22","100 à 199 salariés","100","2015","ETI","19570101","201209","1","S","O","","","","","","","","","","","","2014-07-30T00:00:00"
"005720784","00031","ETABLISSEMENTS DECAYEUX","","","ZONE INDUSTRIELLE","","80210 FEUQUIERES EN VIMEU","FRANCE","ETABLISSEMENTS DECAYEUX","","","ZONE INDUSTRIELLE","","80210 FEUQUIERES EN VIMEU","","","","","ZONE INDUSTRIELLE","80210","","32","Nord-Pas-de-Calais-Picardie","80","1","17","308","FEUQUIERES EN VIMEU","80","1","18","248000630","15","0055","0","","1","O","201209","","","2572Z","Fabrication de serrures et de ferrures","2008","22","100 à 199 salariés","100","2015","4","19930401","19930401","NR","99","","P","S","O","","0","ETABLISSEMENTS DECAYEUX","","","","","","00015","32","80308","","5710","SAS/// société par actions simplifiée","2599A","Fabrication d'articles métalliques ménagers","2008","","N","20160915","32","250 à 499 salariés","200","2015","ETI","19570101","201209","3","S","O","2012","6","2599A","2599A","2599B","2572Z","4649Z","","","","","2001-12-13T00:00:00"
This is an extract from the csv
Each time I have an "SIRET" as a sirets value, but the other var increments and changes everytime
Thank you so much ++

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping returning empty dictionary - python

Related

Python web scraping: Can I somehow convert a "string"[2] to "string[2]" in python?

How do I separate multiple items within a div class into separate lists?

Webscraping with BeautifulSoup in Python tags

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

issues in extracting data from a csv

Categories

Resources