I am a novice wiht python and I am trying to do webscraping as exercise. I would like to scrape the content, and the title of each article inside a web page .
I have a problem with my code because I do not think is very efficient and I would like to optimize it.
the page I am trying to scrape is https://www.ansa.it/sito/notizie/politica/politica.shtml
this is what I have done so far:
#libraries
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import urllib.request,sys,time
import csv
from csv import writer
import time
from datetime import datetime
r= requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
b= soup(r.content, 'lxml')
title=[]
links=[]
content=[]
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
for c in b.findAll("h3", {"class": "news-title"}):
links.append(c.a["href"])
for link in links:
page=requests.get('https://www.ansa.it'+link)
bsjop=soup(page.content)
for n in bsjop.findAll('div',{'itemprop': 'articleBody'}):
content.append(n.text.strip())
The problem is that my output is made of multiple links, multiple titles and multiple contents that do not match each other (like one article has title and a content that has nothing to do with it)
If you know ways that I can improve my code it would be nice
thanks
To get all articles titles, urls, texts into a Pandas DataFrame you can use next example (I used tqdm module to get nice progress bar):
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
url = "https://www.ansa.it/sito/notizie/politica/politica.shtml"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for title in tqdm(soup.select("h3.news-title")):
t = title.get_text(strip=True)
u = title.a["href"]
s = BeautifulSoup(
requests.get("https://www.ansa.it" + u).content, "html.parser"
)
text = s.select_one('[itemprop="articleBody"]')
text = text.get_text(strip=True, separator="\n") if text else ""
all_data.append([t, u, text])
df = pd.DataFrame(all_data, columns=["Title", "URL", "Text"])
df.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):
import requests
from bs4 import BeautifulSoup
import pandas as pd
news_list = []
r = requests.get('https://www.ansa.it/sito/notizie/politica/politica.shtml')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('article.news')
for art in articles:
try:
title = art.select_one('h3').text.strip()
if 'javascript:void(0);' in art.select('a')[0].get('href'):
url = 'https://www.ansa.it' + art.select('a')[1].get('href')
else:
url = 'https://www.ansa.it' + art.select('a')[0].get('href')
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.select_one('div.news-txt').text.strip()
print(f'retrieving {url}')
news_list.append((title, content, url))
except Exception as e:
print(art.text.strip(), e)
df = pd.DataFrame(news_list, columns = ['Title', 'Content', 'Url'])
print(df)
This will return some errors for some links on page, which you will have to investigate and debug (and ask for help if you need it - it's an important part of learning process), and a dataframe with the list successfully retrieved, which looks like this:
Title Content Url
0 Letta: 'Ora difficile ricomporre con M5s'. Mel... Partiti scossi dallo scioglimento anticipato d... https://www.ansa.it/sito/notizie/politica/2022...
1 L'emozione di Draghi: 'Ancheil cuore dei banch... "Certe volte anche il cuore dei banchieri cent... https://www.ansa.it/sito/notizie/politica/2022...
2 La giornata di Draghi in foto https://www.ansa.it/sito/photogallery/primopia...
3 Il timing del voto, liste entro un mese. A Fer... Le liste dei candidati entro un mese a partire... https://www.ansa.it/sito/notizie/politica/2022...
4 Si lavora sulla concorrenza, ipotesi stralcio ... Il DDL Concorrenza andrà in Aula alla Camera l... https://www.ansa.it/sito/notizie/economia/2022...
5 Le cifre del governo Draghi: 55 voti fiducia e... Una media di 7,4 leggi approvate ogni mese su ... https://www.ansa.it/sito/notizie/politica/2022...
6 I 522 giorni del governo Draghi LE FOTO L'arrivo di SuperMario, gli incontri, le allea... https://www.ansa.it/sito/photogallery/primopia...
7 Presidi, disappunto per le urne in autunno, ci... "C'è disappunto non preoccupazione per le urne... https://www.ansa.it/sito/notizie/politica/2022...
8 Ucraina: Di Maio,sostegno ricerca mercati alte... (ANSA) - ROMA, 22 LUG - Lo scoppio del conflit... https://www.ansa.it/sito/photogallery/primopia...
9 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
10 Oltre mille sindaci a sostegno di Draghi Nei giorni che attendono il mercoledì che deci... https://www.ansa.it/sito/notizie/politica/2022...
11 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
12 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
13 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
14 Di Maio, Conte sta compiendo una vendetta poli... Se le cose restano come sono oggi "Mario Dragh... https://www.ansa.it/sito/notizie/politica/2022...
15 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
16 Il giorno più lungo: dal Senato fiducia a Drag... Passa la fiducia al premier Draghi in Senato, ... https://www.ansa.it/sito/notizie/politica/2022...
17 Mattarella scioglie le Camere, si vota il 25 s... E' stata una scelta "inevitabile", il voto del... https://www.ansa.it/sito/notizie/politica/2022...
18 Il discorso di Draghi al Senato 'Partiti, pron... "Siamo qui perché lo hanno chiesto gli italian... https://www.ansa.it/sito/notizie/politica/2022...
19 Governo: mercoledì la fiducia fiducia prima al... Le comunicazioni del presidente del Consiglio ... https://www.ansa.it/sito/notizie/politica/2022...
20 Draghi al Senato per una fiducia al buio. Prem... Draghi al bivio tra governo e crisi. Alle 9.30... https://www.ansa.it/sito/notizie/politica/2022...
21 Ultimatum di Conte, risposte o fuori. Di Maio:... Senza "risposte chiare" il Movimento 5 Stelle ... https://www.ansa.it/sito/notizie/politica/2022...
22 Camere sciolte ma il vitalizio è salvo Nonostante lo scioglimento delle Camere antici... https://www.ansa.it/sito/notizie/politica/2022...
You don't need two different loops if you are referring to the same element. Try the below code to save the title and links.
for c in b.findAll('h3',{'class':'news-title'}):
title.append(c.text.strip())
links.append(c.a["href"])
By combining you will be sure that the title and link are scraped from the same element.
I'm using the script below to retrieve property data for a college project. It is working without errors, but the dataframe has repeated values, that is, if I put it to fetch data from pages it repeats the same data from page 1 5 times, please help!
import requests, re, time, os, csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
# Inicializamos as listas para guardar as informações
link_imovel=[] # nesta lista iremos guardar a url
address=[] # nesta lista iremos guardar o endereço
neighbor=[] # nesta lista iremos guardar o bairro
anunciante=[] # nesta lista iremos guardar o anunciante
area=[] # nesta lista iremos guardar a area
tipo=[] # nesta lista iremos guardar o tipo de imóvel
room=[] # nesta lista iremos guardar a quantidade de quartos
bath=[] # nesta lista iremos guardar a quantidade de banheiros
park=[] # nesta lista iremos guardar a quantidade de vagas de garagem
price=[] # nesta lista iremos guardar o preço do imóvel
# Ele irá solicitar quantas páginas você deseja coletar
pages_number=int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()
# Configure chromedriver
# para executar, é necessário que você baixe o chromedriver e deixe ele na mesma pasta de execução, ou mude o path
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
time.sleep(15)
# Criando o loop entre as paginas do site
for page in range(1,pages_number+1):
link = 'https://www.vivareal.com.br/venda/minas-gerais/pocos-de-caldas/casa_residencial/?pagina='+str(page)+''
driver.get(link)
# Definimos um sleep time para não sobrecarregar o site
# coletamos todas as informações da página e transformamos em formato legivel
data = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
# identificamos todos os itens de card de imóveis
soup = soup_complete_source.find(class_='results-list js-results-list')
# Web-Scraping
# para cada elemento no conjunto de cards, colete:
for line in soup.findAll(class_="js-card-selector"):
# colete o endereço completo e o bairro
try:
full_address=line.find(class_="property-card__address").text.strip()
address.append(full_address.replace('\n', '')) #Get all address
if full_address[:3]=='Rua' or full_address[:7]=='Avenida' or full_address[:8]=='Travessa' or full_address[:7]=='Alameda':
neighbor_first=full_address.strip().find('-')
neighbor_second=full_address.strip().find(',', neighbor_first)
if neighbor_second!=-1:
neighbor_text=full_address.strip()[neighbor_first+2:neighbor_second]
neighbor.append(neighbor_text) # Guarde na lista todos os bairros
else: # Bairro não encontrado
neighbor_text='-'
neighbor.append(neighbor_text) # Caso o bairro não seja encontrado
else:
get_comma=full_address.find(',')
if get_comma!=-1:
neighbor_text=full_address[:get_comma]
neighbor.append(neighbor_text) # Guarde na lista todos os bairros com problema de formatação provenientes do proprio website
else:
get_hif=full_address.find('-')
neighbor_text=full_address[:get_hif]
neighbor.append(neighbor_text)
# Coleta o link
full_link=line.find(class_='property-card__main-info').a.get('href')
link_imovel.append(full_link)
# Coleta o anunciante
full_anunciante=line.find(class_='property-card__account-link js-property-card-account-link').img.get('alt').title()
anunciante.append(full_anunciante)
# Coleta a área
full_area=line.find(class_="property-card__detail-value js-property-card-value property-card__detail-area js-property-card-detail-area").text.strip()
area.append(full_area)
# Coleta tipologia
full_tipo = line.find(class_='property-card__title js-cardLink js-card-title').text.split()[0]
full_tipo=full_tipo.replace(' ','')
full_tipo=full_tipo.replace('\n','')
tipo.append(full_tipo)
# Coleta numero de quartos
full_room=line.find(class_="property-card__detail-item property-card__detail-room js-property-detail-rooms").text.strip()
full_room=full_room.replace(' ','')
full_room=full_room.replace('\n','')
full_room=full_room.replace('Quartos','')
full_room=full_room.replace('Quarto','')
room.append(full_room) #Get apto's rooms
# Coleta numero de banheiros
full_bath=line.find(class_="property-card__detail-item property-card__detail-bathroom js-property-detail-bathroom").text.strip()
full_bath=full_bath.replace(' ','')
full_bath=full_bath.replace('\n','')
full_bath=full_bath.replace('Banheiros','')
full_bath=full_bath.replace('Banheiro','')
bath.append(full_bath) #Get apto's Bathrooms
# Coleta numero de vagas de garagem
full_park=line.find(class_="property-card__detail-item property-card__detail-garage js-property-detail-garages").text.strip()
full_park=full_park.replace(' ','')
full_park=full_park.replace('\n','')
full_park=full_park.replace('Vagas','')
full_park=full_park.replace('Vaga','')
park.append(full_park) #Get apto's parking lot
# Coleta preço
full_price=re.sub('[^0-9]','',line.find(class_="property-card__price js-property-card-prices js-property-card__price-small").text.strip())
price.append(full_price) #Get apto's parking lot
except:
continue
# fecha o chromedriver
driver.quit()
# cria um dataframe pandas e salva como um arquivo CSV
for i in range(0,len(neighbor)):
combinacao=[link_imovel[i],address[i],neighbor[i],anunciante[i],area[i],tipo[i],room[i],bath[i],park[i],price[i]]
df=pd.DataFrame(combinacao)
with open('VivaRealData.csv', 'a', encoding='utf-16', newline='') as f:
df.transpose().to_csv(f, encoding='iso-8859-1', header=False)
# Tempo de execução
toc = time.time()
get_time=round(toc-tic,3)
print('Finished in ' + str(get_time) + ' seconds')
print(str(len(price))+' results!')
it seems to me that the "for line in soup.findAll" never pops, I've tried everything but I always get data from the first page.
Indeed the URL does return the same results regardless of the page number requested. It also returns the same information if requests is used avoiding the huge overhead of using Selenium.
A better (and much faster) approach is to access all of the data directly from the site's JSON API.
The following shows you a possible starting point. All of the data is inside data, you just need to find the information you want inside it and access it. I suggest you print(data) and use a tool to format it better.
import requests, re, time, os, csv
# Ele irá solicitar quantas páginas você deseja coletar
#pages_number = int(input('How many pages? '))
pages_number = 5
# inicializa o tempo de execução
tic = time.time()
sess = requests.Session()
params = {
'addressCity' : 'Poços de Caldas',
'addressLocationId' : 'BR>Minas Gerais>NULL>Pocos de Caldas',
'addressNeighborhood' : '',
'addressState' : 'Minas Gerais',
'addressCountry' : 'Brasil',
'addressStreet' : '',
'addressZone' : '',
'addressPointLat' : '-21.7854',
'addressPointLon' : '-46.561934',
'business' : 'SALE',
'facets' : 'amenities',
'unitTypes' : 'HOME',
'unitSubTypes' : 'UnitSubType_NONE,SINGLE_STOREY_HOUSE,VILLAGE_HOUSE,KITNET',
'unitTypesV3' : 'HOME',
'usageTypes' : 'RESIDENTIAL',
'listingType' : 'USED',
'parentId' : 'null',
'categoryPage' : 'RESULT',
'includeFields' : 'search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount),page,seasonalCampaigns,fullUriFragments,nearby(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount)),expansion(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount)),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones,phones),developments(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount)),owners(search(result(listings(listing(displayAddressType,amenities,usableAreas,constructionStatus,listingType,description,title,unitTypes,nonActivationReason,propertyType,unitSubTypes,id,portal,parkingSpaces,address,suites,publicationType,externalId,bathrooms,usageTypes,totalAreas,advertiserId,bedrooms,pricingInfos,showPrice,status,advertiserContact,videoTourLink,whatsappNumber,stamps),account(id,name,logoUrl,licenseNumber,showAddress,legacyVivarealId,phones),medias,accountLink,link)),totalCount))',
'size' : '100',
'from' : '144',
'q' : '',
'developmentsSize' : '5',
'__vt' : '',
'levels' : 'CITY,UNIT_TYPE',
'ref' : '/venda/minas-gerais/pocos-de-caldas/casa_residencial/',
'pointRadius' : '',
'isPOIQuery' : '',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
'x-domain': 'www.vivareal.com.br',
}
results = 0
with open('VivaRealData.csv', 'w', newline='', encoding='utf-16') as f_output:
csv_output = csv.writer(f_output)
# Criando o loop entre as paginas do site
for page in range(pages_number+1):
print(f"Page {page+1}")
link = 'https://glue-api.vivareal.com/v2/listings'
params['from'] = f"{page * 100}"
req = sess.get(link, headers=headers, params=params)
data = req.json()
for listing in data['search']['result']['listings']:
href = listing['link']['href']
street = listing['listing']['address'].get('street', '').strip()
bedrooms = listing['listing']['bedrooms'][0]
bathrooms = listing['listing']['bathrooms'][0]
price = listing['listing']['pricingInfos'][0]['price']
row = [href, street, bedrooms, bathrooms, price]
csv_output.writerow(row)
results += 1
# Tempo de execução
toc = time.time()
get_time=round(toc-tic,3)
print(f'Finished in {get_time} seconds')
print(f'{results} results!')
For this example, it is hard coded to 5 pages and returns 593 results in about 6 seconds.
Using Pandas might be a bit overkill here as the data can be written a row at a time directly to your output CSV file.
How was this solved?
Your best friend here is your browser's network dev tools. With this you can watch the requests made to obtain the information. The normal process flow is the initial HTML page is downloaded, this runs the javascript and requests more data to further fill the page.
The trick is to first locate where the data you want is (often returned as JSON), then determine what you need to recreate the parameters needed to make the request for it.
Approaches using Selenium allow the javascript to work, but most times this is not needed as it is just making requests and formatting the data for display.
This is a subquestion of this one: Python associate urls's ids and url's titles in lists
I have this HTML script:
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">Monte le son</a>
<div class="rs-cell-details">
<a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
class="ss-titre">"Rubin_Steiner"</a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
class="ss-titre">Fare maohi</a>
How can I do to have this result with BeautifulSoup:
list_titre = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']] #one sublist by id
I tried this:
f = urllib.urlopen(url)
page = f.read()
f.close()
soup = BeautifulSoup(page)
show=[]
list_titre=[]
list_url=[]
for link in soup.findAll('a'):
lien = link.get('href')
if lien == None:
lien = ""
if "http://pluzz.francetv.fr/videos/" in lien:
titre = (link.text.strip())
if "Voir cette vidéo" in titre:
titre = ""
if "Lire la vidéo" in titre:
titre = ""
list_titre.append(titre)
list_url.append(lien)
My result is:
list_titre = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']
list_url = [http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html, http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html]
But "titre" is not sorted by id.
Search for your links with a CSS selector to limit hits to just qualifying URLs.
Collect the links in a dictionary by URL; that way you can then process the information by sorting the dictionary keys:
from bs4 import BeautifulSoup
links = {}
soup = BeautifulSoup(page)
for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
title = link.get_text().strip()
if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
url = link['href']
links.setdefault(url, []).append(title)
The dict.setdefault() call sets an empty list for urls not yet encountered; this produces a dictionary with the URLs as keys, and the titles as a list of values per URL.
Demo:
>>> page = '''\
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">Monte le son</a>
... <div class="rs-cell-details">
... <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"
... class="ss-titre">"Rubin_Steiner"</a>
... <a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html"
... class="ss-titre">Fare maohi</a>
... '''
>>> links = {}
>>> soup = BeautifulSoup(page)
>>> for link in soup.select('a[href^=http://pluzz.francetv.fr/videos/]'):
... title = link.get_text().strip()
... if title and title not in (u'Voir cette vidéo', u'Lire la vidéo'):
... url = link['href']
... links.setdefault(url, []).append(title)
...
>>> from pprint import pprint
>>> pprint(links)
{'http://pluzz.francetv.fr/videos/ce_soir_ou_jamais_,101506826.html': [u'Ce soir (ou jamais !)',
u'"Qui est propri\xe9taire de quoi ? La propri\xe9t\xe9 mise \xe0 mal dans tous les domaines"'],
'http://pluzz.francetv.fr/videos/clip_locaux_,102890631.html': [u'Clips'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102152859.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102292937.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/fare_maohi_,102365651.html': [u'Fare maohi'],
'http://pluzz.francetv.fr/videos/inspecteur_barnaby_,101972045.html': [u'Inspecteur Barnaby',
u'"La musique en h\xe9ritage"'],
'http://pluzz.francetv.fr/videos/le_lab_o_saison3_,101215383.html': [u'Le Lab.\xd4',
u'"Episode 22"',
u'Saison 3'],
'http://pluzz.francetv.fr/videos/monsieur_madame_saison1_,101970319.html': [u'Les Monsieur Madame',
u'"Musique"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html': [u'Monte le son !',
u'"Rubin Steiner"'],
'http://pluzz.francetv.fr/videos/music_explorer_saison1_,101215382.html': [u'Music Explorer : les chasseurs de sons',
u'"Episode 3/6"',
u'Saison 1'],
'http://pluzz.francetv.fr/videos/retour_a_goree_,101641108.html': [u'Retour \xe0 Gor\xe9e'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101507102.html': [u'Singe mi singe moi',
u'"Le chat"'],
'http://pluzz.francetv.fr/videos/singe_mi_singe_moi_,101777072.html': [u'Singe mi singe moi',
u'"L\'autruche"'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472310.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102472336.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,102721018.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216774.html': [u'T.N.T.'],
'http://pluzz.francetv.fr/videos/toute_nouvelle_tendance_,103216788.html': [u'T.N.T'],
'http://pluzz.francetv.fr/videos/via_cultura_,101959892.html': [u'Via cultura',
u'"L\'Ochju, le Mauvais oeil"']}