Combine multiple lines (line break) into one line in python - python

So I was crawling articles from a site but the summary had multiple paragraphs and I want them in one line.
eg.
Line 1 : Title 1
Line 2 : Summary para1 Summary para2
These are my current code from this site
https://theaizawlpost.org/health-minister-in-fimkhur-turin-mipui-ngen-nawn/
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import date
import urllib
from urllib.request import urlopen
csv_file = open('cms_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['title', 'summary'])
source = requests.get('https://theaizawlpost.org/health-minister-in-fimkhur-turin-mipui-ngen-nawn/').text
soup = BeautifulSoup(source, 'lxml')
article = soup.find('article')
title = article.find('span', class_='current').text
print(title)
summary = article.find('div', class_='entry-content entry clearfix').text
print(summary)
csv_writer.writerow([title, summary.strip()])
csv_file.close()

Set the strip=True argument in get_text() to remove a newline (\n):
summary = article.find('div', class_='entry-content entry clearfix').get_text(strip=True)
Since you have already stripped the whitespace from summary, don't call .strip() when writing to the CSV file, instead, use:
csv_writer.writerow([title, summary])
Output:
Health minister-in fimkhur turin mipui ngen nawn
Sawrkarin May ni 31 thleng total lockdown a pawhsei leh hnuah, health minister Dr. R. Lalthangliana chuan nimin khan mipui hnena ngenna leh thuchah tichhuakin, total lockdown chu kan damkhawchhuahna tur a nih tih hriaa inkhuahkhirhna dan te tha taka zawm chunga fimkhurna ngai pawimawh zel turin mipui a chah.Health minister Dr. Thangtea chuan, kan zavaia kan tanrual a, kan tawrh leh rih hram hram a tul dawn a, chutih rualin, tumah riltam leh chhuanchhama kan awm hi sawrkarin a phal lova, kohhran leh khawtlang hruaitute, Local task Force te nen tangrualin theihtawp kan chhuah zel dawn a ni, a ti a. Hetih rual hian sum lakluhna te a lo kiam tak avangin chhungtinin mahni zawnah theuh inrenchem tum ila, fimkhur takin, chi-ai si lovin awm ila, inlenpawh lo turin leh a tul tawpkhawkah lo chuan pawn chhuak rih lo turin kan inchah nawn leh a, kan duh reng vang pawh ni lovin, nunna chhan nan kan tawrh tlan rih hram hram a ngai a ni, a ti bawk.Total lockdown kar hnih kalpui hnu pawha hri kaiin kian lam a la pan theih loh chungchangah health minister chuan, inkharkhip laiin mipui lam hi kan fimkhur tawk lo deuh em tih zawhna a awm hial a ni, a ti a. “Nikum lama khauh taka bazar-na hmuna social distancing kan zawm ang khan kan zawm ta lo em ni aw? ka ti a, mipuite pawh bazar-ah leh puipunna hmunah duty te hmuh phak loha kan awm hian kan fimkhur tawk lo palh ang tih ka hlau a. Mahni theuh kan pawimawh ber a ni tih hriain kan inkhuahkhirhna dan hi khauh deuh mah se, kan zavaia kan himna tura ruahman a ni tih i hre nawn fo ang u,” tiin mipui a chah a.Hri vanga thi awm thin chu lungchhiatthlak a tih thu sawiin health minister chuan, “Kan state-a thi zat hi a tam tawh viaua a lan laiin hmarchhak state dang te leh India ram ngaihtuah chuan kan dinhmun a la ziaawm hle a. Kan positivity rate a sang kan tih pawh hi test kan neih that vang a ni ve pakhat a, kan test percentage hi 31.49 niin hmarchhak state-ah Arunachal Pradesh tih lohah chuan test tam ber kan ni,” a ti.Vaccine chungchangah, sawrkarin vaccine a chah mek thu leh, Central Ministry lamin inkaihhruaina a siam ang zelin chak taka vaccine pek hna kalpui zel tum a nih thu a sawi a. Mizoramin khawvel ram hrang hrang United Kingdom, Egypt, Ireland, Switzerland, Turkey, China, Taiwan atangtein kan mamawh hmanrua leh khawl chi hrang hrang kan dawng tawh tih sawiin, USA, Spain leh Kuwait atang pawhin tanpuina dawn tur a la awm thu te, World Health Organisation atanga oxygen concentrator 150 dawn a nih thu te pawh a sawi bawk.Minister chuan, ram hruaitu Minister-te leh MLA zawng zawngte an thawkrim hle tih sawiin, “Kan thawhhona a thatna leh mipuite thlawpna avangin he hripui hi kan hneh ngei dawn a ni,” a ti.“Total lockdown puang tura kohhran, Local Council Association, NGO leh VC Association te bakah mi thahnemngaite ngenna a lawmawmin a zahawm ka ti hle a, nitina eizawngte lakah inthlahrunawm viau mahse state dangte pawhin lockdown lo chu kawng dang zawh tur an hre bik meuh lo tih i hre tlang ila. Lockdown chhung hian kan frontline worker te leh healthcare worker ten nasa takin hma an la a, theihtawp an chhuah a ni tih i hriatsak ang u. Kan rorelah sawisel tur leh khamkhawp loh tam tak in hmu thin tih pawh ka hria a, khawvel pum luhchhuahtu hri a ni a, mi zawng zawng min nghawng vek a, rorel thiam a har a, vawiina tha kan tih kha naktuk lawkah a lo tha leh lova, engmah experiment han tih hman a ni si lo, kan rorelah leh kan tawngkam chhuakah te in rilru kan tih nat a awm chuan khawngaihtakin min ngaidam ula, inhriatthiamna nen dawhthei takin indawm tlang ila, thurawn tha leh fing engtiklai pawhin kan dawng thei reng a ni,” health minister chuan a ti a. Pathian venhimna leh a chhanchhuahna bang lova dil turin leh, malsawm tlak ni tura mahni lamin kan tih ve theihte ti ve turin Zoram mipuite a ngen a ni.

What you want to do is replace all the newlines in the string. You can do this by
summary.replace("\n"," ")
The first string in this is what we want to replace
The second string is the what we want in that place

Related

scraped data using BeautifulSoup does not match source code

I'm new to webscraping. I have seen a few tutorials on how to scrape websites using beautifulsoup.
As an exercise I would like to extract data from a real estate website.
The specific page I want to scrape is this one: https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1
My goal is to extract a list of all the links to each real estate sale.
Afterwards, I want to loop through that list of links to extract all the data for each sale (price, location, nb bedrooms etc.)
The first issue I'm encountering is that the data scraped using the classic beautifulsoup code did not match the source code of the webpage.
This is my code:
URL = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
page = requests.get(URL)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Hence, when looking for the links of each real estate sale which is located under
soup.find_all("a", class_="card__title-link")
It outputs an empty list. Indeed these tags were actually not properly extracted from my code above.
Why is that? What should I do to ensure that the extracted html correctly corresponds to what is visible in the source code of the website?
Thank you :-)
The data you see is embedded within the page in Json format. You can use this example how to load it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.immoweb.be/fr/recherche/maison-et-appartement/a-vendre?countries=BE&page=1"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.find("iw-search")[":results"])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data:
for ad in data:
print(
"{:<63} {:<8} {}".format(
ad["property"]["title"],
ad["transaction"]["sale"]["price"] or "-",
"https://www.immoweb.be/fr/annonce/{}".format(ad["id"]),
)
)
Prints:
Triplex appartement met 3 slaapkamers en garage. 239000 https://www.immoweb.be/fr/annonce/9309298
Appartement 285000 https://www.immoweb.be/fr/annonce/9309895
Heel ruime, moderne, lichtrijke Duplex te koop, bij centrum 269000 https://www.immoweb.be/fr/annonce/9303797
À VENDRE PAR LANDBERGH : appartement de deux chambres à Gand 359000 https://www.immoweb.be/fr/annonce/9310300
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309278
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309251
Prachtige nieuwbouw appartementen - https://www.immoweb.be/fr/annonce/9309264
Appartement intéressant avec agréable vue panoramique verdoy 219000 https://www.immoweb.be/fr/annonce/9309366
Projet Utopia by Godin - https://www.immoweb.be/fr/annonce/9309458
Appartement 2-ch avec vue unique! 270000 https://www.immoweb.be/fr/annonce/9309183
Residentieel wonen in Hélécine, dichtbij de natuur en de sne - https://www.immoweb.be/fr/annonce/9309241
Appartement 375000 https://www.immoweb.be/fr/annonce/9309187
DUPLEX LUMIEUX ET SPACIEUX 380000 https://www.immoweb.be/fr/annonce/9298271
SINT-PIETERS-LEEUW / Magnifique maison de ±130m² avec jardin 430000 https://www.immoweb.be/fr/annonce/9310259
PARC PARMENTIER // APP MODERNE 3CH 490000 https://www.immoweb.be/fr/annonce/9262193
BOIS DE LA CAMBRE – AV DE FRE – CLINIQUES DE L’EUROPE 575000 https://www.immoweb.be/fr/annonce/9309664
Entre Stockel et le Stade Fallon 675000 https://www.immoweb.be/fr/annonce/9310094
Maisons neuves dans un cadre verdoyant - https://www.immoweb.be/fr/annonce/6792221
Nieuwbouwproject Dockside Gardens - Gent - https://www.immoweb.be/fr/annonce/9008956
Appartement 139000 https://www.immoweb.be/fr/annonce/9187904
A VENDRE CHEZ LANDBERGH: appartements à Merelbeke Flora - https://www.immoweb.be/fr/annonce/9306877
Très beau studio avec une belle vue sur la plage et la mer! 319000 https://www.immoweb.be/fr/annonce/9306787
BEL APPARTEMENT LUMINEUX DIAMANT / PLASKY 320000 https://www.immoweb.be/fr/annonce/9264748
Un projet d'appartements neufs à proximité de Woluwé-St-Lamb - https://www.immoweb.be/fr/annonce/9308037
PLACE JOURDAN - 2 CHAMBRES 345000 https://www.immoweb.be/fr/annonce/9306953
Magnifiek appartement in de Brugse Rand - Assebroek 399000 https://www.immoweb.be/fr/annonce/9306613
Bien d'exception 415000 https://www.immoweb.be/fr/annonce/9308022
Appartement 435000 https://www.immoweb.be/fr/annonce/9307802
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307178
Magnifique maison 5CH - 3SDB - bureau - dressing - garage 465000 https://www.immoweb.be/fr/annonce/9307177
EDIT: Added URL column.

BS4 : Iterate through page return same result in Python

Why this code return same film title (title from first page)?
url_base = "https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-"
for page in range(1, 3): #nb_pages+1):
url_n = url_base + str(page)
print(url_n)
html_n = urllib2.urlopen(url_n).read().decode('utf-8')
soup_n = BeautifulSoup(html_n, 'html.parser')
for film in soup_n.find_all('li', attrs={"class": u"elli-item"}):
print(film.find('a', attrs={"class": u"elco-anchor"}).text)
The page is loading the titles from different url via Ajax:
import requests
from bs4 import BeautifulSoup
# https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-
url_base = 'https://www.senscritique.com/sc2/liste/772407/page-{}.ajax'
for page in range(1, 3): #nb_pages+1):
url_n = url_base.format(page)
soup_n = BeautifulSoup(requests.get(url_n).content, 'html.parser')
for film in soup_n.find_all('li', attrs={"class": u"elli-item"}):
print(film.find('a', attrs={"class": u"elco-anchor"}).text)
Prints:
Old Boy
Lucy
Le Loup de Wall Street
Tomboy
Dersou Ouzala
Les Lumières de la ville
12 hommes en colère
Gravity
Hunger Games : La Révolte, partie 1
Le Parrain
La Nuit du chasseur
La Belle et la Bête
The Big Lebowski
Interstellar
Le Ruban blanc
Vive la France
La vie est belle
Le Hobbit : La Bataille des cinq armées
Pulp Fiction
Melancholia
Her
The Grand Budapest Hotel
The Tree of Life
Le Prestige
Conan le Barbare
Ninja Turtles
Jurassic Park III
Rebelle
Mud, sur les rives du Mississippi
Détour mortel
Only God Forgives
A Serious Man
Bienvenue à Gattaca
Colombiana
Rome, ville ouverte
Man of Steel
Black Book
La Rafle
Aliens : Le Retour
Les Petits Mouchoirs
Mysterious Skin
Rashômon
Lolita
Le Mystère de la matière noire
Godzilla
9 mois ferme
Pour une poignée de dollars
Les Enfants du paradis
Drive
Fight Club
Evil Dead
Le Labyrinthe
Sous les jupes des filles
Le Seigneur des Anneaux : La Communauté de l'anneau
La Chasse
Le Locataire
Gone Girl
La Planète des singes : L'Affrontement
L'Homme sans âge
Cinquante nuances de Grey
Change it to this. The problem part is urllib2 in python3.
The urllib2 module has been split across several modules in Python 3
named urllib.request and urllib.error. The 2to3 tool will
automatically adapt imports when converting your sources to Python 3.
from bs4 import BeautifulSoup
from urllib.request import urlopen
url_base = "https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-"
for page in range(1, 3): #nb_pages+1):
url_n = url_base + str(page)
print(url_n)
html_n = urlopen(url_n).read().decode('utf-8')
soup_n = BeautifulSoup(html_n, 'html.parser')
for film in soup_n.find_all('li', attrs={"class": u"elli-item"}):
print(film.find('a', attrs={"class": u"elco-anchor"}).text)
Output
Old Boy
Lucy
Le Loup de Wall Street
Tomboy
Dersou Ouzala
Les Lumières de la ville
12 hommes en colère
Gravity
Hunger Games : La Révolte, partie 1
Le Parrain
La Nuit du chasseur
La Belle et la Bête
The Big Lebowski
Interstellar
Le Ruban blanc
Vive la France
La vie est belle
Le Hobbit : La Bataille des cinq armées
Pulp Fiction
Melancholia
Her
The Grand Budapest Hotel
The Tree of Life
Le Prestige
Conan le Barbare
Ninja Turtles
Jurassic Park III
Rebelle
Mud, sur les rives du Mississippi
Détour mortel
https://www.senscritique.com/liste/Le_meilleur_du_meilleur_des_meilleurs/772407#page-2
Old Boy
Lucy
Le Loup de Wall Street
Tomboy
Dersou Ouzala
Les Lumières de la ville
12 hommes en colère
Gravity
Hunger Games : La Révolte, partie 1
Le Parrain
La Nuit du chasseur
La Belle et la Bête
The Big Lebowski
Interstellar
Le Ruban blanc
Vive la France
La vie est belle
Le Hobbit : La Bataille des cinq armées
Pulp Fiction
Melancholia
Her
The Grand Budapest Hotel
The Tree of Life
Le Prestige
Conan le Barbare
Ninja Turtles
Jurassic Park III
Rebelle
Mud, sur les rives du Mississippi
Détour mortel

How to scrape Hindi Content only that is wrapped in DIV elements with out any class or id using Python, Beautifulsoup?

I have to scrape all the webpages of Hindi Song Lyrics website. Each page contains song lyrics in Hindi and English.
Target Site: Lyrics Hindi Song
For scraping I am using Python, Beautifulsoup.
I am successfully able to scrape data from all the pages except one task and that is to fetch Hindi Lyrics from each webpage.
Following is the my code.
import pymysql
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.lyricshindisong.in/2020/02/khuda-bhi-asamaan-se-jab-jmin-par.html")
soup=BeautifulSoup(r.content,'html5lib')
pagettitle=soup.find('h1').text
songcontentwrapper=soup.find('div',{'class':'post-outer'})
targetcontents=songcontentwrapper.find_all('div')
for targetcontent in targetcontents:
print(targetcontent.text)
The above code gives me following result.
Khuda Bhi Asamaan Se Jab Jmin Par Dekhata Hoga Lyrics from the Hindi
Bollywood cinema Dharti(1970). Singer(s) of this song: Mohammed Rafi.
Music Composed By Shankar Jaikishan. SongKhuda bhi asamaan se jab
jmin par dekhata hogaSinger(s)Mohammed RafiMusic(s)Shankar
JaikishanLyrics(s)Rajendra Krishna(राजेंद्र कृष्ण)
(adsbygoogle = window.adsbygoogle || []).push({}); खुदा भी आसमाँ से जब ज़मीं पर देखता होगामेरे मेहबूब को किसने बनाया सोचता होगा
मुसव्विर खुद परेशां है के ये तस्वीर किसकी हैबनोगी जिसकी तुम ऐसी हसीं
तक़दीर किसकी हैकभी वो जल रहा होगा, कभी खुश हो रहा होगा ज़माने भर की
मस्ती को निगाहों में समेटा हैकली से जिस्म को कितनी बहारों में लपेटा
हैहुआ तुमसा कोई पहले न कोई दूसरा होगा फ़रिश्ते भी यहाँ रातों को आकर
घूमते होंगेजहाँ रखती हो तुम पाँव, जगह वो चूमते होंगेकिसी के दिल पे
क्या गुज़री, ये वो ही जानता होगा अब आप आपका मनचाहा गाने के बोल अपनी
ऊँगली की नोक पर पाएं। डाउनलोड करें हमारा एंड्राइड ऐप्प। Link: Lyrics
Hindi Song Android App Khuda bhi asamaan se jab jmin par dekhata
hogaMere mehabub ko kisane banaaya sochata hoga Musawwir khud
pareshaan hai ke ye taswir kisaki haiBanogi jisaki tum aisi hasin
takdir kisaki haiKabhi wo jal raha hoga, kabhi khush ho raha hoga
Jmaane bhar ki masti ko nigaahon men sameta haiKali se jism ko kitani
bahaaron men lapeta haiHua tumasa koi pahale n koi dusara hoga Frishte
bhi yahaan raaton ko akar ghumate hongeJahaan rakhati ho tum paanw,
jagah wo chumate hongeKisi ke dil pe kya gujri, ye wo hi jaanata hoga
You can access your favourite song lyrics at your finger tips.
Download our android app available at Google Play Store. Link: Lyrics
Hindi Song Android App Play
You might be interested in the following all Hindi Song Lyrics from
the Bollywood Feature Film "Dharti" . Ishq ki main bimar ki walla Teer
e nazar hai paar Ki walla tujhse hua hai pyar huyi alla Dil hai
bekarar ki walla Aankho mein khumar Ki walla tujhpe dil nishar huyi
alla....(Read More) Ye albeli pyar ki rahe Ye jane pahchane rashte Kal
bhi the ye nikhre nikhre Aaj bhi hai ye haste haste Ye albeli pyar ki
rahe Ye jane pahchane rashte Kal bhi the ye nikhre nikhre Aaj bhi hai
ye haste haste Ye albeli pyar ki rahe....(Read More) Khuda bhi aasma
se jab Zameen par dekhata hoga Khuda bhi aasma se jab Zameen par
dekhata hoga Mere mehboob ko kisane Banaya sochata hoga Khuda bhi
aasma se jab Zameen par dekhata hoga Mere mehboob ko kisane Banaya
sochata hoga Khuda bhi aasma se....(Read More) Jab se aankhe ho gayi
Tumse char is dharti par Jab se aankhe ho gayi Tumse char is dharti
par Kadam kadam par machal Raha hai pyaar is dharti par Jab se aankhe
ho gayi Tumse char is dharti par....(Read More) Shu shu shu shu shu
shu Dhire dheere bolo ji shu shu Bhed mat kholo ji shu shu Dhire
dheere bolo ji shu shu Bhed mat kholo ji shu shu Deewaro ke bhi kan
hai Khatre mein apni jaaan hai Deewaro ke bhi kan hai Khatre mein apni
jaaan hai Shu shu dheere dheere bolo ji shu shu....(Read More) Meri
gali mein aaya chor Tan ka chor mann ka chor Dil ko chura ker bhaga
Mere dil ko chura ker bhaga Meri gali mein aaya chor Tan ka chor mann
ka chor Dil ko chura ker bhaga Mere dil ko chura ker bhaga Meri gali
mein aaya chor Tan ka chor mann ka chor....(Read More) Ye mausam
bheega bheega hai Hawa bhi jyada jyada hai Kyun na machlega dil mera
Tumko pane ka irada hai Ye mausam bheega bheega hai Hawa bhi jyada
jyada hai Kyun na machlega dil mera Tumko pane ka irada hai Ye mausam
bheega bheega hai....(Read More) You are viewing android version of
our website: Lyrics Hindi SongTo visit our website: Click Here
Required task is only to get Hindi Lyrics.
Problem: The song lyrics are wrapped around div tag and it has no class or id.
So please guide me how can I get Hindi Lyrics only using Python and Beautifulsoup.
The simplest approach would probably be to just split your text on each space, remove every word that's not in English, and join the words back together with spaces in between.
See this post
Detect strings with non English characters in Python
Using the isEnglish() function from the accepted answer, you could do something like
hindi_only = " ".join([i for i in scraped_text.split(" ") if not isEnglish(i)])

How to parse XML in Python and store info in a dictionary

I have an XML file from TED talks with the following structure:
<xml language="it"><file id="1">
<head>
<url>http://www.ted.com/talks/travis_kalanick_uber_s_plan_to_get_more_people_into_fewer_cars</url>
<pagesize>79324</pagesize>
<dtime>Fri Apr 01 01:00:04 CEST 2016</dtime>
<encoding>UTF-8</encoding>
<content-type>text/html; charset=utf-8</content-type>
<keywords>talks, Brand, Internet, business, cars, china, cities, economics, entrepreneur, environment, future, green, india, innovation, invention, investment, mobility, pollution, potential, society, software, sustainability, technology, transportation, web</keywords>
<speaker>Travis Kalanick</speaker>
<talkid>2443</talkid>
<videourl>http://download.ted.com/talks/TravisKalanick_2016.mp4</videourl>
<videopath>talks/TravisKalanick_2016.mp4</videopath>
<date>2016/02/15</date>
<title>Travis Kalanick: Il progetto di Uber di mettere più persone in meno auto</title>
<description>TED Talk Subtitles and Transcript: Uber non ha cominciato con grandi ambizioni di ridurre il traffico e l'inquinamento. Ma quando l'azienda ha decollato, il suo cofondatore Travis Kalanick si è chiesto se ci fosse un modo per far usare Uber alle persone sullo stesso percorso e far loro condividere i tragitti, riducendo allo stesso tempo i costi e l'impronta ecologica. Il risultato: uberPOOL, il servizio di condivisione dell'auto dell'azienda, che nei suoi primi otto mesi ha tolto dalle strade 7.9 milioni di miglia e 1.400 tonnellate di diossido di carbonio dall'aria a Los Angeles. Ora Kalanick afferma che la condivisione dell'auto potrebbe funzionare anche per i pendolari nelle periferie. "Con la tecnologia a disposizione e un po' di leggi intelligenti," afferma, "possiamo trasformare ogni auto in un'auto condivisa e riappropriarci immediatamente delle nostre città."</description>
<transcription>
<seekvideo id="680">Stamattina, vorrei parlarvi del futuro</seekvideo>
<seekvideo id="6064">dei trasporti a guida umana,</seekvideo>
<seekvideo id="9560">di come possiamo ridurre il traffico, l'inquinamento e i parcheggi</seekvideo>
<seekvideo id="15640">mettendo più persone in meno macchine,</seekvideo>
<seekvideo id="19840">e di come possiamo farlo con la tecnologia che sta nelle nostre tasche.</seekvideo>
<seekvideo id="25840">Sì, sto parlando degli smartphone...</seekvideo>
<seekvideo id="28720">non delle auto senza autista.</seekvideo>
<seekvideo id="31120">Ma per cominciare, dobbiamo tornare indietro di oltre 100 anni.</seekvideo>
<seekvideo id="1140160">Grazie mille." TK: "Grazie mille a te."</seekvideo>
<seekvideo id="1142240">(Applauso)</seekvideo>
</transcription>
<translators>
<translator href="http://www.ted.com/profiles/2859034">Maria Carmina Distratto</translator>
</translators>
<reviewers>
<reviewer href="http://www.ted.com/profiles/5006808">Enrica Pillon</reviewer>
</reviewers>
<wordnum>2478</wordnum>
<charnum>14914</charnum>
</head>
<content>Stamattina, vorrei parlarvi del futuro dei trasporti a guida umana, di come possiamo ridurre il traffico, l'inquinamento e i parcheggi mettendo più persone in meno macchine, e di come possiamo farlo con la tecnologia che sta nelle nostre tasche. Sì, sto parlando degli smartphone... non delle auto senza autista.
Ma per cominciare, dobbiamo tornare indietro di oltre 100 anni. Perché si è scoperto che Uber esisteva molto prima di Uber. E se fosse sopravvissuta, il futuro dei trasporti sarebbe probabilmente già qui.
Permettetemi di presentarvi il Jitney. Fu creato o inventato nel 1914 da un tizio di nome LP Draper. Era un venditore di auto di Los Angeles, che ebbe un'idea: stava girando per il centro di Los Angeles, la mia città natale, e vide questi tram con lunghe file di persone che cercavano di andare dove volevano. E si disse: "Beh, perché non mettere un cartello sulla mia macchina per portare le persone ovunque vogliano per un jitney?" Un Jitney è un nichelino, in slang.
Le persone saltarono a bordo. E non solo a Los Angeles, ma in tutto il paese. E nel giro di un anno, il 1915, a Seattle si prendevano 50.000 passaggi al giorno, 45.000 passaggi al giorno in Kansas e 150.000 passaggi al giorno a Los Angeles. Per darvi un'idea, Uber a Los Angeles effettua 157.000 passaggi al giorno oggi, 100 anni dopo.
Ed ecco i tranvieri, color
"Travis, quello che stai creando è davvero incredibile e ti sono grato per averlo condiviso apertamente sul palco di TED.
Grazie mille." TK: "Grazie mille a te."
(Applauso)</content>
</file>
<file id="2">
<head>
<url>http://www.ted.com/talks/magda_sayeg_how_yarn_bombing_grew_into_a_worldwide_movement</url>
<pagesize>77162</pagesize>
<dtime>Fri Apr 01 01:00:32 CEST 2016</dtime>
<encoding>UTF-8</encoding>
<content-type>text/html; charset=utf-8</content-type>
<keywords>talks, TEDYouth, arts, beauty, creativity, entertainment, materials, street art</keywords>
<speaker>Magda Sayeg</speaker>
<talkid>2437</talkid>
<videourl>http://download.ted.com/talks/MagdaSayeg_2015Y.mp4</videourl>
<videopath>talks/MagdaSayeg_2015Y.mp4</videopath>
<date>2015/11/14</date>
<title>Magda Sayeg: Come lo yarn bombing è diventato un movimento internazionale</title>
<description>TED Talk Subtitles and Transcript: L'artista di tessuti Magda Sayeg trasforma i paesaggi urbani nel suo parco giochi personale decorando gli oggetti quotidiani con colorati lavori a maglia e a uncinetto. Queste calde e pelose "bombe di filo" sono partite dalle piccole cose, con i pali degli stop e gli idranti nella città natale di Sayeg, ma presto le persone hanno trovato una connessione con quest'arte e l'hanno diffusa nel mondo. "Viviamo in un questo frenetico mondo digitale, ma desideriamo ancora qualcosa con cui possiamo relazionarci", dice Sayeg. "Si può trovare del potenziale nascosto nei luoghi più impensabili e tutti possediamo doti che aspettano solo di essere scoperte."</description>
<transcription>
<seekvideo id="0">Sono un'artista di tessuti</seekvideo>
<seekvideo id="3080">meglio nota per aver dato inizio allo yarn bombing.</seekvideo>
<seekvideo id="6080">Yarn bombing significa inserire del materiale fatto a maglia</seekvideo>
<seekvideo id="9040">nell'ambiente urbano, tipo graffiti --</seekvideo>
<seekvideo id="11400">o, per la precisione,</seekvideo>
I would like to store all the info in a way such that I am able to match the ID of the talks and the sentences ID with the same sentences in other languages (same format).
So for example something like a dictionary as:
mydict["Ted_Talk_ID"]["Sentence_ID"]
Any idea on how to do it?
I tried with xml.etree.ElementTree but it gives me no output (but maybe I did some errors). My code is the following:
import xml.etree.ElementTree as ET
root = ET.parse('ted_it-20160408.xml').getroot()
for type_tag in root.findall('url'):
value = type_tag.get('title')
print(value)
Thanks!
(if it could help: the data are from https://wit3.fbk.eu/mono.php?release=XML_releases&tinfo=cleanedhtml_ted)
you can use by using xml.etree.ElementTree, check this link https://www.geeksforgeeks.org/xml-parsing-python/
This function is perfectly adapted for your use :
def parseXML(xmlfile):
# create element tree object
tree = ET.parse(xmlfile)
# get root element
root = tree.getroot()
# create empty list for news items
newsitems = []
# iterate news items
for item in root.findall('./channel/item'):
# empty news dictionary
news = {}
# iterate child elements of item
for child in item:
# special checking for namespace object content:media
if child.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = child.attrib['url']
else:
news[child.tag] = child.text.encode('utf8')
# append news dictionary to news items list
newsitems.append(news)
# return news items list
return newsitems

Extracting lat/long from a BeautifulSoup object in python

I just started working with python's BeautifulSoup package for scraping. In my code I get the following soup object
>>> soupObj
[u'$(function(){SEAT.PG.initDialog(".generate-popup-3030",[{msez:"3030",c:"#333",o:0.5,f:false,html:SEAT.PG.getImgPopHtml}]);SEAT.PG.mappaInterattiva({longitude:13.37489,latitude:42.36009,sito:"pgol",zoomLevel:"1",lng:1,mirino:"http://img.pgol.it/pgol/img/mk_pallino.png",allowFoto:true,mapType:null,streetView:false,dr:false,addMobile:false,ums:"sorellenurzia"});var a=SEAT.commenti({__2011_commento_click_stella:"Clicca su una stella per dare il tuo voto",__2011_commento_da_evitare:"Da evitare",__2011_commento_di_meglio:"C\'\xe8 di meglio",__2011_commento_non_male:"Non male",__2011_commento_mi_piace:"Mi piace",__2011_commento_consigliato:"Consigliato",__2011_commento_scrivi:"Scrivi una recensione per condividere la tua esperienza con gli altri utenti",__2011_commento_breve:"Scrivi almeno 120 caratteri per dare informazioni utili e complete a chi legger\xe0 la recensione.",__2011_commento_manca_voto:"Ti sei dimenticato di dare un voto",__2011_commento_servizio_non_disponibile:"Il servizio al momento non \xe8 disponibile",__2011_commento_segnala:"Segnala la recensione",__2011_commento_segnala_msg1:"Ritieni che questa recensione sia offensiva o inappropriata a PagineGialle.it?",__2011_commento_segnala_msg2:"La nostra redazione si occuper\xe0 di controllarne il contenuto e, se necessario, di rimuoverlo dal sito",__2011_conferma:"conferma",__2011_annulla:"annulla",__2011_ha_scritto_il:"ha scritto il",__2011_pubblica:"pubblica",__2011_commento_modifica_recensione:"Modifica recensione",__2011_conferma_notifica:"Aggiornamento via email attivo",__2012_commenti_relatedopec:"Conosci anche queste attivit\xe0?",__2011_elimina_notifica:"Aggiornamento via email non attivo"});a.enableAbuse();a.enableRating()});']
there is lat/long portion that I want to extract from this object which is "longitude:13.37489,latitude:42.36009". I am not sure how to do this
Here's sample python code using a regex to extract the coords:
import re
soupObj = [u'$(function(){SEAT.PG.initDialog(".generate-popup-3030",[{msez:"3030",c:"#333",o:0.5,f:false,html:SEAT.PG.getImgPopHtml}]);SEAT.PG.mappaInterattiva({longitude:13.37489,latitude:42.36009,sito:"pgol",zoomLevel:"1",lng:1,mirino:"http://img.pgol.it/pgol/img/mk_pallino.png",allowFoto:true,mapType:null,streetView:false,dr:false,addMobile:false,ums:"sorellenurzia"});var a=SEAT.commenti({__2011_commento_click_stella:"Clicca su una stella per dare il tuo voto",__2011_commento_da_evitare:"Da evitare",__2011_commento_di_meglio:"C\'\xe8 di meglio",__2011_commento_non_male:"Non male",__2011_commento_mi_piace:"Mi piace",__2011_commento_consigliato:"Consigliato",__2011_commento_scrivi:"Scrivi una recensione per condividere la tua esperienza con gli altri utenti",__2011_commento_breve:"Scrivi almeno 120 caratteri per dare informazioni utili e complete a chi legger\xe0 la recensione.",__2011_commento_manca_voto:"Ti sei dimenticato di dare un voto",__2011_commento_servizio_non_disponibile:"Il servizio al momento non \xe8 disponibile",__2011_commento_segnala:"Segnala la recensione",__2011_commento_segnala_msg1:"Ritieni che questa recensione sia offensiva o inappropriata a PagineGialle.it?",__2011_commento_segnala_msg2:"La nostra redazione si occuper\xe0 di controllarne il contenuto e, se necessario, di rimuoverlo dal sito",__2011_conferma:"conferma",__2011_annulla:"annulla",__2011_ha_scritto_il:"ha scritto il",__2011_pubblica:"pubblica",__2011_commento_modifica_recensione:"Modifica recensione",__2011_conferma_notifica:"Aggiornamento via email attivo",__2012_commenti_relatedopec:"Conosci anche queste attivit\xe0?",__2011_elimina_notifica:"Aggiornamento via email non attivo"});a.enableAbuse();a.enableRating()});']
m = re.search('longitude:([-+]?\d+.\d+),latitude:([-+]?\d+.\d)', soupObj[0])
if m:
longitude = m.group(1)
latitude = m.group(2)
print "longitude=%s, latitude=%s" % (longitude, latitude)
else:
print "Failed to match longitude, latitude."
Running this code yields:
longitude=13.37489, latitude=42.3

Categories