Iam a beginner to data scraping.I want to extract all the marathon events name from a website Wikipedia
For this I have written a small code :-
from lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/List_of_marathon_races')
tree = html.fromstring(page.text)
events = tree.xpath('//td/a[#class="new"]/text()')
print events
The problem is when I try to execute the following code,a blank array comes as an output.What is the problem with this code?It would be grateful if anyone can help me in correcting or finding mistakes my code.
Related
I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km¤cy=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km¤cy=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup
Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)
I'm not a programmer, but I'm trying to teach myself Python so that I can pull data off various sites for projects that I'm working on. I'm using "Automate the Boring Stuff" and I'm having trouble getting the examples to work with one of the pages I'm trying to pull data from.
I'm using Anaconda as my prompt with Python 3.65. Here's what I've done:
Step 1: create the beautiful soup object
import requests, bs4
res = requests.get('https://www.almanac.com/weather/history/zipcode/02111/2017-05-15')
res.raise_for_status()
weatherTest = bs4.BeautifulSoup(res.text)
type(weatherTest)
This works, and returns the result
<class 'bs4.BeautifulSoup'>
I've made the assumption that the "noStarchSoup" that was in the original text (in place of weatherTest here) is a name the author gave to the object that I can rename to something more relevant to me. If that's not accurate, please let me know.
Step 2: pull an element out of the html
Here's where I get stuck. The author had just mentioned how to pull a page down into a file (which I would prefer not to do, I want to use the bs4 object), but then is using that file as his source for the html data. The exampleFile was his downloaded file.
import bs4
exampleFile = open('https://www.almanac.com/weather/history/zipcode/02111/2017-05-15')
I've tried using weatherTest in place of exampleFile, I've tried running the whole thing with the original object name (noStarchSoup), I've even tried it with exampleFile, even though I haven't downloaded the file.
What I get is
"OSError: [Errno 22] Invalid argument:
'https://www.almanac.com/weather/history/zipcode/02111/2017-05-15'
The next step is to tell it what element to pull but I'm trying to fix this error first and kind of spinning my wheels here.
Couldn't resist here!
I found this page during my search but this answer didn't quite help... try this code :)
Step 1: download Anaconda 3.0+
Step 2: (function)
# Import Libraries
import bs4
import requests
def import_high_short_tickers(market_type):
if market_type == 'NADAQ':
page = requests.get('https://www.highshortinterest.com/nasdaq/')
elif market_type == 'NYSE':
page = requests.get('https://www.highshortinterest.com/nyse/')
else:
logger.error("Invalid market_type: " + market_type)
return None
# Parse the HTML Page
soup = bs4.BeautifulSoup(page.content, 'html.parser')
# Grab only table elements
all_soup = soup.find_all('table')
# Get what you want from table elements!
for element in all_soup:
listing = str(element)
if 'https://finance.yahoo.com/' in listing:
# Stuff the results in a pandas data frame (if your not using these you should)
data = pd.read_html(listing)
return data
Yes Yes its very crude but don't hate!
Cheers!
I've created a script in python 3 to scrape data from 4 different pages of a site. It works fine but when i try to get that result in a csv file, something goes wrong and it only prints the info of the last page. Could anybody help me out on this. i've attached the script for your consideration. Dying to know what i'm doing wrong.
import csv
import requests
from bs4 import BeautifulSoup
def web_crawler(mpage):
page=1
while page<=mpage:
url=requests.get("http://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=San%20Francisco%2C%20CA&page="+str(page))
soup=BeautifulSoup(url.text,'html.parser')
x=soup.findAll(class_='info')
gist=[]
for z in x:
Item=z.findAll(class_="business-name")
for Title in Item:
Name=Title.text
Patta=z.findAll(class_="adr")
for Thikana in Patta:
Address=Thikana.text
Number=z.findAll(class_="phones")
for Token in Number:
Phone=Token.text
metco=(Name,Address,Phone)
print(metco)
gist.append(metco)
outfile=open('data.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
writer.writerows(gist)
page+=1
web_crawler(4)
You are overwriting your file in the loop
outfile=open('data.csv','w',newline='')
try moving it out of the main loop.
I tried to do the following: read trough a website choose the 18 th of link on this site, open that link and repeat that 7 times. But I am not really advanced in programming, so I stuck by trying to open the 18th link. How can I do that? My code is this:
import urllib.request
import io
u = urllib.request.urlopen("http://xxxxxxxx.com/tsugi/mod/python-data/data/known_by_Yong.html", data = None)
f = io.TextIOWrapper(u, encoding='utf-8')
text = f.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
print (soup.find_all("a")
my result looks like this: e.g
[ href="http://xxxxxxxx.com/tsugi/mod/python-data/data/known_by_Keiva.html">Keiva</a>, href="http://xxxxx.com/tsugi/mod/python-data/data/known_by_Rowyn.</a>]
An HTML-document with names/links.
Till I not expect, that anybody is guiding me trough the whole code, where can I look up what I need?
Here are my main questions:
How can I make the program count the names/links?
How can I open the 18th link in the list?
How can I repeat that,7 times?
Thanks for your support in advance!!
I am planning to scraping exchange rates with Python.After I get the raw data from HTML pages, what kind of processing will I need to get prepared for my output/visualization? Will I need some text processing, NLP algorithms, graph processing or cleaning of your data?
I don't know exactly what you need but according to your comment, you can use following code to extract all data from that page:
import urllib
import bs4
url=urllib.urlopen('http://www.tcmb.gov.tr/kurlar/201501/02012015.xml').read().decode('Windows-1252')
soup=bs4.BeautifulSoup(url)
data=soup.get_text(' ')
print(data)
this script wrote on python 2.7 and you need to install beautifulsoup4.
or you can use below code. in this code I extract rates for us dollar:
import urllib.request
import xml.etree.ElementTree as ET
url=urllib.request.urlopen('http://www.tcmb.gov.tr/kurlar/201501/02012015.xml').read()
f=open('data.xml','w+b')
f.write(url)
f.close()
tree = ET.parse('data.xml')
root = tree.getroot()
for i in range(len(root[0])):
print(root[0][i].text)
or you can extract all rates for ForexBuying:
for i in root.iter('ForexBuying'):
print(i.text)