I'm new to programming and python and am trying to access the number of available bikes at a given station in the DC bikeshare program. I believe that the best way to do that is with BeautifulSoup. The good news is that the data is available in what appears to be a clean format here: https://www.capitalbikeshare.com/data/stations/bikeStations.xml
Here's an example of a station:
<station>
<id>1</id>
<name>15th & S Eads St</name>
<terminalName>31000</terminalName>
<lastCommWithServer>1460217337648</lastCommWithServer>
<lat>38.858662</lat>
<long>-77.053199</long>
<installed>true</installed>
<locked>false</locked>
<installDate>0</installDate>
<removalDate/>
<temporary>false</temporary>
<public>true</public>
<nbBikes>7</nbBikes>
<nbEmptyDocks>8</nbEmptyDocks>
<latestUpdateTime>1460192501598</latestUpdateTime>
</station>
I'm looking for the <nbBikes> value. I had what I thought would be the start of a python script that would show me the value for the first 5 stations (I'll tackle picking the station I want once I get this under control) but it doesn't return any values. Here's the script:
# bikeShareParse.py - parses the capital bikeshare info page
import bs4, requests
url = "https://www.capitalbikeshare.com/data/stations/bikeStations.xml"
res = requests.get(url)
res.raise_for_status()
#create the soup element from the file
soup = bs4.BeautifulSoup("res.text", "lxml")
# defines the part of the page we are looking for
nbikes = soup.select('#text')
#limits number of results for testing
numOpen = 5
for i in range(numOpen):
print nbikes
I believe that my problem (besides not understanding how to format code correctly in a stack overflow question) is that the value for nbikes = soup.select('#text') is incorrect. However, I can't seem to substitute anything for '#text' to get any values, let alone the ones I want.
Am I approaching this the right way? If so, what am I missing?
thanks
This script creates a dictionary with the structure [station_ID, bikes_remaining]. It is modified from the beginning of this: http://www.plotsofdots.com/archives/68
# from http://www.plotsofdots.com/archives/68
import xml.etree.ElementTree as ET
import urllib2
#we parse the data using urlib2 and xml
site='https://www.capitalbikeshare.com/data/stations/bikeStations.xml'
htm=urllib2.urlopen(site)
doc = ET.parse(htm)
#we get the root tag
root=doc.getroot()
root.tag
#we define empty lists for the empty bikes
sID=[]
embikes=[]
#we now use a for loop to extract the information we are interested in
for country in root.findall('station'):
sID.append(country.find('id').text)
embikes.append(int(country.find('nbBikes').text))
#this just tests that the process above works, can be commented out
#print embikes
#print sID
#use zip to create touples and then parse them into a dataframe
prov=zip(sID,embikes)
print prov[0]
Related
I have two lists of baseball players that I would like to scrape data for from the website fangraphs. I am trying to figure out how to have selenium search the first player in the list which would redirect to that players profile, scrape the data I am interested in, and then search the next player until each for loop is completed for the two lists. I have written other scrapers with selenium, but I haven't come across this situation where I need to perform a search, collect the data, then perform the next search, etc ...
Here is a smaller version of one of the lists:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
This will search all the names at once obviously, so I guess I'm trying to figure out how to code the logic of searching one by one but not performing the next search until I have collected the data for the previous search - any help is appreciated cheers
With selenium, you would just have to iterate through the names, "type" it into the search bar, click/go to the link, scrape the stats, then repeat. You have set up to do that, you just need to add the scrape part. So something like:
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
driver.get('https://www.fangraphs.com/')
search_box = driver.find_element_by_xpath('/html/body/form/div[3]/div[1]/div[2]/header/div[3]/nav/div[1]/div[2]/div/div/input')
search_box.click()
for batter in batters:
search_box.send_keys(batter)
search_box.send_keys(Keys.RETURN)
## CODE THAT SCRAPES THE DATA ##
## CODE THAT STORES IT SOMEWAY TO APPEND AFTER EACH ITERATION ##
However, they have an api which is a far better solution than Selenium. Why?
APIs are consistent. Parsing HTML with selnium and/or beautifulsoup is reliant on the html structure. If they ever change the layout of the website, it may crash as certain tags that used to be there may not be there anymore, or they may add certain tags and attributes to the html. But the underlying data that is rendered in the html will come from the api in a nice json format and that will rarely change unless they do a complete overhaul of the data structure
It's far more efficient and quicker. No need to have Selenium open a browser, search and load/render the content, that scrape, then repeat. You get the response in 1 request
You'll get far more data than you intended and (imo) is a good thing. I'd rather have more data and "trim" off what I don't need. Lots of the time you'll see very interesting and useful data that you otherwise wouldn't had known was there.
So I'm not sure what you are after specifically, but this will get you going. You'll have to sift through the statsData to figure out what you want, but if you tell me what you are after, I can help get that into a nice table for you. Or if you want to figure it out yourself, look up pandas and the .json_normalize() function with that. Parsing nested json can be tricky (but it's also fun ;-) )
Code:
import requests
# Get teamIds
def get_teamIds():
team_id_dict = {}
url = 'https://cdn.fangraphs.com/api/menu/menu-standings'
jsonData = requests.get(url).json()
for team in jsonData:
team_id_dict[team['shortName']] = str(team['teamid'])
return team_id_dict
# Get Player IDs
def get_playerIds(team_id_dict):
player_id_dict = {}
for team, teamId in team_id_dict.items():
url = 'https://cdn.fangraphs.com/api/depth-charts/roster?teamid={teamId}'.format(teamId=teamId)
jsonData = requests.get(url).json()
print(team)
for player in jsonData:
if 'oPlayerId' in player.keys():
player_id_dict[player['player']] = [str(player['oPlayerId']), player['position']]
else:
player_id_dict[player['player']] = ['N/A', player['position']]
return player_id_dict
team_id_dict = get_teamIds()
player_id_dict = get_playerIds(team_id_dict)
batters = ['Freddie Freeman','Bryce Harper','Jesse Winker']
for player in batters:
playerId = player_id_dict[player][0]
pos = player_id_dict[player][1]
url = 'https://cdn.fangraphs.com/api/players/stats?playerid={playerId}&position={pos}'.format(playerId=playerId, pos=pos)
statsData = requests.get(url).json()
Ouput: Here's just a look at what you get
I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.
I'm a newbie in Python Programming and I am facing following issue:
Objective: I need to scrap Freelancers website and store the list of theusers along with their attributes (score, ratings,reviews,details, rate,etc)
into a file. I have following codes but I am not able to get all the users.
Also, sometimes I run the program, the output changes.
import requests
from bs4 import BeautifulSoup
pages = 1
fileWriter =open('freelancers.txt','w')
url = 'https://www.freelancer.com/freelancers/skills/all/'+str(pages)+'/'
r = requests.get(url)
#gets the html contents and stores them into soup object
soup = BeautifulSoup(r.content)
links = soup.findAll("a")
#Finds the freelancer-details nodes and stores the html content into c_data
c_data = soup.findAll("div", {"class":"freelancer-details"})
for item in c_data:
print item.text
fileWriter.write('Freelancers Details:'+item.text+'\t')
#Writes the result into text file
I need to get the user details under specific users. But so far, the output looks dispersed.
Sample Output:
Freelancers Details:
thetechie13
507 Reviews
$20 USD/hr
Top Skills:
Website Design,
HTML,
PHP,
eCommerce,
Volusion
Dear Customer - We are a team of 75 Most Creative People and proud to be
Preferred Freelancer on Freelancer.com. We offer wide range of web
solutions and IT services that are bespoke in nature, can best fit our
clients' business needs and provide them cost benefits.
If you want each individual text component on its own (each assigned a different name), I would advise you to parse the text from from the HTML separately. However if you want it all grouped together you could join the strings:
print ' '.join(item.text.split())
This will place a single space between each word.
I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll
I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?
Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.
http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined
what are the modules used? Is there any tutorial available?
I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.
You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.
Example - retrieving all questions from the StackOverflow.com main page:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)
for incident in soup('h3'):
print [i.decode('utf8') for i in incident.contents]
print
This code sample was adapted from the BeautifulSoup documentation.
You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything.
http://www.crummy.com/software/BeautifulSoup/
The documentation is built like a tutorial. Sorta:
http://www.crummy.com/software/BeautifulSoup/documentation.html
In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:
import urllib2
def getArticles(query, start_index, count):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
(query, start_index, count))
# TODO:
# parse xml code here (using BeautifulSoup or an xml parser like Python's
# own xml.etree. We should at least have the name and ID for each article.
# article = (article_name, article_id)
return (article_names # a list of parsed names from XML
def getArticleContent(article):
xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
'acti=xart&arid=%d&sphra=undefined' % article_id)
# TODO: parse xml
return parsed_article
Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:
query = 'ana*'
article_dict = {}
i = 0
while (true):
new_articles = getArticles(query, i, 100)
if len(new_articles) == 0:
break
i += 100
for article_name, article_id in new_articles:
article_dict[article_name] = getArticleContent(article_id)
Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).
A word of warning though:
You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.