How to web scrape all of the batters names?

How to web scrape all of the batters names? - python

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)

You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams

You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)

1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

Related

Why Does This Return None when entering a list[0]?

'zipcodes.txt' is a text file with just zipcodes. The script works correct if I just enter a zipcode e.g. "90210". zip_list[0] type is a string and when printed it returns a single zipcode. However with the code as is a keep getting 'None'
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=False)
zip_list = list(open("zipcodes.txt","r"))
search_by_zip = search.by_zipcode(zip_list[0])
print(search_by_zip.major_city)

I changed the variable names around a bit to make sense for me, but I had to strip() the list to get rid of the '\n'
first_list = list(open("zipcodes.txt","r"))
zip_list = []
for item in first_list:
zip_list.append(item.strip())

Getting out of range while splitting values from list in python

I've list of string like below:
12345,abcd,03/03/2013,23,,32,EURRIE-373HFJ-DJDMKD|838383,ldof,09/02/2017,23,,32,DJJFJF-DJFH83-JDUEJD|939393,uejs,08/07/2016,23,,32,JDJFJF-UEJDKD-LEPEKD|
My code:
content = "12345,abcd,03/03/2013,23,,32,EURRIE-373HFJ-DJDMKD|838383,ldof,09/02/2017,23,,32,DJJFJF-DJFH83-JDUEJD|939393,uejs,08/07/2016,23,,32,JDJFJF-UEJDKD-LEPEKD|"
result = [content.split(',')[2] for content in content.split('|')]
for v in result[:-1]:
print v
I want to print all second index element which is
03/03/2013
09/02/2017
08/07/2016
But I'm getting out of range error, what i'm doing wrong here.
Can someone help to fix this issue

When I tested your code, I found that content.split('|') was generating an empty last element that was responsible for the index error after the split. So I changed it for:
[content.split(',')[2] for content in content.split('|')[:-1]]
and got:
['03/03/2013', '09/02/2017', '08/07/2016']
Does that solve the issue for you?

You can use re:
import re
content = "12345,abcd,03/03/2013,23,,32,EURRIE-373HFJ-DJDMKD|838383,ldof,09/02/2017,23,,32,DJJFJF-DJFH83-JDUEJD|939393,uejs,08/07/2016,23,,32,JDJFJF-UEJDKD-LEPEKD|"
result = re.findall('\d{2}/\d{2}/\d{4}', content)
Result:
for date in result:
print date
# 03/03/2013
# 09/02/2017
# 08/07/2016
Also you can fix your code by filtering out empty elements after the 1st split:
result = [content.split(',')[2] for content in content.split('|') if len(content)]
Result:
for v in result:
print v
# 03/03/2013
# 09/02/2017
# 08/07/2016

"Expected string or buffer" error using Beautiful Soup

I'm trying a code that will pull numbers from a URL using Beautiful Soup, then sum these numbers, but I keep getting an error that looks like this:
Expected string or buffer
I think it's related to the regular expressions, but I can't pinpoint the problem.
import re
import urllib
from BeautifulSoup import *
htm1 = urllib.urlopen('https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/comments_42.html').read()
soup = BeautifulSoup(htm1)
tags = soup('span')
for tag in tags:
y = re.findall ('([0-9]+)',tag.txt)
print sum(y)

I recommend bs4 instead of BeautifulSoup (which is the old version). You also need to change this line:
y = re.findall ('([0-9]+)',tag)
to something like this:
y = re.findall ('([0-9]+)',tag.text)
See if this gets you further:
sum = 0 #initialize the sum
for tag in tags:
y = re.findall ('([0-9]+)',tag.text) #get the text from the tag
print(y[0]) #y is a list, print the first element of the list
sum += int(y[0]) #convert it to an integer and add it to the sum
print('the sum is: {}'.format(sum))

Tuple trouble when trying to count elements in a list?

I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename

As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539

Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.

Extracting a string from html tags in python

Hopefully there isn't a duplicated question that I've looked over because I've been scouring this forum for someone who has posted to a similar to the one below...
Basically, I've created a python script that will scrape the callsigns of each ship from the url shown below and append them into a list. In short it works, however whenever I iterate through the list and display each element there seems to be a '[' and ']' between each of the callsigns. I've shown the output of my script below:
Output
*********************** Contents of 'listOfCallSigns' List ***********************
0 ['311062900']
1 ['235056239']
2 ['305500000']
3 ['311063300']
4 ['236111791']
5 ['245639000']
6 ['235077805']
7 ['235011590']
As you can see, it shows the square brackets for each callsign. I have a feeling that this might be down to an encoding problem within the BeautifulSoup library.
Ideally, I want the output to be without any of the square brackets and just the callsign as a string.
*********************** Contents of 'listOfCallSigns' List ***********************
0 311062900
1 235056239
2 305500000
3 311063300
4 236111791
5 245639000
6 235077805
7 235011590
This script I'm using currently is shown below:
My script
# Importing the modules needed to run the script
from bs4 import BeautifulSoup
import urllib2
import re
import requests
import pprint
# Declaring the url for the port of hull
url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898"
# Opening and reading the contents of the URL using the module 'urlib2'
# Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags
portOfHull = urllib2.urlopen(url).read()
soup = BeautifulSoup(portOfHull)
table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr")
# Declaring a list to hold the call signs of each ship in the table
listOfCallSigns = []
# For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign
# Adding each extracted call-sign to the 'listOfCallSigns' list
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4])))
print "\n\n*********************** Contents of 'listOfCallSigns' List ***********************\n"
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row
Does anyone know how to remove the square brackets surrounding each callsign and just display the string?
Thanks in advance! :)

Change the last lines to:
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row[0] # <-- added a [0] here
Alternatively, you can also add the [0] here:
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) <-- added a [0] here
The explanation here is that re.findall(...) returns a list (in your case, with a single element in it). So, listOfCallSigns ends up being a "list of sublists each containing a single string":
>>> listOfCallSigns
>>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'],
['245639000'], ['305500000'], ['235077805'], ['235011590'] ]
When you enumerate your listOfCallSigns, the row variable is basically the re.findall(...) that you appended earlier in the code (that's why you can add the [0] after either of them).
So row and re.findall(...) are both of type "list of string(s)" and look like this:
>>> row
>>> ['311062900']
And to get the string inside the list, you need access its first element, i.e.:
>>> row[0]
>>> '311062900'
Hope this helps!

This can also be done by stripping the unwanted characters from the string like so:
a = "string with bad characters []'] in here"
a = a.translate(None, "[]'")
print a

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to web scrape all of the batters names? - python

Related

Why Does This Return None when entering a list[0]?

Getting out of range while splitting values from list in python

"Expected string or buffer" error using Beautiful Soup

Tuple trouble when trying to count elements in a list?

Extracting a string from html tags in python

Categories

Resources