How to strip string down to "point guard" - python

I'm trying to scrape the position off of this webpage using BeautifulSoup. Here is my relevant code.
info_panel = soup.find("div",{"id":"meta"})
info_panel_rows = info_panel.find_all("p")
if(info_panel_rows[2].find("strong") != None):
position = info_panel_rows[2].find("strong").next_sibling
position = str(position).strip()
else: # Executing on this path in my current problem
position = info_panel_rows[3].find("strong").next_sibling
position = str(position).strip()
print(position)
When I scrape it though, it prints like such:
Small Forward
▪
How would I go about stripping this down to just "Small Forward"? I've looked all over Stack Overflow and couldn't find a clear answer.
Thanks for any help you can provide!

Are you having issues with newline and tab in position? if so, do
position = str(position).strip('\n\t ')
and if that dot is also an issue, copy from print and paste it into strip. when you dont put anything in strip, it only removes white space from both side, you need to specify what you want to remove, the above example removes newline and tab and whitespace
If this does not solve your problem, you can try regex
import re
string_patterns = re.compile(r'\b[0-9a-zA-Z]*\b')
position = info_panel_rows[3].find("strong").next_sibling
results = string_patterns.findall(str(position))
results = ' '.join([item for item in results if len(item)])
print(results)
Hope this helps

If you encode it to ascii ignoring errors then call strip() you get the desired output.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.basketball-reference.com/players/y/youngtr01.html').text
soup = BeautifulSoup(html, 'html.parser')
info_panel = soup.find("div", {"id": "meta"})
info_panel_rows = info_panel.find_all("p")
if info_panel_rows[2].find("strong") is not None:
position = info_panel_rows[2].find("strong").next_sibling
else:
position = info_panel_rows[3].find("strong").next_sibling
print(position.encode('ascii', 'ignore').strip())
Outputs:
Point Guard
Encoding to ascii gets rid of the bullet point.
Or if you just want to print the second line:
print(position.splitlines()[1].strip())
Also outputs:
Point Guard

Related

Repeat a python function on its own output

I made a function that scrapes the last 64 characters of text from a website and adds it to url1, resulting in new_url. I want to repeat the process by scraping the last 64 characters from the resulting URL (new_url) and adding it to url1 again. The goal is to repeat this until I hit a website where the last 3 characters are "END".
Here is my code so far:
#function
def getlink(url):
url1 = 'https://www.random.computer/api.php?file='
req=request.urlopen(url)
link = req.read().splitlines()
for i,line in enumerate(link):
text = line.decode('utf-8')
last64= text[-64:]
new_url= url1+last64
return new_url
getlink('https://www.random/api.php?file=abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz012345678910')
#output
'https://www.random/api.php?file=zyxwvutsrqponmlkjihgfedcba012345678910abcdefghijklmnopqrstuvwxyz'
My trouble is figuring out a way to be able to repeat the function on its output. Any help would be appreciated!
A simple loop should work. I've removed the first token as it may be sensible information. Just change the WRITE_YOUR_FIRST_TOKEN_HERE string with the code for the first link.
from urllib import request
def get_chunk(chunk, url='https://www.uchicago.computer/api.php?file='):
with request.urlopen(url + chunk) as f:
return f.read().decode('UTF-8').strip()
if __name__ == '__main__':
chunk = 'WRITE_YOUR_FIRST_TOKEN_HERE'
while chunk[-3:] != "END":
chunk = get_chunk(chunk[-64:])
print(chunk)
# Chunk is a string, do whatever you want with it,
# like chunk.splitlines() to get a list of the lines
read get the byte stream, decode turns it into a string, and strip removes leading and trailing whitespaces (like \n) so that it doesn't mess with the last 64 chars (if you get the last 64 chars but one is a \n you will only get 63 chars of the token).
Try the below code. It can perform what you mention above?
import requests
from bs4 import BeautifulSoup
def getlink(url):
url1 = 'https://www.uchicago.computer/api.php?file='
response = requests.post(url)
doc = BeautifulSoup(response.text, 'html.parser')
text = doc.decode('utf-8')
last64= text[-65:-1]
new_url= url1+last64
return new_url
def caller(url):
url = getlink(url)
if not url[-3:]=='END':
print(url)
caller(url)

How to remove multiple empty lines when scraping with Beautifulsoup

my code outputs multiple empty line breaks.
How do i remove all the empty space?
from bs4 import BeautifulSoup
import urllib.request
import re
url = input('enter url moish')
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page,'lxml')
all = soup.find_all('a', {'class' : re.compile('itemIncludes')})
for i in all:
print(i.text)
code output:
Canon EOS 77D DSLR Camera (Body Only)
LP-E17 Lithium-Ion Battery Pack
LC-E17 Charger for LP-E17 Battery Pack
desired output:
Canon EOS 77D DSLR Camera (Body Only)
LP-E17 Lithium-Ion Battery Pack
LC-E17 Charger for LP-E17 Battery Pack
Thanks!
You could remove empty lines before printing:
items = [item.text for item in all if item.text.strip() != '']
for i in all:
items = ' '.join(i.text.split())
print(items)
the code above removed all the white spaces
You can use a regex to filter the output, something like:
import re
text = i.text.strip()
if not re.search(r"^\s+$", text): # if not a bank line
print(text)
Note:
This is just a fix for the output since the problem may reside on
the find_all arguments, which I cannot test.
I'm sure you've solved this by now, but I'm brand new to python and had the same issue. I also didn't want to just remove the lines when printing, I wanted to change them in the element, this was my solution
soup = BeautifulSoup(getPage())
elements = soup.findAll()
for element in elements:
text = element.text.strip()
element.string = re.sub(r"[\n][\W]+[^\w]", "\n", text)
print(soup)
Loops through elements, gets the text, replace any instance of "\n followed by whitespace, but nothing else>" (one way to find empty lines, but feel free to use a better one!), sets the replaced value back into the element.

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

Web Scraping a wikipedia page

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p>, there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.
The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?
EDIT
In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis
Is this what you want?
import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]
This gives:
linguistics
if you want to extract href then you can use this:
parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]
UPDATE
It seems you want href after parentheses not the before one.
I have written script for it. Try this:
import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
temp = parsed_html.body.findAll('p')[0]
start_count = 0
started = False
found = False
while temp.next and found is False:
temp = temp.next
if '(' in temp:
start_count += 1
if started is False:
started = True
if ')' in temp and started and start_count > 1:
start_count -= 1
elif ')' in temp and started and start_count == 1:
found = True
print temp.findNext('a').attrs[0][1]

Python - to output contents in a HTML file to spreadsheet

Part of below is sourced from another example. It’s modified a bit and use to read a HTML file, and output the contents into a spreadsheet.
As it’s a just a local file, using Selenium is maybe an over-kill, but I just want to learn through this example.
from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)
driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')
results = []
driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text = elt.text or '' #question 1
tail = elt.tail or '' #question 1
words = ''.join((text,tail)).strip()
if words: # extra question
words = words.encode('utf-8') #question 2
results.append(words) #question 3
results.append('; ') #question 3
sheet.write (0, 0, results)
book.save("C:\\ source_output.xls")
The lines text=elt.text or '' and tail=elt.tail or '' – why both .text and .tail have texts? And why the or '' part is important here?
The texts in the HTML file contains special characters like ° (temperature degrees) – the .encode('utf-8') doesn’t make it a perfect output, neither in IDLE or Excel spreadsheet. What’s the alternative?
Is it possible to join the output into a string, instead of a list? Now to append it into a list, I have to .append it twice to have the texts and ; added.
elt is a html node. It contains certain attributes and a text section. lxml provides way to extract all the attributes and text, by using .text or .tail depending where the text is.
<a attribute1='abc'>
some text ----> .text gets this
<p attributeP='def'> </p>
some tail ---> .tail gets this
</a>
The idea behind the or ''is that if there is no text/tail found in the current html node, it returns None. And later when we want to concatenate/append None type it will complain. So to avoid any future error, if the text/tail is None then use an empty string ''
Degree character is a one-character unicode string, but when you do a .encode('utf-8') it becomes 2-byte utf-8 byte string. This 2-byte is nothing but ° or \xc3\x82\xc2\xb0. So basically you do not have to do any encoding for ° character and Python interpreter correctly interprets the encoding. If not, provide the correct shebang on top of your python script. Check the PEP-0263
# -*- coding: UTF-8 -*-
Yes you can also join the output in string, just use + as there is no append for string types for e.g.
results = ''
results = results + 'whatever you want to join'
You can keep the list and combine your 2 lines:
results.append(words + '; ')
Note: Just now i checked the xlwt documentation and sheet.write() accept only strings. So basically you cannot pass results, a list type.
A simple example for Q1
from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or '' #prints empty string
test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>
test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer

Categories