Scraping information from hoverbox

Scraping information from hoverbox - python

As background, I am scraping a webpage in Python and using BeautifulSoup.
Some of the information that I need to access is a little box about user profiles that pops up when the mouse hovers over the user's profile picture. The problem, is that this information is not available in the html, instead, I get the following:
""div class="username mo"
span class="expand_inline scrname mbrName_1586A02614A388AEE215B4A3139A2C18" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">Sapphire-Ed
""
(I have deleted some of the >s so that the html will show up in the question, sorry!)
Can anyone tell me how to do this? Thank you for the help!!
Here is the webpage if that is helpful:
view-source:http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html
The information I am trying to access is the review distribution.

Below is the complete working code that outputs a dictionary where the keys are usernames and the values are review distributions. To understand how the code works, here are the key things to take into an account:
the information in the overlay appearing on the mouse over is loaded dynamically with a HTTP GET request with a number of user-specific parameters - the most important are uid and src
the uid and src values can be extracted with a regular expression from the id attribute for every user profile element
the response to this GET request is HTML which you need to parse with BeautifulSoup also
you should maintain the web-scraping session with requests.Session
The code:
import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
data = {}
# this pattern would help us to extract uid and src needed to make a GET request
pattern = re.compile(r"UID_(\w+)-SRC_(\w+)")
# making a web-scraping session
with requests.Session() as session:
response = requests.get("http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html")
soup = BeautifulSoup(response.content, "lxml")
# iterating over usernames on the page
for member in soup.select("div.member_info div.memberOverlayLink"):
# extracting uid and src from the `id` attribute
match = pattern.search(member['id'])
if match:
username = member.find("div", class_="username").text.strip()
uid, src = match.groups()
# making a GET request for the overlay information
response = session.get("http://www.tripadvisor.com/MemberOverlay", params={
"uid": uid,
"src": src,
"c": "",
"fus": "false",
"partner": "false",
"LsoId": ""
})
# getting the grades dictionary
soup_overlay = BeautifulSoup(response.content, "lxml")
data[username] = {grade_type: soup_overlay.find("span", text=grade_type).find_next_sibling("span", class_="numbersText").text.strip(" ()")
for grade_type in ["Excellent", "Very good", "Average", "Poor", "Terrible"]}
pprint(data)
Prints:
{'Anna T': {'Average': '2',
'Excellent': '0',
'Poor': '0',
'Terrible': '0',
'Very good': '2'},
'Arlyss T': {'Average': '0',
'Excellent': '6',
'Poor': '0',
'Terrible': '0',
'Very good': '1'},
'Bf B': {'Average': '1',
'Excellent': '22',
'Poor': '0',
'Terrible': '0',
'Very good': '17'},
'Charmingnl': {'Average': '15',
'Excellent': '109',
'Poor': '4',
'Terrible': '4',
'Very good': '45'},
'Jackie M': {'Average': '2',
'Excellent': '10',
'Poor': '0',
'Terrible': '0',
'Very good': '4'},
'Jonathan K': {'Average': '69',
'Excellent': '90',
'Poor': '6',
'Terrible': '0',
'Very good': '154'},
'Sapphire-Ed': {'Average': '8',
'Excellent': '47',
'Poor': '2',
'Terrible': '0',
'Very good': '49'},
'TundraJayco': {'Average': '14',
'Excellent': '59',
'Poor': '0',
'Terrible': '1',
'Very good': '49'},
'Versrii': {'Average': '2',
'Excellent': '8',
'Poor': '0',
'Terrible': '0',
'Very good': '10'},
'tripavisor83': {'Average': '12',
'Excellent': '9',
'Poor': '1',
'Terrible': '0',
'Very good': '20'}}

Related

Trouble getting numbers off of a webpage using re and beautifulsoup

i'm doing an assignement where i have to look throught a webpage, and pull out the numbers and compute the sum, however i'm having trouble getting the numbers and i believe my re isn't doing the job, here's the code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/comments_687617.html'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = (soup.find_all('tr'))
numbers = re.findall('[0-9]+', tags)
print (numbers)
e: #changed 'tags'to tags but the problem persists.

Use the variable tags, not the string 'tags':
Your line
numbers = re.findall('[0-9]+', 'tags')
should be
numbers = re.findall('[0-9]+', tags)

re.findall() expects a string as second argument. You are passing 'tags' which will be passed as a string not a variable because of the quotes. And, this string doesn't has any numbers in it. So, the output is an empty list.
To get the right output, you can concatenate all the tags in one string and pass it to the function. Here's one approach:
...
tags = (soup.find_all('tr'))
# Concatenate all tags to one string
string = ""
for tag in tags:
string += str(tag)
numbers = re.findall('[0-9]+', string)
print(numbers)
Output:
['97', '96', '94', '91', '90', '86', '84', '81', '81', '77', '76', '75', '75', '74', '72', '70', '70', '70', '66', '64', '64', '63', '56', '52', '52', '47', '47', '44', '43', '40', '40', '40', '40', '37', '36', '35', '33', '31', '30', '28', '22', '21', '21', '11', '11', '10', '7', '6', '2', '1']
Edit
A simpler way without using regular expression:
tags = soup.find_all('span', class_="comments")
numbers = [tag.get_text() for tag in tags]
print(numbers)

Turn a text file into a dictionary with Python

I have a text file with a pattern:
[Badges_373382]
Deleted=0
Button2=0 1497592154
Button1=0 1497592154
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1509194246
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1509120279
Name=asd
Neuron=7F0027BF2D
Owner=373381
LostSince=1509120774
Index1=218
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497592154
BatteryLow=0
PrevReader=10703
Reader=357862
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:48:15 PM
OwnerOldName=asd
[Badges_373384]
Deleted=0
Button2=0 1497538610
Button1=0 1497538610
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1513872678
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1513872684
Name=dsa
Neuron=7F0027CC1C
Owner=373383
LostSince=1513872723
Index1=219
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497538610
BatteryLow=0
PrevReader=357874
Reader=357873
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:48:51 PM
OwnerOldName=dsa
[Badges_373386]
Deleted=0
Button2=0 1497780768
Button1=0 1497780768
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1514124910
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1514124915
Name=ss
Neuron=7F0027B5FD
Owner=373385
LostSince=1514124950
Index1=220
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497780768
BatteryLow=0
PrevReader=357872
Reader=357871
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:49:24 PM
OwnerOldName=ss
Every new "Badge" info starts with [Badges_number] and end with blank line.
Using Python 3.6, I would like to turn this file into a dictionary so that I could easily access that information.
It should look like this:
content = {"Badges_373382:{"Deleted:0,.."},"Badges_371231":{"Deleted":0,..}"}
I'm pretty confused on how to do that, I'd love to get some help.
Thanks!

This is basically an INI file, and Python provides the configparser module to parse such files.
import configparser
config = configparser.ConfigParser()
config.readfp(open('badges.ini'))
r = {section: dict(config[section]) for section in config.sections()}

You can loop through each line and keep track if you have seen a header in the format [Badges_373382]:
import re
import itertools
with open('filename.txt') as f:
f = filter(lambda x:x, [i.strip('\n') for i in f])
new_data = [(a, list(b)) for a, b in itertools.groupby(f, key=lambda x:bool(re.findall('\[[a-zA-Z]+_+\d+\]', x)))]
final_data = {new_data[i][-1][-1]:dict(c.split('=') for c in new_data[i+1][-1]) for i in range(0, len(new_data), 2)}
Output:
{'[Badges_373384]': {'OwnerOldName': 'dsa', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:48:51 PM', 'Program': 'AccessProxBadge', 'LocChg': '1513872684', 'Reader': '357873', 'LostSince': '1513872723', 'LastMotion': '1497538610', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1513872678', 'BatteryLow': '0', 'Index1': '219', 'Name': 'dsa', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497538610', 'Button1': '0 1497538610', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '357874', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027CC1C', 'Owner': '373383', 'Params': '10106'}, '[Badges_373382]': {'OwnerOldName': 'asd', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:48:15 PM', 'Program': 'AccessProxBadge', 'LocChg': '1509120279', 'Reader': '357862', 'LostSince': '1509120774', 'LastMotion': '1497592154', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1509194246', 'BatteryLow': '0', 'Index1': '218', 'Name': 'asd', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497592154', 'Button1': '0 1497592154', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '10703', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027BF2D', 'Owner': '373381', 'Params': '10106'}, '[Badges_373386]': {'OwnerOldName': 'ss', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:49:24 PM', 'Program': 'AccessProxBadge', 'LocChg': '1514124915', 'Reader': '357871', 'LostSince': '1514124950', 'LastMotion': '1497780768', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1514124910', 'BatteryLow': '0', 'Index1': '220', 'Name': 'ss', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497780768', 'Button1': '0 1497780768', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '357872', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027B5FD', 'Owner': '373385', 'Params': '10106'}}

You can just go through each line of the file and add what you need. Their are three cases of lines you can come across:
1. The is a header, it will be a key final dictionary. You can just check if a line starts with "[Badges" here, and store the current header with a temporary variable while reading the file.
2. The line is a blank line, marking the end of the current badge data being read. All you need to do here is add the information collected from the current badge and add it to the dictionary, with the correct corresponding key. Depending on your implementation, you can delete these beforehand, or keep them when reading the lines.
3. Otherwise, the line has some info that needs to be stored. You first need to split this info on "=", and store it in your dictionary.
With these suggestions, you can write something like this to accomplish this task:
from collections import defaultdict
# dictionary of dictionary values
data = defaultdict(dict)
with open('pattern.txt') as file:
lines = [line.strip('\n') for line in file]
# keeps track of current header
header = None
# case 2, deletes empty lines before hand
valid_lines = [line for line in lines if line]
for line in valid_lines:
# case 1, for headers
if line.startswith('[Badges'):
# updates current header, and deletes square brackets
header = line.replace('[', '').replace(']', '')
# case 3, data has been found
else:
# split and add the data
info = line.split('=')
key, value = info[0], info[1]
data[header][key] = value
print(dict(data))
Which outputs:
{'Badges_373382': {'Deleted': '0', 'Button2': '0 1497592154', 'Button1': '0 1497592154', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1509194246', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1509120279', 'Name': 'asd', 'Neuron': '7F0027BF2D', 'Owner': '373381', 'LostSince': '1509120774', 'Index1': '218', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497592154', 'BatteryLow': '0', 'PrevReader': '10703', 'Reader': '357862', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:48:15 PM', 'OwnerOldName': 'asd'}, 'Badges_373384': {'Deleted': '0', 'Button2': '0 1497538610', 'Button1': '0 1497538610', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1513872678', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1513872684', 'Name': 'dsa', 'Neuron': '7F0027CC1C', 'Owner': '373383', 'LostSince': '1513872723', 'Index1': '219', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497538610', 'BatteryLow': '0', 'PrevReader': '357874', 'Reader': '357873', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:48:51 PM', 'OwnerOldName': 'dsa'}, 'Badges_373386': {'Deleted': '0', 'Button2': '0 1497780768', 'Button1': '0 1497780768', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1514124910', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1514124915', 'Name': 'ss', 'Neuron': '7F0027B5FD', 'Owner': '373385', 'LostSince': '1514124950', 'Index1': '220', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497780768', 'BatteryLow': '0', 'PrevReader': '357872', 'Reader': '357871', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:49:24 PM', 'OwnerOldName': 'ss'}}
Note: The above code is just a possibility, feel free to adapt it to your needs, or even improve it.
I also used collections.defaultdict to add the data, since its easier to use. You can also wrap dict() at the end to convert it to a normal dictionary, which is optional.

You can try regex and split the result of output:
pattern='^\[Badges.+?OwnerOldName=\w+'
import re
with open('file.txt','r') as f:
match=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)
new=[]
for kk in match:
if kk.group()!='\n':
new.append(kk.group())
print({i.split()[0]:i.split()[1:] for i in new})
output:
{'[Badges_373384]': ['Deleted=0', 'Button2=0', '1497538610', 'Button1=0', '1497538610', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1513872678', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1513872684', 'Name=dsa', 'Neuron=7F0027CC1C', 'Owner=373383', 'LostSince=1513872723', 'Index1=219', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497538610', 'BatteryLow=0', 'PrevReader=357874', 'Reader=357873', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:48:51', 'PM', 'OwnerOldName=dsa'], '[Badges_373382]': ['Deleted=0', 'Button2=0', '1497592154', 'Button1=0', '1497592154', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1509194246', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1509120279', 'Name=asd', 'Neuron=7F0027BF2D', 'Owner=373381', 'LostSince=1509120774', 'Index1=218', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497592154', 'BatteryLow=0', 'PrevReader=10703', 'Reader=357862', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:48:15', 'PM', 'OwnerOldName=asd'], '[Badges_373386]': ['Deleted=0', 'Button2=0', '1497780768', 'Button1=0', '1497780768', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1514124910', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1514124915', 'Name=ss', 'Neuron=7F0027B5FD', 'Owner=373385', 'LostSince=1514124950', 'Index1=220', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497780768', 'BatteryLow=0', 'PrevReader=357872', 'Reader=357871', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:49:24', 'PM', 'OwnerOldName=ss']}

How to use findall or search to extract data in python?

Here is my string,
str = 'A:[{type:"mb",id:9,name:"John",url:"/mb9/",cur:0,num:83498},
{type:"mb",id:92,name:"Mary",url:"/mb92/",cur:0,num:404},
{type:"mb",id:97,name:"Dan",url:"/mb97/",cur:0,num:139},
{type:"mb",id:268,name:"Jennifer",url:"/mb268/",cur:0,num:0},
{type:"mb",id:289,name:"Mike",url:"/mb289/",cur:0,num:0}],B:
[{type:"mb",id:157,name:"Sue",url:"/mb157/",cur:0,num:35200},
{type:"mb",id:3,name:"Rob",url:"/mb3/",cur:0,num:103047},
{type:"mb",id:2,name:"Tracy",url:"/mb2/",cur:0,num:87946},
{type:"mb",id:26,name:"Jenny",url:"/mb26/",cur:0,num:74870},
{type:"mb",id:5,name:"Florence",url:"/mb5/",cur:0,num:37261},
{type:"mb",id:127,name:"Peter",url:"/mb127/",cur:0,num:63711},
{type:"mb",id:15,name:"Grace",url:"/mb15/",cur:0,num:63243},
{type:"mb",id:82,name:"Tony",url:"/mb82/",cur:0,num:6471},
{type:"mb",id:236,name:"Lisa",url:"/mb236/",cur:0,num:4883}]'
I want to use findall or search to extract all the data under "name" and "url" from str. Here is what I did,
pattern = re.comile(r'type:(.*),id:(.*),name:(.*),url:(.*),cur:(.*),num:
(.*)')
for (v1, v2, v3, v4, v5, v6) in re.findall(pattern, str):
print v3
print v4
But unfortunately, this doesn't do what I want. Is there anything wrong? Thanks for your inputs.

You shouldn't call you string "str," because that's a built-in function. But here's an option for you:
# Find all of the entries
x = re.findall('(?<![AB]:)(?<=:).*?(?=[,}])', s)
['"mb"', '9', '"John"', '"/mb9/"', '0', '83498', '"mb"', '92', '"Mary"',
'"/mb92/"', '0', '404', '"mb"', '97', '"Dan"', '"/mb97/"', '0', '139',
'"mb"', '268', '"Jennifer"', '"/mb268/"', '0', '0', '"mb"', '289', '"Mike"',
'"/mb289/"', '0', '0', '"mb"', '157', '"Sue"', '"/mb157/"', '0', '35200',
'"mb"', '3', '"Rob"', '"/mb3/"', '0', '103047', '"mb"', '2', '"Tracy"',
'"/mb2/"', '0', '87946', '"mb"', '26', '"Jenny"', '"/mb26/"', '0', '74870',
'"mb"', '5', '"Florence"', '"/mb5/"', '0', '37261', '"mb"', '127', '"Peter"',
'"/mb127/"', '0', '63711', '"mb"', '15', '"Grace"', '"/mb15/"', '0', '63243',
'"mb"', '82', '"Tony"', '"/mb82/"', '0', '6471', '"mb"', '236', '"Lisa"',
'"/mb236/"', '0', '4883']
# Break up into each section
y = []
for i in range(0, len(x), 6):
y.append(x[i:i+6])
[['"mb"', '9', '"John"', '"/mb9/"', '0', '83498']
['"mb"', '92', '"Mary"', '"/mb92/"', '0', '404']
['"mb"', '97', '"Dan"', '"/mb97/"', '0', '139']
['"mb"', '268', '"Jennifer"', '"/mb268/"', '0', '0']
['"mb"', '289', '"Mike"', '"/mb289/"', '0', '0']
['"mb"', '157', '"Sue"', '"/mb157/"', '0', '35200']
['"mb"', '3', '"Rob"', '"/mb3/"', '0', '103047']
['"mb"', '2', '"Tracy"', '"/mb2/"', '0', '87946']
['"mb"', '26', '"Jenny"', '"/mb26/"', '0', '74870']
['"mb"', '5', '"Florence"', '"/mb5/"', '0', '37261']
['"mb"', '127', '"Peter"', '"/mb127/"', '0', '63711']
['"mb"', '15', '"Grace"', '"/mb15/"', '0', '63243']
['"mb"', '82', '"Tony"', '"/mb82/"', '0', '6471']
['"mb"', '236', '"Lisa"', '"/mb236/"', '0', '4883']]
# Name is 3rd value in each list and url is 4th
for i in y:
name = i[2]
url = i[3]

You can try this:
import re
data = """
A:[{type:"mb",id:9,name:"John",url:"/mb9/",cur:0,num:83498},
{type:"mb",id:92,name:"Mary",url:"/mb92/",cur:0,num:404},
{type:"mb",id:97,name:"Dan",url:"/mb97/",cur:0,num:139},
{type:"mb",id:268,name:"Jennifer",url:"/mb268/",cur:0,num:0},
{type:"mb",id:289,name:"Mike",url:"/mb289/",cur:0,num:0}],B:
[{type:"mb",id:157,name:"Sue",url:"/mb157/",cur:0,num:35200},
{type:"mb",id:3,name:"Rob",url:"/mb3/",cur:0,num:103047},
{type:"mb",id:2,name:"Tracy",url:"/mb2/",cur:0,num:87946},
{type:"mb",id:26,name:"Jenny",url:"/mb26/",cur:0,num:74870},
{type:"mb",id:5,name:"Florence",url:"/mb5/",cur:0,num:37261},
{type:"mb",id:127,name:"Peter",url:"/mb127/",cur:0,num:63711},
{type:"mb",id:15,name:"Grace",url:"/mb15/",cur:0,num:63243},
{type:"mb",id:82,name:"Tony",url:"/mb82/",cur:0,num:6471},
{type:"mb",id:236,name:"Lisa",url:"/mb236/",cur:0,num:4883}]
"""
full_data = [i[1:-1] for i in re.findall('(?<=name:)".*?"(?=,)|(?<=url:)".*?"(?=,)', data)]
final_data = [full_data[i]+":"+full_data[i+1] for i in range(0, len(full_data)-1, 2)]
print(full_data)
Output
['John:/mb9/', 'Mary:/mb92/', 'Dan:/mb97/', 'Jennifer:/mb268/', 'Mike:/mb289/', 'Sue:/mb157/', 'Rob:/mb3/', 'Tracy:/mb2/', 'Jenny:/mb26/', 'Florence:/mb5/', 'Peter:/mb127/', 'Grace:/mb15/', 'Tony:/mb82/', 'Lisa:/mb236/']

Why isn't the HTML I get from BeautifulSoup the same as the one I see when I inspect element?

I am making a username scraper and I really can't understand why the HTML is 'disappearing' when I parse it. Let's take this site for example:
http://www.lolking.net/leaderboards#/eune/1
See how there is a tbody and a bunch of tables in it?
Well when I parse it and output it to the shell the tbody is empty
<div style="background: #333; box-shadow: 0 0 2px #000; padding: 10px;">
<table class="lktable" id="leaderboard_table" width="100%">
<thead>
<tr>
<th style="width: 80px;">
Rank
</th>
<th style="width: 80px;">
Change
</th>
<th style="width: 100px;">
Tier
</th>
<th>
Summoner
</th>
<th style="width: 150px;">
Top Champions
</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</div>
</div>
Why is this happening and how can I fix it?

This site needs JavaScript to work. JavaScript is used to populate the table by forming a web request, which probably points to a back-end API. This means that the "raw" HTML, without the effects of any JavaScript, has an empty table.
We can actually see this empty table in the background if we visit the site with JavaScript disabled:
BeautifulSoup doesn't cause this JavaScript to execute. Instead, have a look at some alternative libraries which do, such as the more advanced Selenium.

You can get all the data in json format, ll you need to do is parse a value from script tag inside the original page source and pass it to "http://www.lolking.net/leaderboards/some_value_here/eune/1.json":
from bs4 import BeautifulSoup
import requests
import re
patt = re.compile("\$\.get\('/leaderboards/(\w+)/")
js = "http://www.lolking.net/leaderboards/{}/eune/1.json"
soup = BeautifulSoup(requests.get("http://www.lolking.net/leaderboards#/eune/1").content)
script = soup.find("script", text=re.compile("\$\.get\('/leaderboards/"))
val = patt.search(script.text).group(1)
data = requests.get(js.format(val)).json()
data gives you json that contains all the player info like:
{'data': [{'division': '1',
'global_ranking': '12',
'league_points': '1217',
'lks': '2961',
'losses': '31',
'most_played_champions': [{'assists': '238',
'champion_id': '236',
'creep_score': '7227',
'deaths': '131',
'kills': '288',
'losses': '5',
'played': '39',
'wins': '34'},
{'assists': '209',
'champion_id': '429',
'creep_score': '5454',
'deaths': '111',
'kills': '204',
'losses': '3',
'played': '27',
'wins': '24'},
{'assists': '155',
'champion_id': '81',
'creep_score': '4800',
'deaths': '103',
'kills': '168',
'losses': '8',
'played': '26',
'wins': '18'}],
'name': 'Sadastyczny',
'previous_ranking': '2',
'profile_icon_id': 7,
'ranking': '1',
'region': 'eune',
'summoner_id': '42893043',
'tier': '6',
'tier_name': 'CHALLENGER',
'wins': '128'},
{'division': '1',
'global_ranking': '30',
'league_points': '1128',
'lks': '2956',
'losses': '180',
'most_played_champions': [{'assists': '928',
'champion_id': '24',
'creep_score': '37601',
'deaths': '1426',
'kills': '1874',
'losses': '64',
'played': '210',
'wins': '146'},
{'assists': '501',
'champion_id': '67',
'creep_score': '16836',
'deaths': '584',
'kills': '662',
'losses': '37',
'played': '90',
'wins': '53'},
{'assists': '124',
'champion_id': '157',
'creep_score': '5058',
'deaths': '205',
'kills': '141',
'losses': '14',
'played': '28',
'wins': '14'}],
'name': 'Richor',
'previous_ranking': '1',
'profile_icon_id': 577,
'ranking': '2',
'region': 'eune',
'summoner_id': '40385818',
'tier': '6',
'tier_name': 'CHALLENGER',
'wins': '254'},
{'division': '1',
'global_ranking': '49',
'league_points': '1051',
'lks': '2953',
'losses': '47',
'most_played_champions': [{'assists': '638',
'champion_id': '117',
'creep_score': '11927',
'deaths': '99',
'kills': '199',
'losses': '7',
'played': '66',
'wins': '59'},
{'assists': '345',
'champion_id': '48',
'creep_score': '8061',
'deaths': '99',
'kills': '192',
'losses': '11',
'played': '43',
'wins': '32'},
{'assists': '161',
'champion_id': '114',
'creep_score': '5584',
'deaths': '64',
'kills': '165',
'losses': '11',
'played': '31',
'wins': '20'}],

As you can see in Chrome Dev Tools, the site sends 2 XHR requests to get the data, and displays it by using JavaScript.
Since BeautifulSoup is an HTML parser. It will not execute JavaScript. You should use a tool like selenium, which emulates a real browser.
But in this case you might be better of using the API, they use to get the data. You can easily see from which urls they get the data by looking in the 'Network' tab. Reload the page, select XHR and you can use the info to create your own requests using something like Python Requests.

HTML file parsing in Python

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping information from hoverbox - python

Related

Trouble getting numbers off of a webpage using re and beautifulsoup

Turn a text file into a dictionary with Python

How to use findall or search to extract data in python?

Why isn't the HTML I get from BeautifulSoup the same as the one I see when I inspect element?

HTML file parsing in Python

Categories

Resources