Trouble getting numbers off of a webpage using re and beautifulsoup

Trouble getting numbers off of a webpage using re and beautifulsoup - python

i'm doing an assignement where i have to look throught a webpage, and pull out the numbers and compute the sum, however i'm having trouble getting the numbers and i believe my re isn't doing the job, here's the code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/comments_687617.html'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = (soup.find_all('tr'))
numbers = re.findall('[0-9]+', tags)
print (numbers)
e: #changed 'tags'to tags but the problem persists.

Use the variable tags, not the string 'tags':
Your line
numbers = re.findall('[0-9]+', 'tags')
should be
numbers = re.findall('[0-9]+', tags)

re.findall() expects a string as second argument. You are passing 'tags' which will be passed as a string not a variable because of the quotes. And, this string doesn't has any numbers in it. So, the output is an empty list.
To get the right output, you can concatenate all the tags in one string and pass it to the function. Here's one approach:
...
tags = (soup.find_all('tr'))
# Concatenate all tags to one string
string = ""
for tag in tags:
string += str(tag)
numbers = re.findall('[0-9]+', string)
print(numbers)
Output:
['97', '96', '94', '91', '90', '86', '84', '81', '81', '77', '76', '75', '75', '74', '72', '70', '70', '70', '66', '64', '64', '63', '56', '52', '52', '47', '47', '44', '43', '40', '40', '40', '40', '37', '36', '35', '33', '31', '30', '28', '22', '21', '21', '11', '11', '10', '7', '6', '2', '1']
Edit
A simpler way without using regular expression:
tags = soup.find_all('span', class_="comments")
numbers = [tag.get_text() for tag in tags]
print(numbers)

Related

Python Tkinter dependent drop down

i trying to do the following:
i have two lists, matching each other. i want to create a dropdown list with tkinter with one list, and return the match from the other list. the values in the lists match injectivly one to each other.
i want the second list match to be a label or entry without choosing it like with two dependent drop down list. i've tried the following, but it only return OD.
thanks from advance
Avraham
import tkinter
from tkinter import ttk
root = tkinter.Tk()
options = tkinter.StringVar(root)
pipe_size = [
'1/8', '1/4', '3/8', '1/2', '3/4', '1', '1 1/4', '1 1/2', '2', '2 1/2', '3', '3',
'1/2', '4', '5', '6', '8', '10', '12', '14', '16', '18', '20', '22', '24', '26', '28', '30',
'32', '34', '36', '38', '40', '42', '44', '46', '48', '52', '56', '60'
]
OD_mm = [
'10.3', '13.7', '17.1', '21.3', '26.7', '33.4', '42.2', '48.3', '60.3', '73', '88.9',
'101.6', '114.3', '141.3', '168.3', '219.1', '273.1', '323.9', '355.6', '406.4', '457.2',
'508',
'559', '610', '660', '711', '762', '813', '884', '914', '965', '1016', '1067', '1118', '1168',
'1219', '1321'
]
str_out=tkinter.StringVar(root)
str_out.set("OD")
def callback(eventObject):
abc = eventObject.widget.get()
car = pipe_size_list.get()
index=pipe_size.index(car)
str_out.config(values=OD_mm[index])
str_out.set(OD_mm_str.get())
pipe_size_list = ttk.Combobox(root, width=37, value=(pipe_size))
pipe_size_list.grid(row=3, column=1, columnspan=2, padx=10, pady=2, sticky='w')
OD_mm_str = tkinter.Label(root, textvariable = str_out)
OD_mm_str.grid(row = 5, column = 1)
OD_mm_str.bind('<<ComboboxSelected>>', callback)
root.mainloop()

Get an empty List despite no error and assuming the logic is correct

im getting an empty list when i print the below, i'm pretty sure the code is correct
I am trying to retrieve the values for both class level-0 and level-1
website:https://stamprally.org/
prefectureValues = []
prefectureValueStorage = driver.find_elements_by_class_name(
'div.header_search_inputs>select#header_search_cat1 > option.level-0.level-1')
for prefectureCode in prefectureValueStorage:
prefectureValues.append(prefectureCode.get_attribute('value'))
print(prefectureValues)

//select[#name='search_cat1']/option[#class='level-1' or #class='level-0']
Should be a simple xpath to use to get all the options with those classes.
#header_search_cat1>:is(option.level-1 ,option.level-0)
Would be the css selector for the above.
So to pull it all together
prefectureValues =[x.get_attribute('value') for x in driver.find_elements_by_xpath("//select[#name='search_cat1']/option[#class='level-1' or #class='level-0']")]
print(prefectureValues)
Outputs
['145', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '94', '93', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120']
You could also use waits and etc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait=WebDriverWait(driver, 10)
driver.get("https://stamprally.org/")
searchCat1Options=wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"#header_search_cat1>:is(option.level-1 ,option.level-0)")))
prefectureValues=[x.get_attribute('value') for x in searchCat1Options]
print(prefectureValues)

How to customize a table for label printing with reportlab

I want to make the label print to PDF from reportlab by python 3.6, and I checked the reportlab for tables' usage. All of the methods are regular form.
I want to merge the cell to realize the final effect as follows.
<> contains records from database.
My label requirements
By "Span" method, I got the tables here:
When I met the last rows, I cannot split it. Because I use 0.5cm x16, 0.5cmx11 to format the table. Now, should I change it to 0.25 cmx32, 0.25cm x 22? It must be a mass work.
My result
Does anyone give me a suggestion to solve this problem? I need a direction. Thanks.
* If simply draw line and output text, I cannot realize the align,valign, wrap etc.
My codes are here:
# -*- coding:utf-8 -*-
from reportlab.lib import colors
from reportlab.lib.pagesizes import A4,cm
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
doc = SimpleDocTemplate("LabelTest.pdf", pagesize=A4)
# container for the 'Flowable' objects
elements = []
data= [['1', '', '','','','','','','', '', '', '', '13', '', '', ''],
['', '', '','','','','','','', '', '', '', '', '', '', ''],
['', '', '','','','','','','', '', '', '', '', '', '', ''],
['4','', '','','','','','','', '', '', '', '13', '14', '15', '16'],
['5', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['6', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['7', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['8', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['9', '', '','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['', '', '','','','','','8','9', '10', '11', '12', '13', '14', '15', '16'],
['', '', '','','','','','8','9', '10', '11', '12', '13', '14', '15', '16']]
t=Table(data,16*[0.5*cm], 11*[0.5*cm],)
t.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
('SPAN',(-4,0),(-1,3)), # Right corner for logo image
('SPAN',(0,0),(-5,2)), # First two rows for product des and surface
('SPAN',(0,3),(-5,3)), # Third row for product requirements
('SPAN',(0,4),(5,7)), # For product picture
('SPAN',(6,3),(-1,6)), # Description and size
('SPAN',(6,4),(-1,7)), # For mat'l no.
('SPAN',(0,8),(5,-1)), # EAN-13
]))
elements.append(t)
# write the document to disk
doc.build(elements)
Currently, I find a solution to make it by myself, maybe it is not the best one, but it really helps a lot.
# -*- coding:utf-8 -*-
from reportlab.lib import colors
from reportlab.lib.pagesizes import A4,cm
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
doc = SimpleDocTemplate("ex09_1-ReyherTable.pdf", pagesize=A4)
# container for the 'Flowable' objects
elements = []
data0= [['1','9']]
t0=Table(data0,colWidths=[6*cm,2*cm],rowHeights=[2*cm])
t0.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
]))
data1= [['2','9']]
t1=Table(data1,colWidths=[3*cm,5*cm],rowHeights=[2*cm])
t1.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
]))
data2= [['3','4','5'],
['4','5','6'],]
t2=Table(data2,colWidths=[3*cm,2.5*cm,2.5*cm],rowHeights=2*[0.75*cm])
t2.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
('SPAN',(0,0),(0,-1)),
('SPAN',(-1,0),(-1,-1)),
]))
elements.append(t0)
elements.append(t1)
elements.append(t2)
# write the document to disk
doc.build(elements)

Iteration & Casting index values to integer and float in nested lists

I'm having difficulty with iterating through the nested list table below. I understand how to iterate through the table once, but to go a level deeper and iterate through each nested list, I am stuck on the correct syntax to use. In iterating through the sublists, I am trying to cast each 'age' and 'years experience' to an integer, perform the operation 'age' - 'years experience', and append the value (as a string) to each sublist.
table = [
['first_name', 'last_name', 'age', 'years experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2'],
['Lenna', 'Paprocki', '33', '7', '343240.2'],
['Donette', 'Foller', '26', '2', '273541.4'],
['Simona', 'Morasca', '35', '15', '960967.0'],
['Mitsue', 'Tollner', '51', '31', '162776.7'],
['Leota', 'Dilliard', '64', '39', '464595.5'],
['Sage', 'Wieser', '27', '9', '819519.7'],
['Kris', 'Marrier', '59', '33', '327505.55000000005'],
['Minna', 'Amigon', '45', '23', '571227.05'],
['Abel', 'Maclead', '46', '23', '247927.25'],
['Kiley', 'Caldarera', '33', '7', '179182.8'],
['Graciela', 'Ruta', '48', '21', '136978.95'],
['Cammy', 'Albares', '29', '9', '1016378.95'],
['Mattie', 'Poquette', '39', '15', '86458.75'],
['Meaghan', 'Garufi', '21', '3', '260256.5'],
['Gladys', 'Rim', '52', '26', '827390.5'],
['Yuki', 'Whobrey', '32', '10', '652737.0'],
['Fletcher', 'Flosi', '59', '37', '954975.15']]
##Exercise 3 (rows as lists): Iterate over each row and append the following values:
#If it is the first row then extend it with the following ['Started Working', 'Salary / Experience']
#Start work age (age - years experience)
#Salary / Experience ratio = (salary / divided by experience)
for i, v in enumerate(table):
extension = ['Started Working', 'Salary/Experience']
if i == 0:
v.extend(extension)
print(i,v) #test to print out the index and nested list values
#for index, value in enumerate(v):
# age =
#exp =
#start_work = age - exp
#print(index, value) test to print out the index and each value in the nested list

Pass the argument start to enumerate, enumerate(table, 1) in your case,
table = [['first_name', 'last_name', 'age', 'years experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2']]
table[0].extend(['Started Working', 'Salary/Experience'])
for idx, row in enumerate(table[1:], 1):
start_work_age = int(row[2]) - int(row[3])
ratio = float(row[4]) / int(row[3])
table[idx].extend([str(start_work_age), str(ratio)])
print(table)
# Output
[['first_name', 'last_name', 'age', 'years experience', 'salary', 'Started Working', 'Salary/Experience'],
['James', 'Butt', '29', '8', '887174.4', '21', '110896.8'],
['Josephine', 'Darakjy', '59', '39', '1051267.9', '20', '26955.5871795'],
['Art', 'Venere', '22', '2', '47104.2', '20', '23552.1']]

If you can convert the space to an underscore in years experience you can use collections.namedtuple to make your life simpler:
from collections import namedtuple
table = [
['first_name', 'last_name', 'age', 'years_experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2'],
# ...
]
workerv1 = namedtuple('workerv1', ','.join(table[0]))
for i,v in enumerate(table):
worker = workerv1(*v)
if i == 0:
swage = 'Started Working'
sex_ratio = 'S/Ex ratio'
else:
swage = int(worker.age) - int(worker.years_experience)
sex_ratio = float(worker.salary) / float(worker.years_experience)
print("{w.first_name},{w.last_name},{w.age},{w.years_experience},{w.salary},{0},{1}".format(
swage, sex_ratio, w=worker))

HTML file parsing in Python

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble getting numbers off of a webpage using re and beautifulsoup - python

Use the variable tags, not the string 'tags': Your line numbers = re.findall('[0-9]+', 'tags') should be numbers = re.findall('[0-9]+', tags)

Related

Python Tkinter dependent drop down

Get an empty List despite no error and assuming the logic is correct

How to customize a table for label printing with reportlab

Iteration & Casting index values to integer and float in nested lists

HTML file parsing in Python

Categories

Resources