HTML file parsing in Python

HTML file parsing in Python - python

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

Related

How do I append JSON entries to CSV file under columns with the same names as the keys?

This is my code :
import pandas as pd
import requests
response = requests.get(f'https://www.magicbricks.com/mbsrp/propertySearch.html?editSearch=Y&category=S&propertyType=10002,10003,10021,10022,10001,10017,10000&bedrooms=11700,11701,11702,11703,11704,11705,11706,11707,11708,11709,11710&city=4320&page=2&groupstart=30&offset=0&maxOffset=248&sortBy=premiumRecent&postedSince=-1&pType=10002,10003,10021,10022,10001,10017,10000&isNRI=N&multiLang=en')
df = pd.json_normalize(response.json()['resultList'], max_level=0)
df.to_csv('property_data.csv', mode='a')
for i in range(3, 102):
response = requests.get(f'https://www.magicbricks.com/mbsrp/propertySearch.html?editSearch=Y&category=S&propertyType=10002,10003,10021,10022,10001,10017,10000&bedrooms=11700,11701,11702,11703,11704,11705,11706,11707,11708,11709,11710&city=4320&page={i}&groupstart={30 * (i - 1)}&offset=0&maxOffset=248&sortBy=premiumRecent&postedSince=-1&pType=10002,10003,10021,10022,10001,10017,10000&isNRI=N&multiLang=en')
df = pd.json_normalize(response.json()['resultList'])
df.to_csv('property_data.csv', mode='a', header=False)
An issue I am encountering while appending to the property_data.csv file is that due to the non existence certain keys in JSON file, data from different columns is getting mixed up.
eg:
df = pd.read_csv("property_data.csv", on_bad_lines='skip')
df['floorNo'].unique()
produces the result
Output exceeds the size limit. Open the full output data in a text editor
array([nan, '9', '38', '12', '10', '17', '16', '48', '28', '3', '15',
'35', '5', '30', '21', '27', '45', '11', 'Lift', 'Air Conditioned',
'Private jaccuzi', 'Jogging and Strolling Track', 'North - East',
'South -West', 'East', 'West', 'South', 'North', 'South - East',
'ABAMNC', 'MCGM', 'Water Availability 24 Hours Available',
'Water Availability 6 Hours Available',
'Water Availability 12 Hours Available',
'Water Availability 1 Hour Available', '800', '355', '550', '507',
'250', '750', '1310', '1620', '1295', '650', '1012', '600', '2000',
'555', '775', '920', '1095', '882', '1500', '630', '1100', '699',
'900', '540', 'Intercom Facility', '71353', '68345', '77877',
'56356', '54016', '68703', '64595',
'ff260c41-dd5c-44a1-866f-c9cedca191e8graj6227',
'96de2716-6a3a-488f-a34c-f01563544696ghar1657',
'5d6b1f8e-73a4-4439-a956-c4c8367e8bd7raje3800',
'9f75a3cd-f749-42aa-bc77-39ef9d2ad8c7yoge1410',
'0955f6a3-f09e-417b-a2e1-8332b6d467f0vaid3337',
'b2ce8fad-2392-4b26-95d9-edaf4bf46cddkhuz5152',
'6e729b73-e20b-4b43-832d-30cef3094fc4chan8207',
'b09c339d-5667-4095-a756-b54356a44ddckash6262',
'50abb073-dfdd-4e24-b544-712b6d8bcc43bina7067',
'58d5ac1c-cdb3-4542-96c3-581b3a9fb8d1meha4330',
'35d87e1a-5dfd-4c8f-9192-0acbb8d36d60hari1100',
'53d875a6-1074-443e-b7d7-a1cb55f63d6anila1380',
'3cad0738-bdf0-4f1b-81f3-4bb890a16e77scsh5048', '+91-81XXXXXXX88',
...
'prathamesh apartments, Matunga East, Mumbai', 'Bhandup',
'40 X 100 sqft sqft', 'Freehold', 'Co-operative Society',
'Power Back Up', 'Near Sitladevi Temple, Kotak Mahindra Bank',
'Near Lodha World School', 'Newly Constructed Property',
'Near Mohan Baug Banquet hall', 'Smart Home'], dtype=object)
How do I make it so that the data is appended to the correct column on the csv file and if the file does not have a certain column, we create a new one and set values of other entries to null?

NoneType Error when trying to parse Table using BeautifulSoup

Here's my code:
source = urllib.request.urlopen('http://nflcombineresults.com/nflcombinedata_expanded.php ?year=2015&pos=&college=').read()
soup = bs.BeautifulSoup(source, 'lxml')
table = soup.table
table = soup.find(id='datatable')
table_rows = table.find_all('tr')
#print(table_rows)
year = []
name = []
college = []
pos = []
height = []
weight = []
hand_size = []
arm_length = []
wonderlic = []
fortyyrd = []
for row in table_rows[1:]:
col = row.find_all('td')
#row = [i.text for i in td]
#print(col[4])
# Create a variable of the string inside each <td> tag pair,
column_1 = col[0].string.strip()
# and append it to each variable
year.append(column_1)
column_2 = col[1].string.strip()
name.append(column_2)
column_3 = col[2].string.strip()
college.append(column_3)
column_4 = col[3].string.strip()
pos.append(column_4)
#print(col[4])
column_5 = col[4].string.strip()
height.append(column_5)
There are several more columns in the table I want to add, but whenever I try and run these last two lines, I get an error saying:
"AttributeError: 'NoneType' object has no attribute 'strip'"
when I print col[4] right above this line, I get:
<td><div align="center">69</div></td>
I originally thought this is due to missing data, but the first instance of missing data in the original table on the website is in the 9th column (Wonderlic) of the first row, not the 4th column.
There are several other columns not included in this snippet of code that I want to add to my dataframe and I'm getting the NoneType error with them as well despite there being an entry in that cell.
I'm fairly new to parsing tables from a site using BeautifulSoup and so this could be a stupid question, but why is this object NoneType how can I fix this so I can put this table into a pandas dataframe?

Alternately if you want to try it with pandas, you can do it like so:
import pandas as pd
df = pd.read_html("http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=")[0]
df.head()
Output:

AttributeError: 'NoneType' object has no attribute 'strip'
The actual error is happening on the last row of the table which has a single cell, here is it's HTML:
<tr style="background-color:#333333;"><td colspan="15"> </td></tr>
Just slice it:
for row in table_rows[1:-1]:
As far as improving the overall quality of the code, you can/should follow #宏杰李's answer.

import requests
from bs4 import BeautifulSoup
r = requests.get('http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=')
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print (row)
out:
['Year', 'Name', 'College', 'POS', 'Height (in)', 'Weight (lbs)', 'Hand Size (in)', 'Arm Length (in)', 'Wonderlic', '40 Yard', 'Bench Press', 'Vert Leap (in)', 'Broad Jump (in)', 'Shuttle', '3Cone', '60Yd Shuttle']
['2015', 'Ameer Abdullah', 'Nebraska', 'RB', '69', '205', '8.63', '30.00', '', '4.60', '24', '42.5', '130', '3.95', '6.79', '11.18']
['2015', 'Nelson Agholor', 'Southern California', 'WR', '73', '198', '9.25', '32.25', '', '4.42', '12', '', '', '', '', '']
['2015', 'Malcolm Agnew', 'Southern Illinois', 'RB', '70', '202', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Jay Ajayi', 'Boise State', 'RB', '73', '221', '10.00', '32.00', '24', '4.57', '19', '39.0', '121', '4.10', '7.10', '11.10']
['2015', 'Brandon Alexander', 'Central Florida', 'FS', '74', '195', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Kwon Alexander', 'Louisiana State', 'OLB', '73', '227', '9.25', '30.25', '', '4.55', '24', '36.0', '121', '4.20', '7.14', '']
['2015', 'Mario Alford', 'West Virginia', 'WR', '68', '180', '9.38', '31.25', '', '4.43', '13', '34.0', '121', '4.07', '6.64', '11.22']
['2015', 'Detric Allen', 'East Carolina', 'CB', '73', '200', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Javorius Allen', 'Southern California', 'RB', '73', '221', '9.38', '31.75', '12', '4.53', '11', '35.5', '121', '4.28', '6.96', '']
As you can see, there are a lot of empty fields in the table, the better way is to put all the field in a list, then unpack them or use namedtuple.
This will improve your code stability.

Why isn't the HTML I get from BeautifulSoup the same as the one I see when I inspect element?

I am making a username scraper and I really can't understand why the HTML is 'disappearing' when I parse it. Let's take this site for example:
http://www.lolking.net/leaderboards#/eune/1
See how there is a tbody and a bunch of tables in it?
Well when I parse it and output it to the shell the tbody is empty
<div style="background: #333; box-shadow: 0 0 2px #000; padding: 10px;">
<table class="lktable" id="leaderboard_table" width="100%">
<thead>
<tr>
<th style="width: 80px;">
Rank
</th>
<th style="width: 80px;">
Change
</th>
<th style="width: 100px;">
Tier
</th>
<th>
Summoner
</th>
<th style="width: 150px;">
Top Champions
</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</div>
</div>
Why is this happening and how can I fix it?

This site needs JavaScript to work. JavaScript is used to populate the table by forming a web request, which probably points to a back-end API. This means that the "raw" HTML, without the effects of any JavaScript, has an empty table.
We can actually see this empty table in the background if we visit the site with JavaScript disabled:
BeautifulSoup doesn't cause this JavaScript to execute. Instead, have a look at some alternative libraries which do, such as the more advanced Selenium.

You can get all the data in json format, ll you need to do is parse a value from script tag inside the original page source and pass it to "http://www.lolking.net/leaderboards/some_value_here/eune/1.json":
from bs4 import BeautifulSoup
import requests
import re
patt = re.compile("\$\.get\('/leaderboards/(\w+)/")
js = "http://www.lolking.net/leaderboards/{}/eune/1.json"
soup = BeautifulSoup(requests.get("http://www.lolking.net/leaderboards#/eune/1").content)
script = soup.find("script", text=re.compile("\$\.get\('/leaderboards/"))
val = patt.search(script.text).group(1)
data = requests.get(js.format(val)).json()
data gives you json that contains all the player info like:
{'data': [{'division': '1',
'global_ranking': '12',
'league_points': '1217',
'lks': '2961',
'losses': '31',
'most_played_champions': [{'assists': '238',
'champion_id': '236',
'creep_score': '7227',
'deaths': '131',
'kills': '288',
'losses': '5',
'played': '39',
'wins': '34'},
{'assists': '209',
'champion_id': '429',
'creep_score': '5454',
'deaths': '111',
'kills': '204',
'losses': '3',
'played': '27',
'wins': '24'},
{'assists': '155',
'champion_id': '81',
'creep_score': '4800',
'deaths': '103',
'kills': '168',
'losses': '8',
'played': '26',
'wins': '18'}],
'name': 'Sadastyczny',
'previous_ranking': '2',
'profile_icon_id': 7,
'ranking': '1',
'region': 'eune',
'summoner_id': '42893043',
'tier': '6',
'tier_name': 'CHALLENGER',
'wins': '128'},
{'division': '1',
'global_ranking': '30',
'league_points': '1128',
'lks': '2956',
'losses': '180',
'most_played_champions': [{'assists': '928',
'champion_id': '24',
'creep_score': '37601',
'deaths': '1426',
'kills': '1874',
'losses': '64',
'played': '210',
'wins': '146'},
{'assists': '501',
'champion_id': '67',
'creep_score': '16836',
'deaths': '584',
'kills': '662',
'losses': '37',
'played': '90',
'wins': '53'},
{'assists': '124',
'champion_id': '157',
'creep_score': '5058',
'deaths': '205',
'kills': '141',
'losses': '14',
'played': '28',
'wins': '14'}],
'name': 'Richor',
'previous_ranking': '1',
'profile_icon_id': 577,
'ranking': '2',
'region': 'eune',
'summoner_id': '40385818',
'tier': '6',
'tier_name': 'CHALLENGER',
'wins': '254'},
{'division': '1',
'global_ranking': '49',
'league_points': '1051',
'lks': '2953',
'losses': '47',
'most_played_champions': [{'assists': '638',
'champion_id': '117',
'creep_score': '11927',
'deaths': '99',
'kills': '199',
'losses': '7',
'played': '66',
'wins': '59'},
{'assists': '345',
'champion_id': '48',
'creep_score': '8061',
'deaths': '99',
'kills': '192',
'losses': '11',
'played': '43',
'wins': '32'},
{'assists': '161',
'champion_id': '114',
'creep_score': '5584',
'deaths': '64',
'kills': '165',
'losses': '11',
'played': '31',
'wins': '20'}],

As you can see in Chrome Dev Tools, the site sends 2 XHR requests to get the data, and displays it by using JavaScript.
Since BeautifulSoup is an HTML parser. It will not execute JavaScript. You should use a tool like selenium, which emulates a real browser.
But in this case you might be better of using the API, they use to get the data. You can easily see from which urls they get the data by looking in the 'Network' tab. Reload the page, select XHR and you can use the info to create your own requests using something like Python Requests.

Having trouble with beautifulsoup in python

I am very new to python and have trouble with the code below. I am trying to get either the temperature or the date on the website, but can't seem to get an output. I have tried many variations, but still can't seem to get it right..
Thank you for your help!
#Code below:
import requests,bs4
r = requests.get('http://www.hko.gov.hk/contente.htm')
print r.raise_for_status()
hkweather = bs4.BeautifulSoup(r.text)
print hkweather.select('div left_content fnd_day fnd_date')

Your css selector is incorrect, you should use . between the tag and css classes, the tags you want are in the divs with the fnd_day class inside the div with the id fnd_content
divs = soup.select("#fnd_content div.fnd_day")
But that still won't get the data as it is dynamically generated through an ajax request, you can get all the data in json format using the code below:
u = "http://www.hko.gov.hk/wxinfo/json/one_json.xml?_=1468955579991"
data = requests.get(u).json()
from pprint import pprint as pp
pp(data)
That returns pretty much all the dynamic content including the dates and temps etc..
If you access the key F9D, you can see the general weather description all the temps and dates:
from pprint import pprint as pp
pp(data['F9D'])
Output:
{'BulletinDate': '20160720',
'BulletinTime': '0315',
'GeneralSituation': 'A southwesterly airstream will bring showers to the '
'coast of Guangdong today. Under the dominance of an '
'upper-air anticyclone, it will be generally fine and '
'very hot over southern China in the latter part of this '
'week and early next week.',
'NPTemp': '25',
'WeatherForecast': [{'ForecastDate': '20160720',
'ForecastIcon': 'pic53.png',
'ForecastMaxrh': '95',
'ForecastMaxtemp': '32',
'ForecastMinrh': '70',
'ForecastMintemp': '26',
'ForecastWeather': 'Sunny periods and a few showers. '
'Isolated squally thunderstorms at '
'first.',
'ForecastWind': 'South to southwest force 4.',
'IconDesc': 'Sunny Periods with A Few Showers',
'WeekDay': '3'},
{'ForecastDate': '20160721',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '4'},
{'ForecastDate': '20160722',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '5'},
{'ForecastDate': '20160723',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '34',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Fine and very hot.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '6'},
{'ForecastDate': '20160724',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '34',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Fine and very hot.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '0'},
{'ForecastDate': '20160725',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '1'},
{'ForecastDate': '20160726',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '2'},
{'ForecastDate': '20160727',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '3'},
{'ForecastDate': '20160728',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '4'}]}
The only query string parameter is the epoch timestamp which you can generate using the time lib:
from time import time
u = "http://www.hko.gov.hk/wxinfo/json/one_json.xml?_={}".format(int(time()))
data = requests.get(u).json()
Not passing the timestamp also returns the same data so I will leave you to investigate the significance.

I was able to get the dates:
>>> import requests,bs4
>>> r = requests.get('http://www.hko.gov.hk/contente.htm')
>>> hkweather = bs4.BeautifulSoup(r.text)
>>> print hkweather.select('div[class="fnd_date"]')
# [<div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>]
But the text is missing. This doesn't seem like a problem with BeautifulSoup because I looked through r.text myself and all I see is <div class="fnd_date"></div> instead of anything like <div class="fnd_date">July 20</div>.
You can check that the text isn't there using regex (although using regex with HTML is frowned upon):
>>> import re
>>> re.findall(r'<[^<>]*fnd_date[^<>]*>[^>]*>', r.text)
# [u'<div id="fnd_date" class="date"></div>', ... repeated 10 times]

Iteration & Casting index values to integer and float in nested lists

I'm having difficulty with iterating through the nested list table below. I understand how to iterate through the table once, but to go a level deeper and iterate through each nested list, I am stuck on the correct syntax to use. In iterating through the sublists, I am trying to cast each 'age' and 'years experience' to an integer, perform the operation 'age' - 'years experience', and append the value (as a string) to each sublist.
table = [
['first_name', 'last_name', 'age', 'years experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2'],
['Lenna', 'Paprocki', '33', '7', '343240.2'],
['Donette', 'Foller', '26', '2', '273541.4'],
['Simona', 'Morasca', '35', '15', '960967.0'],
['Mitsue', 'Tollner', '51', '31', '162776.7'],
['Leota', 'Dilliard', '64', '39', '464595.5'],
['Sage', 'Wieser', '27', '9', '819519.7'],
['Kris', 'Marrier', '59', '33', '327505.55000000005'],
['Minna', 'Amigon', '45', '23', '571227.05'],
['Abel', 'Maclead', '46', '23', '247927.25'],
['Kiley', 'Caldarera', '33', '7', '179182.8'],
['Graciela', 'Ruta', '48', '21', '136978.95'],
['Cammy', 'Albares', '29', '9', '1016378.95'],
['Mattie', 'Poquette', '39', '15', '86458.75'],
['Meaghan', 'Garufi', '21', '3', '260256.5'],
['Gladys', 'Rim', '52', '26', '827390.5'],
['Yuki', 'Whobrey', '32', '10', '652737.0'],
['Fletcher', 'Flosi', '59', '37', '954975.15']]
##Exercise 3 (rows as lists): Iterate over each row and append the following values:
#If it is the first row then extend it with the following ['Started Working', 'Salary / Experience']
#Start work age (age - years experience)
#Salary / Experience ratio = (salary / divided by experience)
for i, v in enumerate(table):
extension = ['Started Working', 'Salary/Experience']
if i == 0:
v.extend(extension)
print(i,v) #test to print out the index and nested list values
#for index, value in enumerate(v):
# age =
#exp =
#start_work = age - exp
#print(index, value) test to print out the index and each value in the nested list

Pass the argument start to enumerate, enumerate(table, 1) in your case,
table = [['first_name', 'last_name', 'age', 'years experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2']]
table[0].extend(['Started Working', 'Salary/Experience'])
for idx, row in enumerate(table[1:], 1):
start_work_age = int(row[2]) - int(row[3])
ratio = float(row[4]) / int(row[3])
table[idx].extend([str(start_work_age), str(ratio)])
print(table)
# Output
[['first_name', 'last_name', 'age', 'years experience', 'salary', 'Started Working', 'Salary/Experience'],
['James', 'Butt', '29', '8', '887174.4', '21', '110896.8'],
['Josephine', 'Darakjy', '59', '39', '1051267.9', '20', '26955.5871795'],
['Art', 'Venere', '22', '2', '47104.2', '20', '23552.1']]

If you can convert the space to an underscore in years experience you can use collections.namedtuple to make your life simpler:
from collections import namedtuple
table = [
['first_name', 'last_name', 'age', 'years_experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2'],
# ...
]
workerv1 = namedtuple('workerv1', ','.join(table[0]))
for i,v in enumerate(table):
worker = workerv1(*v)
if i == 0:
swage = 'Started Working'
sex_ratio = 'S/Ex ratio'
else:
swage = int(worker.age) - int(worker.years_experience)
sex_ratio = float(worker.salary) / float(worker.years_experience)
print("{w.first_name},{w.last_name},{w.age},{w.years_experience},{w.salary},{0},{1}".format(
swage, sex_ratio, w=worker))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML file parsing in Python - python

Related

How do I append JSON entries to CSV file under columns with the same names as the keys?

NoneType Error when trying to parse Table using BeautifulSoup

Why isn't the HTML I get from BeautifulSoup the same as the one I see when I inspect element?

Having trouble with beautifulsoup in python

Iteration & Casting index values to integer and float in nested lists

Categories

Resources