Having trouble with beautifulsoup in python

Having trouble with beautifulsoup in python - python

I am very new to python and have trouble with the code below. I am trying to get either the temperature or the date on the website, but can't seem to get an output. I have tried many variations, but still can't seem to get it right..
Thank you for your help!
#Code below:
import requests,bs4
r = requests.get('http://www.hko.gov.hk/contente.htm')
print r.raise_for_status()
hkweather = bs4.BeautifulSoup(r.text)
print hkweather.select('div left_content fnd_day fnd_date')

Your css selector is incorrect, you should use . between the tag and css classes, the tags you want are in the divs with the fnd_day class inside the div with the id fnd_content
divs = soup.select("#fnd_content div.fnd_day")
But that still won't get the data as it is dynamically generated through an ajax request, you can get all the data in json format using the code below:
u = "http://www.hko.gov.hk/wxinfo/json/one_json.xml?_=1468955579991"
data = requests.get(u).json()
from pprint import pprint as pp
pp(data)
That returns pretty much all the dynamic content including the dates and temps etc..
If you access the key F9D, you can see the general weather description all the temps and dates:
from pprint import pprint as pp
pp(data['F9D'])
Output:
{'BulletinDate': '20160720',
'BulletinTime': '0315',
'GeneralSituation': 'A southwesterly airstream will bring showers to the '
'coast of Guangdong today. Under the dominance of an '
'upper-air anticyclone, it will be generally fine and '
'very hot over southern China in the latter part of this '
'week and early next week.',
'NPTemp': '25',
'WeatherForecast': [{'ForecastDate': '20160720',
'ForecastIcon': 'pic53.png',
'ForecastMaxrh': '95',
'ForecastMaxtemp': '32',
'ForecastMinrh': '70',
'ForecastMintemp': '26',
'ForecastWeather': 'Sunny periods and a few showers. '
'Isolated squally thunderstorms at '
'first.',
'ForecastWind': 'South to southwest force 4.',
'IconDesc': 'Sunny Periods with A Few Showers',
'WeekDay': '3'},
{'ForecastDate': '20160721',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '4'},
{'ForecastDate': '20160722',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '5'},
{'ForecastDate': '20160723',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '34',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Fine and very hot.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '6'},
{'ForecastDate': '20160724',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '34',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Fine and very hot.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '0'},
{'ForecastDate': '20160725',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '1'},
{'ForecastDate': '20160726',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '2'},
{'ForecastDate': '20160727',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '3'},
{'ForecastDate': '20160728',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '4'}]}
The only query string parameter is the epoch timestamp which you can generate using the time lib:
from time import time
u = "http://www.hko.gov.hk/wxinfo/json/one_json.xml?_={}".format(int(time()))
data = requests.get(u).json()
Not passing the timestamp also returns the same data so I will leave you to investigate the significance.

I was able to get the dates:
>>> import requests,bs4
>>> r = requests.get('http://www.hko.gov.hk/contente.htm')
>>> hkweather = bs4.BeautifulSoup(r.text)
>>> print hkweather.select('div[class="fnd_date"]')
# [<div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>]
But the text is missing. This doesn't seem like a problem with BeautifulSoup because I looked through r.text myself and all I see is <div class="fnd_date"></div> instead of anything like <div class="fnd_date">July 20</div>.
You can check that the text isn't there using regex (although using regex with HTML is frowned upon):
>>> import re
>>> re.findall(r'<[^<>]*fnd_date[^<>]*>[^>]*>', r.text)
# [u'<div id="fnd_date" class="date"></div>', ... repeated 10 times]

Related

How do I append JSON entries to CSV file under columns with the same names as the keys?

This is my code :
import pandas as pd
import requests
response = requests.get(f'https://www.magicbricks.com/mbsrp/propertySearch.html?editSearch=Y&category=S&propertyType=10002,10003,10021,10022,10001,10017,10000&bedrooms=11700,11701,11702,11703,11704,11705,11706,11707,11708,11709,11710&city=4320&page=2&groupstart=30&offset=0&maxOffset=248&sortBy=premiumRecent&postedSince=-1&pType=10002,10003,10021,10022,10001,10017,10000&isNRI=N&multiLang=en')
df = pd.json_normalize(response.json()['resultList'], max_level=0)
df.to_csv('property_data.csv', mode='a')
for i in range(3, 102):
response = requests.get(f'https://www.magicbricks.com/mbsrp/propertySearch.html?editSearch=Y&category=S&propertyType=10002,10003,10021,10022,10001,10017,10000&bedrooms=11700,11701,11702,11703,11704,11705,11706,11707,11708,11709,11710&city=4320&page={i}&groupstart={30 * (i - 1)}&offset=0&maxOffset=248&sortBy=premiumRecent&postedSince=-1&pType=10002,10003,10021,10022,10001,10017,10000&isNRI=N&multiLang=en')
df = pd.json_normalize(response.json()['resultList'])
df.to_csv('property_data.csv', mode='a', header=False)
An issue I am encountering while appending to the property_data.csv file is that due to the non existence certain keys in JSON file, data from different columns is getting mixed up.
eg:
df = pd.read_csv("property_data.csv", on_bad_lines='skip')
df['floorNo'].unique()
produces the result
Output exceeds the size limit. Open the full output data in a text editor
array([nan, '9', '38', '12', '10', '17', '16', '48', '28', '3', '15',
'35', '5', '30', '21', '27', '45', '11', 'Lift', 'Air Conditioned',
'Private jaccuzi', 'Jogging and Strolling Track', 'North - East',
'South -West', 'East', 'West', 'South', 'North', 'South - East',
'ABAMNC', 'MCGM', 'Water Availability 24 Hours Available',
'Water Availability 6 Hours Available',
'Water Availability 12 Hours Available',
'Water Availability 1 Hour Available', '800', '355', '550', '507',
'250', '750', '1310', '1620', '1295', '650', '1012', '600', '2000',
'555', '775', '920', '1095', '882', '1500', '630', '1100', '699',
'900', '540', 'Intercom Facility', '71353', '68345', '77877',
'56356', '54016', '68703', '64595',
'ff260c41-dd5c-44a1-866f-c9cedca191e8graj6227',
'96de2716-6a3a-488f-a34c-f01563544696ghar1657',
'5d6b1f8e-73a4-4439-a956-c4c8367e8bd7raje3800',
'9f75a3cd-f749-42aa-bc77-39ef9d2ad8c7yoge1410',
'0955f6a3-f09e-417b-a2e1-8332b6d467f0vaid3337',
'b2ce8fad-2392-4b26-95d9-edaf4bf46cddkhuz5152',
'6e729b73-e20b-4b43-832d-30cef3094fc4chan8207',
'b09c339d-5667-4095-a756-b54356a44ddckash6262',
'50abb073-dfdd-4e24-b544-712b6d8bcc43bina7067',
'58d5ac1c-cdb3-4542-96c3-581b3a9fb8d1meha4330',
'35d87e1a-5dfd-4c8f-9192-0acbb8d36d60hari1100',
'53d875a6-1074-443e-b7d7-a1cb55f63d6anila1380',
'3cad0738-bdf0-4f1b-81f3-4bb890a16e77scsh5048', '+91-81XXXXXXX88',
...
'prathamesh apartments, Matunga East, Mumbai', 'Bhandup',
'40 X 100 sqft sqft', 'Freehold', 'Co-operative Society',
'Power Back Up', 'Near Sitladevi Temple, Kotak Mahindra Bank',
'Near Lodha World School', 'Newly Constructed Property',
'Near Mohan Baug Banquet hall', 'Smart Home'], dtype=object)
How do I make it so that the data is appended to the correct column on the csv file and if the file does not have a certain column, we create a new one and set values of other entries to null?

Unable to locate element under dynamic-comp

There are dozens of questions here on SO with a title very similar to this one - but most of them seem related to some iFrame, which prevents Selenium to access the intended tag, node, or whatever.
In my case, I'm trying to access this site. All I want is to read the data on a table - it is easy to identify, given there it is inside a div with a very particular ID. The table is also simple to read. Despite that, there is this dynamic-comp tag, which seems to be my stumbling block - I can access all elements outside of it, and no element inside at all - be it by ID, class, tag name, whatever.
How do I handle this? Is this some kind of special IFrame? I'd have tried the .switchTo approach, but the dynamic-comp elements have no ID or class, just the tag alone.
EDIT: I also tried adding wait = WebDriverWait(driver,20), just in case.
Didn't work. My goal is to iterate through the dates using the date selector, so I intend to read the table multiple times.

The table you need is the last one inside #oReportCell. To get it you can use (//td[#id='oReportCell']//table)[last()] xpath or #oReportCell table css selector and get the last one.
How to get table with requests and beautifulsoup, #PedroLobito's solution proposal. You can use pandas to collect and save data:
import requests
from bs4 import BeautifulSoup
params = (
('path', 'conteudo/txcred/Reports/TaxasCredito-Consolidadas-porTaxasAnuais-Historico.rdl'),
('parametros', ''),
('exibeparametros', 'true'),
)
response = requests.get('https://www.bcb.gov.br/api/relatorio/pt-br/contaspub', params=params)
page = BeautifulSoup(response.json()['conteudo'], 'lxml')
table = page.select('#oReportCell table')[-1]
for tr in table.find_all('tr'):
row_values = [td.text.strip() for td in tr.find_all('td')]
print(row_values)
Output:
['', '', '', '']
['', '', 'Taxas de juros']
['Posição', 'Instituição', '% a.m.', '% a.a.']
['1', 'SINOSSERRA S/A - SCFI', '0,47', '5,76']
['2', 'GRAZZIOTIN FINANCIADORA SA CFI', '0,81', '10,13']
['3', 'BCO CATERPILLAR S.A.', '0,91', '11,44']
['4', 'BCO DE LAGE LANDEN BRASIL S.A.', '0,91', '11,54']
['5', 'BCO VOLKSWAGEN S.A', '0,93', '11,76']
['6', 'BCO KOMATSU S.A.', '1,02', '12,92']
['7', 'BCO SANTANDER (BRASIL) S.A.', '1,13', '14,43']
['8', 'BCO VOLVO BRASIL S.A.', '1,16', '14,80']
['9', 'BCO DO ESTADO DO RS S.A.', '1,32', '17,07']
['10', 'BV FINANCEIRA S.A. CFI', '1,39', '18,05']
['11', 'FINANC ALFA S.A. CFI', '1,42', '18,43']
['12', 'AYMORÉ CFI S.A.', '1,44', '18,75']
['13', 'BCO RIBEIRAO PRETO S.A.', '1,46', '19,05']
['14', 'BCO BRADESCO S.A.', '1,47', '19,15']
['15', 'TODESCREDI S/A - CFI', '1,72', '22,75']
['16', 'CAIXA ECONOMICA FEDERAL', '2,46', '33,84']
['17', 'SIMPALA S.A. CFI', '2,50', '34,48']
['18', 'LEBES FINANCEIRA CFI SA', '3,12', '44,60']
['19', 'BCO RENDIMENTO S.A.', '3,15', '45,06']
['20', 'BECKER FINANCEIRA SA - CFI', '3,52', '51,47']
['21', 'BCO DO BRASIL S.A.', '3,61', '53,08']
['22', 'BCO CETELEM S.A.', '3,70', '54,65']
['23', 'LECCA CFI S.A.', '3,87', '57,65']
['24', 'HS FINANCEIRA', '3,98', '59,65']
['25', 'CREDIARE CFI S.A.', '4,17', '63,32']
['26', 'KREDILIG S.A. - CFI', '4,42', '68,06']
['27', 'CENTROCRED S.A. CFI', '4,60', '71,61']
['28', 'SENFF S.A. - CFI', '4,79', '75,31']
['29', 'ZEMA CFI S/A', '4,81', '75,68']
['30', 'VIA CERTA FINANCIADORA S.A. - CFI', '5,32', '86,31']
['31', 'OMNI BANCO S.A.', '5,35', '86,93']
['32', 'OMNI SA CFI', '5,42', '88,47']
['33', 'LUIZACRED S.A. SCFI', '5,55', '91,16']
['34', 'BCO HONDA S.A.', '5,67', '93,89']
['35', 'BCO LOSANGO S.A.', '5,71', '94,70']
['36', 'BANCO SEMEAR', '6,00', '101,13']
['37', 'NEGRESCO S.A. - CFI', '6,24', '106,69']
['38', 'GAZINCRED S.A. SCFI', '6,60', '115,24']
['39', 'PORTOCRED S.A. - CFI', '7,03', '125,93']
['40', 'AGORACRED S/A SCFI', '7,27', '132,10']

To locate and print the items from Posição, Instituição, % a.m. and % a.a. columns you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using XPATH:
driver.get('https://www.bcb.gov.br/estatisticas/reporttxjuros?path=conteudo%2Ftxcred%2FReports%2FTaxasCredito-Consolidadas-porTaxasAnuais-Historico.rdl&nome=Hist%C3%B3rico%20Posterior%20a%2001%2F01%2F2012&exibeparametros=true')
print([my_elem.text for my_elem in WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[text()='Instituição']//following::tr[#valign='top']//td/div")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['1', 'SINOSSERRA S/A - SCFI', ' 0,47', ' 5,76', '2', 'GRAZZIOTIN FINANCIADORA SA CFI', ' 0,81', ' 10,13', '3', 'BCO CATERPILLAR S.A.', ' 0,91', ' 11,44', '4', 'BCO DE LAGE LANDEN BRASIL S.A.', ' 0,91', ' 11,54', '5', 'BCO VOLKSWAGEN S.A', ' 0,93', ' 11,76', '6', 'BCO KOMATSU S.A.', ' 1,02', ' 12,92', '7', 'BCO SANTANDER (BRASIL) S.A.', ' 1,13', ' 14,43', '8', 'BCO VOLVO BRASIL S.A.', ' 1,16', ' 14,80', '9', 'BCO DO ESTADO DO RS S.A.', ' 1,32', ' 17,07', '10', 'BV FINANCEIRA S.A. CFI', ' 1,39', ' 18,05', '11', 'FINANC ALFA S.A. CFI', ' 1,42', ' 18,43', '12', 'AYMORÉ CFI S.A.', ' 1,44', ' 18,75', '13', 'BCO RIBEIRAO PRETO S.A.', ' 1,46', ' 19,05', '14', 'BCO BRADESCO S.A.', ' 1,47', ' 19,15', '15', 'TODESCREDI S/A - CFI', ' 1,72', ' 22,75', '16', 'CAIXA ECONOMICA FEDERAL', ' 2,46', ' 33,84', '17', 'SIMPALA S.A. CFI', ' 2,50', ' 34,48', '18', 'LEBES FINANCEIRA CFI SA', ' 3,12', ' 44,60', '19', 'BCO RENDIMENTO S.A.', ' 3,15', ' 45,06', '20', 'BECKER FINANCEIRA SA - CFI', ' 3,52', ' 51,47', '21', 'BCO DO BRASIL S.A.', ' 3,61', ' 53,08', '22', 'BCO CETELEM S.A.', ' 3,70', ' 54,65', '23', 'LECCA CFI S.A.', ' 3,87', ' 57,65', '24', 'HS FINANCEIRA', ' 3,98', ' 59,65', '25', 'CREDIARE CFI S.A.', ' 4,17', ' 63,32', '26', 'KREDILIG S.A. - CFI', ' 4,42', ' 68,06', '27', 'CENTROCRED S.A. CFI', ' 4,60', ' 71,61', '28', 'SENFF S.A. - CFI', ' 4,79', ' 75,31', '29', 'ZEMA CFI S/A', ' 4,81', ' 75,68', '30', 'VIA CERTA FINANCIADORA S.A. - CFI', ' 5,32', ' 86,31', '31', 'OMNI BANCO S.A.', ' 5,35', ' 86,93', '32', 'OMNI SA CFI', ' 5,42', ' 88,47', '33', 'LUIZACRED S.A. SCFI', ' 5,55', ' 91,16', '34', 'BCO HONDA S.A.', ' 5,67', ' 93,89', '35', 'BCO LOSANGO S.A.', ' 5,71', ' 94,70', '36', 'BANCO SEMEAR', ' 6,00', ' 101,13', '37', 'NEGRESCO S.A. - CFI', ' 6,24', ' 106,69', '38', 'GAZINCRED S.A. SCFI', ' 6,60', ' 115,24', '39', 'PORTOCRED S.A. - CFI', ' 7,03', ' 125,93', '40', 'AGORACRED S/A SCFI', ' 7,27', ' 132,10']

Why isn't the HTML I get from BeautifulSoup the same as the one I see when I inspect element?

I am making a username scraper and I really can't understand why the HTML is 'disappearing' when I parse it. Let's take this site for example:
http://www.lolking.net/leaderboards#/eune/1
See how there is a tbody and a bunch of tables in it?
Well when I parse it and output it to the shell the tbody is empty
<div style="background: #333; box-shadow: 0 0 2px #000; padding: 10px;">
<table class="lktable" id="leaderboard_table" width="100%">
<thead>
<tr>
<th style="width: 80px;">
Rank
</th>
<th style="width: 80px;">
Change
</th>
<th style="width: 100px;">
Tier
</th>
<th>
Summoner
</th>
<th style="width: 150px;">
Top Champions
</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</div>
</div>
Why is this happening and how can I fix it?

This site needs JavaScript to work. JavaScript is used to populate the table by forming a web request, which probably points to a back-end API. This means that the "raw" HTML, without the effects of any JavaScript, has an empty table.
We can actually see this empty table in the background if we visit the site with JavaScript disabled:
BeautifulSoup doesn't cause this JavaScript to execute. Instead, have a look at some alternative libraries which do, such as the more advanced Selenium.

You can get all the data in json format, ll you need to do is parse a value from script tag inside the original page source and pass it to "http://www.lolking.net/leaderboards/some_value_here/eune/1.json":
from bs4 import BeautifulSoup
import requests
import re
patt = re.compile("\$\.get\('/leaderboards/(\w+)/")
js = "http://www.lolking.net/leaderboards/{}/eune/1.json"
soup = BeautifulSoup(requests.get("http://www.lolking.net/leaderboards#/eune/1").content)
script = soup.find("script", text=re.compile("\$\.get\('/leaderboards/"))
val = patt.search(script.text).group(1)
data = requests.get(js.format(val)).json()
data gives you json that contains all the player info like:
{'data': [{'division': '1',
'global_ranking': '12',
'league_points': '1217',
'lks': '2961',
'losses': '31',
'most_played_champions': [{'assists': '238',
'champion_id': '236',
'creep_score': '7227',
'deaths': '131',
'kills': '288',
'losses': '5',
'played': '39',
'wins': '34'},
{'assists': '209',
'champion_id': '429',
'creep_score': '5454',
'deaths': '111',
'kills': '204',
'losses': '3',
'played': '27',
'wins': '24'},
{'assists': '155',
'champion_id': '81',
'creep_score': '4800',
'deaths': '103',
'kills': '168',
'losses': '8',
'played': '26',
'wins': '18'}],
'name': 'Sadastyczny',
'previous_ranking': '2',
'profile_icon_id': 7,
'ranking': '1',
'region': 'eune',
'summoner_id': '42893043',
'tier': '6',
'tier_name': 'CHALLENGER',
'wins': '128'},
{'division': '1',
'global_ranking': '30',
'league_points': '1128',
'lks': '2956',
'losses': '180',
'most_played_champions': [{'assists': '928',
'champion_id': '24',
'creep_score': '37601',
'deaths': '1426',
'kills': '1874',
'losses': '64',
'played': '210',
'wins': '146'},
{'assists': '501',
'champion_id': '67',
'creep_score': '16836',
'deaths': '584',
'kills': '662',
'losses': '37',
'played': '90',
'wins': '53'},
{'assists': '124',
'champion_id': '157',
'creep_score': '5058',
'deaths': '205',
'kills': '141',
'losses': '14',
'played': '28',
'wins': '14'}],
'name': 'Richor',
'previous_ranking': '1',
'profile_icon_id': 577,
'ranking': '2',
'region': 'eune',
'summoner_id': '40385818',
'tier': '6',
'tier_name': 'CHALLENGER',
'wins': '254'},
{'division': '1',
'global_ranking': '49',
'league_points': '1051',
'lks': '2953',
'losses': '47',
'most_played_champions': [{'assists': '638',
'champion_id': '117',
'creep_score': '11927',
'deaths': '99',
'kills': '199',
'losses': '7',
'played': '66',
'wins': '59'},
{'assists': '345',
'champion_id': '48',
'creep_score': '8061',
'deaths': '99',
'kills': '192',
'losses': '11',
'played': '43',
'wins': '32'},
{'assists': '161',
'champion_id': '114',
'creep_score': '5584',
'deaths': '64',
'kills': '165',
'losses': '11',
'played': '31',
'wins': '20'}],

As you can see in Chrome Dev Tools, the site sends 2 XHR requests to get the data, and displays it by using JavaScript.
Since BeautifulSoup is an HTML parser. It will not execute JavaScript. You should use a tool like selenium, which emulates a real browser.
But in this case you might be better of using the API, they use to get the data. You can easily see from which urls they get the data by looking in the 'Network' tab. Reload the page, select XHR and you can use the info to create your own requests using something like Python Requests.

Creating list of defined number pattern

I want to create a list that produces the output of:
[001,002,003,004,005]
and keeps going until 300. Having the 0's in front of the digits is essential. I tried a method such as:
a = []
for i in range(0,3):
for j in range(0,10):
for k in range(0,10):
a.append(i j k)
However, for obvious reasons, the append function does not behave in the manner I would like.
Do people have any suggestions on how else I could do this?

You cannot produce a list with integers that are presented with padding, no. You can produce strings with leading zeros:
a = [format(i + 1, '03d') for i in range(300)]
The format() function is used to format integers to a field width of 3 characters with leading zeros to pad out the length, encoded as 03d.
Demo:
>>> [format(i + 1, '03d') for i in range(300)]
['001', '002', '003', '004', '005', '006', '007', '008', '009', '010', '011', '012', '013', '014', '015', '016', '017', '018', '019', '020', '021', '022', '023', '024', '025', '026', '027', '028', '029', '030', '031', '032', '033', '034', '035', '036', '037', '038', '039', '040', '041', '042', '043', '044', '045', '046', '047', '048', '049', '050', '051', '052', '053', '054', '055', '056', '057', '058', '059', '060', '061', '062', '063', '064', '065', '066', '067', '068', '069', '070', '071', '072', '073', '074', '075', '076', '077', '078', '079', '080', '081', '082', '083', '084', '085', '086', '087', '088', '089', '090', '091', '092', '093', '094', '095', '096', '097', '098', '099', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300']

You could subclass list and overload its __repr__ method to call str.zfill on each number:
class NumList(list):
def __repr__(self):
return '[' + ', '.join([str(x).zfill(3) for x in self]) + ']'
Demo:
>>> class NumList(list):
... def __repr__(self):
... return '[' + ', '.join([str(x).zfill(3) for x in self]) + ']'
...
>>> MyList([1, 2, 3, 4, 5])
[001, 002, 003, 004, 005]
>>>
To make the exact list you want, do NumList(range(300)).
Note however that this does not make integers with leading zeros (as #MartijnPieters said, that is impossible). The output is still a string. All this is doing is telling Python how to display those integers when they are outputed to the console.

HTML file parsing in Python

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having trouble with beautifulsoup in python - python

Related

How do I append JSON entries to CSV file under columns with the same names as the keys?

Unable to locate element under dynamic-comp

Why isn't the HTML I get from BeautifulSoup the same as the one I see when I inspect element?

Creating list of defined number pattern

HTML file parsing in Python

Categories

Resources