Unable to locate element under dynamic-comp - python

There are dozens of questions here on SO with a title very similar to this one - but most of them seem related to some iFrame, which prevents Selenium to access the intended tag, node, or whatever.
In my case, I'm trying to access this site. All I want is to read the data on a table - it is easy to identify, given there it is inside a div with a very particular ID. The table is also simple to read. Despite that, there is this dynamic-comp tag, which seems to be my stumbling block - I can access all elements outside of it, and no element inside at all - be it by ID, class, tag name, whatever.
How do I handle this? Is this some kind of special IFrame? I'd have tried the .switchTo approach, but the dynamic-comp elements have no ID or class, just the tag alone.
EDIT: I also tried adding wait = WebDriverWait(driver,20), just in case.
Didn't work. My goal is to iterate through the dates using the date selector, so I intend to read the table multiple times.

The table you need is the last one inside #oReportCell. To get it you can use (//td[#id='oReportCell']//table)[last()] xpath or #oReportCell table css selector and get the last one.
How to get table with requests and beautifulsoup, #PedroLobito's solution proposal. You can use pandas to collect and save data:
import requests
from bs4 import BeautifulSoup
params = (
('path', 'conteudo/txcred/Reports/TaxasCredito-Consolidadas-porTaxasAnuais-Historico.rdl'),
('parametros', ''),
('exibeparametros', 'true'),
)
response = requests.get('https://www.bcb.gov.br/api/relatorio/pt-br/contaspub', params=params)
page = BeautifulSoup(response.json()['conteudo'], 'lxml')
table = page.select('#oReportCell table')[-1]
for tr in table.find_all('tr'):
row_values = [td.text.strip() for td in tr.find_all('td')]
print(row_values)
Output:
['', '', '', '']
['', '', 'Taxas de juros']
['Posição', 'Instituição', '% a.m.', '% a.a.']
['1', 'SINOSSERRA S/A - SCFI', '0,47', '5,76']
['2', 'GRAZZIOTIN FINANCIADORA SA CFI', '0,81', '10,13']
['3', 'BCO CATERPILLAR S.A.', '0,91', '11,44']
['4', 'BCO DE LAGE LANDEN BRASIL S.A.', '0,91', '11,54']
['5', 'BCO VOLKSWAGEN S.A', '0,93', '11,76']
['6', 'BCO KOMATSU S.A.', '1,02', '12,92']
['7', 'BCO SANTANDER (BRASIL) S.A.', '1,13', '14,43']
['8', 'BCO VOLVO BRASIL S.A.', '1,16', '14,80']
['9', 'BCO DO ESTADO DO RS S.A.', '1,32', '17,07']
['10', 'BV FINANCEIRA S.A. CFI', '1,39', '18,05']
['11', 'FINANC ALFA S.A. CFI', '1,42', '18,43']
['12', 'AYMORÉ CFI S.A.', '1,44', '18,75']
['13', 'BCO RIBEIRAO PRETO S.A.', '1,46', '19,05']
['14', 'BCO BRADESCO S.A.', '1,47', '19,15']
['15', 'TODESCREDI S/A - CFI', '1,72', '22,75']
['16', 'CAIXA ECONOMICA FEDERAL', '2,46', '33,84']
['17', 'SIMPALA S.A. CFI', '2,50', '34,48']
['18', 'LEBES FINANCEIRA CFI SA', '3,12', '44,60']
['19', 'BCO RENDIMENTO S.A.', '3,15', '45,06']
['20', 'BECKER FINANCEIRA SA - CFI', '3,52', '51,47']
['21', 'BCO DO BRASIL S.A.', '3,61', '53,08']
['22', 'BCO CETELEM S.A.', '3,70', '54,65']
['23', 'LECCA CFI S.A.', '3,87', '57,65']
['24', 'HS FINANCEIRA', '3,98', '59,65']
['25', 'CREDIARE CFI S.A.', '4,17', '63,32']
['26', 'KREDILIG S.A. - CFI', '4,42', '68,06']
['27', 'CENTROCRED S.A. CFI', '4,60', '71,61']
['28', 'SENFF S.A. - CFI', '4,79', '75,31']
['29', 'ZEMA CFI S/A', '4,81', '75,68']
['30', 'VIA CERTA FINANCIADORA S.A. - CFI', '5,32', '86,31']
['31', 'OMNI BANCO S.A.', '5,35', '86,93']
['32', 'OMNI SA CFI', '5,42', '88,47']
['33', 'LUIZACRED S.A. SCFI', '5,55', '91,16']
['34', 'BCO HONDA S.A.', '5,67', '93,89']
['35', 'BCO LOSANGO S.A.', '5,71', '94,70']
['36', 'BANCO SEMEAR', '6,00', '101,13']
['37', 'NEGRESCO S.A. - CFI', '6,24', '106,69']
['38', 'GAZINCRED S.A. SCFI', '6,60', '115,24']
['39', 'PORTOCRED S.A. - CFI', '7,03', '125,93']
['40', 'AGORACRED S/A SCFI', '7,27', '132,10']

To locate and print the items from Posição, Instituição, % a.m. and % a.a. columns you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using XPATH:
driver.get('https://www.bcb.gov.br/estatisticas/reporttxjuros?path=conteudo%2Ftxcred%2FReports%2FTaxasCredito-Consolidadas-porTaxasAnuais-Historico.rdl&nome=Hist%C3%B3rico%20Posterior%20a%2001%2F01%2F2012&exibeparametros=true')
print([my_elem.text for my_elem in WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[text()='Instituição']//following::tr[#valign='top']//td/div")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['1', 'SINOSSERRA S/A - SCFI', ' 0,47', ' 5,76', '2', 'GRAZZIOTIN FINANCIADORA SA CFI', ' 0,81', ' 10,13', '3', 'BCO CATERPILLAR S.A.', ' 0,91', ' 11,44', '4', 'BCO DE LAGE LANDEN BRASIL S.A.', ' 0,91', ' 11,54', '5', 'BCO VOLKSWAGEN S.A', ' 0,93', ' 11,76', '6', 'BCO KOMATSU S.A.', ' 1,02', ' 12,92', '7', 'BCO SANTANDER (BRASIL) S.A.', ' 1,13', ' 14,43', '8', 'BCO VOLVO BRASIL S.A.', ' 1,16', ' 14,80', '9', 'BCO DO ESTADO DO RS S.A.', ' 1,32', ' 17,07', '10', 'BV FINANCEIRA S.A. CFI', ' 1,39', ' 18,05', '11', 'FINANC ALFA S.A. CFI', ' 1,42', ' 18,43', '12', 'AYMORÉ CFI S.A.', ' 1,44', ' 18,75', '13', 'BCO RIBEIRAO PRETO S.A.', ' 1,46', ' 19,05', '14', 'BCO BRADESCO S.A.', ' 1,47', ' 19,15', '15', 'TODESCREDI S/A - CFI', ' 1,72', ' 22,75', '16', 'CAIXA ECONOMICA FEDERAL', ' 2,46', ' 33,84', '17', 'SIMPALA S.A. CFI', ' 2,50', ' 34,48', '18', 'LEBES FINANCEIRA CFI SA', ' 3,12', ' 44,60', '19', 'BCO RENDIMENTO S.A.', ' 3,15', ' 45,06', '20', 'BECKER FINANCEIRA SA - CFI', ' 3,52', ' 51,47', '21', 'BCO DO BRASIL S.A.', ' 3,61', ' 53,08', '22', 'BCO CETELEM S.A.', ' 3,70', ' 54,65', '23', 'LECCA CFI S.A.', ' 3,87', ' 57,65', '24', 'HS FINANCEIRA', ' 3,98', ' 59,65', '25', 'CREDIARE CFI S.A.', ' 4,17', ' 63,32', '26', 'KREDILIG S.A. - CFI', ' 4,42', ' 68,06', '27', 'CENTROCRED S.A. CFI', ' 4,60', ' 71,61', '28', 'SENFF S.A. - CFI', ' 4,79', ' 75,31', '29', 'ZEMA CFI S/A', ' 4,81', ' 75,68', '30', 'VIA CERTA FINANCIADORA S.A. - CFI', ' 5,32', ' 86,31', '31', 'OMNI BANCO S.A.', ' 5,35', ' 86,93', '32', 'OMNI SA CFI', ' 5,42', ' 88,47', '33', 'LUIZACRED S.A. SCFI', ' 5,55', ' 91,16', '34', 'BCO HONDA S.A.', ' 5,67', ' 93,89', '35', 'BCO LOSANGO S.A.', ' 5,71', ' 94,70', '36', 'BANCO SEMEAR', ' 6,00', ' 101,13', '37', 'NEGRESCO S.A. - CFI', ' 6,24', ' 106,69', '38', 'GAZINCRED S.A. SCFI', ' 6,60', ' 115,24', '39', 'PORTOCRED S.A. - CFI', ' 7,03', ' 125,93', '40', 'AGORACRED S/A SCFI', ' 7,27', ' 132,10']

Related

Add numbers of a column within an array based of value from another column

I asked this question once but was very inconsistent in my wording. Here is my full code. I have a dataArray and wish to add numbers within the 5th column but only if within the same row, column 7 has a 0.
#!/usr/bin/python
#Date: 4.24.18
#importing necessary modules
import csv
import collections
import sys
from array import array
#variables for ease of use in script
fileName = 'medicaldata.tsv'
filePath = '/home/pjvaglic/Desktop/scripts/pythonScripts/final/data/'
dataURL = 'http://pages.mtu.edu/~toarney/sat3310/final/'
dataArray = []
sumBeds = 0
count = 0
countFac = 0
sumNSal = 0
sumNSalR = 0
#download file from MTU
downloadFile = urllib2.urlopen(dataURL + fileName)
#opening the file
with open(filePath + fileName, 'w') as output:
output.write(downloadFile.read())
output.close()
#count number of lines in the data file, take off the header, print results to screen
count = open(filePath + fileName).readlines()
print "There are", len(count)-1, "facilities accounted for in", filePath + fileName
#keep track of number of facilities
countFac = len(count)-1
#open data file, put everything in an array, cut everything at the tab delimiter
with open(filePath + fileName, 'rt') as inputfile:
next(inputfile)
dataArray = csv.reader(inputfile, delimiter='\t')
#sum the amount of beds are in the first column
for row in dataArray:
sumBeds += int(row[0])
print "There are ", sumBeds, "in the medical file."
print "There are about", sumBeds/countFac, "beds per facility."
#this line does not work for my purposes.
#list = [[row[4] for row in dataArray if row[6] == '1']]
#print list
Here is the dataArray. The last column has 0's and 1's. I believe they are strings. For example, in the first row it has a 0, so I want to take 5230 and add that to 6304 and then 6590, so forth and so on. Just rows that include a 0 in the last column.
['244', '128', '385', '23521', '5230', '5334', '0']
['59', '155', '203', '9160', '2459', '493', '1']
['120', '281', '392', '21900', '6304', '6115', '0']
['120', '291', '419', '22354', '6590', '6346', '0']
['120', '238', '363', '17421', '5362', '6225', '0']
['65', '180', '234', '10531', '3622', '449', '1']
['120', '306', '372', '22147', '4406', '4998', '1']
['90', '214', '305', '14025', '4173', '966', '1']
['96', '155', '169', '8812', '1955', '1260', '0']
['120', '133', '188', '11729', '3224', '6442', '1']
['62', '148', '192', '8896', '2409', '1236', '0']
['120', '274', '426', '20987', '2066', '3360', '1']
['116', '154', '321', '17655', '5946', '4231', '0']
['59', '120', '164', '7085', '1925', '1280', '1']
['80', '261', '284', '13089', '4166', '1123', '1']
['120', '338', '375', '21453', '5257', '5206', '1']
['80', '77', '133', '7790', '1988', '4443', '1']
['100', '204', '318', '18309', '4156', '4585', '1']
['60', '97', '213', '8872', '1914', '1675', '1']
['110', '178', '280', '17881', '5173', '5686', '1']
['120', '232', '336', '17004', '4630', '907', '0']
['135', '316', '442', '23829', '7489', '3351', '0']
['59', '163', '191', '9424', '2051', '1756', '1']
['60', '96', '202', '12474', '3803', '2123', '0']
['25', '74', '83', '4078', '2008', '4531', '1']
['221', '514', '776', '36029', '1288', '2543', '1']
['64', '91', '214', '8782', '4729', '4446', '1']
['62', '146', '204', '8951', '2367', '1064', '0']
['108', '255', '366', '17446', '5933', '2987', '1']
['62', '144', '220', '6164', '2782', '411', '1']
['90', '151', '286', '2853', '4651', '4197', '0']
['146', '100', '375', '21334', '6857', '1198', '0']
['62', '174', '189', '8082', '2143', '1209', '1']
['30', '54', '88', '3948', '3025', '137', '1']
['79', '213', '278', '11649', '2905', '1279', '0']
['44', '127', '158', '7850', '1498', '1273', '1']
['120', '208', '423', '29035', '6236', '3524', '0']
['100', '255', '300', '17532', '3547', '2561', '1']
['49', '110', '177', '8197', '2810', '3874', '1']
['123', '208', '336', '22555', '6059', '6402', '1']
['82', '114', '136', '8459', '1995', '1911', '1']
['58', '166', '205', '10412', '2245', '1122', '1']
['110', '228', '323', '16661', '4029', '3893', '1']
['62', '183', '222', '12406', '2784', '2212', '1']
['86', '62', '200', '11312', '3720', '2959', '1']
['102', '326', '355', '14499', '3866', '3006', '1']
['135', '157', '471', '24274', '7485', '1344', '0']
['78', '154', '203', '9327', '3672', '1242', '1']
['83', '224', '390', '12362', '3995', '1484', '1']
['60', '48', '213', '10644', '2820', '1154', '0']
['54', '119', '144', '7556', '2088', '245', '1']
['120', '217', '327', '20182', '4432', '6274', '0']
I know there is a short hand way of placing all those numbers within a list and use a sum function to add them up. I'm just not sure of how to go about it.
There are 2 ways. Below I use only an extract of your data.
Setup
We assume you begin with a list of lists of strings.
lst = [['244', '128', '385', '23521', '5230', '5334', '0'],
['59', '155', '203', '9160', '2459', '493', '1'],
['120', '281', '392', '21900', '6304', '6115', '0'],
['120', '291', '419', '22354', '6590', '6346', '0'],
['120', '238', '363', '17421', '5362', '6225', '0'],
['65', '180', '234', '10531', '3622', '449', '1'],
['120', '306', '372', '22147', '4406', '4998', '1'],
['90', '214', '305', '14025', '4173', '966', '1'],
['96', '155', '169', '8812', '1955', '1260', '0']]
Pure Python
A = [[int(i) for i in row] for row in lst]
res = sum(row[4] for row in A if row[6] == 0)
# 25441
Vectorised solution
You can use a 3rd party library such as numpy:
import numpy as np
A = np.array(lst, dtype=int)
res = A[np.where(A[:, 6] == 0), 4].sum()
# 25441
Turn your data file into an array of arrays.
['244', '128', '385', '23521', '5230', '5334', '0']
['59', '155', '203', '9160', '2459', '493', '1']
['120', '281', '392', '21900', '6304', '6115', '0']
Instead:
[['244', '128', '385', '23521', '5230', '5334', '0'],
['59', '155', '203', '9160', '2459', '493', '1'],
['120', '281', '392', '21900', '6304', '6115', '0']]
Then iterate over the elements in the array of arrays looking for the string '0' then adding the element [i][4] to your total sum. You'll need to convert the strings to a number value to add them though, otherwise you'll get one long string of numbers instead of a sum.
var sum = 0;
for (i = 0; while i < dataArray.length; i ++) {
if (dataArray[i][7] === '0') {
var sum += Number(dataArray[i][4])
}
};
At the end of the loop you'll have your total in var sum and can do with it as you please.
Just realized your working in python, my answer is in javascript. Whoops. Might not be the best answer but if you find the python version of the above solution it should get you on the right track. Cheers

Having trouble with beautifulsoup in python

I am very new to python and have trouble with the code below. I am trying to get either the temperature or the date on the website, but can't seem to get an output. I have tried many variations, but still can't seem to get it right..
Thank you for your help!
#Code below:
import requests,bs4
r = requests.get('http://www.hko.gov.hk/contente.htm')
print r.raise_for_status()
hkweather = bs4.BeautifulSoup(r.text)
print hkweather.select('div left_content fnd_day fnd_date')
Your css selector is incorrect, you should use . between the tag and css classes, the tags you want are in the divs with the fnd_day class inside the div with the id fnd_content
divs = soup.select("#fnd_content div.fnd_day")
But that still won't get the data as it is dynamically generated through an ajax request, you can get all the data in json format using the code below:
u = "http://www.hko.gov.hk/wxinfo/json/one_json.xml?_=1468955579991"
data = requests.get(u).json()
from pprint import pprint as pp
pp(data)
That returns pretty much all the dynamic content including the dates and temps etc..
If you access the key F9D, you can see the general weather description all the temps and dates:
from pprint import pprint as pp
pp(data['F9D'])
Output:
{'BulletinDate': '20160720',
'BulletinTime': '0315',
'GeneralSituation': 'A southwesterly airstream will bring showers to the '
'coast of Guangdong today. Under the dominance of an '
'upper-air anticyclone, it will be generally fine and '
'very hot over southern China in the latter part of this '
'week and early next week.',
'NPTemp': '25',
'WeatherForecast': [{'ForecastDate': '20160720',
'ForecastIcon': 'pic53.png',
'ForecastMaxrh': '95',
'ForecastMaxtemp': '32',
'ForecastMinrh': '70',
'ForecastMintemp': '26',
'ForecastWeather': 'Sunny periods and a few showers. '
'Isolated squally thunderstorms at '
'first.',
'ForecastWind': 'South to southwest force 4.',
'IconDesc': 'Sunny Periods with A Few Showers',
'WeekDay': '3'},
{'ForecastDate': '20160721',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '4'},
{'ForecastDate': '20160722',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '5'},
{'ForecastDate': '20160723',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '34',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Fine and very hot.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '6'},
{'ForecastDate': '20160724',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '34',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Fine and very hot.',
'ForecastWind': 'Southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '0'},
{'ForecastDate': '20160725',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '1'},
{'ForecastDate': '20160726',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '29',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'South to southwest force 3.',
'IconDesc': 'Hot',
'WeekDay': '2'},
{'ForecastDate': '20160727',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '3'},
{'ForecastDate': '20160728',
'ForecastIcon': 'pic90.png',
'ForecastMaxrh': '90',
'ForecastMaxtemp': '33',
'ForecastMinrh': '65',
'ForecastMintemp': '28',
'ForecastWeather': 'Mainly fine and very hot apart from '
'isolated showers in the morning.',
'ForecastWind': 'Southwest force 3 to 4.',
'IconDesc': 'Hot',
'WeekDay': '4'}]}
The only query string parameter is the epoch timestamp which you can generate using the time lib:
from time import time
u = "http://www.hko.gov.hk/wxinfo/json/one_json.xml?_={}".format(int(time()))
data = requests.get(u).json()
Not passing the timestamp also returns the same data so I will leave you to investigate the significance.
I was able to get the dates:
>>> import requests,bs4
>>> r = requests.get('http://www.hko.gov.hk/contente.htm')
>>> hkweather = bs4.BeautifulSoup(r.text)
>>> print hkweather.select('div[class="fnd_date"]')
# [<div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>, <div class="fnd_date"></div>]
But the text is missing. This doesn't seem like a problem with BeautifulSoup because I looked through r.text myself and all I see is <div class="fnd_date"></div> instead of anything like <div class="fnd_date">July 20</div>.
You can check that the text isn't there using regex (although using regex with HTML is frowned upon):
>>> import re
>>> re.findall(r'<[^<>]*fnd_date[^<>]*>[^>]*>', r.text)
# [u'<div id="fnd_date" class="date"></div>', ... repeated 10 times]

Iteration & Casting index values to integer and float in nested lists

I'm having difficulty with iterating through the nested list table below. I understand how to iterate through the table once, but to go a level deeper and iterate through each nested list, I am stuck on the correct syntax to use. In iterating through the sublists, I am trying to cast each 'age' and 'years experience' to an integer, perform the operation 'age' - 'years experience', and append the value (as a string) to each sublist.
table = [
['first_name', 'last_name', 'age', 'years experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2'],
['Lenna', 'Paprocki', '33', '7', '343240.2'],
['Donette', 'Foller', '26', '2', '273541.4'],
['Simona', 'Morasca', '35', '15', '960967.0'],
['Mitsue', 'Tollner', '51', '31', '162776.7'],
['Leota', 'Dilliard', '64', '39', '464595.5'],
['Sage', 'Wieser', '27', '9', '819519.7'],
['Kris', 'Marrier', '59', '33', '327505.55000000005'],
['Minna', 'Amigon', '45', '23', '571227.05'],
['Abel', 'Maclead', '46', '23', '247927.25'],
['Kiley', 'Caldarera', '33', '7', '179182.8'],
['Graciela', 'Ruta', '48', '21', '136978.95'],
['Cammy', 'Albares', '29', '9', '1016378.95'],
['Mattie', 'Poquette', '39', '15', '86458.75'],
['Meaghan', 'Garufi', '21', '3', '260256.5'],
['Gladys', 'Rim', '52', '26', '827390.5'],
['Yuki', 'Whobrey', '32', '10', '652737.0'],
['Fletcher', 'Flosi', '59', '37', '954975.15']]
##Exercise 3 (rows as lists): Iterate over each row and append the following values:
#If it is the first row then extend it with the following ['Started Working', 'Salary / Experience']
#Start work age (age - years experience)
#Salary / Experience ratio = (salary / divided by experience)
for i, v in enumerate(table):
extension = ['Started Working', 'Salary/Experience']
if i == 0:
v.extend(extension)
print(i,v) #test to print out the index and nested list values
#for index, value in enumerate(v):
# age =
#exp =
#start_work = age - exp
#print(index, value) test to print out the index and each value in the nested list
Pass the argument start to enumerate, enumerate(table, 1) in your case,
table = [['first_name', 'last_name', 'age', 'years experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2']]
table[0].extend(['Started Working', 'Salary/Experience'])
for idx, row in enumerate(table[1:], 1):
start_work_age = int(row[2]) - int(row[3])
ratio = float(row[4]) / int(row[3])
table[idx].extend([str(start_work_age), str(ratio)])
print(table)
# Output
[['first_name', 'last_name', 'age', 'years experience', 'salary', 'Started Working', 'Salary/Experience'],
['James', 'Butt', '29', '8', '887174.4', '21', '110896.8'],
['Josephine', 'Darakjy', '59', '39', '1051267.9', '20', '26955.5871795'],
['Art', 'Venere', '22', '2', '47104.2', '20', '23552.1']]
If you can convert the space to an underscore in years experience you can use collections.namedtuple to make your life simpler:
from collections import namedtuple
table = [
['first_name', 'last_name', 'age', 'years_experience', 'salary'],
['James', 'Butt', '29', '8', '887174.4'],
['Josephine', 'Darakjy', '59', '39', '1051267.9'],
['Art', 'Venere', '22', '2', '47104.2'],
# ...
]
workerv1 = namedtuple('workerv1', ','.join(table[0]))
for i,v in enumerate(table):
worker = workerv1(*v)
if i == 0:
swage = 'Started Working'
sex_ratio = 'S/Ex ratio'
else:
swage = int(worker.age) - int(worker.years_experience)
sex_ratio = float(worker.salary) / float(worker.years_experience)
print("{w.first_name},{w.last_name},{w.age},{w.years_experience},{w.salary},{0},{1}".format(
swage, sex_ratio, w=worker))

HTML file parsing in Python

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

How to find double entries and mutate them to a key

My script is this :
import csv
with open('lees.csv','rU') as naver:
reader = csv.DictReader (naver)
for alist in reader:
name = alist["naam"]
polisnumber = alist["polisnr"]
riskadr = alist["risico adr"]
insurencecode = alist["branchecode"]
relationnumber = alist["rel"]
header = alist["aanhef"]
tav = alist["tav"]
thelist = [name,riskadr,polisnumber,
relationnumber,insurencecode,header,tav]
the output of the script is:
['Cautus B.V.', 'plein 92', '1129008', '10', 'AVB', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa']
['Cautus B.V.', 'Wei 9-11', '1019123', '10', 'AVB', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa']
['Cautus B.V.', 'plein 92', '1129008', '10', 'BEDR', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa']
['Cautus B.V.', 'Wei 9-11', '1019123', '10', 'BEDR', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa']
['De company', 'tiellaan 42', 'KD0022232', '13', 'AVB', 'Geachte heer Tigch', 'De heer I. Tigch']
['De company', 'tiellaan 42', 'KD0022232', '13', 'DAS', 'Geachte heer Tigch', 'De heer I. Tigch']
['Slever ', 'klopt 42', 'KD2220115', '17', 'AVB', 'Geachte heer Slever', 'De heer T.Slever']
As you can see I created a dir from a .csv file.
My problem is that I need to make a script to filter the duplicates in riskadr (wei 9-11 / plein 92 / tiellaan 42) and add the insurencecode (AVB/BEDR/DAS, etc.) of the second duplicate riskadr to the first one in a new list together with the other entry's.
So now we have 2 entry's with the same risk adr like this:
['De company', 'tiellaan 42', 'KD0022232', '13', 'AVB', 'Geachte heer Tigch', 'De heer I. Tigch']
['De company', 'tiellaan 42', 'KD0022232', '13', 'DAS', 'Geachte heer Tigch', 'De heer I. Tigch']
But i want a scipt that makes 1 entry from that 2 entry's with the insurence type added to the first 1 like this(AVB/DAS):
['De company', 'tiellaan 42', 'KD0022232', '13', 'AVB','DAS', 'Geachte heer Tigch', 'De heer I. Tigch']
You should be able to achieve your goal using itertools.groupby:
from itertools import groupby
# define input
l = [['Cautus B.V.', 'plein 92', '1129008', '10', 'AVB', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa'],
['Cautus B.V.', 'Wei 9-11', '1019123', '10', 'AVB', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa'],
['Cautus B.V.', 'plein 92', '1129008', '10', 'BEDR', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa'],
['Cautus B.V.', 'Wei 9-11', '1019123', '10', 'BEDR', 'Geachte mevrouw Daa', 'Mevrouw C.P. Daa'],
['De company', 'tiellaan 42', 'KD0022232', '13', 'AVB', 'Geachte heer Tigch', 'De heer I. Tigch'],
['De company', 'tiellaan 42', 'KD0022232', '13', 'DAS', 'Geachte heer Tigch', 'De heer I. Tigch'],
['Slever ', 'klopt 42', 'KD2220115', '17', 'AVB', 'Geachte heer Slever', 'De heer T.Slever']]
# remove clutter
l_clean = [(x[1], x[4]) for x in l]
# sort (groupby requires input to be sorted)
l_sorted = sorted(l_clean)
# group by first column
l_final = [(k, zip(*v)[1]) for k,v in groupby(l_sorted, key=lambda x:x[0])]
# print output
for k,v in l_final:
print k, list(v)
The output is:
Wei 9-11 ['AVB', 'BEDR']
klopt 42 ['AVB']
plein 92 ['AVB', 'BEDR']
tiellaan 42 ['AVB', 'DAS']
Note that you will need to adapt the key functions used for sorting and grouping to work as intended with input different from l_clean.
>>> a = [
... ('De company', 'tiellaan 42', 'KD0022232', '13', 'DAS', 'Geachte heer Tigch', 'De heer I. Tigch'),
... ('De company', 'tiellaan 42', 'KD0022232', '13', 'DAS', 'Geachte heer Tigch', 'De heer I. Tigch'),
... ]
>>>
>>> set(a)
set([('De company', 'tiellaan 42', 'KD0022232', '13', 'DAS', 'Geachte heer Tigch', 'De heer I. Tigch')])
>>>
save them as tuples already instead of lists, and add them to a set... if this is what you need
You probably need something along these lines. Have an in memory array (ultimatelist) in which you check presence of similar thelist. If found append the insurencecode
def search(item, array):
for i in range(len(array)):
# if first four elements and last two elements are identical
if array[i][:4] == item[0:4] and array[i][-2:] == item[-2:]:
return i
return -1
index = search(thelist, ultimatelist):
if index > 0:
ultimatelist[index] = ultimatelist[index][:4] + thelist[4] + ultimatelist[index][4:]

Categories