Python script to extract specific data with Xpath

Python script to extract specific data with Xpath - python

I would like to extract all data of the row named "Nb B" at this url page : https://www.coteur.com/cotes-foot.php
Here is my python script :
#!/usr/bin/python3
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get('https://www.coteur.com/cotes-foot.php')
#Store url associated with the soccer games
url_links = []
for i in driver.find_elements_by_xpath('//a[contains(#href, "match/cotes-")]'):
url_links.append(i.get_attribute('href'))
print(len(url_links), '\n')
nb_bookies = []
for i in driver.find_elements_by_xpath('//td[contains(#class, " odds")][contains(#style, "")]'):
nb_bookies.append(i.text)
print(nb_bookies)
And here is the output :
25
['1.80', '3.55', '4.70', '95%', '', '1.40', '4.60', '8.00', '94.33%', '', '2.35', '3.42', '2.63', '90.18%', '', '3.20', '3.60', '2.05', '92.19%', '', '7.00', '4.80', '1.35', '90.81%', '', '5.30', '4.30', '1.70', '99.05%', '', '2.15', '3.55', '3.65', '97.92%', '', '2.90', '3.20', '2.20', '88.81%', '', '3.95', '3.40', '2.10', '97.65%', '', '2.00', '3.80', '3.90', '98.04%', '', '2.40', '3.05', '3.50', '96.98%', '', '3.70', '3.20', '2.00', '91.72%', '', '2.75', '2.52', '3.05', '91.17%', '', '4.20', '3.05', '1.69', '84.23%', '', '1.22', '5.10', '10.00', '88.42%', '', '1.54', '4.60', '5.10', '93.72%', '', '3.00', '3.10', '2.45', '93.59%', '', '2.40', '3.50', '2.55', '90.55%', '', '1.76', '3.50', '4.20', '90.8%', '', '11.50', '5.30', '1.36', '98.91%', '', '3.00', '3.50', '2.20', '92.64%', '', '1.72', '3.42', '5.00', '92.62%', '', '1.08', '9.25', '19.00', '91.33%', '', '9.75', '5.75', '1.36', '98.82%', '', '5.70', '4.50', '1.63', '98.88%', '']
All the data of the table is extracted and you can see '' for the last row whereas I just want the last row.

To get the data from the last column only, fix your XPath accordingly :
nb_bookies = []
for i in driver.find_elements_by_xpath('//tr[#id and #role="row" ]/td[last()]'):
nb_bookies.append(i.text)
Output :
['12', '12', '1', '9', '11', '12', '12', '12', '12', '12', '11', '2', '11', '11', '9', '12', '11', '12', '12', '12', '12', '12', '10', '5', '12']

Your code is perfectly fine, the problem is to do with the window size that is spawned by the Automator in a headless mode. The default window size and display size in headless mode is 800x600 on all platforms.
The developers of the site have set the header to only appear if the width of the window is >1030px and only then the display: none; is removed from DOM. You can test this for yourself by shrinking & expanding the window size.
You need to understand that if an element's attribute contains style="display: none;" which means the element is hidden then Selenium won't be able to interact with the element, i.e. if a user can't see it then the same behavior applies to selenium.
Simply adding this line to enlarge your window in a headless mode will solve your problem.
options.add_argument("window-size=1400,800")

Related

How to use Python xlwings to copy a large list of lists to Excel

I have a lists of lists of length 42, and each list has about 16 items in it. I have noticed that copying the list to excel using xlwings only works for up to 25 lists and anything after that doesn't work, or works sometimes and sometimes doesn't. I have the complete list and code below if anyone would like to reproduce the issue.
import xlwings as xw
data = [['1st', '(6)', '29.9', '407m', '22/05/2017', 'GRAC', 'M', '23.76', '23.76', '23.13', '8.62', '0.50', 'Supreme Flash', '1111', '', '$6.60'], ['8th', '(5)', '29.8', '407m', '29/05/2017', 'GRAC', '5', '24.64', '23.52', '23.15', '9.02', '16.00', 'Vision Time', '1788', '', '$17.80'], ['5th', '(3)', '30.3', '305m', '12/06/2017', 'GRAC', '5', '18.25', '17.84', '17.81', '3.30', '5.75', 'Red Red Wine', '7835', '', '$21.60'], ['2nd', '(2)', '30.1', '407m', '07/07/2017', 'GRAC', 'MX', '23.62', '23.57', '22.89', '8.60', '0.75', 'Tictac Cloud', '3222', '', '$24.10'], ['4th', '(4)', '29.9', '407m', '14/07/2017', 'GRAC', '5', '23.58', '23.44', '22.98', '8.67', '2.00', 'Kooringa Theo', '2434', '', '$7.00'], ['8th', '(4)', '29.9', '407m', '24/07/2017', 'GRAC', '5', '24.44', '23.75', '23.03', '8.88', '9.75', 'Myraki', '3458', '', '$10.20'], ['1st', '(1)', '30.4', '407m', '07/08/2017', 'GRAC', '5', '23.41', '23.41', '23.12', '8.52', '3.00', 'Myraki', '11', '', '$8.10'], ['1st', '(7)', '30.4', '407m', '14/08/2017', 'GRAF', '5', '23.53', '23.53', '23.18', '8.62', '0.75', 'Gee Tee Bee', '11', '', '$26.40'], ['4th', '(6)', '30.6', '420m', '22/08/2017', 'LISM', '5', '24.58', '23.97', '23.88', '', '8.75', 'Bazaar Mckenzie', '5444', '', '$12.20'], ['5th', '(8)', '31.7', '407m', '23/10/2017', 'GRAC', '5', '23.86', '23.55', '23.27', '8.71', '4.25', 'Hidden Sniper', '1755', '', '$8.50'], ['3rd', '(8)', '31.3', '407m', '30/10/2017', 'GRAC', '5', '23.68', '23.40', '23.13', '8.63', '4.00', 'Hidden Sniper', '1763', '', '$10.20'], ['1st', '(8)', '30.4', '420m', '14/11/2017', 'LISC', '5', '24.19', '24.19', '23.93', '9.82', '1.50', 'Pavlova Cloud', '2211', '', '$3.60'], ['3rd', '(1)', '30.3', '420m', '21/11/2017', 'LISM', '5', '24.34', '24.12', '24.10', '9.78', '3.00', 'Senor Izmir', '3333', '', '$5.50'], ['6th', '(6)', '30.2', '420m', '28/11/2017', 'LISM', '5', '24.98', '24.16', '24.01', '10.17', '11.75', 'Ace Gambler', '7666', '', '$3.80'], ['5th', '(8)', '30.2', '407m', '04/12/2017', 'GRAF', '5', '23.68', '23.11', '23.11', '8.80', '8.25', 'Slippery Valley', '1665', '', '$12.80'], ['1st', '(8)', '30.1', '411m', '08/12/2017', 'CASC', '4/5', '23.55', '23.55', '23.34', '', '2.25', 'Plane Spotter', '1111', '', '$3.40'], ['1st', '(2)', '30.3', '411m', '15/12/2017', 'CASO', '4/5', '23.29', '23.29', '23.29', '', '2.25', 'Benne Fortuna', '1111', '', '$5.10'], ['3rd', '(5)', '30.4', '407m', '01/01/2018', 'GRAF', '5', '23.68', '23.52', '22.94', '8.66', '2.25', 'Bella Lyndan', '1433', '', '$3.80'], ['5th', '(3)', '30.1', '420m', '09/01/2018', 'LISM', '5', '24.37', '24.00', '23.90', '9.82', '5.25', 'Brightest Star', '4555', '', '$4.30'], ['4th', '(2)', '30.4', '420m', '16/01/2018', 'LISM', '5', '24.60', '24.11', '24.04', '10.28', '7.00', 'Lucky Call', '7644', '', '$6.30'], ['1st', '(1)', '30.2', '407m', '22/01/2018', 'GRAC', '4/5', '23.21', '23.21', '23.20', '8.68', '6.75', 'Soltador', '7211', '', '$3.30'], ['2nd', '(2)', '29.9', '407m', '29/01/2018', 'GRAC', '4/5', '23.36', '23.25', '23.24', '8.59', '1.50', 'Slippery Valley', '7322', '', '$3.60'], ['4th', '(6)', '29.8', '407m', '05/02/2018', 'GRAF', '5', '23.69', '23.18', '23.18', '8.61', '7.25', 'Karaoke Cloud', '1444', '', '$3.10'], ['3rd', '(6)', '30.0', '420m', '13/02/2018', 'LISM', '5', '24.18', '24.01', '24.01', '9.80', '2.25', 'Tranquil Invader', '4333', '', '$5.90'], ['3rd', '(1)', '30.0', '420m', '20/02/2018', 'LISM', '5', '24.23', '24.10', '23.95', '9.86', '1.75', 'Benne Fortuna', '3333', '', '$3.30'], ['2nd', '(4)', '30.0', '420m', '27/02/2018', 'LISM', '5', '24.18', '23.91', '23.91', '9.75', '3.75', 'Oh So Fabio', '3322', '\n$4.70'], ['6th', '(4)', '30.0', '407m', '05/03/2018', 'GRAF', '5', '24.57', '23.63', '23.36', '8.63', '13.25', 'Star Billing', '2676', '', '$5.90'], ['1st', '(4)', '29.8', '407m', '12/03/2018', 'GRAC', '4/5', '23.27', '23.27', '23.08', '8.57', '0.50', 'Senor Izmir', '3321', '', '$8.50'], ['3rd', '(8)', '30.4', '407m', '19/03/2018', 'GRAC', '4/5', '23.24', '23.02', '23.02', '8.58', '3.00', "Freddy's Back", '1633', '', '$17.40'], ['6th', '(5)', '30.6', '420m', '27/03/2018', 'LISM', '5', '24.88', '24.25', '23.97', '10.31', '9.00', 'Kingsbrae Steve', '7666', '', '$4.00'], ['1st', '(3)', '30.4', '407m', '02/04/2018', 'GRAF', '5', '23.17', '23.17', '23.15', '8.54', '1.25', 'Whistler Valley', '2221', '', '$5.60'], ['3rd', '(1)', '30.3', '407m', '09/04/2018', 'GRAC', 'NG', '23.41', '23.13', '23.13', '8.53', '4.00', 'Orara Sal', '4323', '', '$3.60'], ['5th', '(3)', '30.0', '520m', '17/04/2018', 'LISM', '4/5', '30.67', '30.30', '30.06', '4.53', '5.25', 'Kulu Turkey', '2455', '', '$4.70'], ['5th', '(5)', '30.2', '411m', '27/04/2018', 'CASO', '5', '24.26', '23.86', '23.18', '', '5.75', 'Our Cavalier', '5555', '', '$4.30'], ['6th', '(3)', '31.4', '305m', '13/08/2018', 'GRAC', '4/5', '18.29', '17.79', '17.31', '3.31', '7.00', "Here's Molly", '8856', '', '$7.60'], ['1st', '(6)', '31.6', '305m', '20/08/2018', 'GRAC', '5', '17.66', '17.66', '17.66', '3.19', '1.25', 'Sandler', '1111', '', '$3.30'], ['1st', '(3)', '31.6', '420m', '28/08/2018', 'LISM', '4/5', '24.46', '24.46', '24.05', '9.95', '2.00', "Don't Seamus", '1111', '', '$2.00'], ['7th', '(7)', '31.6', '407m', '03/09/2018', 'GRAF', '4/5', '24.05', '23.48', '23.39', '8.72', '8.25', 'Kooringa Molly', '4667', '', '$6.50'], ['6th', '(4)', '31.4', '411m', '07/09/2018', 'CASC', '5', '23.90', '23.49', '23.15', '', '5.75', 'Nitro Beach', '6566', '', '$5.70'], ['4th', '(3)', '31.1', '420m', '11/09/2018', 'LISM', '4/5', '24.33', '23.91', '23.80', '9.78', '6.00', 'Blue Max', '4444', '', '$10.10'], ['5th', '(3)', '31.3', '411m', '14/09/2018', 'CASO', '5', '24.01', '23.25', '22.97', '', '10.75', 'Kingsbrae Steve', '7755', '\n$3.60']]
wb = xw.Book('example.xlsm')
sht = wb.sheets["Sheet1"]
sht.clear()
sht.range('A1').value = data[1:26]
The above code works and copies each list to successive row. However it doesnt work when I change the 26 to any number above. Also the code doesn't work if my starting index is 0, for example sht.range('A1').value = data[0:5]How can I get this working properly?

Ok I've realised xlwings certainly struggles and is unpredictable with lists. For anyone having this issue, simply convert the list to a dataframe and it works as expected. Sample code below:
import xlwings as xw
import pandas as pd
data = [['1st', '(6)',...]] #View complete list above
wb = xw.Book('example.xlsm')
sht = wb.sheets["Sheet1"]
sht.clear()
df = pd.DataFrame(data)
sht.range("A1").value = df

All lists/tuples that represent rows must be of the same length. It's a known limitation and there should be an appropriate error message with one of the next releases, see the issue.
Your answer works as numpy arrays or pandas dataframes are always regular arrays.

Turn a text file into a dictionary with Python

I have a text file with a pattern:
[Badges_373382]
Deleted=0
Button2=0 1497592154
Button1=0 1497592154
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1509194246
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1509120279
Name=asd
Neuron=7F0027BF2D
Owner=373381
LostSince=1509120774
Index1=218
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497592154
BatteryLow=0
PrevReader=10703
Reader=357862
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:48:15 PM
OwnerOldName=asd
[Badges_373384]
Deleted=0
Button2=0 1497538610
Button1=0 1497538610
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1513872678
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1513872684
Name=dsa
Neuron=7F0027CC1C
Owner=373383
LostSince=1513872723
Index1=219
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497538610
BatteryLow=0
PrevReader=357874
Reader=357873
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:48:51 PM
OwnerOldName=dsa
[Badges_373386]
Deleted=0
Button2=0 1497780768
Button1=0 1497780768
ProxReader=0
StartProc=100 1509194246 ""
NextStart=0
LastSeen=1514124910
Enabled=1
Driver=Access Control
Program=AccessProxBadge
LocChg=1514124915
Name=ss
Neuron=7F0027B5FD
Owner=373385
LostSince=1514124950
Index1=220
Photo=unknown.jpg
LastProxReader=0
Temp=0
LastTemp=0
LastMotionless=0
LastMotion=1497780768
BatteryLow=0
PrevReader=357872
Reader=357871
SuspendTill=0
SuspendSince=0
Status=1001
ConvertUponDownload=0
AXSFlags=0
Params=10106
Motion=1
USER_DATA_CreationDate=6/15/2017 4:49:24 PM
OwnerOldName=ss
Every new "Badge" info starts with [Badges_number] and end with blank line.
Using Python 3.6, I would like to turn this file into a dictionary so that I could easily access that information.
It should look like this:
content = {"Badges_373382:{"Deleted:0,.."},"Badges_371231":{"Deleted":0,..}"}
I'm pretty confused on how to do that, I'd love to get some help.
Thanks!

This is basically an INI file, and Python provides the configparser module to parse such files.
import configparser
config = configparser.ConfigParser()
config.readfp(open('badges.ini'))
r = {section: dict(config[section]) for section in config.sections()}

You can loop through each line and keep track if you have seen a header in the format [Badges_373382]:
import re
import itertools
with open('filename.txt') as f:
f = filter(lambda x:x, [i.strip('\n') for i in f])
new_data = [(a, list(b)) for a, b in itertools.groupby(f, key=lambda x:bool(re.findall('\[[a-zA-Z]+_+\d+\]', x)))]
final_data = {new_data[i][-1][-1]:dict(c.split('=') for c in new_data[i+1][-1]) for i in range(0, len(new_data), 2)}
Output:
{'[Badges_373384]': {'OwnerOldName': 'dsa', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:48:51 PM', 'Program': 'AccessProxBadge', 'LocChg': '1513872684', 'Reader': '357873', 'LostSince': '1513872723', 'LastMotion': '1497538610', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1513872678', 'BatteryLow': '0', 'Index1': '219', 'Name': 'dsa', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497538610', 'Button1': '0 1497538610', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '357874', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027CC1C', 'Owner': '373383', 'Params': '10106'}, '[Badges_373382]': {'OwnerOldName': 'asd', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:48:15 PM', 'Program': 'AccessProxBadge', 'LocChg': '1509120279', 'Reader': '357862', 'LostSince': '1509120774', 'LastMotion': '1497592154', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1509194246', 'BatteryLow': '0', 'Index1': '218', 'Name': 'asd', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497592154', 'Button1': '0 1497592154', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '10703', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027BF2D', 'Owner': '373381', 'Params': '10106'}, '[Badges_373386]': {'OwnerOldName': 'ss', 'LastMotionless': '0', 'NextStart': '0', 'Driver': 'Access Control', 'LastTemp': '0', 'USER_DATA_CreationDate': '6/15/2017 4:49:24 PM', 'Program': 'AccessProxBadge', 'LocChg': '1514124915', 'Reader': '357871', 'LostSince': '1514124950', 'LastMotion': '1497780768', 'Status': '1001', 'Deleted': '0', 'SuspendTill': '0', 'ProxReader': '0', 'LastSeen': '1514124910', 'BatteryLow': '0', 'Index1': '220', 'Name': 'ss', 'Temp': '0', 'Enabled': '1', 'StartProc': '100 1509194246 ""', 'Motion': '1', 'Button2': '0 1497780768', 'Button1': '0 1497780768', 'SuspendSince': '0', 'ConvertUponDownload': '0', 'PrevReader': '357872', 'AXSFlags': '0', 'LastProxReader': '0', 'Photo': 'unknown.jpg', 'Neuron': '7F0027B5FD', 'Owner': '373385', 'Params': '10106'}}

You can just go through each line of the file and add what you need. Their are three cases of lines you can come across:
1. The is a header, it will be a key final dictionary. You can just check if a line starts with "[Badges" here, and store the current header with a temporary variable while reading the file.
2. The line is a blank line, marking the end of the current badge data being read. All you need to do here is add the information collected from the current badge and add it to the dictionary, with the correct corresponding key. Depending on your implementation, you can delete these beforehand, or keep them when reading the lines.
3. Otherwise, the line has some info that needs to be stored. You first need to split this info on "=", and store it in your dictionary.
With these suggestions, you can write something like this to accomplish this task:
from collections import defaultdict
# dictionary of dictionary values
data = defaultdict(dict)
with open('pattern.txt') as file:
lines = [line.strip('\n') for line in file]
# keeps track of current header
header = None
# case 2, deletes empty lines before hand
valid_lines = [line for line in lines if line]
for line in valid_lines:
# case 1, for headers
if line.startswith('[Badges'):
# updates current header, and deletes square brackets
header = line.replace('[', '').replace(']', '')
# case 3, data has been found
else:
# split and add the data
info = line.split('=')
key, value = info[0], info[1]
data[header][key] = value
print(dict(data))
Which outputs:
{'Badges_373382': {'Deleted': '0', 'Button2': '0 1497592154', 'Button1': '0 1497592154', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1509194246', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1509120279', 'Name': 'asd', 'Neuron': '7F0027BF2D', 'Owner': '373381', 'LostSince': '1509120774', 'Index1': '218', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497592154', 'BatteryLow': '0', 'PrevReader': '10703', 'Reader': '357862', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:48:15 PM', 'OwnerOldName': 'asd'}, 'Badges_373384': {'Deleted': '0', 'Button2': '0 1497538610', 'Button1': '0 1497538610', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1513872678', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1513872684', 'Name': 'dsa', 'Neuron': '7F0027CC1C', 'Owner': '373383', 'LostSince': '1513872723', 'Index1': '219', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497538610', 'BatteryLow': '0', 'PrevReader': '357874', 'Reader': '357873', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:48:51 PM', 'OwnerOldName': 'dsa'}, 'Badges_373386': {'Deleted': '0', 'Button2': '0 1497780768', 'Button1': '0 1497780768', 'ProxReader': '0', 'StartProc': '100 1509194246 ""', 'NextStart': '0', 'LastSeen': '1514124910', 'Enabled': '1', 'Driver': 'Access Control', 'Program': 'AccessProxBadge', 'LocChg': '1514124915', 'Name': 'ss', 'Neuron': '7F0027B5FD', 'Owner': '373385', 'LostSince': '1514124950', 'Index1': '220', 'Photo': 'unknown.jpg', 'LastProxReader': '0', 'Temp': '0', 'LastTemp': '0', 'LastMotionless': '0', 'LastMotion': '1497780768', 'BatteryLow': '0', 'PrevReader': '357872', 'Reader': '357871', 'SuspendTill': '0', 'SuspendSince': '0', 'Status': '1001', 'ConvertUponDownload': '0', 'AXSFlags': '0', 'Params': '10106', 'Motion': '1', 'USER_DATA_CreationDate': '6/15/2017 4:49:24 PM', 'OwnerOldName': 'ss'}}
Note: The above code is just a possibility, feel free to adapt it to your needs, or even improve it.
I also used collections.defaultdict to add the data, since its easier to use. You can also wrap dict() at the end to convert it to a normal dictionary, which is optional.

You can try regex and split the result of output:
pattern='^\[Badges.+?OwnerOldName=\w+'
import re
with open('file.txt','r') as f:
match=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)
new=[]
for kk in match:
if kk.group()!='\n':
new.append(kk.group())
print({i.split()[0]:i.split()[1:] for i in new})
output:
{'[Badges_373384]': ['Deleted=0', 'Button2=0', '1497538610', 'Button1=0', '1497538610', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1513872678', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1513872684', 'Name=dsa', 'Neuron=7F0027CC1C', 'Owner=373383', 'LostSince=1513872723', 'Index1=219', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497538610', 'BatteryLow=0', 'PrevReader=357874', 'Reader=357873', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:48:51', 'PM', 'OwnerOldName=dsa'], '[Badges_373382]': ['Deleted=0', 'Button2=0', '1497592154', 'Button1=0', '1497592154', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1509194246', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1509120279', 'Name=asd', 'Neuron=7F0027BF2D', 'Owner=373381', 'LostSince=1509120774', 'Index1=218', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497592154', 'BatteryLow=0', 'PrevReader=10703', 'Reader=357862', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:48:15', 'PM', 'OwnerOldName=asd'], '[Badges_373386]': ['Deleted=0', 'Button2=0', '1497780768', 'Button1=0', '1497780768', 'ProxReader=0', 'StartProc=100', '1509194246', '""', 'NextStart=0', 'LastSeen=1514124910', 'Enabled=1', 'Driver=Access', 'Control', 'Program=AccessProxBadge', 'LocChg=1514124915', 'Name=ss', 'Neuron=7F0027B5FD', 'Owner=373385', 'LostSince=1514124950', 'Index1=220', 'Photo=unknown.jpg', 'LastProxReader=0', 'Temp=0', 'LastTemp=0', 'LastMotionless=0', 'LastMotion=1497780768', 'BatteryLow=0', 'PrevReader=357872', 'Reader=357871', 'SuspendTill=0', 'SuspendSince=0', 'Status=1001', 'ConvertUponDownload=0', 'AXSFlags=0', 'Params=10106', 'Motion=1', 'USER_DATA_CreationDate=6/15/2017', '4:49:24', 'PM', 'OwnerOldName=ss']}

How to customize a table for label printing with reportlab

I want to make the label print to PDF from reportlab by python 3.6, and I checked the reportlab for tables' usage. All of the methods are regular form.
I want to merge the cell to realize the final effect as follows.
<> contains records from database.
My label requirements
By "Span" method, I got the tables here:
When I met the last rows, I cannot split it. Because I use 0.5cm x16, 0.5cmx11 to format the table. Now, should I change it to 0.25 cmx32, 0.25cm x 22? It must be a mass work.
My result
Does anyone give me a suggestion to solve this problem? I need a direction. Thanks.
* If simply draw line and output text, I cannot realize the align,valign, wrap etc.
My codes are here:
# -*- coding:utf-8 -*-
from reportlab.lib import colors
from reportlab.lib.pagesizes import A4,cm
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
doc = SimpleDocTemplate("LabelTest.pdf", pagesize=A4)
# container for the 'Flowable' objects
elements = []
data= [['1', '', '','','','','','','', '', '', '', '13', '', '', ''],
['', '', '','','','','','','', '', '', '', '', '', '', ''],
['', '', '','','','','','','', '', '', '', '', '', '', ''],
['4','', '','','','','','','', '', '', '', '13', '14', '15', '16'],
['5', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['6', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['7', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['8', '', '','','','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['9', '', '','','','','7','8','9', '10', '11', '12', '13', '14', '15', '16'],
['', '', '','','','','','8','9', '10', '11', '12', '13', '14', '15', '16'],
['', '', '','','','','','8','9', '10', '11', '12', '13', '14', '15', '16']]
t=Table(data,16*[0.5*cm], 11*[0.5*cm],)
t.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
('SPAN',(-4,0),(-1,3)), # Right corner for logo image
('SPAN',(0,0),(-5,2)), # First two rows for product des and surface
('SPAN',(0,3),(-5,3)), # Third row for product requirements
('SPAN',(0,4),(5,7)), # For product picture
('SPAN',(6,3),(-1,6)), # Description and size
('SPAN',(6,4),(-1,7)), # For mat'l no.
('SPAN',(0,8),(5,-1)), # EAN-13
]))
elements.append(t)
# write the document to disk
doc.build(elements)
Currently, I find a solution to make it by myself, maybe it is not the best one, but it really helps a lot.
# -*- coding:utf-8 -*-
from reportlab.lib import colors
from reportlab.lib.pagesizes import A4,cm
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
doc = SimpleDocTemplate("ex09_1-ReyherTable.pdf", pagesize=A4)
# container for the 'Flowable' objects
elements = []
data0= [['1','9']]
t0=Table(data0,colWidths=[6*cm,2*cm],rowHeights=[2*cm])
t0.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
]))
data1= [['2','9']]
t1=Table(data1,colWidths=[3*cm,5*cm],rowHeights=[2*cm])
t1.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
]))
data2= [['3','4','5'],
['4','5','6'],]
t2=Table(data2,colWidths=[3*cm,2.5*cm,2.5*cm],rowHeights=2*[0.75*cm])
t2.setStyle(TableStyle([
('GRID',(0,0),(-1,-1),1,colors.black),
('SPAN',(0,0),(0,-1)),
('SPAN',(-1,0),(-1,-1)),
]))
elements.append(t0)
elements.append(t1)
elements.append(t2)
# write the document to disk
doc.build(elements)

NoneType Error when trying to parse Table using BeautifulSoup

Here's my code:
source = urllib.request.urlopen('http://nflcombineresults.com/nflcombinedata_expanded.php ?year=2015&pos=&college=').read()
soup = bs.BeautifulSoup(source, 'lxml')
table = soup.table
table = soup.find(id='datatable')
table_rows = table.find_all('tr')
#print(table_rows)
year = []
name = []
college = []
pos = []
height = []
weight = []
hand_size = []
arm_length = []
wonderlic = []
fortyyrd = []
for row in table_rows[1:]:
col = row.find_all('td')
#row = [i.text for i in td]
#print(col[4])
# Create a variable of the string inside each <td> tag pair,
column_1 = col[0].string.strip()
# and append it to each variable
year.append(column_1)
column_2 = col[1].string.strip()
name.append(column_2)
column_3 = col[2].string.strip()
college.append(column_3)
column_4 = col[3].string.strip()
pos.append(column_4)
#print(col[4])
column_5 = col[4].string.strip()
height.append(column_5)
There are several more columns in the table I want to add, but whenever I try and run these last two lines, I get an error saying:
"AttributeError: 'NoneType' object has no attribute 'strip'"
when I print col[4] right above this line, I get:
<td><div align="center">69</div></td>
I originally thought this is due to missing data, but the first instance of missing data in the original table on the website is in the 9th column (Wonderlic) of the first row, not the 4th column.
There are several other columns not included in this snippet of code that I want to add to my dataframe and I'm getting the NoneType error with them as well despite there being an entry in that cell.
I'm fairly new to parsing tables from a site using BeautifulSoup and so this could be a stupid question, but why is this object NoneType how can I fix this so I can put this table into a pandas dataframe?

Alternately if you want to try it with pandas, you can do it like so:
import pandas as pd
df = pd.read_html("http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=")[0]
df.head()
Output:

AttributeError: 'NoneType' object has no attribute 'strip'
The actual error is happening on the last row of the table which has a single cell, here is it's HTML:
<tr style="background-color:#333333;"><td colspan="15"> </td></tr>
Just slice it:
for row in table_rows[1:-1]:
As far as improving the overall quality of the code, you can/should follow #宏杰李's answer.

import requests
from bs4 import BeautifulSoup
r = requests.get('http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=')
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print (row)
out:
['Year', 'Name', 'College', 'POS', 'Height (in)', 'Weight (lbs)', 'Hand Size (in)', 'Arm Length (in)', 'Wonderlic', '40 Yard', 'Bench Press', 'Vert Leap (in)', 'Broad Jump (in)', 'Shuttle', '3Cone', '60Yd Shuttle']
['2015', 'Ameer Abdullah', 'Nebraska', 'RB', '69', '205', '8.63', '30.00', '', '4.60', '24', '42.5', '130', '3.95', '6.79', '11.18']
['2015', 'Nelson Agholor', 'Southern California', 'WR', '73', '198', '9.25', '32.25', '', '4.42', '12', '', '', '', '', '']
['2015', 'Malcolm Agnew', 'Southern Illinois', 'RB', '70', '202', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Jay Ajayi', 'Boise State', 'RB', '73', '221', '10.00', '32.00', '24', '4.57', '19', '39.0', '121', '4.10', '7.10', '11.10']
['2015', 'Brandon Alexander', 'Central Florida', 'FS', '74', '195', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Kwon Alexander', 'Louisiana State', 'OLB', '73', '227', '9.25', '30.25', '', '4.55', '24', '36.0', '121', '4.20', '7.14', '']
['2015', 'Mario Alford', 'West Virginia', 'WR', '68', '180', '9.38', '31.25', '', '4.43', '13', '34.0', '121', '4.07', '6.64', '11.22']
['2015', 'Detric Allen', 'East Carolina', 'CB', '73', '200', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Javorius Allen', 'Southern California', 'RB', '73', '221', '9.38', '31.75', '12', '4.53', '11', '35.5', '121', '4.28', '6.96', '']
As you can see, there are a lot of empty fields in the table, the better way is to put all the field in a list, then unpack them or use namedtuple.
This will improve your code stability.

HTML file parsing in Python

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .
Example:
<tr>
<td>Cech</td>
<td>Chelsea</td>
<td>30</td>
<td>£6.4</td>
</tr>
The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.
I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:
import nltk
from urllib import urlopen
url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw
which just strips of the html file of all the tags and gives something like this:
Cech
Chelsea
30
£6.4
Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.
The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables
first = True
title =""
players = []
for i,table in enumerate(tables):
if first:
#every h2 element has 2 tables. table size = 8, h2 size = 4
#so for every 2 tables 1 h2
title = h2s[int(i/2)].text
for tr in table.select("tr"):
player = (title,) #create a player
for td in tr.select("td"):
player = player + (td.text,) #add td info in the player
if len(player) > 1:
#If the tr contains a player and its not only ("Goalkeaper") add it
players.append(player)
first = not first
pprint(players)
output:
[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
('Defenders', 'Baines', 'Everton', '43', '£7.7'),
('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python script to extract specific data with Xpath - python

Related

How to use Python xlwings to copy a large list of lists to Excel

Turn a text file into a dictionary with Python

How to customize a table for label printing with reportlab

NoneType Error when trying to parse Table using BeautifulSoup

HTML file parsing in Python

Categories

Resources