Related
How do I extract text from this PDF files where some data is in the form of table while some are key value based data
eg:
https://drive.internxt.com/s/file/78f2d73478b832b2ab55/3edb275967deeca6ad33e7d53f2337c50d5dfb50e0aa525bb7f10d49dff1e2b4
This is what I have tried :
import PyPDF2
import openpyxl
from openpyxl import Workbook
pdfFileObj = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
wb = Workbook()
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext
wb.save('sample.xlsx')
print('Save')
However I'd like the data to be stored in the following format.
This pdf does not have well defined tables, hence cannot use any tool to extract the entire data in one table format. What we can do is read the entire pdf as text. And process each data fields line by line by using regex to extract the data.
Before you move ahead, please install the pdfplumber package for python
pip install pdfplumber
Assumptions
Here are some assumptions that I made for your pdf and accordingly I have written the code.
First line will always contain the title Account History Report.
Second line will contain the names IMAGE All Notes
Third line will contain only the data Date Created in the form of key:value.
Fourth line will contain only the data Number of Pages in the form of key:value.
Fifth line will only contain the data Client Code, Client Name
Starting line 6, a pdf can have multiple data entity, these data entity for eg in this pdf is 2 but can be any number of entity.
Each data entity will contain the following fields:
First line in data entity will contain only the data Our Ref, Name, Ref 1, Ref 2
Second line line will only contain data in the form as present in pdf Amount, Total Paid, Balance, Date of A/C, Date Received
Third line in data entity will contain the data Last Paid, Amt Last Paid, Status, Collector.
Fourth line will contain the column name Date Notes
The subsequent lines will contain data in the form of table until the next data entity is started.
I also assume that each data entity will contain the first data with key Our Ref :.
I assume that the data entity will be separated on the first line of each entity in the pattern of key values as Our Ref :Value Name: Value Ref 1 :Value Ref 2:value
pattern = r'Our Ref.*?Name.*?Ref 1.*?Ref 2.*?'
Please note that the rectangle that I have created(thick black) in above image, I am calling those as data entity.
The final data will be stored in a dictionary(json) where the data entity will have key as dataentity1, dataentity2, dataentity3 based on the number of entities you have in your pdf.
The header details are stored in the json as key:value and I assume that each key will be present in header only once.
CODE
Here is the simple elegant code, that gives you information from the pdf in the form of json. In the output the first few field contains information from the header part, subsequent data entities can be found as data_entity 1 and 2.
In the below code all you need to change is pdf_path.
import pdfplumber
import re
# regex pattern for keys in line1 of data entity
my_regex_dict_line1 = {
'Our Ref' : r'Our Ref :(.*?)Name',
'Name' : r'Name:(.*?)Ref 1',
'Ref 1' : r'Ref 1 :(.*?)Ref 2',
'Ref 2' : r'Ref 2:(.*?)$'
}
# regex pattern for keys in line2 of data entity
my_regex_dict_line2 = {
'Amount' : r'Amount:(.*?)Total Paid',
'Total Paid' : r'Total Paid:(.*?)Balance',
'Balance' : r'Balance:(.*?)Date of A/C',
'Date of A/C' : r'Date of A/C:(.*?)Date Received',
'Date Received' : r'Date Received:(.*?)$'
}
# regex pattern for keys in line3 of data entity
my_regex_dict_line3 ={
'Last Paid' : r'Last Paid:(.*?)Amt Last Paid',
'Amt Last Paid' : r'Amt Last Paid:(.*?)A/C\s+Status',
'A/C Status': r'A/C\s+Status:(.*?)Collector',
'Collector' : r'Collector :(.*?)$'
}
def preprocess_data(data):
return [el.strip() for el in data.splitlines() if el.strip()]
def get_header_data(text, json_data = {}):
header_data_list = preprocess_data(text)
# third line in text of header contains Date Created field
json_data['Date Created'] = re.search(r'Date Created:(.*?)$', header_data_list[2]).group(1).strip()
# fourth line in text contains Number of Pages, Client Code, Client Name
json_data['Number of Pages'] = re.search(r'Number of Pages:(.*?)$', header_data_list[3]).group(1).strip()
# fifth line in text contains Client Code and ClientName
json_data['Client Code'] = re.search(r'Client Code - (.*?)Client Name', header_data_list[4]).group(1).strip()
json_data['ClientName'] = re.search(r'Client Name - (.*?)$', header_data_list[4]).group(1).strip()
def iterate_through_regex_and_populate_dictionaries(data_dict, regex_dict, text):
''' For the given pattern of regex_dict, this function iterates through each regex pattern and adds the key value to regex_dict dictionary '''
for key, regex in regex_dict.items():
matched_value = re.search(regex, text)
if matched_value is not None:
data_dict[key] = matched_value.group(1).strip()
def populate_date_notes(data_dict, text):
''' This function populates date and Notes in the data chunk in the form of list to data_dict dictionary '''
data_dict['Date'] = []
data_dict['Notes'] = []
iter = 4
while(iter < len(text)):
date_match = re.search(r'(\d{2}/\d{2}/\d{4})',text[iter])
data_dict['Date'].append(date_match.group(1).strip())
notes_match = re.search(r'\d{2}/\d{2}/\d{4}\s*(.*?)$',text[iter])
data_dict['Notes'].append(notes_match.group(1).strip())
iter += 1
data_index = 1
json_data = {}
pdf_path = r'C:\Users\hpoddar\Desktop\Temp\sample3.pdf' # ENTER YOUR PDF PATH HERE
pdf_text = ''
data_entity_sep_pattern = r'(?=Our Ref.*?Name.*?Ref 1.*?Ref 2)'
if(__name__ == '__main__'):
with pdfplumber.open(pdf_path) as pdf:
index = 0
while(index < len(pdf.pages)):
page = pdf.pages[index]
pdf_text += '\n' + page.extract_text()
index += 1
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())
# first data in the split_on_data_entity list will contain the header information
get_header_data(split_on_data_entity[0], json_data)
while(data_index < len(split_on_data_entity)):
data_entity = {}
data_processed = preprocess_data(split_on_data_entity[data_index])
iterate_through_regex_and_populate_dictionaries(data_entity, my_regex_dict_line1, data_processed[0])
iterate_through_regex_and_populate_dictionaries(data_entity, my_regex_dict_line2, data_processed[1])
iterate_through_regex_and_populate_dictionaries(data_entity, my_regex_dict_line3, data_processed[2])
if(len(data_processed) > 3 and data_processed[3] != None and 'Date' in data_processed[3] and 'Notes' in data_processed[3]):
populate_date_notes(data_entity, data_processed)
json_data['data_entity' + str(data_index)] = data_entity
data_index += 1
print(json_data)
Output :
Result string :
{'Date Created': '18/04/2022', 'Number of Pages': '4', 'Client Code': '110203', 'ClientName': 'AWS PTE. LTD.', 'data_entity1': {'Our Ref': '2118881115', 'Name': 'Sky Blue', 'Ref 1': '12-34-56789-2021/2', 'Ref 2': 'F2021004444', 'Amount': '$100.11', 'Total Paid': '$0.00', 'Balance': '$100.11', 'Date of A/C': '01/08/2021', 'Date Received': '10/12/2021', 'Last Paid': '', 'Amt Last Paid': '', 'A/C Status': 'CLOSED', 'Collector': 'Sunny Jane', 'Date': ['04/03/2022'], 'Notes': ['Letter Dated 04 Mar 2022.']}, 'data_entity2': {'Our Ref': '2112221119', 'Name': 'Green Field', 'Ref 1': '98-76-54321-2021/1', 'Ref 2': 'F2021001111', 'Amount': '$233.88', 'Total Paid': '$0.00', 'Balance': '$233.88', 'Date of A/C': '01/08/2021', 'Date Received': '10/12/2021', 'Last Paid': '', 'Amt Last Paid': '', 'A/C Status': 'CURRENT', 'Collector': 'Sam Jason', 'Date': ['11/03/2022', '11/03/2022', '08/03/2022', '08/03/2022', '21/02/2022', '18/02/2022', '18/02/2022'], 'Notes': ['Email for payment', 'Case Status', 'to send a Letter', '845***Ringing, No reply', 'Letter printed - LET: LETTER 2', 'Letter sent - LET: LETTER 2', '845***Line busy']}}
Now once you got the data in the json format, you can load it in a csv file, as a data frame or whatever format you need the data to be in.
Save as xlsx
To save the same in a xlsx file in the format as shown in the image in the question above. We can use xlsx writer to do the same.
Please install the package using pip
pip install xlsxwriter
From the previous code, we have our entire data in the variable json_data, we will be iterating through all the data entities and write the data to appropriate cell specified by row, col in the code.
import xlsxwriter
workbook = xlsxwriter.Workbook('Sample.xlsx')
worksheet = workbook.add_worksheet("Sheet 1")
row = 0
col = 0
# write columns
columns = ['Account History Report', 'All Notes'] + [ key for key in json_data.keys() if 'data_entity' not in key ] + list(json_data['data_entity1'].keys())
worksheet.write_row(row, col, tuple(columns))
row += 1
column_index_map = {}
for index, col in enumerate(columns):
column_index_map[col] = index
# write the header
worksheet.write(row, column_index_map['Date Created'], json_data['Date Created'])
worksheet.write(row, column_index_map['Number of Pages'], json_data['Number of Pages'])
worksheet.write(row, column_index_map['Client Code'], json_data['Client Code'])
worksheet.write(row, column_index_map['ClientName'], json_data['ClientName'])
data_entity_index = 1
#iterate through each data entity and for each key insert the values in the sheet
while True:
data_entity_key = 'data_entity' + str(data_entity_index)
row_size = 1
if(json_data.get(data_entity_key) != None):
for key, value in json_data.get(data_entity_key).items():
if(type(value) == list):
worksheet.write_column(row, column_index_map[key], tuple(value))
row_size = len(value)
else:
worksheet.write(row, column_index_map[key], value)
else:
break
data_entity_index += 1
row += row_size
workbook.close()
Result :
The above code creates a file sample.xlsx in the working directory.
How to extract/split multi-line comment to make a new list
clientInfo="""James,Jose,664 New Avenue,New Orleans,Orleans,LA,8/27/200,123,jjose#gmail.com,;
Shenna,Laureles, 288 Livinghood Heights,Brighton,Livingston,MI,2/19/75,laureles9219#yahoo.com,;
"""
into this kind of list
f_name = ["james","sheena"]
l_name = ["jose","Laureles"]
strt = ["664 New Avenue","288 Livinghood Heights"]
cty = ["New Orleans","Brighton"]
state = ["New Orleans","Livingston"]
If the order is always same. You could do something like this;
f_name = []
l_name = []
strt = []
cty = []
state = []
for client in clientData.split(";\n "):
client_ = client.split(",")
f_name.append(client_[0])
l_name.append(client_[1])
strt.append(client_[2])
cty.append(client_[3])
state.append(client_[4])
I could add some exception handling to handle the ; at the end of your string but, leaving that to you.
You can use split and zip.
def extract(string):
lines = string.split(";")
split_lines = tuple(map(lambda line: line.split(","), lines))
no_space1 = tuple(map(lambda item: item.strip(), split_lines[0]))
no_space2 = tuple(map(lambda item: item.strip(), split_lines[1]))
return list(zip(no_space1, no_space2))
This will produce
[('James', 'Shenna'), ('Jose', 'Laureles'), ('664 New Avenue', '288 Livinghood Heights'), ('New Orleans', 'Brighton'), ('Orleans', 'Living
ston'), ('LA', 'MI'), ('8/27/200', '2/19/75'), ('123', 'laureles9219#yahoo.com'), ('jjose#gmail.com', '')]
It has some tuples at the end you didn't ask for, but its relatively good. The no_space 1 and 2 lines are a bit repetitive, but cramming them into one line is worse in my opinion.
You can try:
clientData = """James,Jose,664 New Avenue,New Orleans,Orleans,LA,8/27/200,123,jjose#gmail.com,;
Shenna,Laureles, 288 Livinghood Heights,Brighton,Livingston,MI,2/19/75,laureles9219#yahoo.com,;
"""
data = clientData.split(";\n")
f_name = []
l_name = []
strt = []
cty = []
state = []
for data_line in data:
data_line = data_line.strip()
if len(data_line) >= 5:
line_info = data_line.split(",")
f_name.append(line_info[0].strip())
l_name.append(line_info[1].strip())
strt.append(line_info[2].strip())
cty.append(line_info[3].strip())
state.append(line_info[4].strip())
print(f_name)
print(l_name)
print(strt)
print(cty)
print(state)
Output:
['James', 'Shenna']
['Jose', 'Laureles']
['664 New Avenue', '288 Livinghood Heights']
['New Orleans', 'Brighton']
['Orleans', 'Livingston']
I am getting "AttributeError: 'NoneType' object has no attribute 'count'" with following code. Pls help in resolving same.
def search_albums(self, query, _dir = None):
from pprint import pprint
url = self.urls['search_albums_new']
url = url.format(query = query)
response = self._get_url_contents(url)
albums = response.json()['album']
if albums:
albums_list = map(lambda x:[x['album_id'],x['title'], x['language'], x['seokey'], x['release_date'],','.join(map(lambda y:y['name'], x.get('artists',[])[:2])) ,x['trackcount']], albums)
tabledata = [['S No.', 'Album Title', 'Album Language', 'Release Date', 'Artists', 'Track Count']]
for idx, value in enumerate(albums_list):
tabledata.append([str(idx), value[1], value[2], value[4], value[5], value[6]])
table = AsciiTable(tabledata)
print table.table
idx = int(raw_input('Which album do you wish to download? Enter S No. :'))
album_details_url = self.urls['album_details']
album_details_url = album_details_url.format(album_id = albums_list[idx][0])
response = requests.get(album_details_url , headers = {'deviceType':'AndroidApp', 'appVersion':'V5'})
tracks = response.json()['tracks']
tracks_list = map(lambda x:[x['track_title'].strip(),x['track_id'],x['album_id'],x['album_title'], ','.join(map(lambda y:y['name'], x['artist'])), x['duration']], tracks)
print 'List of tracks for ', albums_list[idx][1]
tabledata = [['S No.', 'Track Title', 'Track Artist']]
for idy, value in enumerate(tracks_list):
tabledata.append([str(idy), value[0], value[4]])
tabledata.append([str(idy+1), 'Enter this to download them all.',''])
table = AsciiTable(tabledata)
print table.table
print 'Downloading tracks to %s folder'%albums_list[idx][3]
ids = raw_input('Please enter csv of S no. to download:')
while not self._check_input(ids, len(tracks_list)) or not ids:
print 'Oops!! You made some error in entering input'
ids = raw_input('Please enter csv of S no. to download:')
if not _dir:
_dir = albums_list[idx][3]
self._check_path(_dir)
ids = map(int,map(lambda x:x.strip(),ids.split(',')))
if len(ids) == 1 and ids[0] == idy + 1:
for item in tracks_list:
song_url = self._get_song_url(item[1], item[2])
self._download_track(song_url, item[0].replace(' ','-').strip(), _dir)
else:
for i in ids:
item = tracks_list[i]
song_url = self._get_song_url(item[1], item[2])
self._download_track(song_url, item[0].replace(' ','-').strip(), _dir)
else:
print 'Ooopsss!!! Sorry no such album found.'
print 'Why not try another Album? :)'
Error :
Traceback (most recent call last): File "a-dl.py", line 163, in
d.search_albums(args.album) File "a-dl.py", line 116, in search_albums
print table.table File "C:\Python27\lib\site-packages\terminaltables.py", line 337, in table
padded_table_data = self.padded_table_data File "C:\Python27\lib\site-packages\terminaltables.py", line 326, in
padded_table_data
height = max([c.count('\n') for c in row] or [0]) + 1 AttributeError: 'NoneType' object has no attribute 'count'
well few hit and trial and i got my answer.
I changed following
albums_list = map(lambda x:[x['album_id'],x['title'], x['language'], x['seokey'], x['release_date'],','.join(map(lambda y:y['name'], x.get('artists',[])[:2])) ,x['trackcount']], albums)
tabledata = [['S No.', 'Album Title', 'Album Language', 'Release Date', 'Artists', 'Track Count']]
for idx, value in enumerate(albums_list):
tabledata.append([str(idx), value[1], value[2], value[4], value[5], value[6]])
with
albums_list = map(lambda x:[x['album_id'],x['title'], x['language'], x['seokey'], x['release_date']], albums)
tabledata = [['S No.', 'Album Title', 'Album Language', 'Release Date']]
for idx, value in enumerate(albums_list):
tabledata.append([str(idx), value[1], value[2], value[4]])
and it worked fine now :)
I have two files I wish to compare and then produce a specific output:
1) Below are the contents of the username text file (this stores the latest films viewed by the user)
Sci-Fi,Out of the Silent Planet
Sci-Fi,Solaris
Romance, When Harry met Sally
2) Below are the contents of the films.txt file which stores all the films in the program that are available to the user
0,Genre, Title, Rating, Likes
1,Sci-Fi,Out of the Silent Planet, PG,3
2,Sci-Fi,Solaris, PG,0
3,Sci-Fi,Star Trek, PG,0
4,Sci-Fi,Cosmos, PG,0
5,Drama, The English Patient, 15,0
6,Drama, Benhur, PG,0
7,Drama, The Pursuit of Happiness, 12, 0
8,Drama, The Thin Red Line, 18,0
9,Romance, When Harry met Sally, 12, 0
10,Romance, You've got mail, 12, 0
11,Romance, Last Tango in Paris, 18, 0
12,Romance, Casablanca, 12, 0
An example of the output I require: The user has currently viewed two sci-fi and one Romance film. The output therefore should SEARCH the Films text file by Genre (identifying SCI-FI and ROMANCE), and should list the films in the films.txt file which have NOT been viewed by the user yet. In this case
3,Sci-Fi,Star Trek, PG,0
4,Sci-Fi,Cosmos, PG,0
10,Romance, You've got mail, 12, 0
11,Romance, Last Tango in Paris, 18, 0
12,Romance, Casablanca, 12, 0
I have the following code which attempts to do the above, but the output it produces is incorrect:
def viewrecs(username):
#set the username variable to the text file -to use it in the next bit
username = (username + ".txt")
#open the username file that stores latest viewings
with open(username,"r") as f:
#open the csv file reader for the username file
fReader=csv.reader(f)
#for each row in the fReader
for row in fReader:
#set the genre variable to the row[0], in which row[0] is all the genres (column 1 in username file)
genre=row[0]
#next, open the films file
with open("films.txt","r") as films:
#open the csv reader for this file (filmsReader as opposed to fReader)
filmsReader=csv.reader(films)
#for each row in the films file
for row in filmsReader:
#and for each field in the row
for field in row:
#print(field)
#print(genre)
#print(field[0])
if genre in field and row[2] not in fReader:
print(row)
Output (undesired):
['1', 'Sci-Fi', 'Out of the Silent Planet', ' PG', '3']
['2', 'Sci-Fi', 'Solaris', ' PG', '0']
['3', 'Sci-Fi', 'Star Trek', ' PG', '0']
['4', 'Sci-Fi', 'Cosmos', ' PG', '0']
I don't want a re-write or new solution, but, preferably, a fix to the above solution with its logical progression ...
#gipsy - your solution appears to have nearly worked. I used:
def viewrecs(username):
#set the username variable to the text file -to use it in the next bit
username = (username + ".txt")
#open the username file that stores latest viewings
lookup_set = set()
with open(username,"r") as f:
#open the csv file reader for the username file
fReader=csv.reader(f)
#for each row in the fReader
for row in fReader:
genre = row[1]
name = row[2]
lookup_set.add('%s-%s' % (genre, name))
with open("films.txt","r") as films:
filmsReader=csv.reader(films)
#for each row in the films file
for row in filmsReader:
genre = row[1]
name = row[2]
lookup_key = '%s-%s' % (genre, name)
if lookup_key not in lookup_set:
print(row)
The output is as below: It is printing ALL the lines in allfilms that are not in the first set, rather than just the ones based on the GENRE in the first set:
['0', 'Genre', ' Title', ' Rating', ' Likes']
['3', 'Sci-Fi', 'Star Trek', ' PG', ' 0']
['4', 'Sci-Fi', 'Cosmos', ' PG', ' 0']
['5', 'Drama', ' The English Patient', ' 15', ' 0']
['6', 'Drama', ' Benhur', ' PG', ' 0']
['7', 'Drama', ' The Pursuit of Happiness', ' 12', ' 0']
['8', 'Drama', ' The Thin Red Line', ' 18', ' 0']
['10', 'Romance', " You've got mail", ' 12', ' 0']
['11', 'Romance', ' Last Tango in Paris', ' 18', ' 0']
['12', 'Romance', ' Casablanca', ' 12', ' 0']
NOTE: I changed the format of the first set to be the same, for simplicity, of the all films entries:
1,Sci-Fi,Out of the Silent Planet, PG
2,Sci-Fi,Solaris, PG
How about using sets and separate lists to filter movies in appropriate genres that were not seen? We can even abuse the dictionaries' keys and values for this purpose:
def parse_file (file):
return map(lambda x: [w.strip() for w in x.split(',')], open(file).read().split('\n'))
def movies_to_see ():
seen = {film[0]: film[1] for film in parse_file('seen.txt')}
films = parse_file('films.txt')
to_see = []
for film in films:
if film[1] in seen.keys() and film[2] not in seen.values():
to_see.append(film)
return to_see
The solution using str.split() and str.join() functions:
# change file paths with your actual ones
with open('./text_files/user.txt', 'r') as userfile:
viewed = userfile.read().split('\n')
viewed_genders = set(g.split(',')[0] for g in viewed)
with open('./text_files/films.txt', 'r') as filmsfile:
films = filmsfile.read().split('\n')
not_viewed = [f for f in films
if f.split(',')[1] in viewed_genders and ','.join(f.split(',')[1:3]) not in viewed]
print('\n'.join(not_viewed))
The output:
3,Sci-Fi,Star Trek, PG,0
4,Sci-Fi,Cosmos, PG,0
10,Romance, You've got mail, 12, 0
11,Romance, Last Tango in Paris, 18, 0
12,Romance, Casablanca, 12, 0
Okay , build a set going through the first file with Genre + name as the entry.
Now iterate over the second file and lookup in the set you made above for an entry for Genre+ name, if not exists print that out.
Once I am home I can type some code.
As promised my code for this is below:
def viewrecs(username):
#set the username variable to the text file -to use it in the next bit
username = (username + ".txt")
# In this set we will collect the unique combinations of genre and name
genre_name_lookup_set = set()
# In this set we will collect the unique genres
genre_lookup_set = set()
with open(username,"r") as f:
#open the csv file reader for the username file
fReader=csv.reader(f)
#for each row in the fReader
for row in fReader:
genre = row[0]
name = row[1]
# Add the genre name combination to this set, duplicates will be taken care automatically as set won't allow dupes
genre_name_lookup_set.add('%s-%s' % (genre, name))
# Add genre to this set
genre_lookup_set.add(genre)
with open("films.txt","r") as films:
filmsReader=csv.reader(films)
#for each row in the films file
for row in filmsReader:
genre = row[1]
name = row[2]
# Build a lookup key using genre and name, example:Sci-Fi-Solaris
lookup_key = '%s-%s' % (genre, name)
if lookup_key not in genre_name_lookup_set and genre in genre_lookup_set:
print(row)
I've scraped a website containing a table and I want to format the headers for my desired final out.
headers = []
for row in table.findAll('tr'):
for item in row.findAll('th'):
for link in item.findAll('a', text=True):
headers.append(link.contents[0])
print headers
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Click here to read more', u'Student Satisfaction', u'Click here to read more', u'Research Quality', u'Click here to read more', u'Graduate Prospects', u'Click here to read more', u'Overall Score', u'Click here to read more', u'\r\n 2016\r\n ']
I don't want the "Click here to read more' or '2016' headers so I've done the following:
for idx, i in enumerate(headers):
if 'Click' in i:
del headers[idx]
for idx, i in enumerate(headers):
if '2016' in i:
del headers[idx]
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Student Satisfaction', u'Research Quality', u'Graduate Prospects', u'Overall Score']
Perfect. But is there a better/neater way of removing the unwanted items? Thanks!
headers = filter(lambda h: not 'Click' in h and not '2016' in h, headers)
If you want to be more generic:
banned = ['Click', '2016']
headers = filter(lambda h: not any(b in h for b in banned), headers)
You can consider using list comprehension to get a new, filtered list, something like:
new_headers = [header for header in headers if '2016' not in header]
If you can be sure that '2016' will always be last:
>>> [x for x in headers[:-1] if 'Click here' not in x]
['Rank ', 'University Name ', 'Entry Standards', 'Student Satisfaction', 'Research Quality', 'Graduate Prospects', 'Overall Score']
pattern = '^Click|^2016'
new = [x for x in header if not re.match(pattern,str(x).strip())]