New Pandas Series longer than original dataset? - python

So I have a data set with user, date, and post columns. I'm trying to generate a column of the calories that foods contain in the post column for each user. This dataset has a length of 21, and the code below finds the food words, get their calorie value, append it to that user's respective calorie list, and append that list to the new column. The new generated column, however, somehow has a length of 25:
Current data: 21
New column: 25
Does anybody know why this occurs? Here is the code below and samples of what the original dataset and the new column look like:
while len(col) < len(data['post']):
for post, api_id, api_key in zip(data['post'], ids_keys.keys(), ids_keys.values()): # cycles through text data & api keys
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'x-app-id': api_id,
'x-app-key': api_key,
'x-remote-user-id': '0'
}
calories = []
print('Current data:', len(data['post']), '\n New column: ', len(col)) # prints length of post vs new cal column
for word in eval(post):
if word not in food:
continue
else:
print('Detected Word: ', word)
query = {'query': '{}'.format(word)}
try:
response = requests.request("POST", url, headers=headers, data=query)
except KeyError as ke:
print(ke, 'Out of calls, next key...')
ids_keys.pop(api_id) # drop current api id & key from dict if out of calls
print('API keys left:', len(ids_keys))
finally:
stats = response.json()
print('Food Stats: \n', stats)
print('Calories in food: ', stats['foods'][0]['nf_calories'])
calories.append(stats['foods'][0]['nf_calories'])
print('Current Key', api_id, ':', api_key)
col.append(calories)
if len(col) == len(data['post']):
break
I attempted to use the while loop to only append up to the length of the dataset, but to no avail.
Original Data Set:
pd.DataFrame({'user':['avskk', 'janejellyn', 'firlena227','...'],
'date': ['October 22', 'October 22', 'October 22','...'],
'post': [['autumn', 'fully', 'arrived', 'cooking', 'breakfast', 'toaster','...'],
['breakfast', 'chinese', 'sticky', 'rice', 'tempeh', 'sausage', 'cucumber', 'salad', 'lunch', 'going', 'lunch', 'coworkers', 'probably', 'black', 'bean', 'burger'],
['potato', 'bean', 'inspiring', 'food', 'day', 'today', '...']]
})
New Column:
pd.DataFrame({'Calories': [[22,33,45,32,2,5,7,9,76],
[43,78,54,97,32,56,97],
[23,55,32,22,7,99,66,98,54,35,33]]
})

Related

How to extract text and save as excel file using python or JavaScript

How do I extract text from this PDF files where some data is in the form of table while some are key value based data
eg:
https://drive.internxt.com/s/file/78f2d73478b832b2ab55/3edb275967deeca6ad33e7d53f2337c50d5dfb50e0aa525bb7f10d49dff1e2b4
This is what I have tried :
import PyPDF2
import openpyxl
from openpyxl import Workbook
pdfFileObj = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
wb = Workbook()
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext
wb.save('sample.xlsx')
print('Save')
However I'd like the data to be stored in the following format.
This pdf does not have well defined tables, hence cannot use any tool to extract the entire data in one table format. What we can do is read the entire pdf as text. And process each data fields line by line by using regex to extract the data.
Before you move ahead, please install the pdfplumber package for python
pip install pdfplumber
Assumptions
Here are some assumptions that I made for your pdf and accordingly I have written the code.
First line will always contain the title Account History Report.
Second line will contain the names IMAGE All Notes
Third line will contain only the data Date Created in the form of key:value.
Fourth line will contain only the data Number of Pages in the form of key:value.
Fifth line will only contain the data Client Code, Client Name
Starting line 6, a pdf can have multiple data entity, these data entity for eg in this pdf is 2 but can be any number of entity.
Each data entity will contain the following fields:
First line in data entity will contain only the data Our Ref, Name, Ref 1, Ref 2
Second line line will only contain data in the form as present in pdf Amount, Total Paid, Balance, Date of A/C, Date Received
Third line in data entity will contain the data Last Paid, Amt Last Paid, Status, Collector.
Fourth line will contain the column name Date Notes
The subsequent lines will contain data in the form of table until the next data entity is started.
I also assume that each data entity will contain the first data with key Our Ref :.
I assume that the data entity will be separated on the first line of each entity in the pattern of key values as Our Ref :Value Name: Value Ref 1 :Value Ref 2:value
pattern = r'Our Ref.*?Name.*?Ref 1.*?Ref 2.*?'
Please note that the rectangle that I have created(thick black) in above image, I am calling those as data entity.
The final data will be stored in a dictionary(json) where the data entity will have key as dataentity1, dataentity2, dataentity3 based on the number of entities you have in your pdf.
The header details are stored in the json as key:value and I assume that each key will be present in header only once.
CODE
Here is the simple elegant code, that gives you information from the pdf in the form of json. In the output the first few field contains information from the header part, subsequent data entities can be found as data_entity 1 and 2.
In the below code all you need to change is pdf_path.
import pdfplumber
import re
# regex pattern for keys in line1 of data entity
my_regex_dict_line1 = {
'Our Ref' : r'Our Ref :(.*?)Name',
'Name' : r'Name:(.*?)Ref 1',
'Ref 1' : r'Ref 1 :(.*?)Ref 2',
'Ref 2' : r'Ref 2:(.*?)$'
}
# regex pattern for keys in line2 of data entity
my_regex_dict_line2 = {
'Amount' : r'Amount:(.*?)Total Paid',
'Total Paid' : r'Total Paid:(.*?)Balance',
'Balance' : r'Balance:(.*?)Date of A/C',
'Date of A/C' : r'Date of A/C:(.*?)Date Received',
'Date Received' : r'Date Received:(.*?)$'
}
# regex pattern for keys in line3 of data entity
my_regex_dict_line3 ={
'Last Paid' : r'Last Paid:(.*?)Amt Last Paid',
'Amt Last Paid' : r'Amt Last Paid:(.*?)A/C\s+Status',
'A/C Status': r'A/C\s+Status:(.*?)Collector',
'Collector' : r'Collector :(.*?)$'
}
def preprocess_data(data):
return [el.strip() for el in data.splitlines() if el.strip()]
def get_header_data(text, json_data = {}):
header_data_list = preprocess_data(text)
# third line in text of header contains Date Created field
json_data['Date Created'] = re.search(r'Date Created:(.*?)$', header_data_list[2]).group(1).strip()
# fourth line in text contains Number of Pages, Client Code, Client Name
json_data['Number of Pages'] = re.search(r'Number of Pages:(.*?)$', header_data_list[3]).group(1).strip()
# fifth line in text contains Client Code and ClientName
json_data['Client Code'] = re.search(r'Client Code - (.*?)Client Name', header_data_list[4]).group(1).strip()
json_data['ClientName'] = re.search(r'Client Name - (.*?)$', header_data_list[4]).group(1).strip()
def iterate_through_regex_and_populate_dictionaries(data_dict, regex_dict, text):
''' For the given pattern of regex_dict, this function iterates through each regex pattern and adds the key value to regex_dict dictionary '''
for key, regex in regex_dict.items():
matched_value = re.search(regex, text)
if matched_value is not None:
data_dict[key] = matched_value.group(1).strip()
def populate_date_notes(data_dict, text):
''' This function populates date and Notes in the data chunk in the form of list to data_dict dictionary '''
data_dict['Date'] = []
data_dict['Notes'] = []
iter = 4
while(iter < len(text)):
date_match = re.search(r'(\d{2}/\d{2}/\d{4})',text[iter])
data_dict['Date'].append(date_match.group(1).strip())
notes_match = re.search(r'\d{2}/\d{2}/\d{4}\s*(.*?)$',text[iter])
data_dict['Notes'].append(notes_match.group(1).strip())
iter += 1
data_index = 1
json_data = {}
pdf_path = r'C:\Users\hpoddar\Desktop\Temp\sample3.pdf' # ENTER YOUR PDF PATH HERE
pdf_text = ''
data_entity_sep_pattern = r'(?=Our Ref.*?Name.*?Ref 1.*?Ref 2)'
if(__name__ == '__main__'):
with pdfplumber.open(pdf_path) as pdf:
index = 0
while(index < len(pdf.pages)):
page = pdf.pages[index]
pdf_text += '\n' + page.extract_text()
index += 1
split_on_data_entity = re.split(data_entity_sep_pattern, pdf_text.strip())
# first data in the split_on_data_entity list will contain the header information
get_header_data(split_on_data_entity[0], json_data)
while(data_index < len(split_on_data_entity)):
data_entity = {}
data_processed = preprocess_data(split_on_data_entity[data_index])
iterate_through_regex_and_populate_dictionaries(data_entity, my_regex_dict_line1, data_processed[0])
iterate_through_regex_and_populate_dictionaries(data_entity, my_regex_dict_line2, data_processed[1])
iterate_through_regex_and_populate_dictionaries(data_entity, my_regex_dict_line3, data_processed[2])
if(len(data_processed) > 3 and data_processed[3] != None and 'Date' in data_processed[3] and 'Notes' in data_processed[3]):
populate_date_notes(data_entity, data_processed)
json_data['data_entity' + str(data_index)] = data_entity
data_index += 1
print(json_data)
Output :
Result string :
{'Date Created': '18/04/2022', 'Number of Pages': '4', 'Client Code': '110203', 'ClientName': 'AWS PTE. LTD.', 'data_entity1': {'Our Ref': '2118881115', 'Name': 'Sky Blue', 'Ref 1': '12-34-56789-2021/2', 'Ref 2': 'F2021004444', 'Amount': '$100.11', 'Total Paid': '$0.00', 'Balance': '$100.11', 'Date of A/C': '01/08/2021', 'Date Received': '10/12/2021', 'Last Paid': '', 'Amt Last Paid': '', 'A/C Status': 'CLOSED', 'Collector': 'Sunny Jane', 'Date': ['04/03/2022'], 'Notes': ['Letter Dated 04 Mar 2022.']}, 'data_entity2': {'Our Ref': '2112221119', 'Name': 'Green Field', 'Ref 1': '98-76-54321-2021/1', 'Ref 2': 'F2021001111', 'Amount': '$233.88', 'Total Paid': '$0.00', 'Balance': '$233.88', 'Date of A/C': '01/08/2021', 'Date Received': '10/12/2021', 'Last Paid': '', 'Amt Last Paid': '', 'A/C Status': 'CURRENT', 'Collector': 'Sam Jason', 'Date': ['11/03/2022', '11/03/2022', '08/03/2022', '08/03/2022', '21/02/2022', '18/02/2022', '18/02/2022'], 'Notes': ['Email for payment', 'Case Status', 'to send a Letter', '845***Ringing, No reply', 'Letter printed - LET: LETTER 2', 'Letter sent - LET: LETTER 2', '845***Line busy']}}
Now once you got the data in the json format, you can load it in a csv file, as a data frame or whatever format you need the data to be in.
Save as xlsx
To save the same in a xlsx file in the format as shown in the image in the question above. We can use xlsx writer to do the same.
Please install the package using pip
pip install xlsxwriter
From the previous code, we have our entire data in the variable json_data, we will be iterating through all the data entities and write the data to appropriate cell specified by row, col in the code.
import xlsxwriter
workbook = xlsxwriter.Workbook('Sample.xlsx')
worksheet = workbook.add_worksheet("Sheet 1")
row = 0
col = 0
# write columns
columns = ['Account History Report', 'All Notes'] + [ key for key in json_data.keys() if 'data_entity' not in key ] + list(json_data['data_entity1'].keys())
worksheet.write_row(row, col, tuple(columns))
row += 1
column_index_map = {}
for index, col in enumerate(columns):
column_index_map[col] = index
# write the header
worksheet.write(row, column_index_map['Date Created'], json_data['Date Created'])
worksheet.write(row, column_index_map['Number of Pages'], json_data['Number of Pages'])
worksheet.write(row, column_index_map['Client Code'], json_data['Client Code'])
worksheet.write(row, column_index_map['ClientName'], json_data['ClientName'])
data_entity_index = 1
#iterate through each data entity and for each key insert the values in the sheet
while True:
data_entity_key = 'data_entity' + str(data_entity_index)
row_size = 1
if(json_data.get(data_entity_key) != None):
for key, value in json_data.get(data_entity_key).items():
if(type(value) == list):
worksheet.write_column(row, column_index_map[key], tuple(value))
row_size = len(value)
else:
worksheet.write(row, column_index_map[key], value)
else:
break
data_entity_index += 1
row += row_size
workbook.close()
Result :
The above code creates a file sample.xlsx in the working directory.

Is it possible for a python script to check whether row exists in google sheets before writing that row?

I have a python script that searches for vehicles on a vehicle listing site and writes the results to a spreadsheet. What I want is to automate this script to run every night to get new listings, but what I don't want is to create numerous duplicates if the listing exists each day that the script is run.
So is it possible to get the script to check whether that row (potential duplicate) already exists before writing a new row?
To clarify the code I have works perfectly to print the results exactly how I want them into the google sheets document, what I am trying to do is to run a check before it prints new lines into the sheet to see if that result already exists. Is that clearer? With thanks in advance.
Here is a screenshot of an example where I might have a row already existing with the specific title, but one of the column cells may have a different value in it and I only want to update the original row with the latest/highest price value.
UPDATE:
I am trying something like this but it just seems to print everything rather than only if it doesn't already exist which is what I am trying to do.
listing = [title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant]
list_of_dicts = sheet2.get_all_records()
# Convert listing into dictionary output to be read by following statement to see if listing exists in sheet before printing
i = iter(listing)
d_listing = dict(zip(i, i))
if not d_listing in list_of_dicts:
print(listing)
#print(title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant)
index = 2
row = [title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant]
sheet2.insert_row(row,index)
My code is:
import requests
import re
from bs4 import BeautifulSoup
import pandas
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# use creds to create a client to interact with the Google Drive API
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('creds.json', scope)
client = gspread.authorize(creds)
sheet = client.open("CAR AGGREGATOR")
sheet2 = sheet.worksheet("Auctions - Live")
url = "https://themarket.co.uk/live.xml"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
loc = loc.text
r=requests.get(loc)
c=r.content
hoop = BeautifulSoup(c, 'html.parser')
soup = BeautifulSoup(c, 'lxml')
current_bid = soup.find('div', 'bid-step__header')
bid = soup.find('bid-display')
title = soup.find('h2').text.split()
year = title[0]
if not year:
year = ''
if any(make in 'ASTON ALFA HEALEY ROVER Arnolt Bristol Amilcar Amphicar LOREAN De Cadenet Cosworth'.split() for make in title):
make = title[1] + ' ' + title[2]
model = title[3]
try:
variant = title[4]
except:
variant = ''
else:
make = title[1]
model = title[2]
try:
variant = title[3]
if 'REIMAGINED' in variant:
variant = 'REIMAGINED BY SINGER'
if 'SINGER' in variant:
variant = 'REIMAGINED BY SINGER'
except:
variant = ''
title = year + ' ' + make + ' ' + model
img = soup.find('img')
vehicle_details = soup.find('ul', 'vehicle__overview')
try:
mileage = vehicle_details.find_all('li')[1].text.split()[2]
except:
mileage = ''
try:
vin = vehicle_details.find_all('li')[2].text.split()[2]
except:
vin = ''
try:
gearbox = vehicle_details.find_all('li')[4].text.split()[2]
except:
gearbox = 'N/A'
try:
exterior_colour = vehicle_details.find_all('li')[5].text.split()[1:]
exterior_colour = "-".join(exterior_colour)
except:
exterior_colour = 'N/A'
try:
interior_colour = vehicle_details.find_all('li')[6].text.split()[1:]
interior_colour = "-".join(interior_colour)
except:
interior_colour = 'N/A'
try:
video = soup.find('iframe')['src']
except:
video = ''
tag = soup.countdown
try:
auction_date = tag.attrs['formatted_date'].split()
auction_day = auction_date[0][:2]
auction_month = auction_date[1]
auction_year = auction_date[2]
auction_time = auction_date[3]
auction_date = auction_day + ' ' + auction_month + ' ' + auction_year + ' ' + auction_time
except:
continue
print(title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant)
index = 2
row = [title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant]
sheet2.insert_row(row,index)
I would load all data in two dictionaries, one representing the freshly scraped information, the other one the full information of the GoogleSheet. (To load the information from GoogleSheet, use its API, as described in Google's documentation.)
Both dictionaries, let's call them scraped and sheets, could have the titles as keys, and all the other columns as value (represented in a dictionary), so they would look like this:
{
"1928 Aston Martin V8": {
"Link": "...",
"Price": "12 $",
},
...
}
Then update the Sheets-dictionary with dict.update():
sheets.update(scraped)
and rewrite the Google Sheet with the data in sheets.
Without exactly knowing your update logic, I cannot give a more specific advice than this.

How to get rid of None Type in my output - I only want to select valid IP addresses

Any idea how to not include anything with None? I am trying to just pull in IP addresses at this point, but I don’t want to include empty elements.
My API Response
[{'name': '', 'serial': 'Q2KN-xxxx-438Z', 'mac': '0c:8d:db:c3:ad:c8', 'networkId': 'L_6575255xxx96096977', 'model': 'MX64', 'address': '', 'lat': 38.4180951010362, 'lng': -92.098531723022, 'notes': '', 'tags': '', 'wan1Ip': '47.134.13.195', 'wan2Ip': None}, {'name': '', 'serial': 'Q2PD-xxx-QQ9Y', 'mac': '0c:8d:db:dc:ed:f6', 'networkId': 'L_657525545596096977', 'model': 'MR33', 'address': '', 'lat': 38.4180951010362, 'lng': -92.098531723022, 'notes': '', 'tags': '', 'lanIp': '10.0.0.214'}]
Iterating through elements and selecting certain fields
response = requests.request("GET", url + id + '/devices', headers=headers)
data = response.json()
for item in data:
keys = [item.get(x) for x in ['wan1Ip', 'model', 'lanIp', 'wan2Ip']]
print(*keys, sep="\n", file=sys.stdout)
My output is:
47.134.13.195
MX64
None
None
None
MR33
10.0.0.214
None
My desired output is:
47.134.13.195
10.0.0.214
I’ve tried adding a re.findall for ip addresses, but not sure that’s going to work for me. I’ve also tried to add operators for not in None and several other things.
re.findall(“(?:[\d]{1,3}).(?:[\d]{1,3}).(?:[\d]{1,3}).(?:[\d]{1,3})?“,string2 )
Update
I've changed my line to
keys = [item.get(x) for x in ['wan1Ip', 'model', 'lanIp', 'wan2Ip', '{}']if x in item]
Obviously, I still have non IP addresses in my output, but I can select the elements that have IP addresses only. My main issue was None. I will also try some of the other suggestions.
Search each key ipv4 with regex(https://stackoverflow.com/a/5284410/6250402) and check key is None with item.get(x) or ''
import re
myRegex = re.compile(r'\b((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}\b')
# data here
for item in data:
keys = [item.get(x) for x in ['wan1Ip', 'model', 'lanIp', 'wan2Ip'] if myRegex.search(item.get(x) or '')]
print(*keys, sep="\n", file=sys.stdout)
Add a filter to your list comprehension to check to see if key x is in item.
for item in data:
keys = [item.get(x) for x in ['wan1Ip', 'model', 'lanIp', 'wan2Ip'] if x in item]
print(*keys, sep="\n", file=sys.stdout)
You could add this condition to the end of your list comprehension:
keys = [item.get(x) for x in ['wan1Ip', 'model', 'lanIp', 'wan2Ip'] if item.get(x) is not None]
The output will be what you want.
Could you try this:
>> data = response.json()
>> keys = [x.get(key) for key in ['wan1Ip', 'model', 'lanIp', 'wan2Ip'] for x in data if x.get(key)]
>> print(keys)
>> ['47.134.13.195', 'MX64', 'MR33', '10.0.0.214']

Python search loop slow

I am running a search on a list of ads (adscrape). Each ad is a dict within adscrape (e.g. ad below). It searches through a list of IDs (database_ids) which could be between 200,000 - 1,000,000 items long. I want to find any ads in adscrape that don't have an ID already in database_ids.
My current code is below. It takes a loooong time, and multiple seconds for each ad to scan through database_ids. Is there a more efficient/faster way of running this (finding which items in a big list, are in another big list)?
database_ids = ['id1','id2','id3'...]
ad = {'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'id': u'OAG-AD-12371713', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
for ad in adscrape:
ad['last scan'] = date
ad['active'] = 'Y'
adscrape_ids.append(ad['id'])
if ad['id'] not in database_ids:
ad['first scan'] = date
print 'new ad:',ad
newads.append(ad)
`You can use list comprehensions for this as the code base given below. Use the existing database_ids list and adscrape dict as given above.
Code base:
new_adds_ids = [ad for ad in adscrape if ad['id'] not in database_ids]`
You can build ids_map as dict and check whether id in list by accessing key in that ids_map as in code snippet below:
database_ids = ['id1','id2','id3']
ad = {'id': u'OAG-AD-12371713', 'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
#build ids map
ids_map = dict((k, v) for v, k in enumerate(database_ids))
for ad in adscrape:
# some logic before checking whether id in database_ids
try:
ids_map[ad['id']]
except KeyError:
pass
else:
#error not thrown perform logic for existed ids
print 'id %s in list' % ad['id']

Delete index in list if multiple strings are matched

I've scraped a website containing a table and I want to format the headers for my desired final out.
headers = []
for row in table.findAll('tr'):
for item in row.findAll('th'):
for link in item.findAll('a', text=True):
headers.append(link.contents[0])
print headers
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Click here to read more', u'Student Satisfaction', u'Click here to read more', u'Research Quality', u'Click here to read more', u'Graduate Prospects', u'Click here to read more', u'Overall Score', u'Click here to read more', u'\r\n 2016\r\n ']
I don't want the "Click here to read more' or '2016' headers so I've done the following:
for idx, i in enumerate(headers):
if 'Click' in i:
del headers[idx]
for idx, i in enumerate(headers):
if '2016' in i:
del headers[idx]
Which returns:
[u'Rank ', u'University Name ', u'Entry Standards', u'Student Satisfaction', u'Research Quality', u'Graduate Prospects', u'Overall Score']
Perfect. But is there a better/neater way of removing the unwanted items? Thanks!
headers = filter(lambda h: not 'Click' in h and not '2016' in h, headers)
If you want to be more generic:
banned = ['Click', '2016']
headers = filter(lambda h: not any(b in h for b in banned), headers)
You can consider using list comprehension to get a new, filtered list, something like:
new_headers = [header for header in headers if '2016' not in header]
If you can be sure that '2016' will always be last:
>>> [x for x in headers[:-1] if 'Click here' not in x]
['Rank ', 'University Name ', 'Entry Standards', 'Student Satisfaction', 'Research Quality', 'Graduate Prospects', 'Overall Score']
pattern = '^Click|^2016'
new = [x for x in header if not re.match(pattern,str(x).strip())]

Categories