Related
In an Excel file I have two large tables. Table A ("Dissection", 409 rows x 25 cols) contains unique entries, each separated by a unique ID. Table B ("Dissection", 234 rows x 39 columns) uses the ID of Table A in the first cell and extends it. To analyze the data in Minitab, all data must be in a single long row, meaning the values of "Damage" have to follow "Dissection". The whole thing looks like this:
Table A - i.e. Dissection
- ID1 [valueTabA] [valueTabA]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA]
Table B - i.e. Damage
- ID1 [valueTabB1] [valueTabB1]
- ID1 [valueTabB2] [valueTabB2]
- ID4 [valueTabB] [valueTabB]
They are supposed to combine something like this:
Table A
- ID1 [valueTabA] [valueTabA] [valueTabB1] [valueTabB1] [valueTabB2] [valueTabB2]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA] [valueTabB] [valueTabB]
What is the best way to do that?
The following describes my two approaches. Both use the same data in the same tables but in two different files, to be able to test both scenarios.
The first approach uses a file, where both tables are in the same worksheet, the second uses a file where both tables are in different worksheets.
Scenario: both tables are in the same worksheet, where I'm trying to move the row as a range
current_row = 415 # start without headers of table A
current_line = 2 # start without headers of table B
for row in ws.iter_rows(min_row=415, max_row=647):
# loop through damage
id_A = ws.cell(row=current_row, column=1).value
max_col = 25
for line in ws.iter_rows(min_row=2, max_row=409):
# loop through dissection
id_B = ws.cell(row=current_line, column=1).value
if id_A == id_B:
copy_range = ((ws.cell(row=current_line, column=2)).column_letter + str(current_line) + ":" +
(ws.cell(row=current_line, column=39)).column_letter + str(current_line))
ws.move_range(copy_range, rows=current_row, cols=max_col+1)
print("copied range: " + copy_range +" to: " + str(current_row) + ":"+str(max_col+1))
count += 1
break
if current_line > 409:
current_line = 2
else:
current_line += 1
current_row += 1
-> Here I'm struggling to append the range to the right row of Table A, without overwriting the previous row (see example ID1 above)
Scenario: both tables are located in separated sheets
dissection = wb["Dissection"]
damage = wb["Damage"]
recovery = wb["Recovery"]
current_row, current_line = 2, 2
for row in damage.iter_rows():
# loop through first table
id_A = damage.cell(row=current_row, column=1).value
for line in dissection.iter_rows():
# loop through second table
id_B = dissection.cell(row=current_line, column=1).value
copyData = []
if id_A == id_B:
for col in range(2, 39):
# add data to the list, skipping the ID
copyData.append(damage.cell(row=current_line, column=col).value)
# print(copyData) for debugging purposes
for item in copyData:
column_count = dissection.max_column
dissection.cell(row=current_row, column=column_count).value = item
column_count += 1
current_row += 1
break
if not current_line > 409:
# prevent looping out of range
current_line += 1
else:
current_line = 2
-> Same problem as in 1., at some point it's not adding the damage values to copyData anymore but None instead, and finally it's just not pasting the items (cells stay blank)
I've tried everything excel related that I could find, but unfortunately nothing worked. Would pandas be more useful here or am I just not seeing something?
Thanks for taking the time to read this :)
I highly recommend using pandas for situations like this. It is still a bit unclear how your data is formatted in the excel file, but given your second option I assume that the tables are both on different sheets in the excel file. I also assume that the first row contains the table title (e.g. Table A - i.e. Dissection). If this is not the case, just remove skiprows=1:
import pandas as pd
df = pd.concat(pd.read_excel("filename.xlsx", sheet_name=None, skiprows=1, header=None), axis=1, ignore_index=True)
df.to_excel('combined_data.xlsx) #save to excel
read_excel will load the excel file into a pandas dataframe. sheet_name=None indicates that all sheets should be loaded into an OrderedDict of dataframes. pd.concat will concatenate these dataframes into one single dataframe (axis=1 indicates the axis). You can explore the data with df.head(), or save the dataframe to excel with df.to_excel.
I ended up using the 2. scenario (one file, two worksheets) but this code should be adaptable to the 1. scenario (one file, one worksheet) as well.
I copied the rows of Table B using code taken from here.
And handled the offset with code from here.
Also, I added a few extras to my solution to make it more generic:
import openpyxl, os
from openpyxl.utils import range_boundaries
# Introduction
print("Welcome!\n[!] Advice: Always have a backup of the file you want to sort.\n[+] Please put the file to be sorted in the same directory as this program.")
print("[+] This program assumes that the value to be sorted by is located in the first column of the outgoing table.")
# File listing
while True:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
valid_types = ["xlsx", "xltx", "xlt", "xls"]
print("\n[+] Current directory: " + os.getcwd())
print("[+] Excel files in the current directory: ")
for f in files:
if str(f).split(".")[1] in valid_types:
print(f)
file = input("\nWhich file would you like to sort: ")
try:
ending = file.split(".")[1]
except IndexError:
print("please only enter excel files.")
continue
if ending in valid_types:
break
else:
print("Please only enter excel files")
wb = openpyxl.load_workbook(file)
# Handling Worksheets
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which file would you like to sort? (please copy the name without the parenthesis)")
outgoing_sheet = wb[input("Outgoing sheet: ")]
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which is the receiving sheet? (please copy the name without the parenthesis)")
receiving_sheet = wb[input("Receiving sheet: ")]
# Declaring functions
def copy_row(source_range, target_start, source_sheet, target_sheet):
# Define start Range(target_start) in the new Worksheet
min_col, min_row, max_col, max_row = range_boundaries(target_start)
# Iterate Range you want to copy
for row, row_cells in enumerate(source_sheet[source_range], min_row):
for column, cell in enumerate(row_cells, min_col):
# Copy Value from Copy.Cell to given Worksheet.Cell
target_sheet.cell(row=row, column=column).value = cell.value
def ask_yes_no(prompt):
"""
:param prompt: The question to be asked
:return: Value to check
"""
while True:
answer = input(prompt + " (y/n): ")
if answer == "y":
return True
elif answer == "n":
return False
print("Please only enter y or n.")
def ask_integer(prompt):
while True:
try:
answer = int(input(prompt + ": "))
break
except ValueError:
print("Please only enter integers (e.g. 1, 2 or 3).")
return answer
def scan_empty(index):
print("Scanning for empty cells...")
scan, fill = False, False
min_col = outgoing_sheet.min_column
max_col = outgoing_sheet.max_column
cols = range(min_col, max_col+1)
break_loop = False
count = 0
if not scan:
search_index = index
for row in outgoing_sheet.iter_rows():
for n in cols:
cell = outgoing_sheet.cell(row=search_index, column=n).value
if cell:
pass
else:
choice = ask_yes_no("\n[!] Empty cells found, would you like to fill them? (recommended)")
if choice:
fill = input("Fill with: ")
scan = True
break_loop = True
break
else:
print("[!] Attention: This can produce to mismatches in the sorting algorithm.")
confirm = ask_yes_no("[>] Are you sure you don't want to fill them?\n[+] Hint: You can also enter spaces.\n(n)o I really don't want to\noka(y) I'll enter something, just let me sort already.\n")
if confirm:
fill = input("Fill with: ")
scan = True
break_loop = True
break
else:
print("You have chosen not to fill the empty cells.")
scan = True
break_loop = True
break
if break_loop:
break
search_index += 1
if fill:
search_index = index
for row in outgoing_sheet.iter_rows(max_row=outgoing_sheet.max_row-1):
for n in cols:
cell = outgoing_sheet.cell(row=search_index, column=n).value
if cell:
pass
elif cell != int(0):
count += 1
outgoing_sheet.cell(row=search_index, column=n).value = fill
search_index += 1
print("Filled " + str(count) + " cells with: " + fill)
return fill, count
# Declaring basic variables
first_value = ask_yes_no("Is the first row containing values the 2nd in both tables?")
if first_value:
current_row, current_line = 2, 2
else:
current_row = ask_integer("Sorting table first row")
current_line = ask_integer("Receiving table first row")
verbose = ask_yes_no("Verbose output?")
reset = current_line
rec_max = receiving_sheet.max_row
scan_empty(current_row)
count = 0
print("\nSorting: " + str(outgoing_sheet.max_row - 1) + " rows...")
for row in outgoing_sheet.iter_rows():
# loop through first table - Table you want to sort
id_A = outgoing_sheet.cell(row=current_row, column=1).value
if verbose:
print("\nCurrently at: " + str(current_row - 1) + "/" + str(outgoing_sheet.max_row - 1) + "")
try:
print("Sorting now: " + id_A)
except TypeError:
# Handling None type exceptions
pass
for line in receiving_sheet.iter_rows():
# loop through second table - The receiving table
id_B = receiving_sheet.cell(row=current_line, column=1).value
if id_A == id_B:
try:
# calculate the offset
offset = max((row.column for row in receiving_sheet[current_line] if row.value is not None)) + 1
except ValueError:
# typical "No idea why, but it doesn't work without it" - code
pass
start_paste_from = receiving_sheet.cell(row=current_line, column=offset).column_letter + str(current_line)
copy_Range = ((outgoing_sheet.cell(row=current_row, column=2)).column_letter + str(current_row) + ":" +
(outgoing_sheet.cell(row=current_row, column=outgoing_sheet.max_column)).column_letter + str(current_row))
# Don't copy the ID, alternatively set damage.min_column for the first and damage.max_column for the second
copy_row(copy_Range, start_paste_from, outgoing_sheet, receiving_sheet)
count += 1
current_row += 1
if verbose:
print("Copied " + copy_Range + " to: " + str(start_paste_from))
break
if not current_line > rec_max:
# prevent looping out of range
current_line += 1
else:
current_line = reset
wb.save(file)
print("\nSorted: " + str(count) + " rows.")
print("Saving the file to: " + os.getcwd())
print("Done.")
Note: The values of table B ("Damage") are sorted according to the ID, although that is not required. However, if you choose to do so, this can be done using pandas.
import pandas as pd
df = pd.read_excel("excel/separated.xlsx","Damage")
# open the correct worksheet
df.sort_values(by="Identification")
df.to_excel("sorted.xlsx")
I'm trying to write a script that will pull one row of excel at a time and print it. I would like to use a method to change the row. I am able to get the value of the row to change (variable rrowx) but when I print the currentRow string, I get the original row and not the newly adjusted row.
import xlrd
class Loader(object): ## engine to load and unload spread sheets
## then sets them to a variable
# set the variables
workbook = " " # name of the file
sheetCount = 0 # amount of sheets in the spreadsheet
sheetNumber = 0 # current sheet (index)
rowCount = 0 # amount of rows in the spreadsheet
currentSheet = " " # name of current sheet
topRow = " " # row 0 string
currentRow = " " # row x string
global rrowx
rrowx = 0
# begin the load
workbook = xlrd.open_workbook('test.xlsx')
sheetCount = workbook.nsheets
sheetNames = workbook.sheet_names()
currentSheet = workbook.sheet_by_index(sheetNumber)
#topRow = currentSheet.row_values(rowx=rrowx, start_colx=scolx, end_colx=ecolx)
currentRow = currentSheet.row_values(rowx=rrowx)
# methods to navigate the sheet
def nextrow(self):
global rrowx
print(rrowx)
rrowx += 1
print(rrowx)
return rrowx
spreadsheet = Loader()
## Debuggin prints
print(spreadsheet.sheetNames)
print(spreadsheet.sheetCount)
print("What Sheet would you like to use? (Use numbers)")
spreadsheetadjust = int(input()) # takes input as a interger
spreadsheet.currentSheet = spreadsheetadjust - 1 # takes input and -1 for index value
print ('Current sheet name: %s' % spreadsheet.currentSheet)# prints current sheet name
print('top row:')
#print(spreadsheet.topRow)
print('row 1 ')
print(spreadsheet.currentRow)
print("NextRow")
spreadsheet.nextrow()
print(spreadsheet.currentRow)
I thought after changing the rrowx variable and calling print again on the currentRow would change the row that is printed. But instead I am getting the same row printed twice, even though I can see the value of rrowx is changing with the prints I added in the method.
disclosure: I've only been programming for a month so sorry if this is a easy answer i'm just missing.
I highly recommend you read python object oriented basics. You have multiple issues with your code, I will mention some:
Your class variables are class variables, meaning all instances of
the class will share the same variable. so if you create multiple
instances of your class, you will get unexpected/undesired behavior.
The use of a global variable is not recommended, especially when you
can do without it.
You don't need to initialize variables in python
In your implementation you have
spreadsheet.currentSheet = spreadsheetadjust - 1 which will cause you to fail even if you fix
your problem. You want
spreadsheet.sheetNumber = spreadsheetadjust - 1
Here is code that works with proper usage of python classes:
import xlrd
class Loader:
def __init__(self, path_to_xlsx='test.xlsx'):
self.sheetNumber = 0
self.rrowx = 0
self.workbook = xlrd.open_workbook(path_to_xlsx)
self.sheetCount = self.workbook.nsheets
self.sheetNames = self.workbook.sheet_names()
self.currentSheet = self.workbook.sheet_by_index(self.sheetNumber)
self.currentRow = self.currentSheet.row_values(rowx=self.rrowx)
def nextrow(self):
self.rrowx += 1
self.currentRow = self.currentSheet.row_values(rowx=self.rrowx)
spreadsheet = Loader()
print("What Sheet would you like to use? (Use numbers)")
spreadsheetadjust = int(input()) # takes input as a interger
spreadsheet.sheetNumber = spreadsheetadjust - 1 # takes input and -1 for index value
print('Current sheet name: %s' % spreadsheet.currentSheet) # prints current sheet name
print(spreadsheet.currentRow)
spreadsheet.nextrow()
print(spreadsheet.currentRow)
I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?
I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.
In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.
I am trying to define a function with two arguments : df (dataframe), and an integer (employerID) as my arguments. this function will return the full name of the employer.
If the given ID does not belong to any employee, I want to return the string "UNKNOWN" / If no middle name is given only return "LAST, FIRST". / If only the middle initial is given the return the full name in the format "LAST, FIRST M." with the middle initial followed by a '.'.
def getFullName(df, int1):
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
newdf = df[(df['EmployeeID'] == int1)]
print("'" + newdf['LastName'].item() + "," + " " + newdf['FirstName'].item() + " " + newdf['MiddleName'].item() + "." + "'")
getFullName('df', 110)
I wrote this code but came up with two problems :
1) if I don't put quotation mark around df, it will give me an error message, but I just want to take a data frame as an argument not a string.
2) this code can't deal with someone with out middle name.
I am sorry but I used pd.read_excel to read the excel file which you can not access. I know it will be hard for you to test the codes without the excel file, if someone let me know how to create a random data frame with the column names, I will go ahead and change it. Thank you,
I created some fake data for this:
EmployeeID FirstName LastName MiddleName
0 0 a a a
1 1 b b b
2 2 c c c
3 3 d d d
4 4 e e e
5 5 f f f
6 6 g g g
7 7 h h h
8 8 i i i
9 9 j j None
EmployeeID 9 has no middle name, but everyone else does. The way I would do it is to break up the logic into two parts. The first, for when you cannot find the EmployeeID. The second manages the printing of the employee's name. That second part should also have two sets of logic, one to control if the employee has a middle name, and the other for if they don't. You could likely combine a lot of this into single line statements, but you will likely sacrifice clarity.
I also removed the pd.read_excel call from the function. If you want to pass the dataframe in to the function, then the dataframe should be created oustide of it.
def getFullName(df, int1):
newdf = df[(df['EmployeeID'] == int1)]
# if the dataframe is empty, then we can't find the give ID
# otherwise, go ahead and print out the employee's info
if(newdf.empty):
print("UNKNOWN")
return "UNKNOWN"
else:
# all strings will start with the LastName and FirstName
# we will then add the MiddleName if it's present
# and then we can end the string with the final '
s = "'" + newdf['LastName'].item() + ", " +newdf['FirstName'].item()
if (newdf['MiddleName'].item()):
s = s + " " + newdf['MiddleName'].item() + "."
s = s + "'"
print(s)
return s
I have the function returning values in case you want to manipulate the string further. But that was just me.
If you run getFullName(df, 1) you should get 'b, b b.'. And for getFullName(df, 9) you should get 'j, j'.
So in full, it would be:
df = pd.read_excel('/home/data/AdventureWorks/Employees.xls')
getFullName(df, 1) #outputs 'b, b b.'
getFullName(df, 9) #outputs 'j, j'
getFullName(df, 10) #outputs UNKNOWN
Fake data:
d = {'EmployeeID' : [0,1,2,3,4,5,6,7,8,9],
'FirstName' : ['a','b','c','d','e','f','g','h','i','j'],
'LastName' : ['a','b','c','d','e','f','g','h','i','j'],
'MiddleName' : ['a','b','c','d','e','f','g','h','i',None]}
df = pd.DataFrame(d)
I am learning python and trying to build a scraper to glean parts data from a suppliers site. My issue now is that I am getting different column counts from my parsed table rows where I know that each row has the same column count. The issue has to be something I am overlooking and after two days of trying different things I am asking for a few more sets of eyes on my code to locate my error. Not having much python coding experience is no doubt my biggest hurdle.
First, the data. Rather than paste the html I have stored in my database, I'll give you a link to the live site I have crawled and stored in my db. The first link is this one.
The issue is that I get mostly correct results. However, every so often I get the values skewed in the column count. I can't seem to locate the cause.
Here is an example of the flawed result:
----------------------------------------------------------------------------------
Record: 1 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc
Rec:1 Row 6 Col 5 engine
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1
** Note that the engine value has been shifted to the vin_code field.
And proof it works some of the time:
Record: 2 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1998 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:3 Row 4 Col 1 part Oil Filter
Rec:3 Row 4 Col 2 2
Rec:3 Row 4 Col 3 part_no 51334
Rec:3 Row 4 Col 4 filter_loc
Rec:3 Row 4 Col 5 engine L4 2.3L 2254cc
Rec:3 Row 4 Col 6 vin_code
Rec:3 Row 4 Col 7 comment Engine Code F23A1
** Note the fields line up in this record...
I suspect either there is something in the table cells my parser is not looking for or I have missed something trivial.
Here is the important portion of my code:
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
cnty = 0 # Column Counter
found_oil = 0 # Found oil filter record flag
data = {} # Data
# Data fields
field_data = {'part':'', 'part_no':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
print
print '----------------------------------------------------------------------------------'
print 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
#
# Rules for extracting data columns
# 1. First column always has a link to the bullet image
# 2. Second column is part name
# 3. Third column always empty
# 4. Fourth column is part number
# 5. Fith column is empty
# 6. Sixth column is part location
# 7. Seventh column is always empty
# 8. Eigth column is engine size
# 9. Ninth column is vin code
# 10. Tenth column is COmment
# 11. Eleventh column does not exist.
#
for column in row.xpath('./td[#class="blackmedium"][text()="0xa0"] | ./td[#class="blackmedium"][text()="\n"]/text() | ./td[#class="blackmeduim"]/img[#src]/text() | ./td[#class="blackmedium"][text()=""]/text() | ./td[#class="blackmedium"]/b/text() | ./td[#class="blackmedium"]/a/text() |./td[#class="blackmedium"]/text() | ./td[#class="blackmedium"][text()=" "]/text() | ./td[#class="blackmedium"][text()=" "]/text() | ./td[#class="blackmedium"][text()=None]/text()'):
#' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
cnty+=1
if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
found_oil = 1
if found_oil == 1:
print 'Rec:'+str(rec_cnt), 'Row '+str(cntx), 'Col '+str(cnty), _fields[cnty], column.strip()
#cnty+= 1
#print
else:
print 'Rec: '+str(rec_cnt), 'Col: '+str(cnty)
field_data[ str(_fields[cnty]) ] = str(column.strip())
#cnty = cnty+1
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
print 'Filed Data:'
print field_data
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries
Thank you all for your help and your eyes...
Usually when I run into a problem like this, I do two things:
Break the problem down into smaller chunks. Use python functions or classes to perform subsets of functionality so that you can test the functions individually for correctness.
Use the Python Debugger to inspect the code as it runs to understand where it's failing. For example, in this case, I would add import pdb; pdb.set_trace() before the line that says cnty+=1.
Then, when the code runs, you'll get an interactive interpreter at that point and you can inspect the various variables and discover why you're not getting what you expect.
A couple of tips for using pdb:
Use c to allow the program to continue (until the next breakpoint or set_trace); Use n to step to the next line in the program. Use q to raise an Exception (and usually abort).
Can you pass the details of your scrapping process? The intermittent failures could be based on the parsing of the html data.
The problem seems to be that your xpath expression searches for text nodes. No matches are found for empty cells, causing your code to "skip" columns. Try iterating over the td elements themselves, and then "look down" from the element to its contents. To get you started:
# just iterate over child elements of the row, which are always td
# use enumerate to easily get a counter for the columns
for col_no, td in enumerate(row, start=1):
# use the xpath function string() to get the string value for the element
# this will yield an empty string for empty elements
print col_no, td.xpath('string()')
Note that the use of the string() xpath function may in some cases be not enough/too simple for what you want. In your example, you may find something like <td><a>51334</a><sup>53</sup></td> (see oil filter). My example would give you "5133453", where you would seem to need "51334" (not sure if that was intentional or if you hadn't noticed the "missing" part, if you do want only the in the hyperlink, use td.findtext('a'))
I want to thank everyone who has given aid to me these past few days. All your input has resulted in a working application that I am now using. I wanted to post the resulting changes to my code so those who look here may find an answer or at least information on how they may also tackle their issue. Below is the rewritten portion of my code that solved the issues I was having:
#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row):
index = {}
col_no = 0
td = None
name = ''
for col_no, td in enumerate(row, start=0):
mystr = str(td.xpath('string()').encode('ascii', 'replace'))
name = str.lower(mystr).replace(' ', '_')
idx = name.replace('.', '')
index[idx] = col_no
if int(options.verbose) > 2:
print 'Field Index:', str(index)
return index
def run():
global has_offset
records = get_data()
#print 'Records', records
rec_cnt = 0
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
if int(options.verbose) > 0:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
col_index = {}
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
data = {} # Data
found_oil = 0 #found proper part flag
# Data fields
field_data = {'part':'', 'part_no':'', 'part_note':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
if int(options.verbose) > 0:
print
print '----------------------------------------------------------------------------------'
print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
# get column indexes
if cntx == 1:
col_index = get_column_index(row)
if col_index != None and cntx > 1:
found_oil = 0
for col_no, td in enumerate(row):
if ('part' in col_index) and (col_no == col_index['part']):
part = td.xpath('string()').strip()
if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
found_oil = 1
field_data['part'] = td.xpath('string()').strip()
# Part Number
if ('part_no' in col_index) and (col_no == col_index['part_no']):
field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
# Filter Location
if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
field_data['filter_loc'] = td.xpath('string()').strip()
# Engine
if ('engine' in col_index) and (col_no == col_index['engine']):
field_data['engine'] = td.xpath('string()').strip()
if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
field_data['vin_code'] = td.xpath('string()').strip()
if ('comment' in col_index) and (col_no == col_index['comment']):
field_data['comment'] = td.xpath('string()').strip()
if int(options.verbose) == 0:
print ','
if int(options.verbose) > 0:
print 'Field Data: ', str(field_data)
elif int(options.verbose) == 0:
print '.'
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = field_data['engine'] #record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['part_note'] = field_data['part_note']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
found_oil = 0
if int(options.verbose) > 2:
print 'Data:', str(data)
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries