In an Excel file I have two large tables. Table A ("Dissection", 409 rows x 25 cols) contains unique entries, each separated by a unique ID. Table B ("Dissection", 234 rows x 39 columns) uses the ID of Table A in the first cell and extends it. To analyze the data in Minitab, all data must be in a single long row, meaning the values of "Damage" have to follow "Dissection". The whole thing looks like this:
Table A - i.e. Dissection
- ID1 [valueTabA] [valueTabA]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA]
Table B - i.e. Damage
- ID1 [valueTabB1] [valueTabB1]
- ID1 [valueTabB2] [valueTabB2]
- ID4 [valueTabB] [valueTabB]
They are supposed to combine something like this:
Table A
- ID1 [valueTabA] [valueTabA] [valueTabB1] [valueTabB1] [valueTabB2] [valueTabB2]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA] [valueTabB] [valueTabB]
What is the best way to do that?
The following describes my two approaches. Both use the same data in the same tables but in two different files, to be able to test both scenarios.
The first approach uses a file, where both tables are in the same worksheet, the second uses a file where both tables are in different worksheets.
Scenario: both tables are in the same worksheet, where I'm trying to move the row as a range
current_row = 415 # start without headers of table A
current_line = 2 # start without headers of table B
for row in ws.iter_rows(min_row=415, max_row=647):
# loop through damage
id_A = ws.cell(row=current_row, column=1).value
max_col = 25
for line in ws.iter_rows(min_row=2, max_row=409):
# loop through dissection
id_B = ws.cell(row=current_line, column=1).value
if id_A == id_B:
copy_range = ((ws.cell(row=current_line, column=2)).column_letter + str(current_line) + ":" +
(ws.cell(row=current_line, column=39)).column_letter + str(current_line))
ws.move_range(copy_range, rows=current_row, cols=max_col+1)
print("copied range: " + copy_range +" to: " + str(current_row) + ":"+str(max_col+1))
count += 1
break
if current_line > 409:
current_line = 2
else:
current_line += 1
current_row += 1
-> Here I'm struggling to append the range to the right row of Table A, without overwriting the previous row (see example ID1 above)
Scenario: both tables are located in separated sheets
dissection = wb["Dissection"]
damage = wb["Damage"]
recovery = wb["Recovery"]
current_row, current_line = 2, 2
for row in damage.iter_rows():
# loop through first table
id_A = damage.cell(row=current_row, column=1).value
for line in dissection.iter_rows():
# loop through second table
id_B = dissection.cell(row=current_line, column=1).value
copyData = []
if id_A == id_B:
for col in range(2, 39):
# add data to the list, skipping the ID
copyData.append(damage.cell(row=current_line, column=col).value)
# print(copyData) for debugging purposes
for item in copyData:
column_count = dissection.max_column
dissection.cell(row=current_row, column=column_count).value = item
column_count += 1
current_row += 1
break
if not current_line > 409:
# prevent looping out of range
current_line += 1
else:
current_line = 2
-> Same problem as in 1., at some point it's not adding the damage values to copyData anymore but None instead, and finally it's just not pasting the items (cells stay blank)
I've tried everything excel related that I could find, but unfortunately nothing worked. Would pandas be more useful here or am I just not seeing something?
Thanks for taking the time to read this :)
I highly recommend using pandas for situations like this. It is still a bit unclear how your data is formatted in the excel file, but given your second option I assume that the tables are both on different sheets in the excel file. I also assume that the first row contains the table title (e.g. Table A - i.e. Dissection). If this is not the case, just remove skiprows=1:
import pandas as pd
df = pd.concat(pd.read_excel("filename.xlsx", sheet_name=None, skiprows=1, header=None), axis=1, ignore_index=True)
df.to_excel('combined_data.xlsx) #save to excel
read_excel will load the excel file into a pandas dataframe. sheet_name=None indicates that all sheets should be loaded into an OrderedDict of dataframes. pd.concat will concatenate these dataframes into one single dataframe (axis=1 indicates the axis). You can explore the data with df.head(), or save the dataframe to excel with df.to_excel.
I ended up using the 2. scenario (one file, two worksheets) but this code should be adaptable to the 1. scenario (one file, one worksheet) as well.
I copied the rows of Table B using code taken from here.
And handled the offset with code from here.
Also, I added a few extras to my solution to make it more generic:
import openpyxl, os
from openpyxl.utils import range_boundaries
# Introduction
print("Welcome!\n[!] Advice: Always have a backup of the file you want to sort.\n[+] Please put the file to be sorted in the same directory as this program.")
print("[+] This program assumes that the value to be sorted by is located in the first column of the outgoing table.")
# File listing
while True:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
valid_types = ["xlsx", "xltx", "xlt", "xls"]
print("\n[+] Current directory: " + os.getcwd())
print("[+] Excel files in the current directory: ")
for f in files:
if str(f).split(".")[1] in valid_types:
print(f)
file = input("\nWhich file would you like to sort: ")
try:
ending = file.split(".")[1]
except IndexError:
print("please only enter excel files.")
continue
if ending in valid_types:
break
else:
print("Please only enter excel files")
wb = openpyxl.load_workbook(file)
# Handling Worksheets
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which file would you like to sort? (please copy the name without the parenthesis)")
outgoing_sheet = wb[input("Outgoing sheet: ")]
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which is the receiving sheet? (please copy the name without the parenthesis)")
receiving_sheet = wb[input("Receiving sheet: ")]
# Declaring functions
def copy_row(source_range, target_start, source_sheet, target_sheet):
# Define start Range(target_start) in the new Worksheet
min_col, min_row, max_col, max_row = range_boundaries(target_start)
# Iterate Range you want to copy
for row, row_cells in enumerate(source_sheet[source_range], min_row):
for column, cell in enumerate(row_cells, min_col):
# Copy Value from Copy.Cell to given Worksheet.Cell
target_sheet.cell(row=row, column=column).value = cell.value
def ask_yes_no(prompt):
"""
:param prompt: The question to be asked
:return: Value to check
"""
while True:
answer = input(prompt + " (y/n): ")
if answer == "y":
return True
elif answer == "n":
return False
print("Please only enter y or n.")
def ask_integer(prompt):
while True:
try:
answer = int(input(prompt + ": "))
break
except ValueError:
print("Please only enter integers (e.g. 1, 2 or 3).")
return answer
def scan_empty(index):
print("Scanning for empty cells...")
scan, fill = False, False
min_col = outgoing_sheet.min_column
max_col = outgoing_sheet.max_column
cols = range(min_col, max_col+1)
break_loop = False
count = 0
if not scan:
search_index = index
for row in outgoing_sheet.iter_rows():
for n in cols:
cell = outgoing_sheet.cell(row=search_index, column=n).value
if cell:
pass
else:
choice = ask_yes_no("\n[!] Empty cells found, would you like to fill them? (recommended)")
if choice:
fill = input("Fill with: ")
scan = True
break_loop = True
break
else:
print("[!] Attention: This can produce to mismatches in the sorting algorithm.")
confirm = ask_yes_no("[>] Are you sure you don't want to fill them?\n[+] Hint: You can also enter spaces.\n(n)o I really don't want to\noka(y) I'll enter something, just let me sort already.\n")
if confirm:
fill = input("Fill with: ")
scan = True
break_loop = True
break
else:
print("You have chosen not to fill the empty cells.")
scan = True
break_loop = True
break
if break_loop:
break
search_index += 1
if fill:
search_index = index
for row in outgoing_sheet.iter_rows(max_row=outgoing_sheet.max_row-1):
for n in cols:
cell = outgoing_sheet.cell(row=search_index, column=n).value
if cell:
pass
elif cell != int(0):
count += 1
outgoing_sheet.cell(row=search_index, column=n).value = fill
search_index += 1
print("Filled " + str(count) + " cells with: " + fill)
return fill, count
# Declaring basic variables
first_value = ask_yes_no("Is the first row containing values the 2nd in both tables?")
if first_value:
current_row, current_line = 2, 2
else:
current_row = ask_integer("Sorting table first row")
current_line = ask_integer("Receiving table first row")
verbose = ask_yes_no("Verbose output?")
reset = current_line
rec_max = receiving_sheet.max_row
scan_empty(current_row)
count = 0
print("\nSorting: " + str(outgoing_sheet.max_row - 1) + " rows...")
for row in outgoing_sheet.iter_rows():
# loop through first table - Table you want to sort
id_A = outgoing_sheet.cell(row=current_row, column=1).value
if verbose:
print("\nCurrently at: " + str(current_row - 1) + "/" + str(outgoing_sheet.max_row - 1) + "")
try:
print("Sorting now: " + id_A)
except TypeError:
# Handling None type exceptions
pass
for line in receiving_sheet.iter_rows():
# loop through second table - The receiving table
id_B = receiving_sheet.cell(row=current_line, column=1).value
if id_A == id_B:
try:
# calculate the offset
offset = max((row.column for row in receiving_sheet[current_line] if row.value is not None)) + 1
except ValueError:
# typical "No idea why, but it doesn't work without it" - code
pass
start_paste_from = receiving_sheet.cell(row=current_line, column=offset).column_letter + str(current_line)
copy_Range = ((outgoing_sheet.cell(row=current_row, column=2)).column_letter + str(current_row) + ":" +
(outgoing_sheet.cell(row=current_row, column=outgoing_sheet.max_column)).column_letter + str(current_row))
# Don't copy the ID, alternatively set damage.min_column for the first and damage.max_column for the second
copy_row(copy_Range, start_paste_from, outgoing_sheet, receiving_sheet)
count += 1
current_row += 1
if verbose:
print("Copied " + copy_Range + " to: " + str(start_paste_from))
break
if not current_line > rec_max:
# prevent looping out of range
current_line += 1
else:
current_line = reset
wb.save(file)
print("\nSorted: " + str(count) + " rows.")
print("Saving the file to: " + os.getcwd())
print("Done.")
Note: The values of table B ("Damage") are sorted according to the ID, although that is not required. However, if you choose to do so, this can be done using pandas.
import pandas as pd
df = pd.read_excel("excel/separated.xlsx","Damage")
# open the correct worksheet
df.sort_values(by="Identification")
df.to_excel("sorted.xlsx")
Related
I have 13 multiple sheet excel files in one folder to find degree eligibility.
I read all files using openpyxl.
Check the missing values.
Then convert Grade to Grade point value.
Sort by descending order and remove lower grade duplicates while keeping highest grade.
Extract Course code last digit as subject credit.
Then check whether the course code is applicable for GPA or not using credit multiplier.
Then calculate GPA=(Grade point valueSubject creditcredit multiplier)/Total subject credit.
The results write in a new sheet in same file.
My program shows last file last sheet output not all 13 files. What is missing here?
Program as follows:
dir=os.path.join('file_path')
dir
file_found = False
for files in os.listdir(dir):
print(f"processing file: '{files}'")
if files[-4::] == 'xlsx':
file_found = True
else:
print(f"current file does not end with xlsx. Its last 4 chars are: '{files[-4:]}'")
if not file_found:
print("ERROR: There were no files with ending xlsx")
file_1 = pd.ExcelFile(os.path.join(dir,files),engine='openpyxl')
print('Path of File: ', os.path.join(dir,files))
print('Student: ', pd.read_excel(file_1, sheet_name=0).iloc[0,1])
sheets_names = ['Yr1', 'Yr2', 'Yr3','Subjects']
for names in sheets_names:
sheet = file_1.sheet_names.index(names)
print('Sheet: ', file_1.sheet_names[sheet])
file_original = pd.read_excel(file_1, sheet_name=sheet,engine='openpyxl')
file_copy = file_original.copy()
print(file_copy.columns)
for i in range(len(file_copy)):
file_copy.loc[i,'Grades'] = grades[file_copy.loc[i,'Grade']]
#sortedby=file_copy.sort_values(file_copy.loc[:,'Course Code'])
# Rows Repeated
dupli = file_copy.loc[file_copy.duplicated(['Course Code'], keep='first')].reset_index()
cont = 0
rows_to_delete = []
for i in range(int(len(dupli))):
dupli.loc[cont,'Grades'] >= dupli.loc[cont+1,'Grades']:
rows_to_delete.append(dupli.loc[cont+1,'index'])
else:
rows_to_delete.append(dupli.loc[cont,'index'])
cont += 2
file_copy.drop(index=rows_to_delete, inplace=True)
file_copy.reset_index(drop=True, inplace=True)
# subject credit (all last digits of Coruse Code)
file_copy.loc[:,'subject_credits'] = [int(i[-1]) for i in file_copy.loc[:,'Course Code']]
print('') file_copy.loc[:,'Credit_Multiplier'] = [str(i[0:4]) for i in file_copy.loc[:,'Course Code']]
credits=[("AMAT",1),("BFIN",1),("DELT",0),("ELEC",1),("MGMT",1),("PHYS",1),("PMAT",1),("COST",1),("MAPS",1),("COSC",1),("STAT",1),("BOTA",1)] file_copy.loc[:,'course_unit']=file_copy.loc[:,'Credit_Multiplier'].map(dict(credits)) file_copy.loc[:,'subject_credits'].sum()#need to give specific cell
for index in range(len(file_copy)):
#total_credits[index] = last digit of Course Code
gpv= file_copy.loc[:,'Grades']*file_copy.loc[:,'subject_credits'] *file_copy.loc[:,'Course_Unit']
GPA=gpv/file_copy.loc[:,'subject_credits'].sum()
I want to count the rows & columns present in an Excel table using a loop, and the loop should stop running on encountering 2 (or more) consecutive empty cells.
row_count=0
col_count=0
for a in range(1,temp.max_row+1):
if temp.cell(row=a,column=2).value!=None:
row_count= row_count+1
else:
break
for b in range(2,temp.max_column+1):
if temp.cell(row=8,column=b).value!=None:
col_count=col_count+1
else:
break
print(row_count)
print(col_count)
However, I cant get a correct result with the method used.
You need to check two adjacent cells in each iteration:
for a in range(1,temp.max_row+1):
# Check if the cell is not empty
if temp.cell(row=a, column=2).value is not None:
row_count = row_count + 1
# Else the cell is empty, check if the next one is also empty
elif temp.cell(row=a + 1, column=2).value is None:
break
for b in range(2,temp.max_column+1):
if temp.cell(row=8, column=b).value is not None:
col_count = col_count + 1
elif temp.cell(row=8, column=b + 1).value is None:
break
thanks for answering my question. I made a few changes and the code below seems to work best for my data.- (The purpose of the code is the measure the dimensions of a table in an excel sheet (the sheet contains multiple such tables). so it has to stop counting if it encounters, say, 2 consecutive empty cells, before it reaches another table.)
# A) For 2 consecutive empty cells-
#for rows:
for a in range(4, temp.max_row+1):
if (temp.cell(row=a, column=2).value==None and temp.cell(row=a+1, column=2).value==None): #check if current cell & the following cell is not empty. if empty, it stops counting & exits the loop.
break
else:
row_count=row_count+1
#for columns:
for b in range(2,temp.max_column+1):
if (temp.cell(row=20, column=b).value==None and temp.cell(row=20,column=b+1).value==None):
break
else:
col_count = col_count+1
I'm trying to write a script that will pull one row of excel at a time and print it. I would like to use a method to change the row. I am able to get the value of the row to change (variable rrowx) but when I print the currentRow string, I get the original row and not the newly adjusted row.
import xlrd
class Loader(object): ## engine to load and unload spread sheets
## then sets them to a variable
# set the variables
workbook = " " # name of the file
sheetCount = 0 # amount of sheets in the spreadsheet
sheetNumber = 0 # current sheet (index)
rowCount = 0 # amount of rows in the spreadsheet
currentSheet = " " # name of current sheet
topRow = " " # row 0 string
currentRow = " " # row x string
global rrowx
rrowx = 0
# begin the load
workbook = xlrd.open_workbook('test.xlsx')
sheetCount = workbook.nsheets
sheetNames = workbook.sheet_names()
currentSheet = workbook.sheet_by_index(sheetNumber)
#topRow = currentSheet.row_values(rowx=rrowx, start_colx=scolx, end_colx=ecolx)
currentRow = currentSheet.row_values(rowx=rrowx)
# methods to navigate the sheet
def nextrow(self):
global rrowx
print(rrowx)
rrowx += 1
print(rrowx)
return rrowx
spreadsheet = Loader()
## Debuggin prints
print(spreadsheet.sheetNames)
print(spreadsheet.sheetCount)
print("What Sheet would you like to use? (Use numbers)")
spreadsheetadjust = int(input()) # takes input as a interger
spreadsheet.currentSheet = spreadsheetadjust - 1 # takes input and -1 for index value
print ('Current sheet name: %s' % spreadsheet.currentSheet)# prints current sheet name
print('top row:')
#print(spreadsheet.topRow)
print('row 1 ')
print(spreadsheet.currentRow)
print("NextRow")
spreadsheet.nextrow()
print(spreadsheet.currentRow)
I thought after changing the rrowx variable and calling print again on the currentRow would change the row that is printed. But instead I am getting the same row printed twice, even though I can see the value of rrowx is changing with the prints I added in the method.
disclosure: I've only been programming for a month so sorry if this is a easy answer i'm just missing.
I highly recommend you read python object oriented basics. You have multiple issues with your code, I will mention some:
Your class variables are class variables, meaning all instances of
the class will share the same variable. so if you create multiple
instances of your class, you will get unexpected/undesired behavior.
The use of a global variable is not recommended, especially when you
can do without it.
You don't need to initialize variables in python
In your implementation you have
spreadsheet.currentSheet = spreadsheetadjust - 1 which will cause you to fail even if you fix
your problem. You want
spreadsheet.sheetNumber = spreadsheetadjust - 1
Here is code that works with proper usage of python classes:
import xlrd
class Loader:
def __init__(self, path_to_xlsx='test.xlsx'):
self.sheetNumber = 0
self.rrowx = 0
self.workbook = xlrd.open_workbook(path_to_xlsx)
self.sheetCount = self.workbook.nsheets
self.sheetNames = self.workbook.sheet_names()
self.currentSheet = self.workbook.sheet_by_index(self.sheetNumber)
self.currentRow = self.currentSheet.row_values(rowx=self.rrowx)
def nextrow(self):
self.rrowx += 1
self.currentRow = self.currentSheet.row_values(rowx=self.rrowx)
spreadsheet = Loader()
print("What Sheet would you like to use? (Use numbers)")
spreadsheetadjust = int(input()) # takes input as a interger
spreadsheet.sheetNumber = spreadsheetadjust - 1 # takes input and -1 for index value
print('Current sheet name: %s' % spreadsheet.currentSheet) # prints current sheet name
print(spreadsheet.currentRow)
spreadsheet.nextrow()
print(spreadsheet.currentRow)
I am just starting to learn python and am looking for some direction on a script I am working on to text out daily pick up for my drivers. The vendor name is entered into a spreadsheet along with a purchase order # and notes. What i would like to do is cycle through column "A", find all instances of a vendor name, grab the corresponding B & C cell values and save all info to a text file. I can get it to work if I name the search string explicitly but not if its a variable. Here is what I have so far:
TestList=[]
TestDict= {}
LineNumber = 0
for i in range(1, maxrow + 1):
VendorName = sheet.cell(row = i, column = 1)
if VendorName.value == "CERTIFIED LETTERING":#here is where im lost
#print (VendorName.coordinate)
VendLoc = str(VendorName.coordinate)
TestList.append(VendLoc)
TestDict[VendorName.value]=[TestList]
test = (TestDict["CERTIFIED LETTERING"][0])
ListLength = (len(test))
ListPo = []
List_Notes = []
number = 0
for i in range (0, ListLength):
PO = (str('B'+ test[number][1]))
Note = (str('C'+ test[number][1]))
ListPo.append(PO)
List_Notes.append(Note)
number = number + 1
number = 0
TestVend =(str(VendorName.value))
sonnetFile = open('testsaveforpickups.txt', 'w')
sonnetFile.write("Pick up at:" + '\n')
sonnetFile.write(str(VendorName.value)+'\n')
for i in range (0, ListLength):
sonnetFile.write ("PO# "+ str(sheet[ListPo[number]].value)+'\n'
+"NOTES: " + str(sheet[List_Notes[number]].value)+'\n')
number = number + 1
sonnetFile.close()
the results are as follows:
Pick up at:
CERTIFIED LETTERING
PO# 1111111-00
NOTES: aaa
PO# 333333-00
NOTES: ccc
PO# 555555-00
NOTES: eee
I've tried everything i could think of to change the current string of "CERTIFIED LETTERING" to a variable name, including creating a list of all vendors in column A and using that as a dictionary to go off of. Any help or ideas to point me in the right direction would be appreciated. And I apologise for any formatting errors. I'm new to posting here.
I am learning python and trying to build a scraper to glean parts data from a suppliers site. My issue now is that I am getting different column counts from my parsed table rows where I know that each row has the same column count. The issue has to be something I am overlooking and after two days of trying different things I am asking for a few more sets of eyes on my code to locate my error. Not having much python coding experience is no doubt my biggest hurdle.
First, the data. Rather than paste the html I have stored in my database, I'll give you a link to the live site I have crawled and stored in my db. The first link is this one.
The issue is that I get mostly correct results. However, every so often I get the values skewed in the column count. I can't seem to locate the cause.
Here is an example of the flawed result:
----------------------------------------------------------------------------------
Record: 1 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc
Rec:1 Row 6 Col 5 engine
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1
** Note that the engine value has been shifted to the vin_code field.
And proof it works some of the time:
Record: 2 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1998 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:3 Row 4 Col 1 part Oil Filter
Rec:3 Row 4 Col 2 2
Rec:3 Row 4 Col 3 part_no 51334
Rec:3 Row 4 Col 4 filter_loc
Rec:3 Row 4 Col 5 engine L4 2.3L 2254cc
Rec:3 Row 4 Col 6 vin_code
Rec:3 Row 4 Col 7 comment Engine Code F23A1
** Note the fields line up in this record...
I suspect either there is something in the table cells my parser is not looking for or I have missed something trivial.
Here is the important portion of my code:
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
cnty = 0 # Column Counter
found_oil = 0 # Found oil filter record flag
data = {} # Data
# Data fields
field_data = {'part':'', 'part_no':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
print
print '----------------------------------------------------------------------------------'
print 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
#
# Rules for extracting data columns
# 1. First column always has a link to the bullet image
# 2. Second column is part name
# 3. Third column always empty
# 4. Fourth column is part number
# 5. Fith column is empty
# 6. Sixth column is part location
# 7. Seventh column is always empty
# 8. Eigth column is engine size
# 9. Ninth column is vin code
# 10. Tenth column is COmment
# 11. Eleventh column does not exist.
#
for column in row.xpath('./td[#class="blackmedium"][text()="0xa0"] | ./td[#class="blackmedium"][text()="\n"]/text() | ./td[#class="blackmeduim"]/img[#src]/text() | ./td[#class="blackmedium"][text()=""]/text() | ./td[#class="blackmedium"]/b/text() | ./td[#class="blackmedium"]/a/text() |./td[#class="blackmedium"]/text() | ./td[#class="blackmedium"][text()=" "]/text() | ./td[#class="blackmedium"][text()=" "]/text() | ./td[#class="blackmedium"][text()=None]/text()'):
#' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
cnty+=1
if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
found_oil = 1
if found_oil == 1:
print 'Rec:'+str(rec_cnt), 'Row '+str(cntx), 'Col '+str(cnty), _fields[cnty], column.strip()
#cnty+= 1
#print
else:
print 'Rec: '+str(rec_cnt), 'Col: '+str(cnty)
field_data[ str(_fields[cnty]) ] = str(column.strip())
#cnty = cnty+1
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
print 'Filed Data:'
print field_data
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries
Thank you all for your help and your eyes...
Usually when I run into a problem like this, I do two things:
Break the problem down into smaller chunks. Use python functions or classes to perform subsets of functionality so that you can test the functions individually for correctness.
Use the Python Debugger to inspect the code as it runs to understand where it's failing. For example, in this case, I would add import pdb; pdb.set_trace() before the line that says cnty+=1.
Then, when the code runs, you'll get an interactive interpreter at that point and you can inspect the various variables and discover why you're not getting what you expect.
A couple of tips for using pdb:
Use c to allow the program to continue (until the next breakpoint or set_trace); Use n to step to the next line in the program. Use q to raise an Exception (and usually abort).
Can you pass the details of your scrapping process? The intermittent failures could be based on the parsing of the html data.
The problem seems to be that your xpath expression searches for text nodes. No matches are found for empty cells, causing your code to "skip" columns. Try iterating over the td elements themselves, and then "look down" from the element to its contents. To get you started:
# just iterate over child elements of the row, which are always td
# use enumerate to easily get a counter for the columns
for col_no, td in enumerate(row, start=1):
# use the xpath function string() to get the string value for the element
# this will yield an empty string for empty elements
print col_no, td.xpath('string()')
Note that the use of the string() xpath function may in some cases be not enough/too simple for what you want. In your example, you may find something like <td><a>51334</a><sup>53</sup></td> (see oil filter). My example would give you "5133453", where you would seem to need "51334" (not sure if that was intentional or if you hadn't noticed the "missing" part, if you do want only the in the hyperlink, use td.findtext('a'))
I want to thank everyone who has given aid to me these past few days. All your input has resulted in a working application that I am now using. I wanted to post the resulting changes to my code so those who look here may find an answer or at least information on how they may also tackle their issue. Below is the rewritten portion of my code that solved the issues I was having:
#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row):
index = {}
col_no = 0
td = None
name = ''
for col_no, td in enumerate(row, start=0):
mystr = str(td.xpath('string()').encode('ascii', 'replace'))
name = str.lower(mystr).replace(' ', '_')
idx = name.replace('.', '')
index[idx] = col_no
if int(options.verbose) > 2:
print 'Field Index:', str(index)
return index
def run():
global has_offset
records = get_data()
#print 'Records', records
rec_cnt = 0
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
if int(options.verbose) > 0:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
col_index = {}
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
data = {} # Data
found_oil = 0 #found proper part flag
# Data fields
field_data = {'part':'', 'part_no':'', 'part_note':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
if int(options.verbose) > 0:
print
print '----------------------------------------------------------------------------------'
print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
# get column indexes
if cntx == 1:
col_index = get_column_index(row)
if col_index != None and cntx > 1:
found_oil = 0
for col_no, td in enumerate(row):
if ('part' in col_index) and (col_no == col_index['part']):
part = td.xpath('string()').strip()
if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
found_oil = 1
field_data['part'] = td.xpath('string()').strip()
# Part Number
if ('part_no' in col_index) and (col_no == col_index['part_no']):
field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
# Filter Location
if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
field_data['filter_loc'] = td.xpath('string()').strip()
# Engine
if ('engine' in col_index) and (col_no == col_index['engine']):
field_data['engine'] = td.xpath('string()').strip()
if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
field_data['vin_code'] = td.xpath('string()').strip()
if ('comment' in col_index) and (col_no == col_index['comment']):
field_data['comment'] = td.xpath('string()').strip()
if int(options.verbose) == 0:
print ','
if int(options.verbose) > 0:
print 'Field Data: ', str(field_data)
elif int(options.verbose) == 0:
print '.'
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = field_data['engine'] #record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['part_note'] = field_data['part_note']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
found_oil = 0
if int(options.verbose) > 2:
print 'Data:', str(data)
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries