Break python loop when it finds 2 consecutive empty cells - python

I want to count the rows & columns present in an Excel table using a loop, and the loop should stop running on encountering 2 (or more) consecutive empty cells.
row_count=0
col_count=0
for a in range(1,temp.max_row+1):
if temp.cell(row=a,column=2).value!=None:
row_count= row_count+1
else:
break
for b in range(2,temp.max_column+1):
if temp.cell(row=8,column=b).value!=None:
col_count=col_count+1
else:
break
print(row_count)
print(col_count)
However, I cant get a correct result with the method used.

You need to check two adjacent cells in each iteration:
for a in range(1,temp.max_row+1):
# Check if the cell is not empty
if temp.cell(row=a, column=2).value is not None:
row_count = row_count + 1
# Else the cell is empty, check if the next one is also empty
elif temp.cell(row=a + 1, column=2).value is None:
break
for b in range(2,temp.max_column+1):
if temp.cell(row=8, column=b).value is not None:
col_count = col_count + 1
elif temp.cell(row=8, column=b + 1).value is None:
break

thanks for answering my question. I made a few changes and the code below seems to work best for my data.- (The purpose of the code is the measure the dimensions of a table in an excel sheet (the sheet contains multiple such tables). so it has to stop counting if it encounters, say, 2 consecutive empty cells, before it reaches another table.)
# A) For 2 consecutive empty cells-
#for rows:
for a in range(4, temp.max_row+1):
if (temp.cell(row=a, column=2).value==None and temp.cell(row=a+1, column=2).value==None): #check if current cell & the following cell is not empty. if empty, it stops counting & exits the loop.
break
else:
row_count=row_count+1
#for columns:
for b in range(2,temp.max_column+1):
if (temp.cell(row=20, column=b).value==None and temp.cell(row=20,column=b+1).value==None):
break
else:
col_count = col_count+1

Related

Openpyxl - combine matching rows of two tables into one long row

In an Excel file I have two large tables. Table A ("Dissection", 409 rows x 25 cols) contains unique entries, each separated by a unique ID. Table B ("Dissection", 234 rows x 39 columns) uses the ID of Table A in the first cell and extends it. To analyze the data in Minitab, all data must be in a single long row, meaning the values of "Damage" have to follow "Dissection". The whole thing looks like this:
Table A - i.e. Dissection
- ID1 [valueTabA] [valueTabA]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA]
Table B - i.e. Damage
- ID1 [valueTabB1] [valueTabB1]
- ID1 [valueTabB2] [valueTabB2]
- ID4 [valueTabB] [valueTabB]
They are supposed to combine something like this:
Table A
- ID1 [valueTabA] [valueTabA] [valueTabB1] [valueTabB1] [valueTabB2] [valueTabB2]
- ID2 [valueTabA] [valueTabA]
- ID3 [valueTabA] [valueTabA]
- ID4 [valueTabA] [valueTabA] [valueTabB] [valueTabB]
What is the best way to do that?
The following describes my two approaches. Both use the same data in the same tables but in two different files, to be able to test both scenarios.
The first approach uses a file, where both tables are in the same worksheet, the second uses a file where both tables are in different worksheets.
Scenario: both tables are in the same worksheet, where I'm trying to move the row as a range
current_row = 415 # start without headers of table A
current_line = 2 # start without headers of table B
for row in ws.iter_rows(min_row=415, max_row=647):
# loop through damage
id_A = ws.cell(row=current_row, column=1).value
max_col = 25
for line in ws.iter_rows(min_row=2, max_row=409):
# loop through dissection
id_B = ws.cell(row=current_line, column=1).value
if id_A == id_B:
copy_range = ((ws.cell(row=current_line, column=2)).column_letter + str(current_line) + ":" +
(ws.cell(row=current_line, column=39)).column_letter + str(current_line))
ws.move_range(copy_range, rows=current_row, cols=max_col+1)
print("copied range: " + copy_range +" to: " + str(current_row) + ":"+str(max_col+1))
count += 1
break
if current_line > 409:
current_line = 2
else:
current_line += 1
current_row += 1
-> Here I'm struggling to append the range to the right row of Table A, without overwriting the previous row (see example ID1 above)
Scenario: both tables are located in separated sheets
dissection = wb["Dissection"]
damage = wb["Damage"]
recovery = wb["Recovery"]
current_row, current_line = 2, 2
for row in damage.iter_rows():
# loop through first table
id_A = damage.cell(row=current_row, column=1).value
for line in dissection.iter_rows():
# loop through second table
id_B = dissection.cell(row=current_line, column=1).value
copyData = []
if id_A == id_B:
for col in range(2, 39):
# add data to the list, skipping the ID
copyData.append(damage.cell(row=current_line, column=col).value)
# print(copyData) for debugging purposes
for item in copyData:
column_count = dissection.max_column
dissection.cell(row=current_row, column=column_count).value = item
column_count += 1
current_row += 1
break
if not current_line > 409:
# prevent looping out of range
current_line += 1
else:
current_line = 2
-> Same problem as in 1., at some point it's not adding the damage values to copyData anymore but None instead, and finally it's just not pasting the items (cells stay blank)
I've tried everything excel related that I could find, but unfortunately nothing worked. Would pandas be more useful here or am I just not seeing something?
Thanks for taking the time to read this :)
I highly recommend using pandas for situations like this. It is still a bit unclear how your data is formatted in the excel file, but given your second option I assume that the tables are both on different sheets in the excel file. I also assume that the first row contains the table title (e.g. Table A - i.e. Dissection). If this is not the case, just remove skiprows=1:
import pandas as pd
df = pd.concat(pd.read_excel("filename.xlsx", sheet_name=None, skiprows=1, header=None), axis=1, ignore_index=True)
df.to_excel('combined_data.xlsx) #save to excel
read_excel will load the excel file into a pandas dataframe. sheet_name=None indicates that all sheets should be loaded into an OrderedDict of dataframes. pd.concat will concatenate these dataframes into one single dataframe (axis=1 indicates the axis). You can explore the data with df.head(), or save the dataframe to excel with df.to_excel.
I ended up using the 2. scenario (one file, two worksheets) but this code should be adaptable to the 1. scenario (one file, one worksheet) as well.
I copied the rows of Table B using code taken from here.
And handled the offset with code from here.
Also, I added a few extras to my solution to make it more generic:
import openpyxl, os
from openpyxl.utils import range_boundaries
# Introduction
print("Welcome!\n[!] Advice: Always have a backup of the file you want to sort.\n[+] Please put the file to be sorted in the same directory as this program.")
print("[+] This program assumes that the value to be sorted by is located in the first column of the outgoing table.")
# File listing
while True:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
valid_types = ["xlsx", "xltx", "xlt", "xls"]
print("\n[+] Current directory: " + os.getcwd())
print("[+] Excel files in the current directory: ")
for f in files:
if str(f).split(".")[1] in valid_types:
print(f)
file = input("\nWhich file would you like to sort: ")
try:
ending = file.split(".")[1]
except IndexError:
print("please only enter excel files.")
continue
if ending in valid_types:
break
else:
print("Please only enter excel files")
wb = openpyxl.load_workbook(file)
# Handling Worksheets
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which file would you like to sort? (please copy the name without the parenthesis)")
outgoing_sheet = wb[input("Outgoing sheet: ")]
print("\nAvailable Worksheets: " + str(wb.sheetnames))
print("Which is the receiving sheet? (please copy the name without the parenthesis)")
receiving_sheet = wb[input("Receiving sheet: ")]
# Declaring functions
def copy_row(source_range, target_start, source_sheet, target_sheet):
# Define start Range(target_start) in the new Worksheet
min_col, min_row, max_col, max_row = range_boundaries(target_start)
# Iterate Range you want to copy
for row, row_cells in enumerate(source_sheet[source_range], min_row):
for column, cell in enumerate(row_cells, min_col):
# Copy Value from Copy.Cell to given Worksheet.Cell
target_sheet.cell(row=row, column=column).value = cell.value
def ask_yes_no(prompt):
"""
:param prompt: The question to be asked
:return: Value to check
"""
while True:
answer = input(prompt + " (y/n): ")
if answer == "y":
return True
elif answer == "n":
return False
print("Please only enter y or n.")
def ask_integer(prompt):
while True:
try:
answer = int(input(prompt + ": "))
break
except ValueError:
print("Please only enter integers (e.g. 1, 2 or 3).")
return answer
def scan_empty(index):
print("Scanning for empty cells...")
scan, fill = False, False
min_col = outgoing_sheet.min_column
max_col = outgoing_sheet.max_column
cols = range(min_col, max_col+1)
break_loop = False
count = 0
if not scan:
search_index = index
for row in outgoing_sheet.iter_rows():
for n in cols:
cell = outgoing_sheet.cell(row=search_index, column=n).value
if cell:
pass
else:
choice = ask_yes_no("\n[!] Empty cells found, would you like to fill them? (recommended)")
if choice:
fill = input("Fill with: ")
scan = True
break_loop = True
break
else:
print("[!] Attention: This can produce to mismatches in the sorting algorithm.")
confirm = ask_yes_no("[>] Are you sure you don't want to fill them?\n[+] Hint: You can also enter spaces.\n(n)o I really don't want to\noka(y) I'll enter something, just let me sort already.\n")
if confirm:
fill = input("Fill with: ")
scan = True
break_loop = True
break
else:
print("You have chosen not to fill the empty cells.")
scan = True
break_loop = True
break
if break_loop:
break
search_index += 1
if fill:
search_index = index
for row in outgoing_sheet.iter_rows(max_row=outgoing_sheet.max_row-1):
for n in cols:
cell = outgoing_sheet.cell(row=search_index, column=n).value
if cell:
pass
elif cell != int(0):
count += 1
outgoing_sheet.cell(row=search_index, column=n).value = fill
search_index += 1
print("Filled " + str(count) + " cells with: " + fill)
return fill, count
# Declaring basic variables
first_value = ask_yes_no("Is the first row containing values the 2nd in both tables?")
if first_value:
current_row, current_line = 2, 2
else:
current_row = ask_integer("Sorting table first row")
current_line = ask_integer("Receiving table first row")
verbose = ask_yes_no("Verbose output?")
reset = current_line
rec_max = receiving_sheet.max_row
scan_empty(current_row)
count = 0
print("\nSorting: " + str(outgoing_sheet.max_row - 1) + " rows...")
for row in outgoing_sheet.iter_rows():
# loop through first table - Table you want to sort
id_A = outgoing_sheet.cell(row=current_row, column=1).value
if verbose:
print("\nCurrently at: " + str(current_row - 1) + "/" + str(outgoing_sheet.max_row - 1) + "")
try:
print("Sorting now: " + id_A)
except TypeError:
# Handling None type exceptions
pass
for line in receiving_sheet.iter_rows():
# loop through second table - The receiving table
id_B = receiving_sheet.cell(row=current_line, column=1).value
if id_A == id_B:
try:
# calculate the offset
offset = max((row.column for row in receiving_sheet[current_line] if row.value is not None)) + 1
except ValueError:
# typical "No idea why, but it doesn't work without it" - code
pass
start_paste_from = receiving_sheet.cell(row=current_line, column=offset).column_letter + str(current_line)
copy_Range = ((outgoing_sheet.cell(row=current_row, column=2)).column_letter + str(current_row) + ":" +
(outgoing_sheet.cell(row=current_row, column=outgoing_sheet.max_column)).column_letter + str(current_row))
# Don't copy the ID, alternatively set damage.min_column for the first and damage.max_column for the second
copy_row(copy_Range, start_paste_from, outgoing_sheet, receiving_sheet)
count += 1
current_row += 1
if verbose:
print("Copied " + copy_Range + " to: " + str(start_paste_from))
break
if not current_line > rec_max:
# prevent looping out of range
current_line += 1
else:
current_line = reset
wb.save(file)
print("\nSorted: " + str(count) + " rows.")
print("Saving the file to: " + os.getcwd())
print("Done.")
Note: The values of table B ("Damage") are sorted according to the ID, although that is not required. However, if you choose to do so, this can be done using pandas.
import pandas as pd
df = pd.read_excel("excel/separated.xlsx","Damage")
# open the correct worksheet
df.sort_values(by="Identification")
df.to_excel("sorted.xlsx")

Counting number of occurrences of string matched to another column

df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted','hello there'],
'number_of_stickers':['2','0','0','1','0','0']} ##This column 'number_of_stickers' is what i am aiming to achieve. Currently, i don't have this column.
df = pd.DataFrame(data=df)
Above is what I am trying to achieve. I currently Do not have the column 'number_of_stickers'. This column would be my end goal.
I am trying to count the number of rows with "sticker omitted" and append the row above the chain of "sticker omitted" with the number of occurrences. I would like to append onto the new column 'number_of_stickers'
To give you some context, I am analysing whatsapp text data, and I thought it would be useful to see how many stickers were sent right after a chat was sent. This also shows the tonality and sentiments of the conversation.
Update:
I have posted a solution (credits to #JacoSolari) which would work for the problem I'm solving. Added 1-2 lines (if statement) on top of his codes so that we do not face a problem at the end of the dataframe (range issues).
It's a common technique to check for the other values and take cumsum to identify the blocks:
omitted = df.msg.ne('sticker omitted').cumsum()
df['number_of_stickers'] = np.where(omitted.duplicated(), 0,
omitted.groupby(omitted).transform('size')-1)
This code should do the job. I could not find a solution that only uses pandas functions (it might be possible to do it). Anyways, I left some comments in the code to describe my approach.
# create data
df_dict = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted']}
df=pd.DataFrame(data=df_dict)
# build column for sticker counts after message
sticker_counts = []
for index, row in df.iterrows(): # iterating over df rows
flag = True
count = 0
# when a sticker row is encountered, just put 0 in the count column
# when a non-sticker row is encountered do the following
if row['msg'] != 'sticker omitted':
k = 1 # to check rows after the non-sticker row
while flag:
# if the index + k row is a sticker increase the count for index and k
if df.loc[index + k].msg == 'sticker omitted':
count += 1
k += 1
# when reached the end of the database, break the loop
if index + k +1 > len(df):
flag = False
else:
flag = False
k = 1
sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)
You've actually got it all right so far, and your data is substantial for a easy yet functional algorithm!
Here is a little piece of code I coded up for this problem:
#ss
df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted'],
'number_of_stickers':['2','0','0','1','0']}
j = 0
newarr = [] # new array for use
for i in df["number_of_stickers"]:
if(not int(i)==0):
newarr.append([df["msg"][j], int(i)]) # will store each data in a array
#access the number of it by using element 1(newarr[1]) and the msg by newarr[0]
j+=1;
#se
#feel free to do whatever you want after ss to se
pd.DataFrame(data=df)
se being snippet end and ss snippet start.
Hope this helps! Just comment below if it doesn't!
also you have to refeed the new array to the dict.
I have edited #JacoSolari's codes (with the help of a kind soul) to match the needs of the problem I'm trying to solve. Please find the code below useful.
sticker_counts = []
msg_index = 0
for index, row in df.iterrows(): # iterating over df rows
flag = True
count = 0
# when a sticker row is encountered, just put 0 in the count column
# when a non-sticker row is encountered do the following
if row['msg'] != 'sticker omitted':
k = 1 # to check rows after the non-sticker row
while flag:
print(f'i{msg_index} flag{flag} len{len(df)}')
# if the index + k row is a sticker increase the count for index and k
msg_index=index + k
if msg_index >= len(df):
break
if df.loc[msg_index].msg == 'sticker omitted':
count += 1
k += 1
# when reached the end of the database, break the loop
if msg_index +1 > len(df):
flag = False
print(f'i{msg_index} flag{flag}')
else:
flag = False
k = 1
sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

python-Dataframe count rows with condition

I have a python code that is collecting information from a dataframe (df1) like this
for ind, data in enumerate(df1.Link):
print(data)
result = getInformation(driver, links)
for i in result['information']:
df1.loc[ind, "numOfWorkers"] = i["numOfWorkers"]
the output is saved to a dataframe like the one shown in the photo:
Is there anyway to update my code before it returns the dataframe with this condition:
if noOfWorkers >=30, once we have 2 links that have this condition, the code will break and return the result
can someone help ?
It would be better to put the logic in the code you already have. I would keep a count of the number of records matching the conditions, then exit out of the loop using break (rather than a while loop):
...
workers_threshold = 30
records_matching_threshold = 0
max_records_for_matching_records = 2
for i in result['information']:
df1.loc[ind, "numOfWorkers"] = i["numOfWorkers"]
if i["numOfWorkers"] > workers_threshold:
records_matching_threshold += 1
if records_matching_threshold > max_records_for_matching_records:
break
Note that the variable names above are purposefully long to make their purposes clear in my example.
i = 2
while i != 0:
if numOfWorkers >= 30:
i- = 1

How to identify string repetition throughout rows of a column in a Pandas DataFrame?

I'm trying to think of a way to best handle this. If I have a data frame like this:
Module---|-Line Item---|---Formula-----------------------------------------|-repetition?|--What repeated--------------------------------|---Where repeated
Module 1-|Line Item 1--|---hello[SUM: hello2]------------------------------|----yes-----|--hello[SUM: hello2]---------------------------|---Module 1 Line item 2
Module 1-|Line Item 2--|---goodbye[LOOKUP: blue123] + hello[SUM: hello2]---|----yes-----|--hello[SUM: hello2], goodbye[LOOKUP: blue123]-|---Module 1 Line item 1, Module 2 Line Item 1
Module 2-|Line Item 1--|---goodbye[LOOKUP: blue123] + some other line item-|----yes-----|--goodbye[LOOKUP: blue123]---------------------|---Module 1 Line item 2
How would I go about setting up a search and find to locate and identify repetition in the middle or on edges or complete strings?
Sorry the formatting looks bad
Basically I have the module, line item, and formula columns filled in, but I need to figure out some sort of search function that I can apply to each of the last 3 columns. I'm not sure where to start with this.
I want to match any repetition that occurs between 3 or more words, including if for example a formula was 1 + 2 + 3 + 4 and that occurred 4 times in the Formula column, I'd want to give a yes to the boolean column "repetition" return 1 + 2 + 3 + 4 on the "Where repeated" column and a list of every module/line item combination where it occurred on the last column. I'm sure I can tweak it more to fit my needs once I get started.
This one was a bit messy, is surely some more straight forward way to do some of the steps, but it worked for your data.
Step 1: I just reset_index() (assuming index uses row numbers) to get row numbers into a column.
df.reset_index(inplace=True)
I then wrote a for loop which aim was to check for each given value, if that value is at any place in the given column (using the .str.contains() function, and if so, where. And then store that information in a dictionary. Note that here I used + to split the various values you search by as that looked to be a valid separator in your dataset, but you can adjust this accordingly
#the dictionary will have a key containing row number and the value we searched for
#the value will contain the module and line item values
result = {}
#create a rownumber variable so we know where in the dataset we are
rownumber = -1
#now we just iterate over every row of the Formula series
for row in df['Formula']:
rownumber +=1
#and also every relevant value within that cell
for value in row.split('+'):
#we clean the value from trailing/preceding whitespace
value = value.strip()
#and then we return our key and value and update our dictionary
key = 'row:|:'+str(rownumber)+':|:'+value
value = (df.loc[((df.Formula.str.contains(value,regex=False))) & (df.index!=rownumber),['Module','Line Item']])
result.update({key:value})
We can now unpack the dictionary into list, where we had a match:
where_raw = []
what_raw = []
rows_raw = []
for key,value in zip(result.keys(),result.values()):
if 'Empty' in str(value):
continue
else:
where_raw.append(list(value['Module']+' '+value['Line Item']))
what_raw.append(key.split(':|:')[2])
rows_raw.append(int(key.split(':|:')[1]))
tempdf = pd.DataFrame({'row':rows_raw,'where':where_raw,'what':what_raw})
tempdf now contains one row per match, however, we want to have one row per original row in the df, so we combine all matches for each main row into one
where = []
what = []
rows = []
for row in tempdf.row.unique():
where.append(list(tempdf.loc[tempdf.row==row,'where']))
what.append(list(tempdf.loc[tempdf.row==row,'what']))
rows.append(row)
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}))
lastly we can now get the result by merging the result with our original dataframe
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}),how='left',on='index').drop('index',axis=1)
and lastly we can add the repeated column like this:
result['repeated'] = (result['what']!='')
print(result)
Module Line Item Formula what where
Module 1 Line Item 1 hello[SUM: hello2] ['hello[SUM: hello2]'] [['Module 1 Line Item 2']]
Module 1 Line Item 2 goodbye[LOOKUP: blue123] + hello[SUM: hello2] ['goodbye[LOOKUP: blue123]', 'hello[SUM: hello2]'] [['Module 2 Line Item 1'], ['Module 1 Line Item 1']]
Module 2 Line Item 1 goodbye[LOOKUP: blue123] + some other line item ['goodbye[LOOKUP: blue123]'] [['Module 1 Line Item 2']]

Odd Column returns when scraping with lxml

I am learning python and trying to build a scraper to glean parts data from a suppliers site. My issue now is that I am getting different column counts from my parsed table rows where I know that each row has the same column count. The issue has to be something I am overlooking and after two days of trying different things I am asking for a few more sets of eyes on my code to locate my error. Not having much python coding experience is no doubt my biggest hurdle.
First, the data. Rather than paste the html I have stored in my database, I'll give you a link to the live site I have crawled and stored in my db. The first link is this one.
The issue is that I get mostly correct results. However, every so often I get the values skewed in the column count. I can't seem to locate the cause.
Here is an example of the flawed result:
----------------------------------------------------------------------------------
Record: 1 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc
Rec:1 Row 6 Col 5 engine
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1
** Note that the engine value has been shifted to the vin_code field.
And proof it works some of the time:
Record: 2 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1998 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:3 Row 4 Col 1 part Oil Filter
Rec:3 Row 4 Col 2 2
Rec:3 Row 4 Col 3 part_no 51334
Rec:3 Row 4 Col 4 filter_loc
Rec:3 Row 4 Col 5 engine L4 2.3L 2254cc
Rec:3 Row 4 Col 6 vin_code
Rec:3 Row 4 Col 7 comment Engine Code F23A1
** Note the fields line up in this record...
I suspect either there is something in the table cells my parser is not looking for or I have missed something trivial.
Here is the important portion of my code:
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
cnty = 0 # Column Counter
found_oil = 0 # Found oil filter record flag
data = {} # Data
# Data fields
field_data = {'part':'', 'part_no':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
print
print '----------------------------------------------------------------------------------'
print 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
#
# Rules for extracting data columns
# 1. First column always has a link to the bullet image
# 2. Second column is part name
# 3. Third column always empty
# 4. Fourth column is part number
# 5. Fith column is empty
# 6. Sixth column is part location
# 7. Seventh column is always empty
# 8. Eigth column is engine size
# 9. Ninth column is vin code
# 10. Tenth column is COmment
# 11. Eleventh column does not exist.
#
for column in row.xpath('./td[#class="blackmedium"][text()="0xa0"] | ./td[#class="blackmedium"][text()="\n"]/text() | ./td[#class="blackmeduim"]/img[#src]/text() | ./td[#class="blackmedium"][text()=""]/text() | ./td[#class="blackmedium"]/b/text() | ./td[#class="blackmedium"]/a/text() |./td[#class="blackmedium"]/text() | ./td[#class="blackmedium"][text()=" "]/text() | ./td[#class="blackmedium"][text()="&#160"]/text() | ./td[#class="blackmedium"][text()=None]/text()'):
#' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
cnty+=1
if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
found_oil = 1
if found_oil == 1:
print 'Rec:'+str(rec_cnt), 'Row '+str(cntx), 'Col '+str(cnty), _fields[cnty], column.strip()
#cnty+= 1
#print
else:
print 'Rec: '+str(rec_cnt), 'Col: '+str(cnty)
field_data[ str(_fields[cnty]) ] = str(column.strip())
#cnty = cnty+1
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
print 'Filed Data:'
print field_data
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries
Thank you all for your help and your eyes...
Usually when I run into a problem like this, I do two things:
Break the problem down into smaller chunks. Use python functions or classes to perform subsets of functionality so that you can test the functions individually for correctness.
Use the Python Debugger to inspect the code as it runs to understand where it's failing. For example, in this case, I would add import pdb; pdb.set_trace() before the line that says cnty+=1.
Then, when the code runs, you'll get an interactive interpreter at that point and you can inspect the various variables and discover why you're not getting what you expect.
A couple of tips for using pdb:
Use c to allow the program to continue (until the next breakpoint or set_trace); Use n to step to the next line in the program. Use q to raise an Exception (and usually abort).
Can you pass the details of your scrapping process? The intermittent failures could be based on the parsing of the html data.
The problem seems to be that your xpath expression searches for text nodes. No matches are found for empty cells, causing your code to "skip" columns. Try iterating over the td elements themselves, and then "look down" from the element to its contents. To get you started:
# just iterate over child elements of the row, which are always td
# use enumerate to easily get a counter for the columns
for col_no, td in enumerate(row, start=1):
# use the xpath function string() to get the string value for the element
# this will yield an empty string for empty elements
print col_no, td.xpath('string()')
Note that the use of the string() xpath function may in some cases be not enough/too simple for what you want. In your example, you may find something like <td><a>51334</a><sup>53</sup></td> (see oil filter). My example would give you "5133453", where you would seem to need "51334" (not sure if that was intentional or if you hadn't noticed the "missing" part, if you do want only the in the hyperlink, use td.findtext('a'))
I want to thank everyone who has given aid to me these past few days. All your input has resulted in a working application that I am now using. I wanted to post the resulting changes to my code so those who look here may find an answer or at least information on how they may also tackle their issue. Below is the rewritten portion of my code that solved the issues I was having:
#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row):
index = {}
col_no = 0
td = None
name = ''
for col_no, td in enumerate(row, start=0):
mystr = str(td.xpath('string()').encode('ascii', 'replace'))
name = str.lower(mystr).replace(' ', '_')
idx = name.replace('.', '')
index[idx] = col_no
if int(options.verbose) > 2:
print 'Field Index:', str(index)
return index
def run():
global has_offset
records = get_data()
#print 'Records', records
rec_cnt = 0
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
if int(options.verbose) > 0:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
col_index = {}
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
data = {} # Data
found_oil = 0 #found proper part flag
# Data fields
field_data = {'part':'', 'part_no':'', 'part_note':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
if int(options.verbose) > 0:
print
print '----------------------------------------------------------------------------------'
print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
# get column indexes
if cntx == 1:
col_index = get_column_index(row)
if col_index != None and cntx > 1:
found_oil = 0
for col_no, td in enumerate(row):
if ('part' in col_index) and (col_no == col_index['part']):
part = td.xpath('string()').strip()
if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
found_oil = 1
field_data['part'] = td.xpath('string()').strip()
# Part Number
if ('part_no' in col_index) and (col_no == col_index['part_no']):
field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
# Filter Location
if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
field_data['filter_loc'] = td.xpath('string()').strip()
# Engine
if ('engine' in col_index) and (col_no == col_index['engine']):
field_data['engine'] = td.xpath('string()').strip()
if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
field_data['vin_code'] = td.xpath('string()').strip()
if ('comment' in col_index) and (col_no == col_index['comment']):
field_data['comment'] = td.xpath('string()').strip()
if int(options.verbose) == 0:
print ','
if int(options.verbose) > 0:
print 'Field Data: ', str(field_data)
elif int(options.verbose) == 0:
print '.'
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = field_data['engine'] #record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['part_note'] = field_data['part_note']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
found_oil = 0
if int(options.verbose) > 2:
print 'Data:', str(data)
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries

Categories