Include a header from Excel in a for loop with openpyxl - python

I am trying to include a header when printing data in a column.
Issue
But when I try it an error comes up:
TypeError: '<' not supported between instances of 'int' and 'str'
Code
def pm1():
for cell in all_columns[1]:
power = (cell.value)
if x < power < y:
print(f"{power}")
else:
print("Not steady")
pm1()
I know you cannot compare an string with operation values.
How can I include the header while looping throughout the entire column?

Based on what I understand from your comments, this may work for you.
def pm1():
for cell in all_columns[1]:
for thing in cell:
# in openpyxl you can call on .row or .column to get the location of your cell
# you said you wanted to print the header (row 1), a sting
if thing.row == 1:
print(thing.value)
else:
# you said that the values under the header will be a digit
# so now you should be safe to set your variable and make a comparison
power = thing.value
if x < power < y:
print(f"{power}")
else:
print("Not steady")

So you are looping through all cells of a column, here given by a first column all_columns[1].
Assume the first cell of each column might contain a header which has a value is of type string (type(cell.value) == str).
Then you have to possibilities:
Given the first cell of each column (in row 1) is a header, take advantage of that position
If all other cells contain numerical values, you can handle only the str values differently as supposed headers
def power_of(value):
# either define boundaries x,y here or global
power = float(value) # defensive conversion, some values might erroneously be stored as text in Excel
if x < power < y:
return f"{power}"
return "Not steady" # default return instead else
def pm1():
for cell in all_columns[1]:
if (cell.row == 1): # assume the header is always in first row
print(cell.value) # print header
else:
print(power_of(cell.value))
pm1()

Related

Dataframe Is No Longer Accessible

I am trying to make my code look better and create functions that do all the work from running just one line but it is not working as intended. I am currently pulling data from a pdf that is in a table into a pandas dataframe. From there I have 4 functions, all calling each other and finally returning the updated dataframe. I can see that it is full updated when I print it in the last method. However I am unable to access and use that updated dataframe, even after I return it.
My code is as follows
def data_cleaner(dataFrame):
#removing random rows
removed = dataFrame.drop(columns=['Unnamed: 1','Unnamed: 2','Unnamed: 4','Unnamed: 5','Unnamed: 7','Unnamed: 9','Unnamed: 11','Unnamed: 13','Unnamed: 15','Unnamed: 17','Unnamed: 19'])
#call next method
col_combiner(removed)
def col_combiner(dataFrame):
#Grabbing first and second row of table to combine
first_row = dataFrame.iloc[0]
second_row = dataFrame.iloc[1]
#List to combine columns
newColNames = []
#Run through each row and combine them into one name
for i,j in zip(first_row,second_row):
#Check to see if they are not strings, if they are not convert it
if not isinstance(i,str):
i = str(i)
if not isinstance(j,str):
j = str(j)
newString = ''
#Check for double NAN case and change it to Expenses
if i == 'nan' and j == 'nan':
i = 'Expenses'
newString = newString + i
#Check for leading NAN and remove it
elif i == 'nan':
newString = newString + j
else:
newString = newString + i + ' ' + j
newColNames.append(newString)
#Now update the dataframes column names
dataFrame.columns = newColNames
#Remove the name rows since they are now the column names
dataFrame = dataFrame.iloc[2:,:]
#Going to clean the values in the DF
clean_numbers(dataFrame)
def clean_numbers(dataFrame):
#Fill NAN values with 0
noNan = dataFrame.fillna(0)
#Pull each column, clean the values, then put it back
for i in range(noNan.shape[1]):
colList = noNan.iloc[:,i].tolist()
#calling to clean the column so that it is all ints
col_checker(colList)
noNan.iloc[:,i] = colList
return noNan
def col_checker(col):
#Going through, checking and cleaning
for i in range(len(col)):
#print(type(colList[i]))
if isinstance(col[i],str):
col[i] = col[i].replace(',','')
if col[i].isdigit():
#print('not here')
col[i] = int(col[i])
#If it is not a number then make it 0
else:
col[i] = 0
Then when I run this:
doesThisWork = data_cleaner(cleaner)
type(doesThisWork)
I get NoneType. I might be doing this the long way as I am new to this, so any advice is much appreciated!
The reason you are getting NoneType is because your function does not have a return statement, meaning that when finishing executing it will automatically returns None. And it is the return value of a function that is assigned to a variable var in a statement like this:
var = fun(x)
Now, a different thing entirely is whether or not your dataframe cleaner will be changed by the function data_cleaner, which can happen because dataframes are mutable objects in Python.
In other words, your function can read your dataframe and change it, so after the function call cleaner is different than before. At the same time, your function can return a value (which it doesn't) and this value will be assigned to doesThisWork.
Usually, you should prefer that your function does only one thing, so expect that the function changes its argument and return a value is usually bad practice.

Variable table width with .format

I'm trying to display data from a csv in a text table. I've got to the point where it displays everything that I need, however the table width still has to be set, meaning if the data is longer than the number set then issues begin.
I currently print the table using .format to sort out formatting, is there a way to set the width of the data to a variable that is dependant on the length of the longest piece of data?
for i in range(len(list_l)):
if i == 0:
print(h_dashes)
print('{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}'.format('|', (list_l[i][0].upper()),'|', (list_l[i][1].upper()),'|',(list_l[i][2].upper()),'|', (list_l[i][3].upper()),'|'))
print(h_dashes)
else:
print('{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}'.format('|', list_l[i][0], '|', list_l[i][1], '|', list_l[i][2],'|', list_l[i][3],'|'))
I realise that the code is far from perfect, however I'm still a newbie so it's piecemeal from various tutorials
You can actually use a two-pass approach to first get the correct lengths. As per your example with four fields per line, the following shows the basic idea you can use.
What follows is an example of the two-pass approach, first to get the maximum lengths for each field, the other to do what you're currently doing (with the calculated rather than fixed lengths):
# Can set MINIMUM lengths here if desired, eg: lengths = [10, 0, 41, 7]
lengths = [0] * 4
fmtstr = None
for pass in range(2):
for i in range(len(list_l)):
if pass == 0:
# First pass sets lengths as per data.
for field in range(4):
lengths[field] = max(lengths[field], len(list_l[i][field])
else:
# Second pass prints the data.
# First, set format string if not yet set.
if fmtstr is None:
fmtstr = '|'
for item in lengths:
fmtstr += '{:^%ds}|' % (item)
# Now print item (and header stuff if first item).
if i == 0: print(h_dashes)
print(fmtstr.format(list_l[i][0].upper(), list_l[i][1].upper(), list_l[i][2].upper(), list_l[i][3].upper()))
if i == 0: print(h_dashes)
The construction of the format string is done the first time you process an item in pass two.
It does so by taking a collection like [31,41,59] and giving you the string:
|{:^31s}|{:^41s}|{:^59s}|
There's little point using all those {:^1s} format specifiers when the | is not actually a varying item - you may as well code it directly into the format string.

How to parse text to dataframe?

Let's imagine that I have a text like this:
if last_n_input_time <= 7463.0:
else: if last_n_input_time > 7463.0
else: if passwd_input_time > 27560.0
else: if secret_answ_input_time > 7673.5
if first_n_input_time <= 4054.5:
if passwd_input_time <= 5041.0:
else: if passwd_input_time > 5041.0
else: if first_n_input_time > 4054.5
return [[ 1.01167029]]
....
And I have a dataframe with such columns as passwd_input_time, first_n_input_time and others - named in the same way as variables in the text.
The question is how can I search for example for first_n_input_time and if I find it in the text, then move to another symbol and see whether it's > or <= and then cut out the value which goes after > or <= symbols and add it to the cell of my dataframe.
df['first_n_input_time'] = 4079 should the result of my function.
I understand how to find the world but I don't know how to cut the lines in such a way so I could get, for example, "secret_answ_input_time <= 7673.5" and then operate on it.
Example:
I want to cut out "if passwd_input_time <= 47635.5" at first. Find whether "password_input_time" it belongs to the list of column names of my dataframe (yes, it does). Then I need to move next and to check what the symbol is here "<=" or ">". If it's "<=" then I take the value 47635.5 and write to the cell of df['password_input_time_1']. If it's ">" then I write the value to another column df['password_input_time_2']
Here are some pieces of code I wrote trying to implement this but I'm stuck a bit cos I don't know how to move to the next word in the text:
def to_dataframe(i, str)
for word in str_.split():
if any(word in s for s in cols_list):
#move to the next word somehow
#I will call it next_word later on for simplicity
col_name = word #save the value to refer to it later
if next_word == "<=":
col_name.append('_1')
#move to the value somehow
#I will call it 'value' later on
df[col_name][i] = value
if next_word == ">"
col_name.append('_2')
#move to the value somehow
#I will call it 'value' later on
df[col_name][i]= value
Where cols_list is list of columns names of my dataframe.

Parsing a column using openpyxl

I have the following algorithm to parse a column for integer values:
def getddr(ws):
address = []
col_name = 'C'
start_row = 4
end_row = ws.get_highest_row()+1
range_expr = "{col}{start_row}:{col}{end_row}".format(col=col_name, start_row=start_row, end_row=end_row)
for row in ws.iter_rows(range_string=range_expr):
print row
raw_input("enter to continue")
cell = row[0]
if str(cell.value).isdigit:
address.append(cell.value)
else:
continue
return address
This crashes at cell = row[0] saying "IndexError: tuple index out of range", and i dont know what this means. I tried printing out row to see what it contained, but all it gives me is an empty set of parentheses. Anyone know what I'm missing?
That is not so easy to say what is the problem you have, because there are no input data that you are trying to process.
But I can explain what is the reason of the error you've get, and in which direction you must go. The list row contains 0 elements (row = []), because of that you can not say row[0] — there are no row[0]. The first thing you must change is check, how long is your list, and when if it is long enough make other things:
for row in ws.iter_rows(range_string=range_expr):
print row
raw_input("enter to continue")
if len(row) > 0:
cell = row[0]
if str(cell.value).isdigit:
address.append(cell.value)
else:
continue
That is the first step that you must do anyway.

detect EOF of excel file in Python

I have written a code for detecting the EOF of an excel file using python:
row_no = 1
while True:
x = xlws.Cells(row_no,1).value
if type(x) is None:
break
else:
print(len(x))
print(x)
row_no = row_no + 1
i expect the while loop will stop then x becomes a "blank cell", which I support to be None, but it doesn't work, and it go to len(x) and prompt me an error of NoneType has no len. Why?
Thanks!
This here is your problem:
if type(x) is None:
If x is None, its type is NoneType. Therefore, this is never true, so you never see the blank cell and you end up trying to get the length of None.
Instead, write:
if x is None:
It looks like you are using pywin32com ... you don't need to loop around finding "EOF" (you mean end of Sheet, not end of File).
If xlws refers to a Worksheet object, you can use this:
used = xlws.UsedRange
nrows = used.Row + used.Rows.Count - 1
to get the effective number of rows in the worksheet. used.Row is the 1-based row number of the first used row, and the meaning of used.Rows.Count should be rather obvious.
Alternative: use xlrd ... [dis]claimer: I'm the author.
As mentioned in other comments you can use 'xlrd' as well to know the limits of the excel file as:
workbook = xlrd.open_workbook (excel_loc)
excel_sheet = workbook.sheet_by_index(0)
print("no of rows: %d" %excel_sheet.nrows)
print("no of cols: %d" %excel_sheet.ncols)

Categories