Dataframe Is No Longer Accessible - python

I am trying to make my code look better and create functions that do all the work from running just one line but it is not working as intended. I am currently pulling data from a pdf that is in a table into a pandas dataframe. From there I have 4 functions, all calling each other and finally returning the updated dataframe. I can see that it is full updated when I print it in the last method. However I am unable to access and use that updated dataframe, even after I return it.
My code is as follows
def data_cleaner(dataFrame):
#removing random rows
removed = dataFrame.drop(columns=['Unnamed: 1','Unnamed: 2','Unnamed: 4','Unnamed: 5','Unnamed: 7','Unnamed: 9','Unnamed: 11','Unnamed: 13','Unnamed: 15','Unnamed: 17','Unnamed: 19'])
#call next method
col_combiner(removed)
def col_combiner(dataFrame):
#Grabbing first and second row of table to combine
first_row = dataFrame.iloc[0]
second_row = dataFrame.iloc[1]
#List to combine columns
newColNames = []
#Run through each row and combine them into one name
for i,j in zip(first_row,second_row):
#Check to see if they are not strings, if they are not convert it
if not isinstance(i,str):
i = str(i)
if not isinstance(j,str):
j = str(j)
newString = ''
#Check for double NAN case and change it to Expenses
if i == 'nan' and j == 'nan':
i = 'Expenses'
newString = newString + i
#Check for leading NAN and remove it
elif i == 'nan':
newString = newString + j
else:
newString = newString + i + ' ' + j
newColNames.append(newString)
#Now update the dataframes column names
dataFrame.columns = newColNames
#Remove the name rows since they are now the column names
dataFrame = dataFrame.iloc[2:,:]
#Going to clean the values in the DF
clean_numbers(dataFrame)
def clean_numbers(dataFrame):
#Fill NAN values with 0
noNan = dataFrame.fillna(0)
#Pull each column, clean the values, then put it back
for i in range(noNan.shape[1]):
colList = noNan.iloc[:,i].tolist()
#calling to clean the column so that it is all ints
col_checker(colList)
noNan.iloc[:,i] = colList
return noNan
def col_checker(col):
#Going through, checking and cleaning
for i in range(len(col)):
#print(type(colList[i]))
if isinstance(col[i],str):
col[i] = col[i].replace(',','')
if col[i].isdigit():
#print('not here')
col[i] = int(col[i])
#If it is not a number then make it 0
else:
col[i] = 0
Then when I run this:
doesThisWork = data_cleaner(cleaner)
type(doesThisWork)
I get NoneType. I might be doing this the long way as I am new to this, so any advice is much appreciated!

The reason you are getting NoneType is because your function does not have a return statement, meaning that when finishing executing it will automatically returns None. And it is the return value of a function that is assigned to a variable var in a statement like this:
var = fun(x)
Now, a different thing entirely is whether or not your dataframe cleaner will be changed by the function data_cleaner, which can happen because dataframes are mutable objects in Python.
In other words, your function can read your dataframe and change it, so after the function call cleaner is different than before. At the same time, your function can return a value (which it doesn't) and this value will be assigned to doesThisWork.
Usually, you should prefer that your function does only one thing, so expect that the function changes its argument and return a value is usually bad practice.

Related

How to assign a new value to the list in running loop

My data cutting loop seems to run ok in the loop, but when it prints the result outside the loop, the contents are unchanged. Presuming it's buggy because I'm trying to assign to what the for loop is running through, but I don't know.
For reference, it's a small web review scraper project I'm working on. To get it formatted to CSV with pandas I think all the data needs to end at the same point (length), so I'm cutting any lists that are longer than the shortest. The values "cust_stars_result, rev_result, cust_res" are all lists with basics strings stored inside, in this case equal to lengths 16, 12, and 15. I try to slice everything down to 12 in the end but the results are overwritten. What is the right/best way to go about this?
star_len = len(cust_stars_result)
rev_len = len(rev_result)
custname_len = len(cust_res)
print('customer name length: ' + str(custname_len) + ' -- review length: ' + str(rev_len) + ' -- star length: ' + str(star_len))
datalen = [star_len, rev_len, custname_len]
print(min(datalen))
datapack = [cust_stars_result, rev_result, cust_res]
# LOOPER FOR CULLING
for data in datapack:
if len(data) != min(datalen):
print("operating culler to make data even length")
print(len(data))
data = data[: min(datalen)]
print(len(data)) #this comes out OK
else:
print("equal length, skipping culler")
pass
print(datapack) # prints the original values
Inside your loop you update the data variable but that's just reassigning the value of that variable. You want to do something like
for i, data in enumerate(datapack):
...
datapack[i] = data[: min(datalen)]
This will update the datapack element
While "trying to assign to what the for loop is running through" is a real issue, in this case the problem is rather that your code is not assigning anything to datapack when you change data. Instead, what it does is assign each item in datapack to data, so when you change data, datapack remain unchanged.
Instead, try either adding each item to new list, and then assigning datapack to equal the new list:
temp = []
for data in datapack:
...
temp.append(data[:min(datalen)])
datapack = temp
Or try using a range or enumerate loop:
for i, data in enumerate(datapack):
...
datapack[i] = data[:min(datalen)]
There are more fancy ways (but less readable and debuggable) to accomplish what you're doing here (slicing off the end of the list), such as the below which uses list comprehension and map:
mindatalen = min(map(len, datapack))
datapack = [data[:mindatalen]for data in datapack]

Ironpython - how do I refer to to a calculated variable in additional lines of code

I am working with IronPython inside Spotfire.
I need to extract the maximum date value from a range filter, then use that value to filter a table of exchange rates.
I have working code right up to the datatable.Select statement in which I need to do the match. If I do it based on "Date(2020,3,1)" - which is the row commented out - then the match works and the correct result is returned, however I cannot get the syntax correct for using a calculated variable "newdate" in place of the Date(xxx) statement. I am still learning python and have not come across this before.
Code as below - any help would be greatly appreciated.
from Spotfire.Dxp.Application.Filters import RangeFilter, ValueRange
from Spotfire.Dxp.Data.DataType import Date
from System.Globalization import CultureInfo
parser = Date.CreateCultureSpecificFormatter(CultureInfo("en-AU"))
#get a reference to a filter as checkbox from the myDataTable script parameter
filt=Document.FilteringSchemes[Document.ActiveFilteringSelectionReference].Item[dt].Item[dt.Columns.Item["Calendar Date"]].As[RangeFilter]()
print filt.ValueRange.High
if str(filt.ValueRange.High) == "High":
maxdate = Document.Properties["loaddate"]
else:
maxdate = filt.ValueRange.High
maxdate = Date.Formatter.Parse(maxdate)
print maxdate
new = str(maxdate.Year) + "," + str(maxdate.Month) + "," + str("1")
print new
Document.Properties["maxdate"] = new
from Spotfire.Dxp.Data import *
from System.Collections.Generic import List
table=Document.ActiveDataTableReference
# Expression to limit the data in a table
rowSelection=table.Select("CALENDAR_DATE = Date('new')")
#rowSelection=table.Select("CALENDAR_DATE = Date(2020,3,1)")
# Create a cursor to the Column we wish to get the values from
cursor = DataValueCursor.CreateFormatted(table.Columns["FY_AVERAGE_EXCHANGE"])
# Create List object that will hold values
listofValues=List[str]()
# Loop through all rows, retrieve value for specific column,
# and add value into list
for row in table.GetRows(rowSelection.AsIndexSet(),cursor):
rowIndex = row.Index
value1 = cursor.CurrentValue
listofValues.Add(value1)
for val in listofValues:
print val
I think your variable new would print out as 2020,01,01
In this line new is a string so Date() cannot extract a date.
rowSelection=table.Select("CALENDAR_DATE = Date('new')")
You should put new as variable
rowSelection=table.Select("CALENDAR_DATE = Date(" +new +")")
but not sure it'll work as Date takes in Integer not Strings So you might have to re-write to :
y = maxdate.Year
m= maxdate.Month
d = 1
rowSelection=table.Select("CALENDAR_DATE = Date("+ y + ',' + m +',' + d + ")")
or build your String before hand which is the method I would use:
y = maxdate.Year
m= maxdate.Month
d = 1
mystring = "CALENDAR_DATE = Date("+ str(y) + ',' + str(m) +',' + str(d) + ")"
rowSelection=table.Select(mystring)
One of the above ways should work, I'd start with the last one setting your string before as it makes the most sense to not deal with to many conversions of integers and strings.
If you post this question with an example DXP to Tibco Answers could possible help more since will have an example dxp to work with. but hopefully this helps you out.

Python pandas UIPath - List assignment index out of range

I am currently facing an error
list assignment index out of range
within the invoke Python scope. I am just trying to check if each of the variables contains any of the string mentioned in 'a'. If yes then add it as a row to the excel sheet.
import pandas as pd
import xlsxwriter
def excel_data(mz01arg,p028arg,p006arg,s007arg,mz01desc,p028desc,p006desc,s007desc):
listb=[]
a=['MZ01','P028','P006','S007']
if any (x in mz01arg for x in a) is True:
listb[0] = [mz01arg]
else:
listb[0] = []
if any (x in p028arg for x in a ) is True:
listb[1] = [p028arg]
else:
listb[1] =[]
if any (x in p006arg for x in a) is True:
listb[2]=[p006arg]
else:
listb[2] = []
if any (x in s007arg for x in a) is True:
listb[3]=[s007arg]
else:
listb[3]=[]
df1 = pd.DataFrame({'SODA COUNT': listb})
df2 = pd.DataFrame({'SODA RISK DESCRIPTION': [mz01desc,p028desc,p006desc,s007desc]})
writer = pd.ExcelWriter(r"D:\Single_process_python\try_python.xlsx", engine='xlsxwriter')
df3 = pd.concat([df1,df2],axis=1)
df3.to_excel(writer,sheet_name='Sheet1', index=False)
writer.save()
You can't write to an element that does not yet exist. listb=[] creates an empty list, so there is no element with index 0. You may append items like this: listb.append(foo).
However, since you mentioned UiPath - I would recommend checking variables and their values in the workflow instead of your Python script. This way your script does one thing, and one thing exactly - and the workflow itself makes sure that all prerequisites are met. If not, you can throw and catch error messages, for example in another workflow, and ask users for input. If that logic is part of your script, this will be much harder.
Here's a very simple example:

Add to dictionary in if loop

I have an if loop in which I am trying to;
(1) Create a dataframe from a filepath.
(2) Format this dataframe
(3) Add that dataframe to a dictionary that is a property of an instance of a class.
Here is my code defining the class and the method:
class myClass:
def __init__(self, name, filepathlist):
self.name = name
self.filepathlist = filepathlist
def formatData(self):
i = 0
self.dataframeDict = {}
if i < (len(self.filepathlist) - 1):
DFRAW = pd.read_csv(self.filepathlist[i], header = 9) #Row 9 is the row that is not blank (all blank auto-skipped)
DFRAW['DateTime'], DFRAW['dummycol1'] = DFRAW[' ;W;W;W;W'].str.split(';', 1).str
DFRAW['Col1'], DFRAW['dummycol2'] = DFRAW['dummycol1'].str.split(';', 1).str
DFRAW['Col2'], DFRAW['dummycol3'] = DFRAW['dummycol2'].str.split(';', 1).str
DFRAW['Col3'], DFRAW['Col4'] = DFRAW['dummycol3'].str.split(';', 1).str
DFRAW = DFRAW.drop([' ;W;W;W;W', 'dummycol1', 'dummycol2', 'dummycol3'], axis = 1)
dictIndex = self.filepathlist[i][39:44]
self.dataframeDict.update({dictIndex: DFRAW})
i = i + 1
Then I create an instance of the class and run the method:
filepathlist = ['filepath1','filepath2']
myINST = myClass('Mydataname', filepathlist)
myINST.formatData()
I then expect myINST.dataframeDict to have two dataframes as per the 2 input filepaths and thus 2 iterations of the if loop. However only 1 is present.
What is the error in my code or my approach?
It is hard to tell whether this will completely solve your problem, because no dummy data is provided. You will, however, get one step closer to your solution if you replace if i < (len(self.filepathlist) - 1): with while i < (len(self.filepathlist) - 1):.
You are currently just checking if i=0 is smaller than len(self.filepathlist)-1. If so, then the if-block is executed once. What you are actually looking for is a loop that keeps on iterating, as long as i is smaller than len(self.filepathlist)-1. This is done with while-loops.
You need to change your condition to for i in range(len(self.filepathlist)):
(Also, remove the assignment of i as the for loop does it automatically. For the same reason, you should also remove the line which increments i).
If you want to use a while loop, change the if line to while i < len(self.filepathlist):.
Notice that there's no -1. This is because you're using < instead of <=. If you want to use -1, then you also need the <= as this will ensure the loop runs the correct number of times.

Python, I need the following code to finish quicker

I need the following code to finish quicker without threads or multiprocessing. If anyone knows of any tricks that would be greatly appreciated. maybe for i in enumerate() or changing the list to a string before calculating, I'm not sure.
For the example below, I have attempted to recreate the variables using a random sequence, however this has rendered some of the conditions inside the loop useless ... which is ok for this example, it just means the 'true' application for the code will take slightly longer.
Currently on my i7, the example below (which will mostly bypass some of its conditions) completes in 1 second, I would like to get this down as much as possible.
import random
import time
import collections
import cProfile
def random_string(length=7):
"""Return a random string of given length"""
return "".join([chr(random.randint(65, 90)) for i in range(length)])
LIST_LEN = 18400
original = [[random_string() for i in range(LIST_LEN)] for j in range(6)]
LIST_LEN = 5
SufxList = [random_string() for i in range(LIST_LEN)]
LIST_LEN = 28
TerminateHook = [random_string() for i in range(LIST_LEN)]
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exclude above from benchmark
ListVar = original[:]
for b in range(len(ListVar)):
for c in range(len(ListVar[b])):
#If its an int ... remove
try:
int(ListVar[b][c].replace(' ', ''))
ListVar[b][c] = ''
except: pass
#if any second sufxList delete
for d in range(len(SufxList)):
if ListVar[b][c].find(SufxList[d]) != -1: ListVar[b][c] = ''
for d in range(len(TerminateHook)):
if ListVar[b][c].find(TerminateHook[d]) != -1: ListVar[b][c] = ''
#remove all '' from list
while '' in ListVar[b]: ListVar[b].remove('')
print(ListVar[b])
ListVar = original[:]
That makes a shallow copy of ListVar, so your changes to the second level lists are going to affect the original also. Are you sure that is what you want? Much better would be to build the new modified list from scratch.
for b in range(len(ListVar)):
for c in range(len(ListVar[b])):
Yuck: whenever possible iterate directly over lists.
#If its an int ... remove
try:
int(ListVar[b][c].replace(' ', ''))
ListVar[b][c] = ''
except: pass
You want to ignore spaces in the middle of numbers? That doesn't sound right. If the numbers can be negative you may want to use the try..except but if they are only positive just use .isdigit().
#if any second sufxList delete
for d in range(len(SufxList)):
if ListVar[b][c].find(SufxList[d]) != -1: ListVar[b][c] = ''
Is that just bad naming? SufxList implies you are looking for suffixes, if so just use .endswith() (and note that you can pass a tuple in to avoid the loop). If you really do want to find the the suffix is anywhere in the string use the in operator.
for d in range(len(TerminateHook)):
if ListVar[b][c].find(TerminateHook[d]) != -1: ListVar[b][c] = ''
Again use the in operator. Also any() is useful here.
#remove all '' from list
while '' in ListVar[b]: ListVar[b].remove('')
and that while is O(n^2) i.e. it will be slow. You could use a list comprehension instead to strip out the blanks, but better just to build clean lists to begin with.
print(ListVar[b])
I think maybe your indentation was wrong on that print.
Putting these suggestions together gives something like:
suffixes = tuple(SufxList)
newListVar = []
for row in original:
newRow = []
newListVar.append(newRow)
for value in row:
if (not value.isdigit() and
not value.endswith(suffixes) and
not any(th in value for th in TerminateHook)):
newRow.append(value)
print(newRow)

Categories