Pandas iterrows not working within function - python

I'm creating a small scale scheduling script, and having some issues with iterrows. These are very small df's so time is minimal (6 rows and maybe 7/8 columns), although i'm guessing these loops are not the most efficient - I am pretty new to this!
Here's what I have already
data = {'Staff 1': ['9-5', '9-5', '9-5', '9-5', '9-5'],
'Staff 2': ['9-5', '9-5', '9-5', '9-5', '9-5'],
'Staff 3': [ '9-5', '9-5', '9-5', '9-5', '9-5']}
dataframe_1 = pd.DataFrame.from_dict(data, orient='index',
columns=['9/2/19', '9/3/19', '9/4/19', '9/5/19', '9/6/19'])
data2 = {'Name': ['Staff 1','Staff 2','Staff 3'], 'Site': ['2','2','2'], 'OT':['yes','yes','no'],
'Days off':['','9/4/19','9/4/19'], '':['','','9/5/19']}
dataframe_2 = pd.DataFrame.from_dict(data2)
def annual_leave(staff, df):
df = df.reset_index(drop=True)
for index, row in df.iterrows():
days_off = []
if df.loc[index,'Name'] == '{}'.format(staff):
for cell in row:
days_off.append(cell)
del days_off[0:3]
else:
pass
return days_off
for index, row in dataframe_1.iterrows():
print(annual_leave(index, dataframe_2))
I added in a few 'print(index)' in places to see if I could work out where it was going wrong.
I found that the bottom iterrows loop is running through each row. However, the itterrows loop in the function is only looking at the first row and I don't understand why.
I am trying to go through each staff name (the index) in dataframe_1, and checking that staff name against a column name in dataframe_2. I then want to get rid of the first 3 columns of that particular row in dataframe_2 (hence the list and del days_off[0:3]).
However in this example it is running the bottom for iterrows loop (outside of the function) for 'Staff 1', 'Staff 2', and 'Staff 3'. But the itterrows loop inside the function is only checking against the 'Staff 1' name.
This means it only works for 'Staff 1', but when the function is called for 'Staff 2', it is only checking for 'Staff 2' in the first row of dataframe_2 - and not finding it because its in the second row.
Does this make any sense?
Any help is greatly appreciated.

Have you tried calling the function ? Executing the above code which you have given will just create a function and you need to call it by passing proper arguments to see the output. I couldn't find any other mistakes in the code. Please correct if I am wrong.

It is because your return is inside the iterrows for loop. return needs to be unindented in order to be outside the for loop.

Related

Is there an 'inplace' alternative to pandas.concat ? Adding rows to a Dataframe with a function on a button

So I am trying to add employees informations to a pandas DataFrame upon pressing a tkinter button.
I defined a function for the 'button command' containing the variable new_employee, calling the class Employee using all the entries as its attributes. Then put the all the attributes into a single row DataFrame, to finally concatenate the row with the existing Dataframe.
Everything works fine until I decide to enter more than one new employee, I guess the function starts over again and even though the employees data are saved into lists, the data frame renews every time and only shows the latest inserted row.
I am looking for any advice about how I could make the Dataframe inplace.
def create_new_employee_button():
new_employee = Employee(
surname_entry.get(),
lastname_entry.get(),
gender_entry.get(),
competences_entry.get(),
pay_rate_entry.get(),
address_entry.get(),
age_entry.get(),
languages_entry.get(),
availability_entry.get(),
hours_week_entry.get(),
email_entry.get()
)
new_employee_id = str('E_' + str((int(len(employee_list)) + 1)))
Employee.add_employee_to_list(new_employee)
row_to_add_df = pd.DataFrame([[new_employee.surname,
new_employee.lastname,
new_employee.gender,
new_employee.competences,
new_employee.paye_rate,
new_employee.address,
new_employee.age,
new_employee.languages,
new_employee.availability,
new_employee.hours_week,
new_employee.email,
new_employee_id]],
columns={
'Name',
'Surname',
'Gender',
'Competences',
'Pay Rate',
'Address',
'Age',
'Languages',
'Availability (Days)',
'Hours per Week',
'Email address',
'Employee ID'
}
)
main_df = pd.concat([employees_df,row_to_add_df],ignore_index=True)
print(main_df)
pop_up_window.destroy()

iterating through csv dataframe to create/assign variables

I have a csv file that will contain a frequently updated (overwritten) dataframe with a few rows of purchase orders, something like this:
uniqueId item action quantity price
123 widget1 buy 10 99.44
234 widget2 sell 15 19.99
345 widget3 buy 2 999.99
This csv file will be passed to my python code by another program; my code will check for its presence every few minutes. Once it appears, the code will read it. I'm not including the code for that, since that's not the issue.
The idea is to turn this purchase order dataframe into something that I can pass to my (already written) place-the-order code. I want to iterate through each row in order (enumerate?), and assign the values from that row to variables that I use in the order code, then reassign the new values to the same variable for the next row after the order from that row has been placed.
As I understand it, itertuples are probably the way to go for iterating through it, but I'm new enough to python that I can't figure out the actual mechanism/syntax of using it to do what I want. All my trial-and-error tests for assigning the values to reusable variables result in syntax errors.
I'm having a mental block on what is probably very basic python! I know how to iterate through the rows and print 'em out--plenty of examples out there show me how to do that--but not how to turn the data into something I can use elsewhere. Can someone walk me through an example or two that actually applies to what I'm trying to do?
Like you said, you can quite easily iterate over a dataframe with .itertuples()
Here's how I would go about it (df is your dataframe; for it I used the data from your example):
Code:
for row in df.itertuples():
print(row)
Output:
Pandas(Index=0, uniqueId=123, item='widget1', action='buy', quantity=10, price=99.44)
Pandas(Index=1, uniqueId=234, item='widget2', action='sell', quantity=15, price=19.99)
Pandas(Index=2, uniqueId=345, item='widget3', action='buy', quantity=2, price=999.999)
If you want to get specific entries of the tuples you need to use the position in the tuple as index:
Code:
for row in df.itertuples():
uniqueID = row[1]
print(uniqueID)
Output:
123
234
345
I'm not sure how the rest of your code looks like. If you have the place-the-order code inside a function you could just call the function in the for-loop after assigning the variables to your liking:
for row in df.itertuples():
uniqueID = row[1]
item = row[2]
action = row[3]
quantity = row[4]
price = row[5]
place-the-order(uniqueID, item, action, quantity, price)
(You could even skip assigning the variables and just call place-the-order(row[1], row[2], ...). In my opinion it is more readable to assign the variables.)
If your place-the-order code is not in a function I would recommend using a nested dictionary with the row index as key and a dictionary of the content of the row as value. The row index is easily accessible as it is the first item in the tuple.
content_of_rows = {}
for row in df.itertuples():
index = row[0]
uniqueID = row[1]
item = row[2]
action = row[3]
quantity = row[4]
price = row[5]
content_of_rows.update({index:{"uniqueID":uniqueID, "item": item, "action": action, "quantity": quantity, "price": price}})
print(content_of_rows)
Output:
{0: {'uniqueID': 123, 'item': 'widget1', 'action': 'buy', 'quantity': 10, 'price': 99.44},
1: {'uniqueID': 234, 'item': 'widget2', 'action': 'sell', 'quantity': 15, 'price': 19.99},
2: {'uniqueID': 345, 'item': 'widget3', 'action': 'buy', 'quantity': 2, 'price': 999.999}}
This way you can't use the same variable for every row, since it's generally speaking just a different way of writing a dataframe. You can iterate over dictionaries pretty much the same as over tuples but instead of numerical indices you have to use the key.
for row in content_of_rows:
# row is the key, so in the first iteration it would be 0, in the second iteration it would be 1, and so on
print(content_of_rows[row])
Output:
{'uniqueID': 123, 'item': 'widget1', 'action': 'buy', 'quantity': 10, 'price': 99.44}
{'uniqueID': 234, 'item': 'widget2', 'action': 'sell', 'quantity': 15, 'price': 19.99}
{'uniqueID': 345, 'item': 'widget3', 'action': 'buy', 'quantity': 2, 'price': 999.999}
If you want to get the uniqueID of the rows, you would do something like this:
Code:
for row in content_of_rows:
print(content_of_rows[row]["uniqueID"]) # just put the second key you're looking for right after the first, also in []
Output:
123
234
345
It's usually best to put different parts of your code into functions, so if you haven't done that already I'd recommend you do so. That way you can use the same variables for each row.
I hope this (kinda long, sorry about that) answer could help you. Greetings from Bavaria!

Finding and replacing values in specific columns in a CSV file using dictionaries

My Goal here is to clean up address data from individual CSV files using dictionaries for each individual column. Sort of like automating the find and replace feature from excel. The addresses are divided into columns. Housenumbers, streetnames, directions and streettype all in their own column. I used the following code to do the whole document.
missad = {
'Typo goes here': 'Corrected typo goes here'}
def replace_all(text, dic):
for i, j in missad.items():
text = text.replace(i, j)
return text
with open('original.csv','r') as csvfile:
text=csvfile.read()
text=replace_all(text,missad)
with open('cleanfile.csv','w') as cleancsv:
cleancsv.write(text)
While the code works, I need to have separate dictionaries as some columns need specific typo fixes.For example for the Housenumbers column housenum , stdir for the street direction and so on each with their column specific typos:
housenum = {
'One': '1',
'Two': '2
}
stdir = {
'NULL': ''}
I have no idea how to proceed, I feel it's something simple or that I would need pandas but am unsure how to continue. Would appreciate any help! Also is there anyway to group the typos together with one corrected typo? I tried the following but got an unhashable type error.
missad = {
['Typo goes here',Typo 2 goes here',Typo 3 goes here']: 'Corrected typo goes here'}
is something like this what you are looking for?
import pandas as pd
df = pd.read_csv(filename, index_col=False) #using pandas to read in the CSV file
#let's say in this dataframe you want to do corrections on the 'column for correction' column
correctiondict= {
'one': 1,
'two': 2
}
df['columnforcorrection']=df['columnforcorrection'].replace(correctiondict)
and use this idea for other columns of interest.

Access dynamically created data frames

Hello Python community,
I have a problem with my code creation.
I wrote a code that creates dynamically dataframes in a for loop. The problem is that I don't know how to access to them.
Here is a part of code
list = ['Group 1', 'Group 2', 'Group 3']
for i in list:
exec('df{} = pd.DataFrame()'.format(i))
for i in list:
print(df+i)
The dataframes are created but i can not access them.
Could someone help me please?
Thank you in advance
I'm not sure exactly how your data is stored/accessed but you could create a dictionary to pair your list items with each dataframe as follows:
list_ = ['Group 1', 'Group 2', 'Group 3']
dataframe_dict = {}
for i in list_:
data = np.random.rand(3,3) #create data for dataframe here
dataframe_dict[i] = pd.DataFrame(data, columns=["your_column_one", "two","etc"])
Can then retrieve each dataframe by calling its associated group name as the key of the dictionary as follows:
for key in dataframe_dict.keys():
print(key)
print(dataframe_dict[key])

Pandas: Selecting Multiple Column Values by Adjacent Column Value

I use the following code snippet to get the adjacent 'Policy Name' column in a data frame when I have the 'Client Name':
policy = df.loc[df['Client Name'] == machine.lower(), 'Policy Name']
If there are multiple rows for the 'Client Name' and they have different policies, how can I grab them all? As it stands, the current code gets me the last entry in the data frame.
As it stands, the current code gets me the last entry in the data
frame.
This isn't true. See below for a minimal counter-example.
df = pd.DataFrame({'Client Name': ['philip', 'ursula', 'frank', 'ursula'],
'Policy Name': ['policy1', 'policy2', 'policy3', 'policy4']})
machine = 'Ursula'
policy = df.loc[df['Client Name'] == machine.lower(), 'Policy Name']
print(policy)
1 policy2
3 policy4
Name: Policy Name, dtype: object

Categories