Pandas: Selecting Multiple Column Values by Adjacent Column Value - python

I use the following code snippet to get the adjacent 'Policy Name' column in a data frame when I have the 'Client Name':
policy = df.loc[df['Client Name'] == machine.lower(), 'Policy Name']
If there are multiple rows for the 'Client Name' and they have different policies, how can I grab them all? As it stands, the current code gets me the last entry in the data frame.

As it stands, the current code gets me the last entry in the data
frame.
This isn't true. See below for a minimal counter-example.
df = pd.DataFrame({'Client Name': ['philip', 'ursula', 'frank', 'ursula'],
'Policy Name': ['policy1', 'policy2', 'policy3', 'policy4']})
machine = 'Ursula'
policy = df.loc[df['Client Name'] == machine.lower(), 'Policy Name']
print(policy)
1 policy2
3 policy4
Name: Policy Name, dtype: object

Related

Is there an 'inplace' alternative to pandas.concat ? Adding rows to a Dataframe with a function on a button

So I am trying to add employees informations to a pandas DataFrame upon pressing a tkinter button.
I defined a function for the 'button command' containing the variable new_employee, calling the class Employee using all the entries as its attributes. Then put the all the attributes into a single row DataFrame, to finally concatenate the row with the existing Dataframe.
Everything works fine until I decide to enter more than one new employee, I guess the function starts over again and even though the employees data are saved into lists, the data frame renews every time and only shows the latest inserted row.
I am looking for any advice about how I could make the Dataframe inplace.
def create_new_employee_button():
new_employee = Employee(
surname_entry.get(),
lastname_entry.get(),
gender_entry.get(),
competences_entry.get(),
pay_rate_entry.get(),
address_entry.get(),
age_entry.get(),
languages_entry.get(),
availability_entry.get(),
hours_week_entry.get(),
email_entry.get()
)
new_employee_id = str('E_' + str((int(len(employee_list)) + 1)))
Employee.add_employee_to_list(new_employee)
row_to_add_df = pd.DataFrame([[new_employee.surname,
new_employee.lastname,
new_employee.gender,
new_employee.competences,
new_employee.paye_rate,
new_employee.address,
new_employee.age,
new_employee.languages,
new_employee.availability,
new_employee.hours_week,
new_employee.email,
new_employee_id]],
columns={
'Name',
'Surname',
'Gender',
'Competences',
'Pay Rate',
'Address',
'Age',
'Languages',
'Availability (Days)',
'Hours per Week',
'Email address',
'Employee ID'
}
)
main_df = pd.concat([employees_df,row_to_add_df],ignore_index=True)
print(main_df)
pop_up_window.destroy()

Generating separate groupby tables for each month in dataframe using for loops or other methods

I have a pandas dataframe (sample):
data = [['ABC', 'John', '123', 'Yes', '2022_Jan'], ['BCD', 'Amy', '456', 'Yes', '2022_Jan'], ['ABC', 'Michelle', '123', 'No', '2022_Feb'], ['CDE', 'John', '789', 'No', '2022_Feb'], ['ABC', 'Michelle', '012', 'Yes', '2022_Mar'], ['BCD', 'Amy', '123', 'No', '2022_Mar'], ['CDE', 'Jill', '789', 'No', '2022_Mar'], ['CDE', 'Jack', '789', 'No', '2022_Mar']]
tmp2 = pd.DataFrame(data, columns = ['Responsibility', 'Name', 'ID', 'Has Error', 'Year_Month'])
tmp3 = tmp2[['Responsibility', 'Name', 'ID', 'Has Error']]
The actual dataframe is a lot larger with more columns, but the above are the only fields I need right now. I already have the following code that generates a year-to-date table that groups by 'Responsibility' and 'Name' and gives me the number & % of unique 'ID's that have errors and don't have errors, and exports the table to a single Excel sheet:
result = pd.pivot_table(tmp3, index =['Responsibility', 'Name'], columns = ['Has Error'], aggfunc=len)
#cleanup
result.fillna(0, inplace=True)
result.columns = [s1 + "_" + str(s2) for (s1,s2) in result.columns.tolist()]
result = result.rename(columns={'ID_No': 'Does NOT Have Error (count)', 'ID_Yes': 'Has Error (count)'})
result = result.astype(int)
#create fields for %s and totals
result['Has Error (%)'] = round(result['Has Error (count)'] / (result['Has Error (count)'] + result['Does NOT Have Error (count)']) *100, 2).astype(str)+'%'
result['Does NOT Have Error (%)'] = round(result['Does NOT Have Error (count)'] / (result['Has Error (count)'] + result['Does NOT Have Error (count)']) *100, 2).astype(str) + '%'
result['Total Count'] = result['Has Error (count)'] + result['Does NOT Have Error (count)']
result = result.reindex(columns=['Has Error (%)', 'Does NOT Have Error (%)', 'Has Error (count)', 'Does NOT Have Error (count)', 'Total Count'])
#save to excel
Excelwriter = pd.ExcelWriter('./output/final.xlsx',engine='xlsxwriter')
workbook=Excelwriter.book
result.to_excel(Excelwriter,sheet_name='YTD Summary',startrow=0 , startcol=0)
Now, I want to keep this YTD summary sheet, and generate the 'result' table for data from each month (from the 'Year_Month' field in the original dataset tmp2), and export the same table with data for each month into separate Excel sheets within the same output file. I will be generating this entire output file on a recurring basis, so want to write the code so that when I read in a new dataframe, it will automatically identify each month available in the data, and generate separate tables for each month using the code that I've already written above, and export each table into separate tabs in the Excel file. I'm a beginner at Python and I'm finding this is harder to do than I originally thought and what I've tried so far is not working. I know one way to do this would be to use a for loop or matrix functions, but can't figure out how to make the code work. Any help would be greatly appreciated!
Assuming you don't care about the year, you can split the month from the last column and then iterate over groupby.
split the month
df['month'] = df['Year_Month'].str.split('_').str[1]
iterate with groupby
for month, df_month in df.groupby('month'):
# your processing stuff here
# 'df_month' is the sub-dataframe for one month
df_month_processed.to_excel(ExcelWriter, sheet_name=month, ...)

Python pandas function to concat into one row different values into one column based on repeating values in another

Apologies, I didn't even know how to title/describe the issue I am having, so bear with me. I have the following code:
import pandas as pd
data = {'Invoice Number':[1279581, 1279581,1229422, 1229422, 1229422],
'Project Key':[263736, 263736, 259661, 259661, 259661],
'Project Type': ['Visibility', 'Culture', 'Spend', 'Visibility', 'Culture']}
df= pd.DataFrame(data)
How do I get the output to basically group the Invoice Numbers so that there is only 1 row per Invoice Number and combine the multiple Project Types (per that 1 Invoice) into 1 row?
Code and output for output is below.
Thanks much appreciated.
import pandas as pd
data = {'Invoice Number':[1279581,1229422],
'Project Key':[263736, 259661],
'Project Type': ['Visibility_Culture', 'Spend_Visibility_Culture']
}
output = pd.DataFrame(data)
output
>>> (df
.groupby(['Invoice Number', 'Project Key'])['Project Type']
.apply(lambda x: '_'.join(x))
.reset_index()
)
Invoice Number Project Key Project Type
0 1229422 259661 Spend_Visibility_Culture
1 1279581 263736 Visibility_Culture

Pandas iterrows not working within function

I'm creating a small scale scheduling script, and having some issues with iterrows. These are very small df's so time is minimal (6 rows and maybe 7/8 columns), although i'm guessing these loops are not the most efficient - I am pretty new to this!
Here's what I have already
data = {'Staff 1': ['9-5', '9-5', '9-5', '9-5', '9-5'],
'Staff 2': ['9-5', '9-5', '9-5', '9-5', '9-5'],
'Staff 3': [ '9-5', '9-5', '9-5', '9-5', '9-5']}
dataframe_1 = pd.DataFrame.from_dict(data, orient='index',
columns=['9/2/19', '9/3/19', '9/4/19', '9/5/19', '9/6/19'])
data2 = {'Name': ['Staff 1','Staff 2','Staff 3'], 'Site': ['2','2','2'], 'OT':['yes','yes','no'],
'Days off':['','9/4/19','9/4/19'], '':['','','9/5/19']}
dataframe_2 = pd.DataFrame.from_dict(data2)
def annual_leave(staff, df):
df = df.reset_index(drop=True)
for index, row in df.iterrows():
days_off = []
if df.loc[index,'Name'] == '{}'.format(staff):
for cell in row:
days_off.append(cell)
del days_off[0:3]
else:
pass
return days_off
for index, row in dataframe_1.iterrows():
print(annual_leave(index, dataframe_2))
I added in a few 'print(index)' in places to see if I could work out where it was going wrong.
I found that the bottom iterrows loop is running through each row. However, the itterrows loop in the function is only looking at the first row and I don't understand why.
I am trying to go through each staff name (the index) in dataframe_1, and checking that staff name against a column name in dataframe_2. I then want to get rid of the first 3 columns of that particular row in dataframe_2 (hence the list and del days_off[0:3]).
However in this example it is running the bottom for iterrows loop (outside of the function) for 'Staff 1', 'Staff 2', and 'Staff 3'. But the itterrows loop inside the function is only checking against the 'Staff 1' name.
This means it only works for 'Staff 1', but when the function is called for 'Staff 2', it is only checking for 'Staff 2' in the first row of dataframe_2 - and not finding it because its in the second row.
Does this make any sense?
Any help is greatly appreciated.
Have you tried calling the function ? Executing the above code which you have given will just create a function and you need to call it by passing proper arguments to see the output. I couldn't find any other mistakes in the code. Please correct if I am wrong.
It is because your return is inside the iterrows for loop. return needs to be unindented in order to be outside the for loop.

Add same value to multiple sets of rows. The value changes based on condition

I have a dataframe that is dynamically created.
I create my first set of rows as:
df['tourist_spots'] = pd.Series(<A list of tourist spots in a city>)
To this df I add:
df['city'] = <City Name>
So far so good. A bunch of rows are created with the same city name for multiple tourist spots.
I want to add a new city. So I do:
df['tourist_spots'].append(pd.Series(<new data>))
Now, when I append a new city with:
df['city'].append('new city')
the previously updated city data is gone. It is as if every time the rows are replaced and not appended.
Here's an example of what I want:
Step 1:
df['tourist_spot'] = pd.Series('Golden State Bridge' + a bunch of other spots)
For all the rows created by the above data I want:
df['city'] = 'San Francisco'
Step 2:
df['tourist_spot'].append(pd.Series('Times Square' + a bunch of other spots)
For all the rows created by the above data, I want:
df['city'] = 'New York'
How can I achieve this?
Use dictionary to add rows to your data frame, it is faster method.
Here is an e.g.
STEP 1
Create dictionary:
dict_df = [{'tourist_spots': 'Jones LLC', 'City': 'Boston'},
{'tourist_spots': 'Alpha Co', 'City': 'Boston'},
{'tourist_spots': 'Blue Inc', 'City': 'Singapore' }]
STEP2
Convert dictionary to dataframe:
df = pd.DataFrame(dict_df)
STEP3
Add new entries to dataframe in dictionary format:
df = df.append({'tourist_spots': 'New_Blue', 'City': 'Singapore'}, ignore_index=True)
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

Categories