Why am I getting a TypeError: unhashable type: numpy.ndarray error? Also, I don't recall importing numpy into my code I what is numpy.ndarray doing? The error is in the last line of the codes
import pandas as pd
import matplotlib.pyplot as plt
entries_csv = "C:\\Users\\Asus\\Desktop\\Entries.csv"
listofaccounts_csv = "C:\\Users\\Asus\\Desktop\\List of Accounts.csv"
data_entries = pd.read_csv(entries_csv)
data_listofaccounts = pd.read_csv(listofaccounts_csv)
i = 0
summary_name = [0]*len(data_listofaccounts)
summary = [0]*1*len(data_listofaccounts)
for account_name in data_listofaccounts['Account Name']:
summary_name[i] = account_name
for debit_account in data_entries['DEBIT ACCOUNT']:
if account_name == debit_account:
summary[i] += data_entries['DEBIT AMOUNT']
i += 1
plt.bar(list(summary_name), list(summary))
These are the data:
1.) Entries:
2.) List of Accounts:
Basically for each item in list of accounts, I want to make a summary where all the debit amounts will sum for each type of account
I think in this case you really want to utilize the pd.merge functionality between your two dataframes. See here: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.merge.html . Once you have joined the two tables you want to groupby according to the Account Name and perform your aggregations. So for example:
list_of_accounts_df = pd.DataFrame({
'Account Name': ['ACCOUNT PAYABLE', 'OUTSIDE SERVICE'],
'Type': ['CURRENT LIABILITY', 'EXPENSE']
})
entries_df = pd.DataFrame({
'DEBIT ACCOUNT':['OUTSIDE SERVICE', 'OUTSIDE SERVICE'],
'DEBIT AMOUNT': [46375.8, 42091.42] ,
'CREDIT ACCOUNT':['CASH IN BANK', 'CASH ON HAND'],
'CREDIT AMOUNT':[46375.8, 42091.42]
})
pd.merge(list_of_accounts_df, entries_df, left_on='Account Name', right_on='DEBIT ACCOUNT', how='left').fillna(0).groupby('Account Name')['DEBIT AMOUNT'].sum()
The output becomes a series where each index is the Account Name, and the value is the sum of all the debit amounts for that series. So in this case:
Account Name
ACCOUNT PAYABLE 0.00
OUTSIDE SERVICE 88467.22
And then regarding your question of how to plot it, for bar plots, you cannot directly provide string values for the x or y-axis.
Using this example: https://pythonspot.com/matplotlib-bar-chart/, in our case you can just do:
objects = x.index.values
y_pos = range(len(objects)
vals = x.values
plt.bar(y_pos, vals, align='center')
plt.xticks(y_pos, objects)
plt.ylabel('Sum of Debits')
plt.title('Total Debits Per Account')
plt.show()
Which gives this in our simple example:
Related
The class is composed of a set of attributes and functions including:
Attributes:
df : a pandas dataframe.
numerical_feature_names: df columns with a numeric value.
label_column_names: df string columns to be grouped.
Functions:
mean(nums): takes a list of numbers as input and returns the mean
fill_na(df, numerical_feature_names, label_columns): takes class attributes as inputs and returns a transformed df.
And here's the class:
class PLUMBER():
def __init__(self):
################# attributes ################
self.df=df
# specify label and numerical features names:
self.numerical_feature_names=numerical_feature_names
self.label_column_names=label_column_names
##################### mean ##############################
def mean(self, nums):
total=0.0
for num in nums:
total=total+num
return total/len(nums)
############ fill the numerical features ##################
def fill_na(self, df, numerical_feature_names, label_column_names):
# declaring parameters:
df=self.df
numerical_feature_names=self.numerical_feature_names
label_column_names=self.label_column_names
# now replacing NaN with group mean
for numerical_feature_name in numerical_feature_names:
df[numerical_feature_name]=df.groupby([label_column_names]).transform(lambda x: x.fillna(self.mean(x)))
return df
When trying to apply it to a pandas df:
if __name__=="__main__":
# initialize class
plumber=PLUMBER()
# replace NaN with group mean
df=plumber.fill_na(df=df, numerical_feature_names=numerical_feature_names, label_column_names=label_column_names)
The next error arises:
ValueError: Grouper and axis must be same length
data and class parameters
import pandas as pd
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[np.nan, 450, 299, np.nan, 19, 29],
'age':[np.nan, 30, 28, np.nan, 29, 18]}
df=pd.DataFrame(d)
# headers
column_names=df.columns.values.tolist()
column_names= [column_name.strip() for column_name in column_names]
# label_column_names (to be grouped)
label_column_names=['country', 'level', 'job title']
# numerical_features:
numerical_feature_names = [x for x in column_names if x not in label_column_names]
numerical_feature_names.remove('month')
How could I change the class in order to get the transformed df (i.e. the one that replaces np.nan with it's group mean)?
First the error is because label_column_names is already a list, so in the groupby you don't need the [] around it. so it should be df.groupby(label_column_names)... instead of df.groupby([label_column_names])...
Now, to actually solve you problem, in the function fill_na of your class, replace the loop for (you don't need it actually) by
df[numerical_feature_names] = (
df[numerical_feature_names]
.fillna(
df.groupby(label_column_names)
[numerical_feature_names].transform('mean')
)
)
in which you fillna the columns numerical_feature_names by the result of the groupy.tranform with the mean of these columns
Apologies, I didn't even know how to title/describe the issue I am having, so bear with me. I have the following code:
import pandas as pd
data = {'Invoice Number':[1279581, 1279581,1229422, 1229422, 1229422],
'Project Key':[263736, 263736, 259661, 259661, 259661],
'Project Type': ['Visibility', 'Culture', 'Spend', 'Visibility', 'Culture']}
df= pd.DataFrame(data)
How do I get the output to basically group the Invoice Numbers so that there is only 1 row per Invoice Number and combine the multiple Project Types (per that 1 Invoice) into 1 row?
Code and output for output is below.
Thanks much appreciated.
import pandas as pd
data = {'Invoice Number':[1279581,1229422],
'Project Key':[263736, 259661],
'Project Type': ['Visibility_Culture', 'Spend_Visibility_Culture']
}
output = pd.DataFrame(data)
output
>>> (df
.groupby(['Invoice Number', 'Project Key'])['Project Type']
.apply(lambda x: '_'.join(x))
.reset_index()
)
Invoice Number Project Key Project Type
0 1229422 259661 Spend_Visibility_Culture
1 1279581 263736 Visibility_Culture
I have a dataframe that looks something like this
import pandas as pd
sectors = [['Industrials', 'Health Care', 'Information Technology', 'Industrials'], ['Health Care', 'Health Care', 'Information Technology'], ['Industrials', 'Information Technology', 'Health Care', 'Information Technology', 'Information Technology'], ['Information Technology', 'Health Care']]
some_date = ['2015-12-01', '2016-01-05', '2016-02-01', '2016-03-01']
somelist = []
for i in range(len(some_date)):
somelist.append((some_date[i], sectors[i]))
df = pd.DataFrame(somelist, columns = ['date', 'sectors'])
I would like to create a plt.stackplot where the X-axis is the date and the Y-axis is number of times any sector is mentioned.
The problem is that it's strings and not integers, one approach could be to iterate through each row of the DataFrame and count how many times each sector is mentioned for each date, but I don't always know the names of the sectors I have so I'm wondering if there's a more efficient way to solve this?
I tried to plot a plt.pie by using df['sectors'].sum() to check how many times throughout the complete date-range each sector is mentioned, but for this I would also somehow need to convert the strings.
Not sure how efficient this is, but I fixed the data as shown here;
plot_sectors = list(set(df['sectors'].sum()))
plot_sectors = {key: [0]*df.shape[0] for key in plot_sectors}
for i in range(df.shape[0]):
for sector in df.iloc[i]['sectors']:
plot_sectors[sector][i] += 1
For the stacked plot, I used;
y = plot_sectors.values()
x = np.arange(df.shape[0])
plt.stackplot(x,y, labels = plot_sectors.keys())
And for the pie plot I used;
plt.pie([sum(values) for key, values in plot_sectors.items()], autopct='%1.1f%%',
labels=plot_sectors.keys())
plt.axis('equal')
plt.show()
I use the following code snippet to get the adjacent 'Policy Name' column in a data frame when I have the 'Client Name':
policy = df.loc[df['Client Name'] == machine.lower(), 'Policy Name']
If there are multiple rows for the 'Client Name' and they have different policies, how can I grab them all? As it stands, the current code gets me the last entry in the data frame.
As it stands, the current code gets me the last entry in the data
frame.
This isn't true. See below for a minimal counter-example.
df = pd.DataFrame({'Client Name': ['philip', 'ursula', 'frank', 'ursula'],
'Policy Name': ['policy1', 'policy2', 'policy3', 'policy4']})
machine = 'Ursula'
policy = df.loc[df['Client Name'] == machine.lower(), 'Policy Name']
print(policy)
1 policy2
3 policy4
Name: Policy Name, dtype: object
Say that I'm given a dataframe that summarizes different companies:
summary=pandas.DataFrame(columns=['Company Name', 'Formation Date', 'Revenue', 'Profit', 'Loss'])
And then say each company in that dataframe has its own corresponding dataframe, named after the company, giving a more in-depth picture of the company's history and stats. Something like:
exampleco=pandas.Dataframe(columns=['Date', 'Daily Profit', 'Daily Loss', 'Daily Revenue'])
I have a script that processes each row of the summary dataframe, but I would like to grab the name from row['Company Name'] and use it to access the company's own dataframe.
In other words I'd love it if there was something that worked like this:
.
.
>>> company=row['Company Name']
>>> pandas.get_dataframe_from_variable(company)
Empty DataFrame
Columns: ['Date', 'Daily Profit', 'Daily Loss', 'Daily Revenue']
Index: []
[0 rows x 2 columns]
.
.
Any ideas of how I might get this to work would be much appreciated.
Thanks in advance!
You can use a dictionary to contain your DataFrames and use strings as the keys.
companies = {'company1':pandas.DataFrame(columns=['Date', 'Daily Profit',
'Daily Loss', 'Daily Revenue']),
'company2':pandas.DataFrame(columns=['Date', 'Daily Profit',
'Daily Loss', 'Daily Revenue'])}
company=row['Company Name'] # Get your company name as a string from your summary.
company_details = companies[company] # Returns a DataFrame.