I have the following sample DF:
Car Model
Sales
Mercedes Benz
Audi
100000
Renault
50000
I have 2 calculations
Calculate the number of blank rows in the sales column
missingSalesInfo = DF['Sales'].isnull().sum()
missingSalesInfo = ("Number of missing values: ",missingSalesInfo)
Calculate the total car sales
totalSales = DF['Sales'].sum()
totalSales = ("Total car sales: ",totalSales)
What I want to do is create a new DF, lets call it DF2 to store the above results.
See example below
DF2
Description
Results
Number of missing values
1
Total car sales
150000
Use Series.agg with aggregate functions in dictionary, convert to integers and convert to DataFrame from Series by Series.reset_index, for set new columns names is used DataFrame.set_axis:
df2 = (df['Sales'].agg({'Number of missing values': lambda x: x.isna().sum(),
'Total car sales': 'sum'})
.astype(int)
.reset_index()
.set_axis(['Description','Results'], axis=1)
)
print (df2)
Description Results
0 Number of missing values 1
1 Total car sales 150000
Alternative:
df2 = (df['Sales'].agg({'Number of missing values': lambda x: x.isna().sum(),
'Total car sales': 'sum'})
.astype(int)
.reset_index())
df2.columns = ['Description','Results']
That's just a question of how to create a DataFrame. You can do that in a few ways, but here it's done with dictionary:
df2 = pd.DataFrame({
'Description': ['Number of missing values', 'Total car sales'],
'Results': [DF['Sales'].isnull().sum(), DF['Sales'].sum()]
})
Several ways to do that. Here's a simple way.
df_2 = pd.DataFrame()
df_2['Number of missing values'] = DF['Sales'].isnull().sum()
df_2['Total car sales'] = DF['Sales'].sum()`
Related
I have this dataframe
d = {'parameters': [{'Year': '2018',
'Median Age': 'nan',
'Total Non-Family Household Income': 289.0,
'Total Household Income': 719.0,
'Gini Index of Income Inequality': 0.4121}]}
df_sample = pd.DataFrame(data=d)
df_sample.head()
I want to convert that json into pandas columns. How do I do this? Assume I only have the dataframe not the parameter d
I saw this example
#which columns have json
#device
json_cols = ['device', 'geoNetwork', 'totals', 'trafficSource']
for column in json_cols:
c_load = test[column].apply(json.loads)
c_list = list(c_load)
c_dat = json.dumps(c_list)
test = test.join(pd.read_json(c_dat))
test = test.drop(column , axis=1)
But this does not seem too pythonic...
Use json_normalize:
df_sample = pd.json_normalize(data=d, record_path=['parameters'])
Resulting dataframe:
Year
Median Age
Total Non-Family Household Income
Total Household Income
Gini Index of Income Inequality
2018
nan
289.0
719.0
0.4121
UPD:
If you already have dataframe loaded, then applying pd.Series should work:
df_sample = df_sample['parameters'].apply(pd.Series) # or df_sample['parameters'].map(json.loads).apply(pd.Series) if data is not already dict
My question essentially is:
I have 3 major columns and 4 rows
The Delta column ALWAYS needs to be formatted as a whole number percent ex: 35%
Toyota & Honda Sales need to be formatted differently depending on the row
Spend and Revenue need to be $XXX,XXX ex: $100,000
Sale count needs to be a whole number XXX,XXX ex: 500
Present-Value/Sale needs to always be percent ex: 35%
Put another way, I have one column that has a single formatting regimen, but two others that have variable formatting depending on row. Any idea for this?
#This is what I have to start
data = {'Toyota Sales Performance': [500000.0000, 150000.0000, 100.0000, .2500],
'Honda Sales Performance': [750000.0000, 100000.0000, 200.0000, .3500],
'Delta': [.25, .35, .50, .75]}
df = pd.DataFrame(data, index=['Total Spend',
'Total Revenue',
'Total Sale Count',
'Present-Value/Sale'])
df
What I would like to see
data2 = {'Toyota Sales Performance': ['$500,000', '$150,000', 100, '25%'],
'Honda Sales Performance': ['$750,000', '$150,000', 200, '35%'],
'Delta': ['25%', '35%', '50%', '75%']}
df2 = pd.DataFrame(data2, index=['Total Spend',
'Total Revenue',
'Total Sale Count',
'Present-Value/Sale'])
df2
You can use apply() to run own function on every column.
import pandas as pd
data = {
'Toyota Sales Performance': [500000.0000, 150000.0000, 100.0000, .2500],
'Honda Sales Performance': [750000.0000, 100000.0000, 200.0000, .3500],
'Delta': [.25, .35, .50, .75]
}
df = pd.DataFrame(data, index=['Total Spend',
'Total Revenue',
'Total Sale Count',
'Present-Value/Sale'])
#df['Delta'] = df['Delta'].apply(lambda val: f'{val:.0%}')
df['Delta'] = df['Delta'].apply('{:.0%}'.format)
def convert_column(col):
return pd.Series({
'Total Spend': "${:,}".format(int(col['Total Spend'])),
'Total Revenue': "${:,}".format(int(col['Total Revenue'])),
'Total Sale Count': int(col['Total Sale Count']),
'Present-Value/Sale': "{:.0%}".format(col['Present-Value/Sale']),
})
cols = ['Toyota Sales Performance', 'Honda Sales Performance']
df[cols] = df[cols].apply(convert_column, axis=0)
print(df)
Result:
Toyota Sales Performance Honda Sales Performance Delta
Total Spend $500,000 $750,000 25%
Total Revenue $150,000 $100,000 35%
Total Sale Count 100 200 50%
Present-Value/Sale 25% 35% 75%
Hi I have a dataframe that lists items that I own, along with their Selling Price.
I also have a variable that defines my current debt. Example:
import pandas as pd
current_debt = 16000
d = {
'Person' : ['John','John','John','John','John'],
'Ïtem': ['Car','Bike','Computer','Phone','TV'],
'Price':[10500,3300,2100,1100,800],
}
df = pd.DataFrame(data=d)
df
I would like to "payback" the current_debt starting with the most expensive item and continuing until the debt is paid. I would like to list the left over money aligned to the last item sold. I'm hoping the function can inlcude a groupby clause for Person as sometimes there is more than one name in the list
My expected output for the debt in the example above would be:
If anyone could help with a function to calculate this that would be fantastic. I wasnt sure whether I needed to convert the dataframe to a list or it could be kept as a dataframe. Thanks very much!
Using a cumsum transformation and np.where to cover your logic for the final price column:
import numpy as np
df = df.sort_values(["Person", "Price"], ascending=False)
df['CumPrice'] = df.groupby("Person")['Price'].transform('cumsum')
df['Diff'] = df['CumPrice'] - current_debt
df['PriceLeft'] = np.where(
df['Diff'] <= 0,
0,
np.where(
df['Diff'] < df['Price'],
df['Diff'],
df['Price']
)
)
Result:
Person Item Price CumPrice Diff PriceLeft
0 John Car 10500 10500 -5500 0
1 John Bike 3300 13800 -2200 0
2 John Computer 2100 15900 -100 0
3 John Phone 1100 17000 1000 1000
4 John TV 800 17800 1800 800
I have been trying with no avail to create a program which converts CSVs with 20+ categories where the years and categories are both in the columns, to one where they are split into rows and columns
How can i do this without having to do it manually for each CSV?
I never studied IT so my knowledge is quite patchy and all my attempts have currently ended in large inefficient codes.
btw. Im doing this for my bachelor thesis and not for investing or something of that sort
Example of the data currently looks
df = pd.DataFrame({
'Total Revenue 2006' : ['786'],
'Total Revenue 2007' : ['643'],
'Total Revenue 2008' : ['1200'],
'Total Revenue 2009' : ['1456'],
'Total Revenue 2010' : ['1675'],
'Total Employees 2006' : ['42'],
'Total Employees 2007' : ['55'],
'Total Employees 2008' : ['65'],
'Total Employees 2009' : ['45'],
'Total Employees 2010' : ['60'],
I want to split the categories and years so that the columns are just years and the rows just categories
here you go
df = df.transpose()
df["temp"] = df.index
df["name"] = df["temp"].map(lambda x: x.rsplit(" ", 1)[0])
df["year"] = df["temp"].map(lambda x: x.rsplit(" ", 1)[1])
df.drop(columns="temp", inplace=True)
result = df.pivot(index='name', columns='year', values=0)
refers to
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
for more
You can try this one as well although a little bit long:
transposed_df = df.transpose()
transposed_df.index.name = "Type"
transposed_df.columns = ["Value"]
transposed_df = transposed_df.reset_index()
transposed_df["Year"] = transposed_df.Type.apply(lambda x: x.rsplit(" ", 1)[-1])
transposed_df["Metric"] = transposed_df.Type.apply(lambda x: x.rsplit(" ", 1)[-0])
revenue_df = transposed_df[transposed_df.Metric=="Total Revenue"].set_index("Year")
employee_df = transposed_df[transposed_df.Metric=="Total Employees"].set_index("Year")
revenue_df.drop(["Type", "Metric"], inplace=True, axis=1)
revenue_df.columns = ["Revenue"]
employee_df.drop(["Type", "Metric"], inplace=True, axis=1)
employee_df.columns = ["TotalEmployees"]
combined_df = pd.concat([employee_df, revenue_df], axis=1)
combined_df.head()
TotalEmployees Revenue
Year
2006 42 786
2007 55 643
2008 65 1200
2009 45 1456
2010 60 1675
My data frame (DF) looks like this
Customer_number Store_number year month last_buying_date1 amount
1 20 2014 10 2015-10-07 100
1 20 2014 10 2015-10-09 200
2 20 2014 10 2015-10-20 100
2 10 2014 10 2015-10-13 500
and I want to get an output like this
year month sum_purchase count_purchases distinct customers
2014 10 900 4 3
How do I get an output like this using Agg and group by . I am using a 2 step group by currently but struggling to get the distinct customers . Here's my approach
#### Step 1 - Aggregating everything at customer_number, store_number level
aggregations = {
'amount': 'sum',
'last_buying_date1': 'count',
}
grouped_at_Cust = DF.groupby(['customer_number','store_number','month','year']).agg(aggregations).reset_index()
grouped_at_Cust.columns = ['customer_number','store_number','month','year','total_purchase','num_purchase']
#### Step2 - Aggregating at year month level
aggregations = {
'total_purchase': 'sum',
'num_purchase': 'sum',
size
}
Monthly_customers = grouped_at_Cust.groupby(['year','month']).agg(aggregations).reset_index()
Monthly_customers.colums = ['year','month','sum_purchase','count_purchase','distinct_customers']
My struggle is in the 2nd step. How do i include size in the 2nd aggregation step ?
You could use groupby.agg and supplying the function nunique to return number of unique Customer Ids in the group.
df_grp = df.groupby(['year', 'month'], as_index=False) \
.agg({'purchase_amt':['sum','count'], 'Customer_number':['nunique']})
df_grp.columns = map('_'.join, df_grp.columns.values)
df_grp
Incase, you are trying to group them differently (omitting certain column) when performing groupby operation:
df_grp_1 = df.groupby(['year', 'month']).agg({'purchase_amt':['sum','count']})
df_grp_2 = df.groupby(['Store_number', 'month', 'year'])['Customer_number'].agg('nunique')
Take the first level of the multiindex columns which contains the agg operation performed:
df_grp_1.columns = df_grp_1.columns.get_level_values(1)
Merge them back on the intersection of the columns used to group them:
df_grp = df_grp_1.reset_index().merge(df_grp_2.reset_index().drop(['Store_number'],
axis=1), on=['year', 'month'], how='outer')
Rename the columns to new ones:
d = {'sum': 'sum_purchase', 'count': 'count_purchase', 'nunique': 'distinct_customers'}
df_grp.columns = [d.get(x, x) for x in df_grp.columns]
df_grp