I am having trouble figuring out how to properly transpose data in a DataFrame in order to calculate differences between actuals and targets. Doing something like: df['difference'] = df['Revenue'] - df['Target'], is straightforward so this is more a question of desired output formatting.
Assume you have a DataFrame with the follow columns and values:
Desire outputs would be a roll up from both sources and comparison at the Source level. Assume there are 30+ additional data points similar to revenue, users, and new users... :
and
Any and all suggestions are very much appreciated.
Setup
df = pd.DataFrame([
['2016-06-01', 15000, 10000, 1000, 900, 100, 50, 'US'],
['2016-06-01', 16000, 12000, 1500, 1200, 150, 100, 'UK']
], columns=['Date', 'Revenue', 'Target', 'Users', 'Target', 'New Users', 'Target', 'Source'])
df
Your columns are not unique. I'll start with moving Source and Date into the index and renaming the columns.
df1 = df.copy()
df1.Date = pd.to_datetime(df1.Date)
df1 = df1.set_index(['Date', 'Source'])
idx = pd.MultiIndex.from_product([['Revenue', 'Users', 'New Users'], ['Actual', 'Target']])
df1.columns = idx
df1
Then move the first level of columns to the index
df1 = df1.stack(0)
df1
From here, I'm going to sum sources across ['Revenue', 'Users', 'New Users'] and assign the result to df2.
df2 = df1.groupby(level=-1).sum()
df2
Finally:
df2['Difference'] = df2.Actual / df2.Target
df1['Difference'] = df1.Actual / df1.Target
df2
df1.stack().unstack([0, 1, -1])
Related
Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp
So basically I am trying to drop 2 columns from my dataframe. One of them works and one does not
Here is my python code
import pandas as pd
df = pd.read_csv("testfile2.csv")
df.columns = df.columns.str.strip()
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
dummies = pd.get_dummies((df.town))
merged = pd.concat([df, dummies], axis='columns')
final = merged.drop(['town', 'robbinsville'], axis='columns')
print(final)
When I try to print at the end, I run into this error:
KeyError: "['robbinsville'] not found in axis"
And this is true of all the keys in the csv
Here is the csv:
town, area, price,
'monroe township', 2600, 550000,
'monroe township', 3000, 560000,
'monroe township', 3200, 610000,
'monroe township', 3600, 680000,
'monroe township', 4000, 725000,
'west_windsor', 2600, 585000,
'west_windsor', 2800, 615000,
'west_windsor', 3300, 650000,
'west_windsor', 3600, 710000,
'robbinsville', 2600, 575000,
'robbinsville', 2000, 600000,
'robbinsville', 3100, 620000,
'robbinsville', 3600, 695000,
I don't think it's a typo, what am I doing wrong?
Merged will have the extra columns from the concat, and I would like to remove both.
I am able to remove town, but not any of the new created columns
Your column name actually has single-quotes in it. Just add them (escaped, of course) to robbinsville:
final = merged.drop(['town', '\'robbinsville\''], axis=1)
However, a better solution would be to remove the single-quotes from the data right after loading it, like this:
df = pd.read_csv("testfile2.csv")
df['town'] = df['town'].str.strip("'")
...
I want to use iterrows() to fill predetermined substrings (Name and Age) in the column 'Unique Code' with the values coming from the other two columns - 'Name'and 'Age'. However while the loop prints the correct output - 'Unique Code' values do not update?
lst = [['tom', 25, 'EVT-PS-Name-Age' ], ['krish', 30, 'EVT-PS-Name-Age'],
['nick', 26, 'EVT-PS-Name-Age'], ['juli', 22, 'EVT-PS-Name-Age']]
df = pd.DataFrame(lst, columns =['Name', 'Age', 'Unique Code'])
for index, row in df.iterrows():
row['Unique Code'] = str(row['Unique Code'])
row['Age'] = str(row['Age'])
row['Unique Code'] = row['Unique Code'].replace('Name', row['Name'])
row['Unique Code'] = row['Unique Code'].replace('Age', row['Age'])
print(row['Unique Code'])
df.head()
This is my intended outcome - thanks!
lst = [['tom', 25, 'EVT-PS-tom-25' ], ['krish', 30, 'EVT-PS-krish-30'],
['nick', 26, 'EVT-PS-nick-26'], ['juli', 22, 'EVT-PS-juli-22']]
df = pd.DataFrame(lst, columns =['Name', 'Age', 'Unique Code'])
If you want to use loop/iterrows in your code you can assign
using this snippet at the end of your for loop:df["Unique Code"][index] = row["Unique Code"]
As per why this does not work, The row variable defined by the loop here is a temporary one and does not affect the dataframe rows.
The idea is that I have a Data Frame that looks something like this:
In [1]: pd.DataFrame([['1/1/2020', '0:00', 807, 1600], ['3/1/2020', '1:00', 4000, 8000],], columns=['Date', 'Hour', 'X', 'Y'])
I have simplified the code becuase I am only interested in the things that are inbetween.
Date is in (dd/mm/yyyy) format.
Is there some simple way to create the values that are missing that is 2/1/2020 and in X and Y add Nan so it would end up looking like this:
I have to do this with a much bigger that data frame but for simple use I used a small portion of the Data frame. The only method I have thought of is creating the rows and adding nan to the rows, but I want to believe there is a much easier way. This is the Link to the data https://drive.google.com/file/d/1NrDBkqfMO2rA631aA4FSmM2vJYDZCFbF/view?usp=sharing
Update using #ALollz suggestion:
df = pd.DataFrame([['1/1/2020', '0:00', 807, 1600], ['3/1/2020', '1:00', 4000, 8000],['3/1/2020', '2:00', 5000, 9000], ['3/1/2020', '5:00', 5000, 9000]], columns=['Date', 'Hour', 'X', 'Y'])
# add column with datetime from Date and Hour
df['dateHour'] = pd.to_datetime(df['Date'] + ' ' + df['Hour'])
df = df.set_index('dateHour').asfreq('H')
# split date from time and convert back to string
df['Date'] = ([d.date().strftime('%m/%d/%Y') for d in df3['dateHour']])
df['Hour'] = [d.time().strftime("%H:%M") for d in df3['dateHour']]
Don't know if there is a more elegant way, but you can do it like this:
df = pd.DataFrame([['1/1/2020', '0:00', 807, 1600], ['3/1/2020', '1:00', 4000, 8000],['3/1/2020', '2:00', 5000, 9000], ['3/1/2020', '5:00', 5000, 9000]], columns=['Date', 'Hour', 'X', 'Y'])
# add datetime column
df['dateHour'] = pd.to_datetime(df['Date'] + ' ' + df['Hour'])
# create new dataframe with all the possible rows
df2 = pd.DataFrame({'dateHour':pd.date_range(df.dateHour.min(), df.dateHour.max(), freq='H')})
# combine the dataframes
df3 = df2.merge(df[['dateHour', 'X', 'Y']], how='left', on='dateHour')
# split date from time and convert back to string
df3['Date'] = ([d.date().strftime('%m/%d/%Y') for d in df3['dateHour']])
df3['Hour'] = [d.time().strftime("%H:%M") for d in df3['dateHour']]
# select and sort columns
df4=df3[['Date', 'Hour', 'X', 'Y']]
df4
Firstly preprocess your excel file:
import pandas as pd
df=pd.read_excel('Trial.xlsx',engine='openpyxl')
df=df.drop(columns=['Unnamed: 0'])
df['Date']=pd.to_datetime(df['Date'],yearfirst=True,format='%y-%d-%m').dt.strftime('%y-%d-%m')
df['Date']=pd.to_datetime(df['Date'],format='%y-%m-%d')
Now create a date range by using date_range() method:
data=pd.date_range('2020-01-02','2020-01-03',freq='H').astype(str)
Then create a new dataframe using that date range:
datadf=data.str.split(' ',expand=True).to_frame().set_index([0,1]).reset_index().rename(columns={0:'Date',1:'Hour'})
Now make use of concat() method and concat both dataframes:
result=pd.concat((df,datadf),ignore_index=True)
Now Convert your 'Date' column in datetime by using to_datetime() method:
result['Date']=pd.to_datetime(result['Date'])
Finally sort your 'Date' column by using sort_values() method:
result=result.sort_values('Date',ignore_index=True)
Now if you print result you will get your desired output
please help me how to tackle the many to many matching with condition ?
import pandas as pd
company1 = {'Product': ['Pro_1','Pro_3','Pro_3','Pro_5'],
'product_date': ['2013-05-09','2012-12-02','2013-10-25','2016-08-25']}
df = pd.DataFrame(company1, columns = ['Product', 'product_date'])
print (df)
company2 = {'Product': ['Pro_1','Pro_2','Pro_2','Pro_3','Pro_3','Pro_3','Pro_3','Pro_5','Pro_5'],
'Start': ['2013-01-01','2012-01-02','2013-01-02','2014-01-01','2011-01-02','2012-01-02','2013-01-02','2014-01-25', '2017-01-26'],
'end': ['2014-01-01','2013-01-01','2013-12-31','2014-12-01','2012-01-01','2013-01-01','2013-12-31','2017-01-25', '2018-01-20'],
'inventory': [20,30,50,30,40,10,20,30,20]}
df2 = pd.DataFrame(company2, columns = ['Product', 'Start','end','inventory'])
print (df2)
result = {'Product': ['Pro_1','Pro_3','Pro_3','Pro_5'],
'inventory': [20,10,20,30]}
df3 = pd.DataFrame(result, columns = ['Product', 'inventory'])
print(df3)
I wanted to take the match df1 and df2 by the 'Product' and condition on the 'product_date' between the 'Start' and 'end' dates, then return the 'inventory' from df2.