Python pandas transpose data issue - python

I am having trouble figuring out how to properly transpose data in a DataFrame in order to calculate differences between actuals and targets. Doing something like: df['difference'] = df['Revenue'] - df['Target'], is straightforward so this is more a question of desired output formatting.
Assume you have a DataFrame with the follow columns and values:
Desire outputs would be a roll up from both sources and comparison at the Source level. Assume there are 30+ additional data points similar to revenue, users, and new users... :
and
Any and all suggestions are very much appreciated.

Setup
df = pd.DataFrame([
['2016-06-01', 15000, 10000, 1000, 900, 100, 50, 'US'],
['2016-06-01', 16000, 12000, 1500, 1200, 150, 100, 'UK']
], columns=['Date', 'Revenue', 'Target', 'Users', 'Target', 'New Users', 'Target', 'Source'])
df
Your columns are not unique. I'll start with moving Source and Date into the index and renaming the columns.
df1 = df.copy()
df1.Date = pd.to_datetime(df1.Date)
df1 = df1.set_index(['Date', 'Source'])
idx = pd.MultiIndex.from_product([['Revenue', 'Users', 'New Users'], ['Actual', 'Target']])
df1.columns = idx
df1
Then move the first level of columns to the index
df1 = df1.stack(0)
df1
From here, I'm going to sum sources across ['Revenue', 'Users', 'New Users'] and assign the result to df2.
df2 = df1.groupby(level=-1).sum()
df2
Finally:
df2['Difference'] = df2.Actual / df2.Target
df1['Difference'] = df1.Actual / df1.Target
df2
df1.stack().unstack([0, 1, -1])

Related

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Not found in axis. I am failing to drop a columns from my pandas dataframe

So basically I am trying to drop 2 columns from my dataframe. One of them works and one does not
Here is my python code
import pandas as pd
df = pd.read_csv("testfile2.csv")
df.columns = df.columns.str.strip()
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
dummies = pd.get_dummies((df.town))
merged = pd.concat([df, dummies], axis='columns')
final = merged.drop(['town', 'robbinsville'], axis='columns')
print(final)
When I try to print at the end, I run into this error:
KeyError: "['robbinsville'] not found in axis"
And this is true of all the keys in the csv
Here is the csv:
town, area, price,
'monroe township', 2600, 550000,
'monroe township', 3000, 560000,
'monroe township', 3200, 610000,
'monroe township', 3600, 680000,
'monroe township', 4000, 725000,
'west_windsor', 2600, 585000,
'west_windsor', 2800, 615000,
'west_windsor', 3300, 650000,
'west_windsor', 3600, 710000,
'robbinsville', 2600, 575000,
'robbinsville', 2000, 600000,
'robbinsville', 3100, 620000,
'robbinsville', 3600, 695000,
I don't think it's a typo, what am I doing wrong?
Merged will have the extra columns from the concat, and I would like to remove both.
I am able to remove town, but not any of the new created columns
Your column name actually has single-quotes in it. Just add them (escaped, of course) to robbinsville:
final = merged.drop(['town', '\'robbinsville\''], axis=1)
However, a better solution would be to remove the single-quotes from the data right after loading it, like this:
df = pd.read_csv("testfile2.csv")
df['town'] = df['town'].str.strip("'")
...

Using iterrows() to fill text into column

I want to use iterrows() to fill predetermined substrings (Name and Age) in the column 'Unique Code' with the values coming from the other two columns - 'Name'and 'Age'. However while the loop prints the correct output - 'Unique Code' values do not update?
lst = [['tom', 25, 'EVT-PS-Name-Age' ], ['krish', 30, 'EVT-PS-Name-Age'],
['nick', 26, 'EVT-PS-Name-Age'], ['juli', 22, 'EVT-PS-Name-Age']]
df = pd.DataFrame(lst, columns =['Name', 'Age', 'Unique Code'])
for index, row in df.iterrows():
row['Unique Code'] = str(row['Unique Code'])
row['Age'] = str(row['Age'])
row['Unique Code'] = row['Unique Code'].replace('Name', row['Name'])
row['Unique Code'] = row['Unique Code'].replace('Age', row['Age'])
print(row['Unique Code'])
df.head()
This is my intended outcome - thanks!
lst = [['tom', 25, 'EVT-PS-tom-25' ], ['krish', 30, 'EVT-PS-krish-30'],
['nick', 26, 'EVT-PS-nick-26'], ['juli', 22, 'EVT-PS-juli-22']]
df = pd.DataFrame(lst, columns =['Name', 'Age', 'Unique Code'])
If you want to use loop/iterrows in your code you can assign
using this snippet at the end of your for loop:df["Unique Code"][index] = row["Unique Code"]
As per why this does not work, The row variable defined by the loop here is a temporary one and does not affect the dataframe rows.

How would I create the missing row values in pyhton for two inbetween dates?

The idea is that I have a Data Frame that looks something like this:
In [1]: pd.DataFrame([['1/1/2020', '0:00', 807, 1600], ['3/1/2020', '1:00', 4000, 8000],], columns=['Date', 'Hour', 'X', 'Y'])
I have simplified the code becuase I am only interested in the things that are inbetween.
Date is in (dd/mm/yyyy) format.
Is there some simple way to create the values that are missing that is 2/1/2020 and in X and Y add Nan so it would end up looking like this:
I have to do this with a much bigger that data frame but for simple use I used a small portion of the Data frame. The only method I have thought of is creating the rows and adding nan to the rows, but I want to believe there is a much easier way. This is the Link to the data https://drive.google.com/file/d/1NrDBkqfMO2rA631aA4FSmM2vJYDZCFbF/view?usp=sharing
Update using #ALollz suggestion:
df = pd.DataFrame([['1/1/2020', '0:00', 807, 1600], ['3/1/2020', '1:00', 4000, 8000],['3/1/2020', '2:00', 5000, 9000], ['3/1/2020', '5:00', 5000, 9000]], columns=['Date', 'Hour', 'X', 'Y'])
# add column with datetime from Date and Hour
df['dateHour'] = pd.to_datetime(df['Date'] + ' ' + df['Hour'])
df = df.set_index('dateHour').asfreq('H')
# split date from time and convert back to string
df['Date'] = ([d.date().strftime('%m/%d/%Y') for d in df3['dateHour']])
df['Hour'] = [d.time().strftime("%H:%M") for d in df3['dateHour']]
Don't know if there is a more elegant way, but you can do it like this:
df = pd.DataFrame([['1/1/2020', '0:00', 807, 1600], ['3/1/2020', '1:00', 4000, 8000],['3/1/2020', '2:00', 5000, 9000], ['3/1/2020', '5:00', 5000, 9000]], columns=['Date', 'Hour', 'X', 'Y'])
# add datetime column
df['dateHour'] = pd.to_datetime(df['Date'] + ' ' + df['Hour'])
# create new dataframe with all the possible rows
df2 = pd.DataFrame({'dateHour':pd.date_range(df.dateHour.min(), df.dateHour.max(), freq='H')})
# combine the dataframes
df3 = df2.merge(df[['dateHour', 'X', 'Y']], how='left', on='dateHour')
# split date from time and convert back to string
df3['Date'] = ([d.date().strftime('%m/%d/%Y') for d in df3['dateHour']])
df3['Hour'] = [d.time().strftime("%H:%M") for d in df3['dateHour']]
# select and sort columns
df4=df3[['Date', 'Hour', 'X', 'Y']]
df4
Firstly preprocess your excel file:
import pandas as pd
df=pd.read_excel('Trial.xlsx',engine='openpyxl')
df=df.drop(columns=['Unnamed: 0'])
df['Date']=pd.to_datetime(df['Date'],yearfirst=True,format='%y-%d-%m').dt.strftime('%y-%d-%m')
df['Date']=pd.to_datetime(df['Date'],format='%y-%m-%d')
Now create a date range by using date_range() method:
data=pd.date_range('2020-01-02','2020-01-03',freq='H').astype(str)
Then create a new dataframe using that date range:
datadf=data.str.split(' ',expand=True).to_frame().set_index([0,1]).reset_index().rename(columns={0:'Date',1:'Hour'})
Now make use of concat() method and concat both dataframes:
result=pd.concat((df,datadf),ignore_index=True)
Now Convert your 'Date' column in datetime by using to_datetime() method:
result['Date']=pd.to_datetime(result['Date'])
Finally sort your 'Date' column by using sort_values() method:
result=result.sort_values('Date',ignore_index=True)
Now if you print result you will get your desired output

How to do many to many matching with condition in Python?

please help me how to tackle the many to many matching with condition ?
import pandas as pd
company1 = {'Product': ['Pro_1','Pro_3','Pro_3','Pro_5'],
'product_date': ['2013-05-09','2012-12-02','2013-10-25','2016-08-25']}
df = pd.DataFrame(company1, columns = ['Product', 'product_date'])
print (df)
company2 = {'Product': ['Pro_1','Pro_2','Pro_2','Pro_3','Pro_3','Pro_3','Pro_3','Pro_5','Pro_5'],
'Start': ['2013-01-01','2012-01-02','2013-01-02','2014-01-01','2011-01-02','2012-01-02','2013-01-02','2014-01-25', '2017-01-26'],
'end': ['2014-01-01','2013-01-01','2013-12-31','2014-12-01','2012-01-01','2013-01-01','2013-12-31','2017-01-25', '2018-01-20'],
'inventory': [20,30,50,30,40,10,20,30,20]}
df2 = pd.DataFrame(company2, columns = ['Product', 'Start','end','inventory'])
print (df2)
result = {'Product': ['Pro_1','Pro_3','Pro_3','Pro_5'],
'inventory': [20,10,20,30]}
df3 = pd.DataFrame(result, columns = ['Product', 'inventory'])
print(df3)
I wanted to take the match df1 and df2 by the 'Product' and condition on the 'product_date' between the 'Start' and 'end' dates, then return the 'inventory' from df2.

Categories