I am trying to create a pivot table from a Dataframe using Pandas. Given below is the view of my Dataframe.
category,date,type1,type2,total
PROD_A,2018-10-01,2,2,4
PROD_A,2018-10-02,2,0,2
PROD_B,2018-10-01,0,0,0
PROD_A,2018-10-03,0,0,0
I am trying to create a pivot and save the output to an excel file
Summary = pd.pivot_table(df, values=['total'], index=['category'], columns='date')
Summary.to_excel(writer, sheet_name='Summary')
I get the below error
KeyError : 'total'
Could anyone guide me where am I gong wrong with this. Thanks
Updating on the datatype:
category object
date object
type1 int64
type2 int64
total float64
dtype: object
Output of df.head():
category,date,type1,type2,total
PROD_A,2018-10-01,2,2,4
PROD_A,2018-10-02,2,0,2
PROD_B,2018-10-01,0,0,0
PROD_A,2018-10-03,0,0,0
PROD_B,2018-10-03,2,3,5
Problem is ['total'], it create MultiIndex in columns:
Summary = pd.pivot_table(df, values=['total'], index=['category'], columns='date')
print (Summary)
total
date 2018-10-01 2018-10-02 2018-10-03
category
PROD_A 4.0 2.0 0.0
PROD_B 0.0 NaN NaN
Solution is use remove it:
Summary = pd.pivot_table(df, values='total', index='category', columns='date')
print (Summary)
date 2018-10-01 2018-10-02 2018-10-03
category
PROD_A 4.0 2.0 0.0
PROD_B 0.0 NaN NaN
Last convert index to column by reset_index:
Summary = (pd.pivot_table(df, values='total', index='category', columns='date')
.reset_index(drop=True))
print (Summary)
date 2018-10-01 2018-10-02 2018-10-03
0 4.0 2.0 0.0
1 0.0 NaN 5.0
Related
I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)
I have two data frames, one with 3 rows and 4 columns + date as index dataframeA
TYPE UNIT PRICE PERCENT
2010-01-05 REDUCE CAR 2300.00 3.0
2010-06-03 INCREASE BOAT 1000.00 2.0
2010-07-01 INCREASE CAR 3500.00 3.0
and another empty one with 100's of dates as index and two columns dataframeB
CAR BOAT
2010-01-01 Nan 0.0
2010-01-02 Nan 0.0
2010-01-03 Nan 0.0
2010-01-04 Nan 0.0
2010-01-05 -69.00 0.0
.....
2010-06-03 Nan 20.00
...
2010-07-01 105.00 0.0
I need to read each row from the first data frame , find the corresponding date and based on the unit type assign it the corresponding percentage or reduction on the second data frame.
I was reading about not iterating when dealing with dataframes? not sure how else?. how can i evaluate each row and then set the value on dataframeB ?
I tried doing the following :
for index, row in dataframeA.iterrows():
type = row['TYPE']
unit = row['UNIT']
price = row['PRICE']
percent = row['PERCENT']
then here with basic math come up with the reduction or
increase and assign to dataframeB do the same for the others
My question is, is this the right approach and also how do i assign the value i come up to the other dataframeB ?
If your first dataframe is limited to just the variables stated, you can do this. Not terribly elegant, but works. If you have many more combinations in the dataframe, it'd have to be rethought. See comments inline.
df = pd.read_csv(io.StringIO(''' date TYPE UNIT PRICE PERCENT
2010-01-05 REDUCE CAR 2300.00 3.0
2010-06-03 INCREASE BOAT 1000.00 2.0
2010-07-01 INCREASE CAR 3500.00 3.0'''), sep='\s+', engine='python').set_index('date')
df1 = pd.read_csv(io.StringIO('''date
2010-01-01
2010-01-02
2010-01-03
2010-01-04
2010-01-05
2010-06-03
2010-07-01'''), engine='python').set_index('date')
# calculate your changes in first dataframe
df.loc[df.TYPE == 'REDUCE', 'Change'] = - df['PRICE'] * df['PERCENT'] / 100
df.loc[df.TYPE == 'INCREASE', 'Change'] = df['PRICE'] * df['PERCENT'] / 100
#merge the Changes into car and boat dataframes; rename columns
df_car = df[['Change']].loc[df.UNIT == 'CAR'].merge(df1, right_index=True, left_index=True, how='right')
df_car.rename(columns={'Change':'Car'}, inplace=True)
df_boat = df[['Change']].loc[df.UNIT == 'BOAT'].merge(df1, right_index=True, left_index=True, how='right')
df_boat.rename(columns={'Change':'Boat'}, inplace=True)
# merge car and boat
dfnew = df_car.merge(df_boat, right_index=True, left_index=True, how='right')
dfnew
Car Boat
date
2010-01-01 NaN NaN
2010-01-02 NaN NaN
2010-01-03 NaN NaN
2010-01-04 NaN NaN
2010-01-05 -69.000 NaN
2010-06-03 NaN 20.000
2010-07-01 105.000 NaN
I have a dataframe. There is always data available for each date and firm. But a given row isn't guaranteed to have the data; the row only has data if that firm is True.
date IBM AAPL_total_amount IBM_total_amount AAPL_count_avg IBM_count_avg
2013-01-31 True False 29 9
2013-01-31 True True 29 9 27 5
2013-02-31 False True 27 5
2013-02-08 True True 2 3 5 6
...
How could I transpose the above dataframe to long format?
Expected output:
date Firm total_amount count_avg
2013-01-31 IBM 9 5
2013-01-31 AAPL 29 27
...
Might have to add some logic to drop all the boolean masks, but once you have that it's just a stack.
u = df.set_index('date').drop(['IBM', 'AAPL'], 1)
u.columns = u.columns.str.split('_', expand=True)
u.stack(0)
count total
date
2013-01-31 IBM 9.0 29.0
AAPL 5.0 27.0
IBM 9.0 29.0
2013-02-31 AAPL 5.0 27.0
2013-02-08 AAPL 6.0 5.0
IBM 3.0 2.0
To drop all the masks if you don't have a list of keys, possibly use select_dtypes
df.select_dtypes(exclude=[bool])
Use wide_to_long with pre-processing on columns and post-processing with slicing and dropna
df.columns = ['_'.join(col[::-1]) for col in df.columns.str.split('_')]
df_final = (pd.wide_to_long(df.reset_index(), stubnames=['total','count'],
i=['index','date'],
j='firm', sep='_', suffix='\w+')[['total', 'count']]
.reset_index(level=[1,2]).dropna())
Out[59]:
date firm total count
index
0 2013-01-31 IBM 29.0 9.0
1 2013-01-31 IBM 29.0 9.0
1 2013-01-31 AAPL 27.0 5.0
2 2013-02-31 AAPL 27.0 5.0
3 2013-02-08 IBM 2.0 3.0
3 2013-02-08 AAPL 5.0 6.0
That's an unusual table design. Let's assume the table is called df.
So you first want to find the list of tickers:
Either you have them elsewhere:
tickers = ['AAPL','IBM']
or you can extract them from your table:
tickers = [c for c in df.columns
if not c.endswith('_count') and
not c.endswith('_total') and
c != 'date']
Now you have to loop over the tickers:
res = []
for tic in tickers:
sub = df[df[tic]][ ['date', f'{tic}_total','f{tic}_count'] ].copy()
sub.columns = ['date', 'Total','Count']
sub['Firm'] = tic
res.append(sub)
res = pd.concat(res, axis=0)
Eventually, you might want to reorder the columns:
res = res[['date','Item','Total','Count']]
You might want to handle duplicates. From what I read in your example, you want to drop them:
res = res.drop_duplicates()
I have a time-series corresponding to the end of the month for some dates of interest:
Date
31-01-2005 0.0
28-02-2006 0.0
30-06-2020 0.0
Name: Whatever, dtype: float64
I'd like to expand this dataframe's index with two month samples before each data point resulting in the following dataframe:
Date
30-11-2004 NaN
31-12-2004 NaN
31-01-2005 0.0
31-12-2005 NaN
31-01-2006 NaN
28-02-2006 0.0
30-04-2020 NaN
31-05-2020 NaN
30-06-2020 0.0
Name: Whatever, dtype: float64
How can I do that? Note that I am only interested in the resulting index.
My naive attempt was to do:
df.index.apply(lambda x: [x - pd.DateOffset(months=2), x - pd.DateOffset(months=1), x])
but index doesn't have an apply function.
I think you need DataFrame.reindex with date_range:
idx = [y for x in df.index for y in pd.date_range(x - pd.DateOffset(months=2), x, freq='M')]
df = df.reindex(pd.to_datetime(idx))
print (df)
Whatever
2004-11-30 NaN
2004-12-31 NaN
2005-01-31 0.0
2005-12-31 NaN
2006-01-31 NaN
2006-02-28 0.0
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 0.0
What I'm doing is I have generated a DataFrame with pandas:
df_output = pd.DataFrame(columns={"id","Payout date", "Amount"}
In column 'Payout date' is a datetime, and in 'Amount' a float. I'm taking the values for each row from a csv:
df=pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False)
but when I assign the values:
df_output.loc[df_output['id'] == index, 'Payout date'].iloc[0]=(parsed_date)
pay=payments.get()
ref=refunds.get()
df_output.loc[df_output['id'] == index, 'Amount'].iloc[0]=(pay+ref-for_next_day)
and I print it the columns 'Payout date' and 'Amount' it only prints the id correctly, and NaT for the payouts and NaN for the amount, even when casting them to floats, or using
df_output['Amount']=pd.to_numeric(df_output['Amount'])
df_output['Payout date'] = pd.to_datetime(df_output['Payout date'])
I've also tried casting the values before passing them to the DataFrame, with no luck, so what I'm getting is this:
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
Instead, I'm looking for something like this:
id Payout date Amount
1 2019-03-11 3.2
2 2019-03-11 3.2
3 2019-03-11 3.2
4 2019-03-11 3.2
5 2019-03-11 3.2
EDIT
print(df_output.head(5))
print(df.head(5))
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
id Created (UTC) Type Currency Amount Fee Net
1 2016-07-27 13:28:00 charge mxn 672.0 31.54 640.46
2 2016-07-27 15:21:00 charge mxn 146.0 9.58 136.42
3 2016-07-27 16:18:00 charge mxn 200.0 11.83 188.17
4 2016-07-27 17:18:00 charge mxn 146.0 9.58 136.42
5 2016-07-27 18:11:00 charge mxn 286.0 15.43 270.57
Probably the easiest thing to do would be just to rename the columns of the dataframe you're loading:
df = pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False, index_col='id')
df.columns(rename={"Created (UTC)":'Payout Date'}, inplace=True)
df_output = df[['Payout Date', 'Amount']]
EDIT:
if you're trying to assign a column in one dataframe to the column of another just do this:
output_df['Amount'] = df['Amount']