I have a csv with a table of data. I would like to read the csv and write a new xlsx file based on the initial csv.
However, I would also like to add a new column that will have a value based on the name of the header (e.g if the header contains the word online) and create sort of a pivot table based on this logic. So as opposed to having three columns for leads, I would have one column represented by three rows (for each column)
date online_won retail_won outbound_won online_leads retail_leads outbound_leads
1/1/11 9 10 11 12 14
2/1/11 1 2 13 15
3/1/11 10 8 14 17
This is the desired output
date source won leads
1/1/11 online 9 12
1/1/11 retail 10 14
1/1/11 outbound 11
.....
I would assume that I can solve this using pd.pivot_table. But can't figure out how to return columns as won and leads and extract only the online/retail/outbound part from the existing columns.
You could use pd.wide_to_long, with a little extra work on the columns, since wide format variables are assumed to start with the stub names:
df.columns = ['_'.join(j for j in i[::-1]) for i in df.columns.str.split('_')]
(pd.wide_to_long(df, stubnames=['won','leads'], i='date', j='source', suffix='_\w+')
.reset_index())
date source won leads
0 1/1/11 _online 9 12.0
1 2/1/11 _online 1 15.0
2 3/1/11 _online 10 17.0
3 1/1/11 _retail 10 14.0
4 2/1/11 _retail 2 NaN
5 3/1/11 _retail 8 NaN
6 1/1/11 _outbound 11 NaN
7 2/1/11 _outbound 13 NaN
8 3/1/11 _outbound 14 NaN
With a column rename, you can use wide_to_long
df.columns = ['_'.join(x.split('_')[::-1]) for x in df.columns ]
pd.wide_to_long(df, ['won','leads'], 'date', 'source', sep='_', suffix='\w+')
Output:
won leads
date source
1/1/11 online 9 12.0
2/1/11 online 1 15.0
3/1/11 online 10 17.0
1/1/11 retail 10 14.0
2/1/11 retail 2 NaN
3/1/11 retail 8 NaN
1/1/11 outbound 11 NaN
2/1/11 outbound 13 NaN
3/1/11 outbound 14 NaN
Another way using melt and series.str.split() with unstack()`:
m=df.melt('date').sort_values('date')
m[['Source','Status']]=m.pop('variable').str.split('_',expand=True)
final=(m.set_index(['date','Source','Status']).unstack()
.droplevel(0,axis=1).reset_index().rename_axis(None,axis=1))
date Source leads won
0 1/1/11 online 12.0 9.0
1 1/1/11 outbound NaN 11.0
2 1/1/11 retail 14.0 10.0
3 2/1/11 online 15.0 1.0
4 2/1/11 outbound NaN 13.0
5 2/1/11 retail NaN 2.0
6 3/1/11 online 17.0 10.0
7 3/1/11 outbound NaN 14.0
8 3/1/11 retail NaN 8.0
Use DataFrame.set_index with str.split for MultiIndex, so possible DataFrame.stack by first level 0:
df = df.set_index('date')
df.columns = df.columns.str.split('_', expand=True)
df = df.stack(0).rename_axis(('date','Source')).reset_index()
print (df)
date Source leads won
0 1/1/11 online 12.0 9
1 1/1/11 outbound NaN 11
2 1/1/11 retail 14.0 10
3 2/1/11 online 15.0 1
4 2/1/11 outbound NaN 13
5 2/1/11 retail NaN 2
6 3/1/11 online 17.0 10
7 3/1/11 outbound NaN 14
8 3/1/11 retail NaN 8
Related
I have been tasked with reorganizing a fairly large data set for analysis. I want to make a dataframe where each employee has a list of Stats associated with their Employee Number ordered based on how many periods they have been with the company. The data does not go all the way back to the start of the company so some employees will not appear in the first period. My guess is there's some combination of pivot and merge that I am unable to wrap my head around.
df1 looks like this:
Periods since Start Period Employee Number Wage Sick Days
0 3 202001 101 20 14
1 2 202001 102 15 12
2 1 202001 103 10 17
3 4 202002 101 20 14
4 3 202002 102 20 10
5 2 202002 103 10 13
6 5 202003 101 25 13
7 4 202003 102 20 9
8 3 202003 103 10 13
And I want df2 (Column# for reference only):
Column1 Column2 Column3 Column4 Column5
101 102 103
1 Wage NaN NaN 10
1 Sick Days NaN NaN 17
2 Wage NaN 15 10
2 Sick Days NaN 12 13
3 Wage 20 20 10
3 Sick Days 14 10 13
4 Wage 20 20 NaN
4 Sick Days 14 9 NaN
Column1 = 'Periods since Start'
Column2 = "Stat" e.g. 'Wage', 'Sick Days'
Column3 - Column 5 Headers = 'Employee Number'
First thoughts were to try pivot/merge/stack but I have had no good results.
The second option I thought of was to create a dataframe with the index and headers that I wanted and then populate it from df1
import pandas as pd
import numpy as np
stat_list = ['Wage', 'Sick Days']
largest_period = df1['Periods since Start'].max()
df2 = np.tile(stat_list, largest_period)
df2 = pd.DataFrame(data=df2, columns = ['Stat'])
df2['Period_Number'] = df2.groupby('Stat').cumcount()+1
df2 = pd.DataFrame(index = df2[['Period_Number', 'Stat']],
columns = df1['Employee Number'])
Which Yields:
Employee Number 101 102 103
(1, 'Wage') NaN NaN NaN
(1, 'Sick Days') NaN NaN NaN
(2, 'Wage') NaN NaN NaN
(2, 'Sick Days') NaN NaN NaN
(3, 'Wage') NaN NaN NaN
(3, 'Sick Days') NaN NaN NaN
(4, 'Wage') NaN NaN NaN
(4, 'Sick Days') NaN NaN NaN
But I am at a loss on how to populate it.
You can .melt and then .unstack the dataframe.
Finish up up with some multiindex column clean up using .droplevel and passing axis=1 to drop unnecessary levels on columns rather than the default axis=0, which would drop index columns. You can also use reset_index() to bring the index columns into your dataframe:
df = (df.melt(id_vars=['Periods since Start', 'Employee Number'],
value_vars=['Wage', 'Sick Days'])
.set_index(['Periods since Start', 'Employee Number', 'variable']).unstack(1)
.droplevel(0, axis=1)
.reset_index())
df
Out[1]:
Employee Number Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN
When melting the dataframe, you can pass var_name= as the default is "variable". If you do that make sure to change the column name when using set_index() as well.
Try this, first melt the dataframe keeping Periods since Start, Employee Number, and Period in the index. Next, pivot the dataframe making rows and columns with 'value' from melt the values in the pivoted dataframe. Lastly, cleanup index with reset_index and remove the column index header name using rename_axis:
df.melt(['Periods since Start', 'Employee Number', 'Period'])\
.pivot(['Periods since Start', 'variable'], 'Employee Number', 'value')\
.reset_index()\
.rename_axis(None, axis=1)
Output:
Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN
This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN
I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
My .csv file looks like:
Area When Year Month Tickets
City Day 2015 1 14
City Night 2015 1 5
Rural Day 2015 1 18
Rural Night 2015 1 21
Suburbs Day 2015 1 15
Suburbs Night 2015 1 21
City Day 2015 2 13
containing 75 rows. I want both a row multiindex and column multiindex that looks like:
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 5.0 3.0 22.0 11.0 13.0 2.0
2 22.0 8.0 4.0 16.0 6.0 18.0
3 26.0 25.0 22.0 23.0 22.0 2.0
2016 1 20.0 25.0 39.0 14.0 3.0 10.0
2 4.0 14.0 16.0 26.0 1.0 24.0
3 22.0 17.0 7.0 24.0 12.0 20.0
I've read the .read_csv doc at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I can get the row multiindex with:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3])
I've tried:
df2 = pd.read_csv('c:\\Data\Tickets.csv', index_col=[2, 3], header=[1, 3, 5])
thinking [1, 3, 5] fetches 'City', 'Rural', and 'Suburbs'. How do I get the desired column multiindex shown above?
Seems like you need to pivot_table with multiple indexes and multiple columns.
Start with just reading you csv plainly
df = pd.read_csv('Tickets.csv')
Then
df.pivot_table(index=['Year', 'Month'], columns=['Area', 'When'], values=['Tickets'])
With the input data you provided, you'd get
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 14.0 5.0 18.0 21.0 15.0 21.0
2 13.0 NaN NaN NaN NaN NaN
Here is a table, need to partially unpivot it by class.
ID Class Type 2017 2018
12A A Net 1 7
12B A Gross 8
12A B Net 3 9
12B B Gross 4 10
13A A Net 5 11
13C B Net 6 5
The expected result:
ID Class Type 2017A 2018A 2017B 2018B
12A A Net 1 7 3 9
12B A Gross NaN 8 4 10
13A A Net 5 11 NaN NaN
13C B Net NaN NaN 6 5
Use:
df1 = df.set_index(['ID','Type','Class']).unstack().sort_index(level=1, axis=1)
df1.columns = ['{}{}'.format(a,b) for a, b in df1.columns]
df1 = df1.reset_index()
s = df.drop_duplicates('ID').set_index('ID')['Class']
df1.insert(1, 'Class', df1['ID'].map(s))
print (df1)
ID Class Type 2017A 2018A 2017B 2018B
0 12A A Net 1.0 7.0 3.0 9.0
1 12B A Gross NaN 8.0 4.0 10.0
2 13A A Net 5.0 11.0 NaN NaN
3 13C B Net NaN NaN 6.0 5.0
Explanation:
Reshape by set_index and unstack, sort columns by second levele by sort_index
Flatten MultiIndex in columns with list comprehension
For column Class use insert for new second column created by map