I have below data frame of item with expiry date:
Item Expiry Date Stock
Voucher 1 1-Mar-2022 3
Voucher 2 31-Apr-2022 2
Voucher 3 1-Feb-2022 1
And I want to create an aging dashboard and map out my number of stock there:
Jan Feb Mar Apr
Voucher 1 3
Voucher 2 2
Voucher 3 1
Any ideas or guides how to do something like above please? I searched a lot of resources, cannot find any. I'm very new on building dashboards. Thanks.
You can extract the month name (NB. Your dates are invalid. 31 Apr. is impossible) and pivot the table. If needed, reindex with a list of months names:
from calendar import month_abbr
cols = month_abbr[1:] # first item is empty string
(df.assign(month=df['Expiry Date'].str.extract('-(\D+)-'))
.pivot(index='Item', columns='month', values='Stock')
.reindex(columns=cols)
)
If you expect to have duplicated Items, use pivot_table with sum as aggregation function instead
Output:
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Item
Voucher 1 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Voucher 2 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
Voucher 3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You may try like this:
import pandas as pd
# Item Expiry Date Stock
# Voucher 1 1-Mar-2022 3
# Voucher 2 31-Apr-2022 2
# Voucher 3 1-Feb-2022 1
data = {'Item': ['Voucher 1', 'Voucher 2', 'Voucher 3'],
'Expiry Date': ['1-Mar-2022', '31-Apr-2022', '1-Feb-2022'],
'Stock': [3, 2, 1]}
df = pd.DataFrame(data)
# Using pandas apply method, get the month from each row using axis=1 and store it in new column 'Month'
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['Month'] = df.apply(lambda x: x['Expiry Date'].split('-')[1], axis=1)
# Using pandas pivot method, set 'Item' column as index,
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html
# set unique values in 'Month' column as separate columns
# set values in 'Stock' column as values for respective month columns
# and using 'rename_axis' method, remove the row name 'Month'
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html
new_df = df.pivot(index='Item', columns='Month', values='Stock').rename_axis(None, axis=1)
# Sort the month column names by first converting it to the the pandas timestamp object
# then using it as a key in a sorted function on all columns
new_df = new_df[sorted(new_df.columns, key=lambda x: pd.to_datetime(x, format='%b'))]
print(new_df)
And this is the output I am getting:
Feb Mar Apr
Item
Voucher 1 NaN 3.0 NaN
Voucher 2 NaN NaN 2.0
Voucher 3 1.0 NaN NaN
Related
This is not my actual data, just a representation of a larger set.
I have a dataframe (df) that looks like this:
id text_field text_value
1 date 2021-07-01
1 hour 07:04
2 available yes
2 sold no
Due to project demand i need to manipulate this data to a certain point. The main part is this one:
df.set_index(['id','text_field'], append=True).unstack().droplevel(0,1).droplevel(0)
Leaving me with something like this:
text_field date hour available sold
id
1 2021-07-01 NaN NaN NaN
1 NaN 07:04 NaN NaN
2 NaN NaN yes NaN
2 NaN NaN NaN no
That is very close to what i need, but i'm failing to achieve the next step. I need to group this data by id, leaving only one id on each line.
Something like this:
text_field date hour available sold
id
1 2021-07-01 07:04 NaN NaN
2 NaN NaN yes no
Can somebody help me?
As mentioned by #Nk03 in the comments, you could use the pivot feature of pandas:
import pandas as pd
# Creating example dataframe
data = {
'id': [1, 1, 2, 2],
'text_field': ['date', 'hour', 'available', 'sold'],
'text_value': ['2021-07-01', '07:04', 'yes', 'no']
}
df = pd.DataFrame(data)
# Pivoting on dataframe
df_pivot = df.pivot(index='id', columns='text_field')
print(df_pivot)
Console output:
text_value
text_field available date hour sold
id
1 NaN 2021-07-01 07:04 NaN
2 yes NaN NaN no
I am struggling with how to pivot the dataframe with multi-indexed columns. First i import the data from an .xlsx file and from then i've tried to generate a certain Dataframe.
Note: I'm not allowed to embed images so that's the reason of the links
import pandas as pd
import numpy as np
# Read Excel file
df = pd.read_excel("myFile.xlsx", header=[0])
Output: Click
If you want, here you can see the File: Link to File
# Get Month and Year of the Dataframe
month_year = df.iloc[:, 5:-1].columns
month_list = []
year_list = []
for x in range(len(month_year)-1):
if("Unnamed" not in month_year[x]):
month_list.append(month_year[x].split()[0])
year_list.append(month_year[x].split()[1])
# Read Excel file with headers 1 & 2
df = pd.read_excel(path, header=[0,1])
Output: Click
# Join both indexes excluding the ones with Unnamed
df.columns = [str(x[0] + " " + x[1]) if("Unnamed" not in x[1]) else str(x[0]) for x in df.columns ]
Output: Click
# Adding month and list columns to the DataFrame
df['Month'] = month_list
df['Year'] = year_list
Output: Click
I want the output DataFrame to be like the following:
Desire Output
You should clean it up a bit, because I do not know how the Total column should be handled.
The code below reads the excel file as a MultiIndex, a bit of column names modification, before stacking and extracting the Year and Month columns.
df = pd.read_excel("Downloads/myFile.xlsx", header=[0,1], index_col=[0, 1, 2])
df.index.names = ['Project', 'KPI', 'Metric']
df.columns = df.columns.delete(-1).union([('Total', 'Total')])
df.columns.names = ['Month_Year', 'Values']
(df
.stack(level = 0)
.rename_axis(columns=None)
.reset_index()
.assign(Year = lambda df: df.Month_Year.str.split(" ").str[-1],
Month = lambda df: df.Month_Year.str.split(" ").str[0])
.drop(columns='Month_Year')
)
Project KPI Metric Real Target Total Year Month
0 Project 1 Numeric Project 1 Metric 10.0 30.0 NaN 2019 April
1 Project 1 Numeric Project 1 Metric 651.0 51651.0 NaN 2019 February
2 Project 1 Numeric Project 1 Metric 200.0 215.0 NaN 2019 January
3 Project 1 Numeric Project 1 Metric 2.0 5.0 NaN 2019 March
4 Project 1 Numeric Project 1 Metric NaN NaN 9.0 Total Total
5 Project 2 General Project 2 Metric 20.0 10.0 NaN 2019 April
6 Project 2 General Project 2 Metric 500.0 100.0 NaN 2019 February
7 Project 2 General Project 2 Metric 749.0 12.0 NaN 2019 January
8 Project 2 General Project 2 Metric 1.0 7.0 NaN 2019 March
9 Project 2 General Project 2 Metric NaN NaN 7.0 Total Total
10 Project 3 Numeric Project 3 Metric 30.0 20.0 NaN 2019 April
11 Project 3 Numeric Project 3 Metric 200.0 55.0 NaN 2019 February
12 Project 3 Numeric Project 3 Metric 5583.0 36.0 NaN 2019 January
13 Project 3 Numeric Project 3 Metric 3.0 7.0 NaN 2019 March
14 Project 3 Numeric Project 3 Metric NaN NaN 4.0 Total Total
in my code I've generated a range of dates using pd.date_range in an effort to compare it to a column of dates read in from excel using pandas. The generated range of dates is refered to as "all_dates".
all_dates=pd.date_range(start='1998-12-31', end='2020-06-23')
for i, date in enumerate(period): # where 'Period' is the column of excel dates
if date==all_dates[i]: # loop until date from excel doesn't match date from generated dates
continue
else:
missing_dates_stock.append(i) # keep list of locations where dates are missing
stock_data.insert(i,"NaN") # insert 'NaN' where missing date is found
This results in TypeError: argument of type 'Timestamp' is not iterable. How can I make the data types match such that I can iterate and compare them? Apologies as I am not very fluent in Python.
I think you are trying to create a NaN row if the date does not exist in the excel file.
Here's a way to do it. You can use the df.merge option.
I am creating df1 to simulate the excel file. It has two columns sale_dt and sale_amt. If the sale_dt does not exist, then we want to create a separate row with NaN in the columns. To ensure we simulate it, I am creating a date range from 1998-12-31 through 2020-06-23 skipping 4 days in between. So we have a dataframe with 4 missing date between each two rows. The solution should create 4 dummy rows with the correct date in ascending order.
import pandas as pd
import random
#create the sales dataframe with missing dates
df1 = pd.DataFrame({'sale_dt':pd.date_range(start='1998-12-31', end='2020-06-23', freq='5D'),
'sale_amt':random.sample(range(1, 2000), 1570)
})
print (df1)
#now create a dataframe with all the dates between '1998-12-31' and '2020-06-23'
df2 = pd.DataFrame({'date':pd.date_range(start='1998-12-31', end='2020-06-23', freq='D')})
print (df2)
#now merge both dataframes with outer join so you get all the rows.
#i am also sorting the data in ascending order so you can see the dates
#also dropping the original sale_dt column and renaming the date column as sale_dt
#then resetting index
df1 = (df1.merge(df2,left_on='sale_dt',right_on='date',how='outer')
.drop(columns=['sale_dt'])
.rename(columns={'date':'sale_dt'})
.sort_values(by='sale_dt')
.reset_index(drop=True))
print (df1.head(20))
The original dataframe was:
sale_dt sale_amt
0 1998-12-31 1988
1 1999-01-05 1746
2 1999-01-10 1395
3 1999-01-15 538
4 1999-01-20 1186
... ... ...
1565 2020-06-03 560
1566 2020-06-08 615
1567 2020-06-13 858
1568 2020-06-18 298
1569 2020-06-23 1427
The output of this will be (first 20 rows):
sale_amt sale_dt
0 1988.0 1998-12-31
1 NaN 1999-01-01
2 NaN 1999-01-02
3 NaN 1999-01-03
4 NaN 1999-01-04
5 1746.0 1999-01-05
6 NaN 1999-01-06
7 NaN 1999-01-07
8 NaN 1999-01-08
9 NaN 1999-01-09
10 1395.0 1999-01-10
11 NaN 1999-01-11
12 NaN 1999-01-12
13 NaN 1999-01-13
14 NaN 1999-01-14
15 538.0 1999-01-15
16 NaN 1999-01-16
17 NaN 1999-01-17
18 NaN 1999-01-18
19 NaN 1999-01-19
I have an excel files and the first two rows are:
Weekly Report
December 1-7, 2014
And after that comes the relevant table.
When I use
filename = r'excel.xlsx'
df = pd.read_excel(filename)
print(df)
I get
Weekly Report Unnamed: 1 Unnamed: 2 Unnamed:
3 Unnamed: 4 Unnamed: 5
0 December 1-7, 2014 NaN NaN
NaN NaN NaN
1 NaN NaN NaN
NaN NaN NaN
2 Date App Campaign
Country Cost Installs
What I mean is that the columns name is unnamed because it is in the first irrelevant row.
If pandas would read only the table my columns will be installs, cost etc... which I want.
How can I tell him to read starting from line 3?
Use skiprows to your advantage -
df = pd.read_excel(filename, skiprows=[0,1])
This should do it. pandas ignores the first two rows in this case -
skiprows : list-like
Rows to skip at the beginning (0-indexed)
More details here
I have data that contains prices, volumes and other data about various financial securities. My input data looks like the following:
import numpy as np
import pandas
prices = np.random.rand(15) * 100
volumes = np.random.randint(15, size=15) * 10
idx = pandas.Series([2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009,
2009, 2009, 2009], name='year')
df = pandas.DataFrame.from_items([('price', prices), ('volume', volumes)])
df.index = idx
# BELOW IS AN EXMPLE OF WHAT INPUT MIGHT LOOK LIKE
# IT WON'T BE EXACT BECAUSE OF THE USE OF RANDOM
# price volume
# year
# 2007 0.121002 30
# 2007 15.256424 70
# 2007 44.479590 50
# 2007 29.096013 0
# 2007 21.424690 0
# 2008 23.019548 40
# 2008 90.011295 0
# 2008 88.487664 30
# 2008 51.609119 70
# 2008 4.265726 80
# 2009 34.402065 140
# 2009 10.259064 100
# 2009 47.024574 110
# 2009 57.614977 140
# 2009 54.718016 50
I want to produce a data frame that looks like:
year 2007 2008 2009
0 0.121002 23.019548 34.402065
1 15.256424 90.011295 10.259064
2 44.479590 88.487664 47.024574
3 29.096013 51.609119 57.614977
4 21.424690 4.265726 54.718016
I know of one way to produce the output above using groupby:
df = df.reset_index()
grouper = df.groupby('year')
df2 = None
for group, data in grouper:
series = data['price'].copy()
series.index = range(len(series))
series.name = group
df2 = pandas.DataFrame(series) if df2 is None else pandas.concat([df2, series], axis=1)
And I also know that you can do pivot to get a DataFrame that has NaNs for the missing indices on the pivot:
# df = df.reset_index()
df.pivot(columns='year', values='price')
# Output
# year 2007 2008 2009
# 0 0.121002 NaN NaN
# 1 15.256424 NaN NaN
# 2 44.479590 NaN NaN
# 3 29.096013 NaN NaN
# 4 21.424690 NaN NaN
# 5 NaN 23.019548 NaN
# 6 NaN 90.011295 NaN
# 7 NaN 88.487664 NaN
# 8 NaN 51.609119 NaN
# 9 NaN 4.265726 NaN
# 10 NaN NaN 34.402065
# 11 NaN NaN 10.259064
# 12 NaN NaN 47.024574
# 13 NaN NaN 57.614977
# 14 NaN NaN 54.718016
My question is the following:
Is there a way that I can create my output DataFrame in the groupby without creating the series, or is there a way I can re-index my input DataFrame so that I get the desired output using pivot?
You need to label each year 0-4. To do this, use the cumcount after grouping. Then you can pivot correctly using that new column as the index.
df['year_count'] = df.groupby(level='year').cumcount()
df.reset_index().pivot(index='year_count', columns='year', values='price')
year 2007 2008 2009
year_count
0 61.682275 32.729113 54.859700
1 44.231296 4.453897 45.325802
2 65.850231 82.023960 28.325119
3 29.098607 86.046499 71.329594
4 67.864723 43.499762 19.255214
You can use groupby with apply new Series created with numpy array by values and then reshape by unstack:
print (df.groupby(level='year')['price'].apply(lambda x: pd.Series(x.values)).unstack(0))
year 2007 2008 2009
0 55.360804 68.671626 78.809139
1 50.246485 55.639250 84.483814
2 17.646684 14.386347 87.185550
3 54.824732 91.846018 60.793002
4 24.303751 50.908714 22.084445