I have a dataframe like below which contains 4 columns. I want to convert each unique inventory item number (Z15, Z17 and so on) under the "inv" column into new columns with "info" value corresponds to each store and period. Transpose function does not work in this situation. Also, if I use pivot_table or groupby function, I won't be able to get the value for "High", "Medium" and so on.
Be noted, for the "info" column, I have many different combination of categorical values along with numerical values in real dataset. Also in the real dataset, I have over 100+ stores, over 400+ inventory items and 30+ periods. This is a simplified version of the data to demonstrate my idea. Any suggestion or advice are greatly appericated.
import pandas as pd
import numpy as np
inv = ['Z15','Z15','Z15','Z15','Z15','Z15','Z15','Z15','Z15','Z17','Z17','Z17','Z17','Z17','Z17','Z17']
store = ['store1','store1','store1','store2','store2','store2','store2','store2','store2','store3','store4','store5','store6','store7','store1','store2']
period = [2018,2019,2020,2015,2016,2017,2018,2019,2020,2022,2022,2022,2022,2022,2018,2019]
info = ['0.84773','0.8487','0.82254','0.75','0.65','0.432','0.546','0.777','0.1','High','High','Medium','Very Low','Low','High','Low']
df = pd.DataFrame({'inv':inv,
'store':store,
'period':period,
'info':info})
Data looks like this:
The desired output will be like this :
You're looking for pivot :
df.pivot(index = ['store', 'period'], columns='inv' ,values = 'info').reset_index()
Output:
inv store period Z15 Z17
0 store1 2018 0.84773 High
1 store1 2019 0.8487 NaN
2 store1 2020 0.82254 NaN
3 store2 2015 0.75 NaN
4 store2 2016 0.65 NaN
5 store2 2017 0.432 NaN
6 store2 2018 0.546 NaN
7 store2 2019 0.777 Low
8 store2 2020 0.1 NaN
9 store3 2022 NaN High
10 store4 2022 NaN High
11 store5 2022 NaN Medium
12 store6 2022 NaN Very Low
13 store7 2022 NaN Low
Related
The CSV files used in this code are air quality sensor data files. They record particle concentrations each hour over multiple years in some cases. There is about 100 CSV files I am using. I have already figured out how to look through each file and average a variable regardless of the year, but I am having trouble finding the averages for only the year 2020.
The goal of the code is to find the average number of hours each sensor is running in the year 2020.
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
# Read in table summarizing key variables about each Purple Air station around Pittsburgh
summary_table = pd.read_csv('Pittsburgh Hourly Averaged PM Data.csv')
# Subset the table to include only stations to be used in analysis
summary_table = summary_table[summary_table['Y/N'] == 'Y']
# Number of stations
print('Initial number of stations: ', len(summary_table))
num_hr = []
# Loop through all rows in the summary data table. For each row, find filename
# of the station corresponding to the row and read in that station data.
hours_utc = ['00','01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23']
for i in summary_table.index:
station_data = pd.read_csv('Hourly_Averages/Excel_Data/' + summary_table.at[i,'Filename'] + '.csv')
if station_data['year'] == 2020:
# num_hr.append(station_data['PM2.5_CF1_ug/m3'].mean())
station_data = station_data[station_data['hr'] == h]
print(num_hr)
with open('average_hr.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(num_hr)
An example of the CSV's used by the code (the full CSV's are thousands of rows long and I don't know a way to put a full file in the question).
, Unnamed: 0, Unnamed: 0.1, Unnamed: 0.1.1, Unnamed: 0.1.1.1, created_at, PM1.0_CF1_ug/m3, PM2.5_CF1_ug/m3, PM10.0_CF1_ug/m3, UptimeMinutes, RSSI_dbm, Temperature_F, Humidity_%, PM2.5_ATM_ug/m3, hr, year, month, date, season
0 0 0 0 0 2020-12-23 17:00:00 UTC 0 0.04 0.12 7.5 -39.45 71 14.85 0.04 17 2020 12 12/23/20 Winter
1 1 1 1 1 2020-12-23 18:00:00 UTC 172.9 393.94 489.19 47.41 -36.93 76.34 14.72 261.9 18 2020 12 12/23/20 Winter
2 2 2 2 2 2020-12-23 19:00:00 UTC 77.59 144.78 161.67 101 -37.7 76.17 15.61 95.94 19 2020 12 12/23/20 Winter
3 3 3 3 3 2021-01-07 19:00:00 UTC 103.61 236.47 298.67 28.04 -60.39 76 14.61 157.63 19 2021 1 1/7/21 Winter
4 4 4 4 4 2021-01-07 20:00:00 UTC 11.18 21.12 23.04 64 -59.55 78.91 13.36 19.77 20 2021 1 1/7/21 Winter
5 5 5 5 5 2021-01-13 18:00:00 UTC 59.77 96.07 102.51 13.26 -49.52 73.78 29.48 65.32 18 2021 1 1/13/21 Winter
FYI I am fairly new to coding and using CSV files, there may be a simple answer to my question, but after looking over many sites I am still stuck. I appreciate any help any of you may have.
Imagine this is your table :
I tried to give you the idea on :
how to do something on a column on a condition of other column:
import pandas as pd
fields = ['Sensor_1','Sensor_2','Sensor_3','Year'] # you can tell pandas that fetch only these attributes
df = pd.read_excel('myData.xlsx' , usecols=fields)
sensor1 = df.Sensor_1.mean()
for x in df:
if(x != 'Year'):
sensor = df[x].where(df['Year'] == 2020).sum() / 14
print(sensor)
the result is :
10.785714285714286 # sensor_1 avg
4.357142857142857 # sensor_2 avg
2.892857142857143 # sensor_3 avg
For more :
I know after you read the code , you wonder is there any function to give you average, the answer is YES and the function name is mean() but when you use mean() it will ignore those rows that disabled in the condition (where(df['Year'] == 2020) ) so it will give you wrong result, for example in my sample it will give you the result of sum()/ 10 cause 4 rows are in 2021 Year.
This is all you need, just replace your attribute names with the code I gave you , I think it will help you .
I have below data frame of item with expiry date:
Item Expiry Date Stock
Voucher 1 1-Mar-2022 3
Voucher 2 31-Apr-2022 2
Voucher 3 1-Feb-2022 1
And I want to create an aging dashboard and map out my number of stock there:
Jan Feb Mar Apr
Voucher 1 3
Voucher 2 2
Voucher 3 1
Any ideas or guides how to do something like above please? I searched a lot of resources, cannot find any. I'm very new on building dashboards. Thanks.
You can extract the month name (NB. Your dates are invalid. 31 Apr. is impossible) and pivot the table. If needed, reindex with a list of months names:
from calendar import month_abbr
cols = month_abbr[1:] # first item is empty string
(df.assign(month=df['Expiry Date'].str.extract('-(\D+)-'))
.pivot(index='Item', columns='month', values='Stock')
.reindex(columns=cols)
)
If you expect to have duplicated Items, use pivot_table with sum as aggregation function instead
Output:
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Item
Voucher 1 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Voucher 2 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
Voucher 3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You may try like this:
import pandas as pd
# Item Expiry Date Stock
# Voucher 1 1-Mar-2022 3
# Voucher 2 31-Apr-2022 2
# Voucher 3 1-Feb-2022 1
data = {'Item': ['Voucher 1', 'Voucher 2', 'Voucher 3'],
'Expiry Date': ['1-Mar-2022', '31-Apr-2022', '1-Feb-2022'],
'Stock': [3, 2, 1]}
df = pd.DataFrame(data)
# Using pandas apply method, get the month from each row using axis=1 and store it in new column 'Month'
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['Month'] = df.apply(lambda x: x['Expiry Date'].split('-')[1], axis=1)
# Using pandas pivot method, set 'Item' column as index,
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html
# set unique values in 'Month' column as separate columns
# set values in 'Stock' column as values for respective month columns
# and using 'rename_axis' method, remove the row name 'Month'
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html
new_df = df.pivot(index='Item', columns='Month', values='Stock').rename_axis(None, axis=1)
# Sort the month column names by first converting it to the the pandas timestamp object
# then using it as a key in a sorted function on all columns
new_df = new_df[sorted(new_df.columns, key=lambda x: pd.to_datetime(x, format='%b'))]
print(new_df)
And this is the output I am getting:
Feb Mar Apr
Item
Voucher 1 NaN 3.0 NaN
Voucher 2 NaN NaN 2.0
Voucher 3 1.0 NaN NaN
I am struggling with how to pivot the dataframe with multi-indexed columns. First i import the data from an .xlsx file and from then i've tried to generate a certain Dataframe.
Note: I'm not allowed to embed images so that's the reason of the links
import pandas as pd
import numpy as np
# Read Excel file
df = pd.read_excel("myFile.xlsx", header=[0])
Output: Click
If you want, here you can see the File: Link to File
# Get Month and Year of the Dataframe
month_year = df.iloc[:, 5:-1].columns
month_list = []
year_list = []
for x in range(len(month_year)-1):
if("Unnamed" not in month_year[x]):
month_list.append(month_year[x].split()[0])
year_list.append(month_year[x].split()[1])
# Read Excel file with headers 1 & 2
df = pd.read_excel(path, header=[0,1])
Output: Click
# Join both indexes excluding the ones with Unnamed
df.columns = [str(x[0] + " " + x[1]) if("Unnamed" not in x[1]) else str(x[0]) for x in df.columns ]
Output: Click
# Adding month and list columns to the DataFrame
df['Month'] = month_list
df['Year'] = year_list
Output: Click
I want the output DataFrame to be like the following:
Desire Output
You should clean it up a bit, because I do not know how the Total column should be handled.
The code below reads the excel file as a MultiIndex, a bit of column names modification, before stacking and extracting the Year and Month columns.
df = pd.read_excel("Downloads/myFile.xlsx", header=[0,1], index_col=[0, 1, 2])
df.index.names = ['Project', 'KPI', 'Metric']
df.columns = df.columns.delete(-1).union([('Total', 'Total')])
df.columns.names = ['Month_Year', 'Values']
(df
.stack(level = 0)
.rename_axis(columns=None)
.reset_index()
.assign(Year = lambda df: df.Month_Year.str.split(" ").str[-1],
Month = lambda df: df.Month_Year.str.split(" ").str[0])
.drop(columns='Month_Year')
)
Project KPI Metric Real Target Total Year Month
0 Project 1 Numeric Project 1 Metric 10.0 30.0 NaN 2019 April
1 Project 1 Numeric Project 1 Metric 651.0 51651.0 NaN 2019 February
2 Project 1 Numeric Project 1 Metric 200.0 215.0 NaN 2019 January
3 Project 1 Numeric Project 1 Metric 2.0 5.0 NaN 2019 March
4 Project 1 Numeric Project 1 Metric NaN NaN 9.0 Total Total
5 Project 2 General Project 2 Metric 20.0 10.0 NaN 2019 April
6 Project 2 General Project 2 Metric 500.0 100.0 NaN 2019 February
7 Project 2 General Project 2 Metric 749.0 12.0 NaN 2019 January
8 Project 2 General Project 2 Metric 1.0 7.0 NaN 2019 March
9 Project 2 General Project 2 Metric NaN NaN 7.0 Total Total
10 Project 3 Numeric Project 3 Metric 30.0 20.0 NaN 2019 April
11 Project 3 Numeric Project 3 Metric 200.0 55.0 NaN 2019 February
12 Project 3 Numeric Project 3 Metric 5583.0 36.0 NaN 2019 January
13 Project 3 Numeric Project 3 Metric 3.0 7.0 NaN 2019 March
14 Project 3 Numeric Project 3 Metric NaN NaN 4.0 Total Total
I have a year wise dataframe with each year has three parameters year,type and value. I'm trying to calculate percentage of taken vs empty. For example year 2014 has total of 50 empty and 50 taken - So 50% in empty and 50% in taken as shown in final_df
df
year type value
0 2014 Total 100
1 2014 Empty 50
2 2014 Taken 50
3 2013 Total 2000
4 2013 Empty 100
5 2013 Taken 1900
6 2012 Total 50
7 2012 Empty 45
8 2012 Taken 5
Final df
year Empty Taken
0 2014 50 50
0 2013 ... ...
0 2012 ... ...
Should i shift cells up and do the percentage calculate or any other method?
You can use pivot_table:
new = df[df['type'] != 'Total']
res = (new.pivot_table(index='year',columns='type',values='value').sort_values(by='year',ascending=False).reset_index())
which gets you:
res
year Empty Taken
0 2014 50 50
1 2013 100 1900
2 2012 45 5
And then you can get the percentages for each column:
total = (res['Empty'] + res['Taken'])
for col in ['Empty','Taken']:
res[col+'_perc'] = res[col] / total
year Empty Taken Empty_perc Taken_perc
2014 50 50 0.50 0.50
2013 100 1900 0.05 0.95
2012 45 5 0.90 0.10
As #sophods pointed out, you can use pivot_table to rearange your dataframe, however, to add to his answer; i think you're after the percentage, hence i suggest you keep the 'Total' record and then apply your calculation:
#pivot your data
res = (df.pivot_table(index='year',columns='type',values='value')).reset_index()
#calculate percentages of empty and taken
res['Empty'] = res['Empty']/res['Total']
res['Taken'] = res['Taken']/res['Total']
#final dataframe
res = res[['year', 'Empty', 'Taken']]
You can filter out records having Empty and Taken in type and then groupby year and apply func. In func, you can set the type as index and then get the required values and calculate the percentage. x in func would be dataframe having type and value columns and data per group.
def func(x):
x = x.set_index('type')
total = x['value'].sum()
return [(x.loc['Empty', 'value']/total)*100, (x.loc['Taken', 'value']/total)*100]
temp = (df[df['type'].isin({'Empty', 'Taken'})]
.groupby('year')[['type', 'value']]
.apply(lambda x: func(x)))
temp
year
2012 [90.0, 10.0]
2013 [5.0, 95.0]
2014 [50.0, 50.0]
dtype: object
Convert the result into the required dataframe
pd.DataFrame(temp.values.tolist(), index=temp.index, columns=['Empty', 'Taken'])
Empty Taken
year
2012 90.0 10.0
2013 5.0 95.0
2014 50.0 50.0
I have a column named volume in a pandas data frame and I wanted to look back previous 5 volumes from current column # and find 40 percentile .
Volume data - as follows
1200
3400
5000
2300
4502
3420
5670
5400
4320
7890
8790
For 1st 5 values we don’t have enough data to look back , but from 6th value 3420 we should find percentile (40) of previous 5 volumes 1200,3400,5000,2300,4502 and keep doing this for rest of the data by taking previous 5 data from current value.
Not sure if I understand correctly since there is no mcve
However, sounds like you want a rolling quantile
>>> s.rolling(5).quantile(0.4)
0 NaN
1 NaN
2 NaN
3 NaN
4 2960.0
5 3412.0
6 4069.2
7 4069.2
8 4429.2
9 4968.0
10 5562.0
dtype: float64