Repeat columns as rows in python? - python

Fruit January Shipments January Sales February Shipments February Sales
------------ ------------------- --------------- -------------------- ----------------
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
I'm trying to achieve the following result:
Fruit Month Shipments Sales
------------ ---------- ----------- -------
Apple January 30 11
Banana January 12 49
Pear January 25 50
Kiwi January 41 25
Strawberry January 11 33
Apple February 18 31
Banana February 39 14
Pear February 44 21
Kiwi February 10 25
Strawberry February 35 50
I've tried pandas.pivot and pandas.pivot_table and had no luck. I'm in the process of creating two dataframes (Fruit/Month/Shipments) and (Fruit/Month/Sales), and concatenating the two into one with a loop, but I was hoping for a easier way to do this.

one way is to use modify the column to a multi level then use stack. Let suppose your dataframe is called df. First set the column Fruit as index, then define the multilevel columns:
df = df.set_index('Fruit')
# manual way to create the multiindex columns
#df.columns = pd.MultiIndex.from_product([['January','February'],
# ['Shipments','Sales']], names=['Month',None])
# more general way to create the multiindex columns thanks to #Scott Boston
df.columns = df.columns.str.split(expand=True)
df.columns.names = ['Month',None]
your data looks like:
Month January February
Shipments Sales Shipments Sales
Fruit
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
Now you can use stack on level 0 and reset_index
df_output = df.stack(0).reset_index()
which gives
Fruit Month Sales Shipments
0 Apple February 31 18
1 Apple January 11 30
2 Banana February 14 39
3 Banana January 49 12
4 Pear February 21 44
5 Pear January 50 25
6 Kiwi February 25 10
7 Kiwi January 25 41
8 Strawberry February 50 35
9 Strawberry January 33 11
Finally, if you want a specific order for values in the column Month you can use pd.Categorical:
df_output['Month'] = pd.Categorical(df_output['Month'].tolist(), ordered=True,
categories=['January','February'])
setting that January is before February when sorting. Now, doing
df_output = df_output.sort_values(['Month'])
gives the result:
Fruit Month Sales Shipments
1 Apple January 11 30
3 Banana January 49 12
5 Pear January 50 25
7 Kiwi January 25 41
9 Strawberry January 33 11
0 Apple February 31 18
2 Banana February 14 39
4 Pear February 21 44
6 Kiwi February 25 10
8 Strawberry February 50 35
I see it's not exactly the expected output (order in Fruit column and order of columns) but both can be easily change if needed.

How to use pd.wide_to_long as #user3483203 suggests.
df1 = df.set_index('Fruit')
#First we have to so column renaming use multiindex column headers and swapping levels.
df1.columns = df1.columns.str.split(expand=True)
df1.columns = df1.columns.map('{0[1]}_{0[0]}'.format)
#Reset index and use pd.wide_to_long:
df1 = df1.reset_index()
df_out = pd.wide_to_long(df1, ['Shipments','Sales'], 'Fruit', 'Month','_','\w+')\
.reset_index()
print(df_out)
Output:
Fruit Month Shipments Sales
0 Apple January 30.0 11.0
1 Banana January 12.0 49.0
2 Pear January 25.0 50.0
3 Kiwi January 41.0 25.0
4 Strawberry January 11.0 33.0
5 Apple February 18.0 31.0
6 Banana February 39.0 14.0
7 Pear February 44.0 21.0
8 Kiwi February 10.0 25.0
9 Strawberry February 35.0 50.0

Related

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

User defined expanding window in Pandas

Is there a way to modify the expanding windows in Pandas. For example consider a random DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
When I to df.expanding(1).apply(), it applies the function to expanding every row, is it possible to pass the date column to the expanding function so instead of every row as a window, it accumulates groups of rows based on date
Existing expanding window:
window 1: 0 16 25 37 Apple Orange 2002-01-01
window 2: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
window 3: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
Expected expanding window:
window 1 (all rows for date "2002-01-01"):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
window 2 (all rows for date "2002-01-01" and "2002-02-01" ):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
Assuming df.date is sorted.
I suppose that there is more efficient way to calculate end. If you find better solution please let me know.
Another usage, Pandas custom window rolling, Possible problems
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
lst = []
prev_date = self.custom_name[0]
for i, date in enumerate(self.custom_name[1:], 1):
if prev_date != date:
lst.append(i)
prev_date = date
lst.append(len(df))
end = np.array(lst)
start = np.zeros_like(end)
return start, end
indexer = CustomIndexer(custom_name=df.date)
for window in df.rolling(indexer):
print(window)
Outputs:
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
7 17 17 37 Mango Orange 2002-02-01
8 5 53 0 Apple lemon 2002-02-01
9 16 10 24 Apple Orange 2002-02-01

How to apply values in column based on condition in row values in python

I have a dataframe that looks like this:
df
Name year week date
0 Adam 2016 16 2016-04-24
1 Mary 2016 17 2016-05-01
2 Jane 2016 20 2016-05-22
3 Joe 2016 17 2016-05-01
4 Arthur 2017 44 2017-11-05
5 Liz 2017 41 2017-10-15
6 Janice 2016 47 2016-11-27
And I want to create column season so df['season'] that attributes a season MAM or OND depending on the value in week.
The result should look like this:
df_final
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
In essence, values of week that are below 40 should be paired with MAM and values above 40 should be OND.
So far I have this:
condition =df.week < 40
df['season'] = df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: 'OND')
But it is clunky and does not produce the final response.
Thank you.
Use numpy.where:
condition = df.week < 40
df['season'] = np.where(condition, 'MAM', 'OND')
print (df)
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
EDIT:
For convert strings to integers use astype:
condition = df.week.astype(int) < 40
Or convert column:
df.week = df.week.astype(int)
condition = df.week < 40

Appending values from one column to another in pandas

Hi all I'm doing data cleanup, and I'm facing a bit of an obstacle. I have multiple dataframes that look like this:
df1
WL WM WH WP
0 NaN NaN Sea NaN
1 low medium high premium
2 26 26 15 14
3 32 32 18 29
4 41 41 19 42
5 apple dog fur napkins
6 orange cat tesla earphone
7 mango rat tobias controller
I am trying to combine the WL and WM column such that the outcome looks like this:
df1
WM WH WP
0 NaN NaN NaN
1 medium high premium
2 26 15 14
3 32 18 29
4 41 19 42
5 dog fur napkins
6 cat tesla earphone
7 rat tobias controller
8 apple
9 orange
10 mango
My initial attempt was to slice the WL column and append that to the WM column, however that has not yielded a correct output.
for num in range(len(df)):
low = df.loc[:, df.isin(['WarrantyLow']).any()]
low = low[5:]
medium = df.loc[:, df.isin(['WarrantyMedium']).any()]
medium.append(low)
df.append to combine WM and WL. Call df.reset_index to reset the index for the next concatenation
pd.concat(..., ignore_index=True, ...) combines result of (1) with rest of the dataframe, ignoring the index
In [400]: pd.concat([df1['WM'].append(df1['WL'].iloc[5:]).reset_index(drop=True), \
df1.iloc[:, 2:]], ignore_index=True, axis=1).fillna('')\
.rename(columns=dict(enumerate(['WM', 'WH', 'WP'])))
Out[400]:
WM WH WP
0 Sea
1 medium high premium
2 26 15 14
3 32 18 29
4 41 19 42
5 dog fur napkins
6 cat tesla earphone
7 rat tobias controller
8 apple
9 orange
10 mango

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories