exdf = pd.DataFrame({'Employee name': ['Alex','Mike'],
'2014.1': [5, 2], '2014.2': [3, 4], '2014.3': [3, 6], '2014.4': [4, 3], '2015.1': [7, 5], '2015.2': [5, 4]})
exdf
Employee name 2014.1 2014.2 2014.3 2014.4 2015.1 2015.2
0 Alex 5 3 3 4 7 5
1 Mike 2 4 6 3 5 4
Suppose the above dataframe has several such rows and columns with output from each employee for each quarter.
I want to create a new dataframe with columns:
newdf=pd.Dataframe(columns=['Employee name','Year','Quarter','Output'])
So, new dataframe will have nxm rows where n and m are rows and columns in original dataframe.
What I have tried is filling every row and column entry using nested for loop.
But I'm sure there is a more efficient method.
for i in range(df.shape[0]):
for j in range(df.shape[1]):
newdf.iloc[?]=exdf.iloc[?]
Use DataFrame.melt with Series.str.split, last change order of columns:
df = exdf.melt('Employee name', var_name='Year', value_name='Output')
df[['Year', 'Quarter']] = df['Year'].str.split('.', expand=True)
df = df[['Employee name','Year','Quarter','Output']]
print (df)
Employee name Year Quarter Output
0 Alex 2014 1 5
1 Mike 2014 1 2
2 Alex 2014 2 3
3 Mike 2014 2 4
4 Alex 2014 3 3
5 Mike 2014 3 6
6 Alex 2014 4 4
7 Mike 2014 4 3
8 Alex 2015 1 7
9 Mike 2015 1 5
10 Alex 2015 2 5
11 Mike 2015 2 4
Convert the columns to a multiIndex, using str.split, then you stack the columns to get ur final output
#set Employee name as index
exdf = exdf.set_index('Employee name')
#convert columns to multiIndex
exdf.columns = exdf.columns.str.split('.',expand = True)
exdf.columns = exdf.columns.set_names(['year','quarter'])
#stack data and give column a name
(exdf
.stack(["year","quarter"])
.reset_index(name='output')
)
Employee name year quarter output
0 Alex 2014 1 5.0
1 Alex 2014 2 3.0
2 Alex 2014 3 3.0
3 Alex 2014 4 4.0
4 Alex 2015 1 7.0
5 Alex 2015 2 5.0
6 Mike 2014 1 2.0
7 Mike 2014 2 4.0
8 Mike 2014 3 6.0
9 Mike 2014 4 3.0
10 Mike 2015 1 5.0
11 Mike 2015 2 4.0
With pivot_longer the reshaping can be abstracted to a simpler form:
# pip install pyjanitor
import janitor
import pandas as pd
```py
exdf.pivot_longer(
index="Employee name",
names_to=("Year", "Quarter"),
names_sep=".",
values_to="Output"
)
Employee name Year Quarter Output
0 Alex 2014 1 5
1 Mike 2014 1 2
2 Alex 2014 2 3
3 Mike 2014 2 4
4 Alex 2014 3 3
5 Mike 2014 3 6
6 Alex 2014 4 4
7 Mike 2014 4 3
8 Alex 2015 1 7
9 Mike 2015 1 5
10 Alex 2015 2 5
11 Mike 2015 2 4
Related
I have the next pd.DataFrame:
Index ID Name Date Days
1 1 Josh 5-1-20 10
2 1 Josh 9-1-20 10
3 1 Josh 19-1-20 6
4 2 Mike 1-1-20 10
5 3 George 1-4-20 10
6 4 Rose 1-2-20 10
7 4 Rose 11-5-20 5
8 5 Mark 1-9-20 10
9 6 Joe 1-4-21 10
10 7 Jill 1-1-21 10
I'm needing to make a DataFrame where the ID is not repeated, for that, I want to creat new columns (Date y Days), considering the case with most repeatitions (3 in this case).
The desired output is the next DataFrame:
Index ID Name Date 1 Date 2 Date 3 Days1 Days2 Days3
1 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
2 2 Mike 1-1-20 10
3 3 George 1-4-20 10
4 4 Rose 1-2-20 11-5-20 10 5
5 5 Mark 1-9-20 10
6 6 Joe 1-4-21 10
7 7 Jill 1-1-21 10
Try:
df_out = df.set_index(['ID','Name',df.groupby('ID').cumcount()+1]).unstack()
df_out.columns = [f'{i} {j}' for i, j in df_out.columns]
df_out.fillna('').reset_index()
Output:
ID Name Index 1 Index 2 Index 3 Date 1 Date 2 Date 3 Days 1 Days 2 Days 3
0 1 Josh 1.0 2.0 3.0 5-1-20 9-1-20 19-1-20 10.0 10.0 6.0
1 2 Mike 4.0 1-1-20 10.0
2 3 George 5.0 1-4-20 10.0
3 4 Rose 6.0 7.0 1-2-20 11-5-20 10.0 5.0
4 5 Mark 8.0 1-9-20 10.0
5 6 Joe 9.0 1-4-21 10.0
6 7 Jill 10.0 1-1-21 10.0
Here is a solution using pivot with a helper column:
df2 = (df
.assign(col=df.groupby('ID').cumcount().add(1).astype(str))
.pivot(index=['ID','Name'], columns='col', values=['Date', 'Days'])
.fillna('')
)
df2.columns = df2.columns.map('_'.join)
df2.reset_index()
Output:
ID Name Date_1 Date_2 Date_3 Days_1 Days_2 Days_3
0 1 Josh 5-1-20 9-1-20 19-1-20 10 10 6
1 2 Mike 1-1-20 10
2 3 George 1-4-20 10
3 4 Rose 1-2-20 11-5-20 10 5
4 5 Mark 1-9-20 10
5 6 Joe 1-4-21 10
6 7 Jill 1-1-21 10
I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer
I have a dataframe:
df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year'])
Amy Bob Carl Chris Ben Other Year
0 2 4 7 8 1 3 2013
1 9 2 4 5 5 6 2014
And a dictionary:
d = {'A': ['Amy'], 'B': ['Bob', 'Ben'], 'C': ['Carl', 'Chris']}
I would like to reshape my dataframe to look like this:
Group Name Year Value
0 A Amy 2013 2
1 A Amy 2014 9
2 B Bob 2013 4
3 B Bob 2014 2
4 B Ben 2013 1
5 B Ben 2014 5
6 C Carl 2013 7
7 C Carl 2014 4
8 C Chris 2013 8
9 C Chris 2014 5
10 Other 2013 3
11 Other 2014 6
Note that Other doesn't have any values in the Name column and the order of the rows does not matter. I think I should be using the melt function but the examples that I've come across aren't too clear.
melt gets you part way there.
In [29]: m = pd.melt(df, id_vars=['Year'], var_name='Name')
This has everything except Group. To get that, we need to reshape d a bit as well.
In [30]: d2 = {}
In [31]: for k, v in d.items():
for item in v:
d2[item] = k
....:
In [32]: d2
Out[32]: {'Amy': 'A', 'Ben': 'B', 'Bob': 'B', 'Carl': 'C', 'Chris': 'C'}
In [34]: m['Group'] = m['Name'].map(d2)
In [35]: m
Out[35]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 Other 3 NaN
11 2014 Other 6 NaN
[12 rows x 4 columns]
And moving 'Other' from Name to Group
In [8]: mask = m['Name'] == 'Other'
In [9]: m.loc[mask, 'Name'] = ''
In [10]: m.loc[mask, 'Group'] = 'Other'
In [11]: m
Out[11]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 3 Other
11 2014 6 Other
[12 rows x 4 columns]
Pandas Melt Function :-
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
eg:-
melted = pd.melt(df, id_vars=["weekday"],
var_name="Person", value_name="Score")
we use melt to transform wide data to long data.
I have a dataset like the below. I want to be able to be able to populate the missing text with what is normal for the group. I have tried using ffil but this doesn't help the ones that are blank at the start, and bfil similarly for the end. How can I do this?
Group Name
1 Annie
2 NaN
3 NaN
4 David
1 NaN
2 Bertha
3 Chris
4 NaN
Desired Output:
Group Name
1 Annie
2 Bertha
3 Chris
4 David
1 Annie
2 Bertha
3 Chris
4 David
Using collections.Counter to create a modal mapping by group:
from collections import Counter
s = df.dropna(subset=['Name'])\
.groupby('Group')['Name']\
.apply(lambda x: Counter(x).most_common()[0][0])
df['Name'] = df['Name'].fillna(df['Group'].map(s))
print(df)
Group Name
0 1 Annie
1 2 Bertha
2 3 Chris
3 4 David
4 1 Annie
5 2 Bertha
6 3 Chris
7 4 David
You can use value_counts and head:
s = df.groupby('Group')['Name'].apply(lambda x: x.value_counts().head(1)).reset_index(-1)['level_1']
df['Name'] = df['Name'].fillna(df['Group'].map(s))
print(df)
Output:
Group Name
0 1 Annie
1 2 Bertha
2 3 Chris
3 4 David
4 1 Annie
5 2 Bertha
6 3 Chris
7 4 David
I have something like the following DataFrame where I have data points at 2 locations in 4 seasons in 2 years.
>>> df=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df['Value']=np.random.randint(1,100,len(df))
>>> df
Value
Season Year Location
1 2011 A 40
B 7
2012 A 81
B 84
2 2011 A 37
B 59
2012 A 30
B 6
3 2011 A 71
B 43
2012 A 3
B 65
4 2011 A 45
B 13
2012 A 38
B 70
>>>
I would like to create a new series that represents that number of the season sorted by year. For example, the seasons in the first year would just be 1,2,3,4 and then the seasons in the second year would be 5,6,7,8. The series would look like this:
Season Year Location
1 2011 A 1
B 1
2012 A 5
B 5
2 2011 A 2
B 2
2012 A 6
B 6
3 2011 A 3
B 3
2012 A 7
B 7
4 2011 A 4
B 4
2012 A 8
B 8
Name: SeasonNum, dtype: int64
>>>
Any suggestions on the best way to do this?
You could do:
def seasons(row):
return row['Year'] % 2011 * 4 + row['Season']
df.reset_index(inplace=True)
df['seasons'] = df.apply(seasons, axis=1)
df.set_index(['Season', 'Year', 'Location'], inplace=True)