I have a dataframe:
df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year'])
Amy Bob Carl Chris Ben Other Year
0 2 4 7 8 1 3 2013
1 9 2 4 5 5 6 2014
And a dictionary:
d = {'A': ['Amy'], 'B': ['Bob', 'Ben'], 'C': ['Carl', 'Chris']}
I would like to reshape my dataframe to look like this:
Group Name Year Value
0 A Amy 2013 2
1 A Amy 2014 9
2 B Bob 2013 4
3 B Bob 2014 2
4 B Ben 2013 1
5 B Ben 2014 5
6 C Carl 2013 7
7 C Carl 2014 4
8 C Chris 2013 8
9 C Chris 2014 5
10 Other 2013 3
11 Other 2014 6
Note that Other doesn't have any values in the Name column and the order of the rows does not matter. I think I should be using the melt function but the examples that I've come across aren't too clear.
melt gets you part way there.
In [29]: m = pd.melt(df, id_vars=['Year'], var_name='Name')
This has everything except Group. To get that, we need to reshape d a bit as well.
In [30]: d2 = {}
In [31]: for k, v in d.items():
for item in v:
d2[item] = k
....:
In [32]: d2
Out[32]: {'Amy': 'A', 'Ben': 'B', 'Bob': 'B', 'Carl': 'C', 'Chris': 'C'}
In [34]: m['Group'] = m['Name'].map(d2)
In [35]: m
Out[35]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 Other 3 NaN
11 2014 Other 6 NaN
[12 rows x 4 columns]
And moving 'Other' from Name to Group
In [8]: mask = m['Name'] == 'Other'
In [9]: m.loc[mask, 'Name'] = ''
In [10]: m.loc[mask, 'Group'] = 'Other'
In [11]: m
Out[11]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 3 Other
11 2014 6 Other
[12 rows x 4 columns]
Pandas Melt Function :-
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
eg:-
melted = pd.melt(df, id_vars=["weekday"],
var_name="Person", value_name="Score")
we use melt to transform wide data to long data.
Related
exdf = pd.DataFrame({'Employee name': ['Alex','Mike'],
'2014.1': [5, 2], '2014.2': [3, 4], '2014.3': [3, 6], '2014.4': [4, 3], '2015.1': [7, 5], '2015.2': [5, 4]})
exdf
Employee name 2014.1 2014.2 2014.3 2014.4 2015.1 2015.2
0 Alex 5 3 3 4 7 5
1 Mike 2 4 6 3 5 4
Suppose the above dataframe has several such rows and columns with output from each employee for each quarter.
I want to create a new dataframe with columns:
newdf=pd.Dataframe(columns=['Employee name','Year','Quarter','Output'])
So, new dataframe will have nxm rows where n and m are rows and columns in original dataframe.
What I have tried is filling every row and column entry using nested for loop.
But I'm sure there is a more efficient method.
for i in range(df.shape[0]):
for j in range(df.shape[1]):
newdf.iloc[?]=exdf.iloc[?]
Use DataFrame.melt with Series.str.split, last change order of columns:
df = exdf.melt('Employee name', var_name='Year', value_name='Output')
df[['Year', 'Quarter']] = df['Year'].str.split('.', expand=True)
df = df[['Employee name','Year','Quarter','Output']]
print (df)
Employee name Year Quarter Output
0 Alex 2014 1 5
1 Mike 2014 1 2
2 Alex 2014 2 3
3 Mike 2014 2 4
4 Alex 2014 3 3
5 Mike 2014 3 6
6 Alex 2014 4 4
7 Mike 2014 4 3
8 Alex 2015 1 7
9 Mike 2015 1 5
10 Alex 2015 2 5
11 Mike 2015 2 4
Convert the columns to a multiIndex, using str.split, then you stack the columns to get ur final output
#set Employee name as index
exdf = exdf.set_index('Employee name')
#convert columns to multiIndex
exdf.columns = exdf.columns.str.split('.',expand = True)
exdf.columns = exdf.columns.set_names(['year','quarter'])
#stack data and give column a name
(exdf
.stack(["year","quarter"])
.reset_index(name='output')
)
Employee name year quarter output
0 Alex 2014 1 5.0
1 Alex 2014 2 3.0
2 Alex 2014 3 3.0
3 Alex 2014 4 4.0
4 Alex 2015 1 7.0
5 Alex 2015 2 5.0
6 Mike 2014 1 2.0
7 Mike 2014 2 4.0
8 Mike 2014 3 6.0
9 Mike 2014 4 3.0
10 Mike 2015 1 5.0
11 Mike 2015 2 4.0
With pivot_longer the reshaping can be abstracted to a simpler form:
# pip install pyjanitor
import janitor
import pandas as pd
```py
exdf.pivot_longer(
index="Employee name",
names_to=("Year", "Quarter"),
names_sep=".",
values_to="Output"
)
Employee name Year Quarter Output
0 Alex 2014 1 5
1 Mike 2014 1 2
2 Alex 2014 2 3
3 Mike 2014 2 4
4 Alex 2014 3 3
5 Mike 2014 3 6
6 Alex 2014 4 4
7 Mike 2014 4 3
8 Alex 2015 1 7
9 Mike 2015 1 5
10 Alex 2015 2 5
11 Mike 2015 2 4
I have a dataset of the following form.
id year
0 A 2000
1 A 2001
2 B 2005
3 B 2006
4 B 2007
5 C 2003
6 C 2004
7 D 2002
8 D 2003
Now two or more IDs are assumed to be part of an aggregated ID if they can be arranged in a consecutive order. Meaning that in the end I would like to have this grouping, in which A & D build a group and B & C another one:
id year match
0 A 2000 1
1 A 2001 1
7 D 2002 1
8 D 2003 1
5 C 2003 2
6 C 2004 2
2 B 2005 2
3 B 2006 2
4 B 2007 2
EDIT: Addressing #Dimitris_ps comments: Assuming an additional row
id year
9 A 2002
would change the desired result to
id year match
0 A 2000 1
1 A 2001 1
9 A 2002 1
5 C 2003 1
6 C 2004 1
2 B 2005 1
3 B 2006 1
4 B 2007 1
7 D 2002 2
8 D 2003 2
because now there is no longer a consecutive order for A & D but instead for A, C, and B with D having no match.
Recode your id to values and then you can sort based on year and id.
import pandas as pd
df = pd.DataFrame({'id':['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'D'],
'year':[2000, 2001, 2005, 2006, 2007, 2003, 2004, 2002, 2003]}) # example dataframe
# Create a dict mapping id to values based on the minimum year
custom_dict = {el:i for i, el in enumerate(df.groupby('id')['year'].min().sort_values().index)}
# and the reverse to map back the values to the id
custom_dict_rev = {v:k for k, v in custom_dict.items()}
df['id'] = df['id'].map(custom_dict)
df = df.sort_values(['year', 'id'])
df['id'] = df['id'].map(custom_dict_rev)
df
I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer
When trying to use the pd.pivot_table on a given dataset, I noticed that it creates levels for all existing levels on a parent group, not all possible levels. For example, on a dataset like this:
YEAR CLASS
0 2013 A
1 2013 A
2 2013 B
3 2013 B
4 2013 B
5 2013 C
6 2013 C
7 2013 D
8 2014 A
9 2014 A
10 2014 A
11 2014 B
12 2014 B
13 2014 B
14 2014 C
15 2014 C
there is no level D for year 2014, so the pivot table will look like this:
pd.pivot_table(d,index=["YEAR","CLASS"],values=["YEAR"],aggfunc=[len],fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
What I want is to get a separate group for D in 2014 with length 0 in my pivot table. How can I include all possible levels in the child variable for the parent variable?
I think you can use crosstab and stack:
print pd.pivot_table(df,
index=["YEAR","CLASS"],
values=["YEAR"],
aggfunc=[len],
fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
print pd.crosstab(df['YEAR'],df['CLASS'])
CLASS A B C D
YEAR
2013 2 3 2 1
2014 3 3 2 0
df = pd.crosstab(df['YEAR'],df['CLASS']).stack()
df.name = 'len'
print df
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
D 0
Name: len, dtype: int64
I have something like the following DataFrame where I have data points at 2 locations in 4 seasons in 2 years.
>>> df=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df['Value']=np.random.randint(1,100,len(df))
>>> df
Value
Season Year Location
1 2011 A 40
B 7
2012 A 81
B 84
2 2011 A 37
B 59
2012 A 30
B 6
3 2011 A 71
B 43
2012 A 3
B 65
4 2011 A 45
B 13
2012 A 38
B 70
>>>
I would like to create a new series that represents that number of the season sorted by year. For example, the seasons in the first year would just be 1,2,3,4 and then the seasons in the second year would be 5,6,7,8. The series would look like this:
Season Year Location
1 2011 A 1
B 1
2012 A 5
B 5
2 2011 A 2
B 2
2012 A 6
B 6
3 2011 A 3
B 3
2012 A 7
B 7
4 2011 A 4
B 4
2012 A 8
B 8
Name: SeasonNum, dtype: int64
>>>
Any suggestions on the best way to do this?
You could do:
def seasons(row):
return row['Year'] % 2011 * 4 + row['Season']
df.reset_index(inplace=True)
df['seasons'] = df.apply(seasons, axis=1)
df.set_index(['Season', 'Year', 'Location'], inplace=True)