Python Pandas: Sort an alphanumeric dataframe - python

I currently have a dataframe as following.
No
ID
Sub_No
Weight
1
a23mcsk
2
30
2
smcd302
3
60
3
a23mcsk
1
24
4
smcd302
2
45
5
a23mcsk
3
18
6
smcd302
1
12
I want to be able to sort this dataframe first by 'ID' and then by the 'Sub_No'. Is there a way I can do this on Python using Pandas?
Expected Result:
No
ID
Sub_No
Weight
3
a23mcsk
1
24
1
a23mcsk
2
30
5
a23mcsk
3
18
6
smcd302
1
12
4
smcd302
2
45
2
smcd302
3
60

Use helper column here for correct sorting by numeric values from ID with Sub_No:
df['new'] = df['ID'].str.extract('(\d+)').astype(int)
df = df.sort_values(by=['new', 'Sub_No'], ascending=False).drop('new', axis=1)
Another idea with natural sorting:
import natsort as ns
df['new'] = pd.Categorical(df['ID'], ordered=True, categories= ns.natsorted(df['a'].unique()))
df = df.sort_values(by=['new', 'Sub_No'], ascending=False).drop('new', axis=1)

Use:
df.sort_values(by=['ID', 'Sub_No'])

You can use
df.sort_values(by=['ID', 'Sub_No'], ascending=True)

Related

How to stack two columns of a pandas dataframe in python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

Pandas Transform

let's say I have the following dataframe:
metric value last_1_day last_7_day
points 10 3 9
assists 15 2 12
rebounds 12 1 5
I want to transpose this and concatenate thre values within metric with the other columns. So the result would have 1 row and 9 columns. For example:
points_value points_last_1_day points_last_7_day assists_value assists_last_1_day assists_last_7_day rebounds_value rebounds_last_1_day rebounds_last_7_day
10 3 9 15 2 12 12 1 5
What is the best way to do this in pandas?
Try with:
df = df.set_index('metric').stack().to_frame().T
df.columns = ['_'.join(a) for a in df.columns]
Output:
points_value points_last_1_day points_last_7_day assists_value assists_last_1_day assists_last_7_day rebounds_value rebounds_last_1_day rebounds_last_7_day
0 10 3 9 15 2 12 12 1 5

Get original dataframe from pivoted dataframe with multiindex

Let's say I have a dataframe in a long form like this
data = {'name': ["A","A","A","B","B","B","C","C","C"],
'time':["1pm","1pm","1pm","2pm","2pm","2pm","3pm","3pm","3pm"],
'idx': [1,2,3,1,2,3,1,2,3], 'var1':[34,234,645,3,23,65,34,24,25],
'var2':[1,35,2,65,2,1,7,3,8]}
df = pd.DataFrame(data)
I pivot it:
df_piv = df.pivot_table(values=['var1','var2'],index=['name','time'], columns='idx',aggfunc=np.sum)
After I performed some operations on the data in the pivoted dataframe I would like to get it back into the long form.
How would i do that?
I tried several pandas functions including melt, to_records,reset_index, swaplevel in multiple combinations. None of it had the desired outcome.
Use DataFrame.stack with DataFrame.reset_index:
df = df_piv.stack().reset_index()
print (df)
name time idx var1 var2
0 A 1pm 1 34 1
1 A 1pm 2 234 35
2 A 1pm 3 645 2
3 B 2pm 1 3 65
4 B 2pm 2 23 2
5 B 2pm 3 65 1
6 C 3pm 1 34 7
7 C 3pm 2 24 3
8 C 3pm 3 25 8

How to get dataframe of unique ids

I'm trying to group the following dataframe by unique binId and then parse the resulting rows based of 'z' and pick the row with highest value of 'z'. Here is my dataframe.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3','4','5','6'], 'binId': ['1','2','2','1','1','3'], 'x':[1,4,5,6,3,4], 'y':[11,24,35,16,23,34],'z':[1,4,5,2,3,4]})
`
I tried following code which gives required answer,
def f(x):
tp = df[df['binId'] == x][['binId','ID','x','y','z']].sort_values(by='z', ascending=False).iloc[0]
return tp`
and then,
binids= pd.Series(df.binId.unique())
print binids.apply(f)
The output is,
binId ID x y z
0 1 5 3 23 3
1 2 3 5 35 5
2 3 6 4 34 4
But the execution is too slow. What is the faster way of doing this?
Use idxmax for indices of max and select by loc:
df1 = df.loc[df.groupby('binId')['z'].idxmax()]
Or faster is use sort_values with drop_duplicates:
df1 = df.sort_values(['binId', 'z']).drop_duplicates('binId', keep='last')
print (df1)
ID binId x y z
4 5 1 3 23 3
2 3 2 5 35 5
5 6 3 4 34 4

Python pandas constructing dataframe by looping over columns

I am trying to develop a new panda dataframe based on data I got from an existing dataframe and then taking into account the previously calculated value in the new dataframe.
As an example, here are two dataframes with the same size.
df1 = pd.DataFrame(np.random.randint(0,10, size = (5, 4)), columns=['1', '2', '3', '4'])
df2 = pd.DataFrame(np.zeros(df1.shape), index=df1.index, columns=df1.columns)
Then I created a list which starts as a starting basis for my second dataframe df2
L = [2,5,6,7]
df2.loc[0] = L
Then for the remaining rows of df2 I want to take the value from the previous time step (df2) and add the value of df1.
for i in df2.loc[1:]:
df2.ix[i] = df2.ix[i-1] + df1
As an example my dataframes should look like this:
>>> df1
1 2 3 4
0 4 6 0 6
1 7 0 7 9
2 9 1 9 9
3 5 2 3 6
4 0 3 2 9
>>> df2
1 2 3 4
0 2 5 6 7
1 9 5 13 16
2 18 6 22 25
3 23 8 25 31
4 23 11 27 40
I know there is something wrong with the indication of indexes in the for loop but I cannot figure out how the argument must be formulated. I would be very thankful for any help on this.
this is a simple cumsum.
df2 = df1.copy()
df2.loc[0] = [2,5,6,7]
desired_df = df2.cumsum()

Categories