Get last observations from Pandas - python

Assuming the following dataframe:
variable value
0 A 12
1 A 11
2 B 4
3 A 2
4 B 1
5 B 4
I want to extract the last observation for each variable. In this case, it would give me:
variable value
3 A 2
5 B 4
How would you do this in the most panda/pythonic way?
I'm not worried about performance. Clarity and conciseness is important.
The best way I came up with:
df = pd.DataFrame({'variable': ['A', 'A', 'B', 'A', 'B', 'B'], 'value': [12, 11, 4, 2, 1, 4]})
variables = df['variable'].unique()
new_df = df.drop(index=df.index, axis=1)
for v in variables:
new_df = new_df.append(df[df['variable'] == v].tail(1), inplace=True)

Use drop_duplicates
new_df = df.drop_duplicates('variable',keep='last')
Out[357]:
variable value
3 A 2
5 B 4

Related

Counting every two rows with the same number in Python

I have a dataframe where every two rows are related. I am trying to give every two rows a unique ID. I thought it would be much easier but I cannot figure it out. Let's say I have this dataframe:
df = pd.DataFrame({'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
Var1 Var2
A B
2 5
C D
7 9
I would like to add an ID that would result in a dataframe that looks like this:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
This is just a sample, but every two rows are related so just trying to count by 1, 1, 2, 2, 3, 3 etc in the ID column.
Thanks for any help!
You can create a sequence first and then divide it by 2 (integer division):
import numpy as np
df['ID'] = np.arange(len(df)) // 2 + 1
df
# Var1 Var2 ID
#0 A B 1
#1 2 5 1
#2 C D 2
#3 7 9 2
I don't think think is a native Pandas way to do it but this works...
import pandas as pd
df = pd.DataFrame({'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
df['ID'] = 1 + df.index // 2
df[['ID', 'Var1', 'Var2']]
Output:
ID Var1 Var2
0 1 A B
1 1 2 5
2 2 C D
3 2 7 9

how to reorder of rows of a dataframe based on values in a column

I have a dataframe like this:
A B C D
b 3 3 4
a 1 2 1
a 1 2 1
d 4 4 1
d 1 2 1
c 4 5 6
Now I hope to reorder the rows based on values in column A.
I don't want to sort the values but reorder them with a specific order like ['b', 'd', 'c', 'a']
what I expect is:
A B C D
b 3 3 4
d 4 4 1
d 1 2 1
c 4 5 6
a 1 2 1
a 1 2 1
This is a good use case for pd.Categorical, since you have ordered categories. Just make that column a categorical and mark ordered=True. Then, sort_values should do the rest.
df['A'] = pd.Categorical(df.A, categories=['b', 'd', 'c', 'a'], ordered=True)
df.sort_values('A')
If you want to keep your column as is, you can just use loc and the indexes.
df.loc[pd.Series(pd.Categorical(df.A,
categories=['b', 'd', 'c', 'a'],
ordered=True))\
.sort_values()\
.index\
]
Use dictionary like mapping for order of strings then sort the values and reindex:
order = ['b', 'd', 'c', 'a']
df = df.reindex(df['A'].map(dict(zip(order, range(len(order))))).sort_values().index)
print(df)
A B C D
0 b 3 3 4
3 d 4 4 1
4 d 1 2 1
5 c 4 5 6
1 a 1 2 1
2 a 1 2 1
Without changing datatype of A, you can set 'A' as index and select elements in the desired order defined by sk.
sk = ['b', 'd', 'c', 'a']
df.set_index('A').loc[sk].reset_index()
Or use a temp column for sorting:
sk = ['b', 'd', 'c', 'a']
(
df.assign(S=df.A.map({v:k for k,v in enumerate(sk)}))
.sort_values(by='S')
.drop('S', axis=1)
)
I'm taking the solution provided by rafaelc a step further. If you want to do it in a chained process, here is how you'd do it:
df = (
df
.assign(A = lambda x: pd.Categorical(x['A'], categories = ['b', 'd', 'c', 'a'], ordered = True))
.sort_values('A')
)

How to encode when u have multiple categories in a column

My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()

Unusual reshaping of Pandas DataFrame

i have a DF like this:
df = pd.DataFrame({'x': ['a', 'a', 'b', 'b', 'b', 'c'],
'y': [1, 2, 3, 4, 5, 6],
})
which looks like:
x y
0 a 1
1 a 2
2 b 3
3 b 4
4 b 5
5 c 6
I need to reshape it in the way to keep 'x' column unique:
x y_1 y_2 y_3
0 a 1 2 NaN
1 b 3 4 5
2 c 6 NaN NaN
So the max N of 'y_N' columns have to be equal to
max(df.groupby('x').count().values)
and the x column has to contain unique values.
For now i dont get how to get y_N columns.
Thanks.
You can use pandas.crosstab with cumcount column as the columns parameter:
(pd.crosstab(df.x, df.groupby('x').cumcount() + 1, df.y,
aggfunc = lambda x: x.iloc[0])
.rename(columns="y_{}".format).reset_index())

How to do group by according to sorted value in Pandas, Python?

I have a data set like
Type1 Value
A 1
B 6
C 4
A 3
C 1
B 2
For each element in Type1, I want it to sum over Value, and then display it in sorted order.
I want my result like,
Type1 Value
A 4
C 5
B 8
Use DataFrame.groupby:
df = pd.DataFrame([['A', 'B', 'C', 'A', 'C', 'B'], [1, 6, 4, 3, 1, 2]], index=['Type1', 'Value']).T
df2 = df.groupby('Type1').sum()
This gives:
Value
Type1
A 4
B 8
C 5
If this isn't sorted, you can do df2.sort_index(inplace=True).
If you want to turn Type1 back into a column, you can do df3 = df2.reset_index().

Categories