I have this data:
df = pd.DataFrame({'X1':[0,5,4,8,9,0,7,6],
'X2':[4,1,3,5,6,2,3,3],
'X3':['A','A','B','B','B','C','C','C']})
df = df.set_index('X3')
I want to set X3 as index so I would do:
df = df.set_index('X3')
And the result is:
However, I'm looking to set the same index but as a multi-index format (even if it's not a multi-index). This is the expected result:
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
Is that possible?
EDIT
Answering the comments, the reason I want to achieve this is that I want to order df by X1 values without losing "the grouping" of X3, so I would be able to see the order in each X3 group. I can't do it with sort_values(['X3', 'X1'], ascending=[False, False]) because the first condition of sorting must be the maximum of each group keeping all rows from the same group together. So I can see the group that has the maximum X1 and immediately see how are the other values of the same group, then the second group contains the second maximum of X1 and immediately see how are the other values of the second group, and so on.
Not sure if this helps; but what if you first get rid of repeated values from 'X3' columns then set that as index...
import pandas as pd
df = pd.DataFrame({'X1':[0,5,4,8,9,0,7,6],
'X2':[4,1,3,5,6,2,3,3],
'X3':['A','A','B','B','B','C','C','C']})
lst_index = []
for x in df["X3"]:
if x in lst_index:
lst_index.append("")
else:
lst_index.append(x)
del df["X3"] # delete this column if not required anymore...
df.index = lst_index
# Output
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
try:
df
X1 X2 X3
0 0 4 A
1 5 1 A
2 4 3 B
3 8 5 B
4 9 6 B
5 0 2 C
6 7 3 C
7 6 3 C
df = df.set_index('X3')
new_index = pd.Series([i if not j else '' for i, j in list(zip(df.index, df.index.duplicated())) ])
df = df.set_index(new_index)
X1 X2
A 0 4
5 1
B 4 3
8 5
9 6
C 0 2
7 3
6 3
Related
I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1
I have the following dataframe:
data={'a1':['X1',2,3,4,5],'Unnamed: 02':['Y1',5,6,7,8],'b1':['X2',5,3,7,9],'Unnamed: 05':['Y2',5,8,9,3],'c1':['X3',4,5,7,5],'Unnamed: 07':['Y3',5,8,9,3],'d1':['P',2,4,5,7],'Unnamed: 09':['M',8,4,6,7]}
df=pd.DataFrame(data)
df.columns=df.columns.to_series().mask(lambda x: x.str.startswith('Unnamed')).ffill()
df
There are a few things which I would like to do:
Change the rows containing (X1, X2 & X3) into just 'X (vice versa for Y1,Y2,Y3 into 'Y)
Combine the existing column header with the row containing X,Y,P,M
The outcome should look like:
Change the rows containing (X1, X2 & X3) into just 'X (vice versa for Y1,Y2,Y3 into 'Y)
Combine the existing column header with the row containing X,Y,P,M - Also note that the 'P' and 'M' completely replaces 'd1' respectively.
Try this.
import numpy as np
# extract only the letters from first row
first_row = df.iloc[0].str.extract('([A-Z]+)')[0]
# update column names by first_row
# the columns with P and M in it have their names completely replaced
df.columns = np.where(first_row.isin(['P', 'M']), first_row, df.columns + '_' + first_row.values)
# remove first row
df = df.iloc[1:].reset_index(drop=True)
df
Alternatively, you can do something like this:
# Transpose data frame and make index to column
df = df.T.reset_index()
# Assign new column, use length of first row as condition
df["column"] = np.where(df[0].str.len() > 1, df["index"].str[:] + "_" + df[0].str[0], df[0].str[0])
df.drop(columns=["index", 0]).set_index("column").T.rename_axis(None, axis=1)
----------------------------------------------------------
a1_X a1_Y b1_X b1_Y c1_X c1_Y P M
1 2 5 5 5 4 5 2 8
2 3 6 3 8 5 8 4 4
3 4 7 7 9 7 9 5 6
4 5 8 9 3 5 3 7 7
----------------------------------------------------------
It's a more general solution as it uses the length of each row-zero entry as a condition, not the actual values 'P' and 'M'. Thus, it holds for each single character string.
df.columns = [x + '_' + y[0] if len(y)>1 else y for x, y in df.iloc[0].reset_index().values]
df = df[1:].reset_index(drop=True)
print(df)
Output:
a1_X a1_Y b1_X b1_Y c1_X c1_Y P M
0 2 5 5 5 4 5 2 8
1 3 6 3 8 5 8 4 4
2 4 7 7 9 7 9 5 6
3 5 8 9 3 5 3 7 7
I have two dataframes, df_diff and df_three. For each column of df_three, it contains the index values of three largest values from each column of df_diff. For example, let's say df_diff looks like this:
A B C
0 4 7 8
1 5 5 7
2 8 2 1
3 10 3 4
4 1 12 3
Using
df_three = df_diff.apply(lambda s: pd.Series(s.nlargest(3).index))
df_three would look like this:
A B C
0 3 4 0
1 2 0 1
2 1 1 3
How could I match the index values in df_three to the column values of df_diff? In other words, how could I get df_three to look like this:
A B C
0 10 12 8
1 8 7 7
2 5 5 4
Am I making this problem too complicated? Would there be an easier way?
Any help is appreciated!
def top_3(s, top_values):
res = s.sort_values(ascending=False)[:top_values]
res.index = range(top_values)
return res
res = df.apply(lambda x: top_3(x, 3))
print(res)
Use numpy.sort with dataframe values:
n=3
arr = df.copy().to_numpy()
df_three = pd.DataFrame(np.sort(arr, 0)[::-1][:n], columns=df.columns)
print(df_three)
A B C
0 10 12 8
1 8 7 7
2 5 5 4
Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.
How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.
At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?
Consider the example dataframe df
np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
dict(
A=rng,
B=list('abcdefghij'),
C=np.random.permutation(rng),
D=np.random.permutation(rng)
)
)
print(df)
A B C D
0 0 a 9 1
1 1 b 4 3
2 2 c 5 5
3 3 d 1 9
4 4 e 7 4
5 5 f 6 6
6 6 g 8 0
7 7 h 3 2
8 8 i 2 7
9 9 j 0 8
Use nlargest to identify lists. Then use query to filter dataframe
n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)
df.query('C in #c_lrgst & D in #d_lrgst')
A B C D
2 2 c 5 5
5 5 f 6 6
I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3