Given a dataframe of the format
A B C D
.......
........
I would like to select the rows whose value in column B is greater than 0.6*the last value in column.For eg,
Input:
A B C
1 0 5
2 3 4
3 6 6
4 8 1
5 9 3
Output:
A B C
3 6 6
4 8 1
5 9 3
I am currently doing the following,
x = df.loc[df.tail(1).index,'B']
Which return a series object corresponding to the index and value of coulmn B of the last row of the dataframe and then,
new_df = df.[df.B > x]
But I am getting the error,
ValueError: Series lengths must match to compare
How should I perform the query?
you need to 1st take the last value of column B using tail and multiply with 0.6.
df[df['B'] > df['B'].tail(1).values[0] * 0.6]
Related
I want to use head / tail function, but for each group i will take the different number of row according an input dictionary.
The function should have 2 input. First input is pandas dataframe
df = pd.DataFrame({"group":["A","A","A","B","B","B","B"],"value":[0,1,2,3,4,5,6,7]})
print(df)
group value
0 A 0
1 A 1
2 A 2
3 B 3
4 B 4
5 B 5
6 B 6
Second input is dict :
slice_per_group = {"A":1,"B":3}
Expected output :
df.groupby('group').head(slice_per_group) #Obviously this doesn't work
group value
0 A 0
3 B 3
4 B 4
5 B 5
Use head on each group separately:
df.groupby('group', group_keys=False).apply(lambda g: g.head(slice_per_group.get(g.name)))
group value
0 A 0
3 B 3
4 B 4
5 B 5
I have a pandas dataframe in which I want to add a column (col_new), which values depend on a comparison of values in a existing column (col_exist).
Existing column (type=objects) contains As and Bs.
New column should count, starting with 1.
If an A follows an A, the count should rise by one.
If an A follows a B, the count should rise by one.
If a B follows an A, the count should not rise.
If a B follows a B, the count should not rise.
col_exist col_new
A 1
A 2
A 3
B 3
A 4
B 4
B 4
A 5
B 5
I am completely new to programming, so thank you in advance for your adequade answer.
Use eq and cumsum:
df['col_new'] = df['col_exist'].eq('A').cumsum()
output:
col_exist col_new
0 A 1
1 A 2
2 A 3
3 B 3
4 A 4
5 B 4
6 B 4
7 A 5
8 B 5
I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)
Suppose I have a pandas dataframe that is like this:
df=
A B 6 2
A C 4 2
D F 9 3
K L 8 9
A B 4 3
D F 8 2
How can I say, if columns A and B have duplicates remove the ones that have the largest column C?
So for instance we can see lines 1 and 5 have the same columns A and B.
A B 6 2 (Line 1)
A B 4 3 (Line 5)
I want to remove line 1 as 6 is greater than 4.
So my output should be
A C 4 2
K L 8 9
A B 4 3
D F 8 2
Try sorting the column in descending order on which you need to find max value using
pd.sort_values
Then drop_duplicates using pd.drop_duplicate
df.sort_values(by=['C'],ascending=[True],inplace=True)
df.drop_duplicates(subset=['A','B'],inplace=True)
I have the following dataframe df:
A B C D E
J 4 2 3 2 3
K 5 2 6 2 1
L 2 6 5 4 7
I would like to create an additional column that adds by index the df except column A (which also are numbers), therefore what I have tried is :
df['summation'] = df.iloc[:, 1:4].sum(axis=0)
However, the column summation is added but gives NaN values.
Desired output is:
A B C D E summation
J 4 2 3 2 3 10
K 5 2 6 2 1 11
L 2 6 5 4 7 22
The sum along the row starting at B to the end.
As pointed out in the comments, you apply sum on the wrong axis. If you want to exclude columns from the sum, you can use drop (which also accepts a list of column names which might be handy if you want to exclude columns at e.g. index 0 and 3; then iloc might not be ideal)
df.drop('A', axis=1).sum(axis=1)
which yields
J 10
K 11
L 22
Also #ayhan's solution in the comments works fine.