I have a table with records which looks like this:
buyer=[name,value]
It is possible for anyid to have several records in the table:
Name Value
E 10
A 2
D 4
E 10
A 5
B 3
B 10
D 10
C 4
I am trying to filter this table based on the following logic: Select all records for those names for which the maximum value is not larger than 5. Based on the above example, I would select all records for names A and C, because their maxima are 5 and 3 respectively:
Name Value
A 2
A 5
C 4
B, D and E will be excluded, because their maxima are 10 (for each of them).
How can I do this ?
Simply with pandas module:
Let's say your data is in test.csv file.
import pandas as pd
df = pd.read_csv("test.csv", delim_whitespace=True)
idx = df.groupby('Name')['Value'].transform(max) <= 5
print(df[idx])
The output:
Name Value
1 A 2
4 A 5
8 C 4
pandas.DataFrame.groupby
Related
I have a pandas dataframe in which I want to add a column (col_new), which values depend on a comparison of values in a existing column (col_exist).
Existing column (type=objects) contains As and Bs.
New column should count, starting with 1.
If an A follows an A, the count should rise by one.
If an A follows a B, the count should rise by one.
If a B follows an A, the count should not rise.
If a B follows a B, the count should not rise.
col_exist col_new
A 1
A 2
A 3
B 3
A 4
B 4
B 4
A 5
B 5
I am completely new to programming, so thank you in advance for your adequade answer.
Use eq and cumsum:
df['col_new'] = df['col_exist'].eq('A').cumsum()
output:
col_exist col_new
0 A 1
1 A 2
2 A 3
3 B 3
4 A 4
5 B 4
6 B 4
7 A 5
8 B 5
I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6
Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,
Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4
Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1
My pandas dataframe:
In [11]: dframe = pd.DataFrame({"A":list("abcde"), "B":list("fghij"), "C":[1,2,3,4,5]}, index=[10,11,12,13,14])
Out[11]:
A B C
10 a f 1
11 b g 2
12 c h 3
13 d i 4
14 e j 5
Question: Compare values in 'C' column of first (1) and second (2) row and get index of minimum?
Answer: 10
Try this:
dframe[0:2]['C'].idxmin()
I hope this helps!
Given a dataframe of the format
A B C D
.......
........
I would like to select the rows whose value in column B is greater than 0.6*the last value in column.For eg,
Input:
A B C
1 0 5
2 3 4
3 6 6
4 8 1
5 9 3
Output:
A B C
3 6 6
4 8 1
5 9 3
I am currently doing the following,
x = df.loc[df.tail(1).index,'B']
Which return a series object corresponding to the index and value of coulmn B of the last row of the dataframe and then,
new_df = df.[df.B > x]
But I am getting the error,
ValueError: Series lengths must match to compare
How should I perform the query?
you need to 1st take the last value of column B using tail and multiply with 0.6.
df[df['B'] > df['B'].tail(1).values[0] * 0.6]