Find maximum value in python dataframe combining several rows - python

I have a dataframe looks like following(I have sorted it according to item column already). For example, item 1- 10,11-20,...(every 10 items) are in the same category, I want to find the item in each category that have the highest score and return it.
What is the most efficient way to do that?
item score
1 1 10
3 4 1
4 6 6
39 11 2
8 12 1
9 13 1
10 15 24
11 17 9
12 18 12
13 20 7
14 22 1
59 25 3
18 28 3
19 29 2
22 34 2
23 37 1
24 38 3
25 39 2
26 40 2
27 42 3
29 45 1
31 48 1
32 53 4
33 58 4

assuming your dataframe is stored in df
g = df.groupby(pd.cut(df.item, np.arange(1, df.item.max(), 10), right=False)
)
get the max values from each category
max_score_ids = g.score.agg('idxmax')
this gives you the ids of the rows that contain the max score in each category
item
[1, 11) 1
[11, 21) 10
[21, 31) 59
[31, 41) 24
[41, 51) 27
then get the items associated with these ids
df.loc[max_score_ids].item
1 1
10 15
59 25
24 38
27 42

Related

Pandas dataframe problem. Create column where a row cell gets the value of another row cell

I have this pandas dataframe. It is sorted by the "h" column. What I want is to add two new columns where:
The items of each zone, will have a max boundary and a min boundary. (They will be the same for every item in the zone). The max boundary will be the minimum "h" value of the previous zone, and the min boundary will be the maximum "h" value of the next zone
name h w set row zone
ZZON5 40 36 A 0 0
DWOPN 38 44 A 1 0
5SWYZ 37 22 B 2 0
TFQEP 32 55 B 3 0
OQ33H 26 41 A 4 1
FTJVQ 24 25 B 5 1
F1RK2 20 15 B 6 1
266LT 18 19 A 7 1
HSJ3X 16 24 A 8 2
L754O 12 86 B 9 2
LWHDX 11 68 A 10 2
ZKB2F 9 47 A 11 2
5KJ5L 7 72 B 12 3
CZ7ET 6 23 B 13 3
SDZ1B 2 10 A 14 3
5KWRU 1 59 B 15 3
what i hope for:
name h w set row zone maxB minB
ZZON5 40 36 A 0 0 26
DWOPN 38 44 A 1 0 26
5SWYZ 37 22 B 2 0 26
TFQEP 32 55 B 3 0 26
OQ33H 26 41 A 4 1 32 16
FTJVQ 24 25 B 5 1 32 16
F1RK2 20 15 B 6 1 32 16
266LT 18 19 A 7 1 32 16
HSJ3X 16 24 A 8 2 18 7
L754O 12 86 B 9 2 18 7
LWHDX 11 68 A 10 2 18 7
ZKB2F 9 47 A 11 2 18 7
5KJ5L 7 72 B 12 3 9
CZ7ET 6 23 B 13 3 9
SDZ1B 2 10 A 14 3 9
5KWRU 1 59 B 15 3 9
Any ideas?
First group-by zone and find the minimum and maximum of them
min_max_zone = df.groupby('zone').agg(min=('h', 'min'), max=('h', 'max'))
Now you can use apply:
df['maxB'] = df['zone'].apply(lambda x: min_max_zone.loc[x-1, 'min']
if x-1 in min_max_zone.index else np.nan)
df['minB'] = df['zone'].apply(lambda x: min_max_zone.loc[x+1, 'max']
if x+1 in min_max_zone.index else np.nan)

get only previous three values from the dataframe

I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.
Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.
You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0

python pandas assign yyyy-mm-dd from multiple years into accumulated week numbers

Given a file with the following columns:
date, userid, amount
where date is in yyyy-mm-dd format. I am trying to use python pandas to assign yyyy-mm-dd from multiple years into accumulated week numbers. For example:
2017-01-01 => 1
2017-12-31 => 52
2018-01-01 => 53
df_counts_dates=pd.read_csv("counts.csv")
print (df_counts_dates['date'].unique())
df = pd.to_datetime(df_counts_dates['date'])
print (df.unique())
print (df.dt.week.unique())
since the data contains Aug 2017-Aug 2018 dates, the above returns
[33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32]
I am wondering if there is any easy way to make the first date "week 1", and make the week number accumulate across years instead of becoming 1 at the beginning of each year?
I believe need a bit different approach - subtract all values of column by first, timedeltas convert to days, floor divide by 7 and last 1 for not starting by 0:
rng = pd.date_range('2017-08-01', periods=365)
df = pd.DataFrame({'date': rng, 'a': range(365)})
print (df.head())
date a
0 2017-08-01 0
1 2017-08-02 1
2 2017-08-03 2
3 2017-08-04 3
4 2017-08-05 4
w = ((df['date'] - df['date'].iloc[0]).dt.days // 7 + 1).unique()
print (w)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53]

grouping by id and a condition

I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)

Pandas replace issue

I can use pandas replace to replace values in a dataframe using a dictionary:
prod_dict = {1:'Productive',2:'Moderate',3:'None'}
df['val'].replace(prod_dict,inplace=True)
What do I do if I want to replace a set of values in the dataframe with a single number. E.g I want to map all values from 1 to 20 to 1; all values from 21 to 40 to 2 and all values from 41 to 100 to 3. How do I specify this in a dictionary and use it in pandas replace?
You can do that using apply to traverse and apply function on every element, and lambda to write a function to replace the key with the value of in your dictionary.
I will go through a quick example here.
First, I will create a dataframe to showcase the algorithm
df = pd.DataFrame(range(50), columns=list('B'))
This function should generate a list of values between i,j .
def genValues(i,j):
return [x for x in range(j+1) if x >=i]
I will create lambda function to map the values.
df['E']= df['B'].apply(lambda x: 1 if x in genValues(0,20) else 2 if x in genValues(21,40) else 3 if x in genValues(41,100) else x)
print df
The output:
B E
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 2
31 31 2
32 32 2
33 33 2
34 34 2
35 35 2
36 36 2
37 37 2
38 38 2
39 39 2
40 40 2
41 41 3
42 42 3
43 43 3
44 44 3
45 45 3
46 46 3
47 47 3
48 48 3
49 49 3
You can replace the column by replacing it:
df['B']= df['B'].apply(lambda x: 1 if x in genValues(0,20) else 2 if x in genValues(21,40) else 3 if x in genValues(41,100) else x)

Categories