Choosing the larger probability from a specific indexID - python

I have a database as follows:
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 clean 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 clean 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 clean 30
What I want to do is, for each repeated indexID, I would like to choose the entry that is of higher probability and mark that as clean and the other as dirty.
The output should look something like this:
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30

If need pandas solution create boolean mask by comparing Probability column by Series.ne (!=) with max values per groups created by transform, because need Series with same size as df:
mask = df['Probability'].ne(df.groupby('indexID')['Probability'].transform('max'))
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
Detail:
print (df.groupby('indexID')['Probability'].transform('max'))
0 75
1 75
2 75
5 85
6 85
9 74
12 72
13 72
14 85
15 76
16 91
19 71
23 71
28 71
32 97
33 97
Name: Probability, dtype: int64
If want compare mean with gt (>):
mask = df['Probability'].gt(df['Probability'].mean())
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 dirty 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 dirty 85
9 4 5 9 dirty 74
12 6 7 12 clean 23
13 6 8 13 dirty 72
14 7 8 14 dirty 85
15 9 10 15 dirty 76
16 10 11 16 dirty 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 dirty 71
32 20 21 32 dirty 97
33 20 22 33 clean 30

Related

Adding a new column in which filling the column ,for first 6 rows fill it with 1 to 6 number same to next 6 rows with python

I am trying to add a new column in which every 6 rows in the dataframe is filled with 1 to 6 numbers.
Repeating it for all the rows in the dataframe. The illustration below shows how the output should look like
input
ID
0 20
1 20
2 20
3 20
4 20
5 20
6 34
7 34
8 34
9 34
10 34
11 34
12 67
13 67
14 67
15 67
16 67
17 67
output
ID 6_months
0 20 1
1 20 2
2 20 3
3 20 4
4 20 5
5 20 6
6 34 1
7 34 2
8 34 3
9 34 4
10 34 5
11 34 6
12 67 1
13 67 2
14 67 3
15 67 4
16 67 5
17 67 6

Get Max Value of a Row in subset of Column respecting a condition

I have a dataframe that looks like this:
FakeDist
-5
-4
-3
-2
-1
0
1
2
3
4
5
1
37
14
17
29
31
34
32
31
21
17
18
2
12
13
12
16
30
33
37
32
32
15
42
3
40
16
29
31
36
32
30
19
16
15
12
4
12
14
12
28
28
30
29
27
16
18
33
5
12
13
16
17
28
32
33
30
29
17
35
I want to add a column that will be the Column_Name of the Maximum Value per Row.
I did that with:
df['MaxVal_Dist'] = df.idxmax(axis=1)
Which gives me this df:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
-5
2
12
13
...
5
3
40
16
...
-5
4
12
14
...
5
5
12
13
...
5
But my real end point would be to add an if condition. I want the Max Value for the column where 'FakeDist' is between -2 and 2. To have the following result:
FakeDist
-5
-4
...
MaxVal_Dist
1
37
14
...
0
2
12
13
...
1
3
40
16
...
-1
4
12
14
...
0
5
12
13
...
1
I did try to look at how to add a df.apply but couldn't find how to make it work.
I have a "work around" idea that would be to store a subset of column (from -2 to 2) in a new dataframe, create my new column to get the max there, and then add that result column to my initial dataframe but it seem to me to be a very not elegant solution and I am sure there is much better to do.
I would be really glad to learn the elegant way to do that from you !
You can use boolean indexing with loc to filter the columns in the range -2 to 2, then use idxmax along axis=1:
c = df.columns.astype(int)
df['MaxVal_Dist'] = df.loc[:, (c >= -2) & (c <= 2)].idxmax(1)
Result:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
1 37 14 17 29 31 34 32 31 21 17 18 0
2 12 13 12 16 30 33 37 32 32 15 42 1
3 40 16 29 31 36 32 30 19 16 15 12 -1
4 12 14 12 28 28 30 29 27 16 18 33 0
5 12 13 16 17 28 32 33 30 29 17 35 1
You can try List comprehension:
In [1159]: cols = [i for i in df.columns[1:] if -2 <= int(i) <= 2]
In [1161]: df['MaxVal_Dist'] = df[cols].idxmax(axis=1)
In [1162]: df
Out[1162]:
FakeDist -5 -4 -3 -2 -1 0 1 2 3 4 5 MaxVal_Dist
0 1 37 14 17 29 31 34 32 31 21 17 18 0
1 2 12 13 12 16 30 33 37 32 32 15 42 1
2 3 40 16 29 31 36 32 30 19 16 15 12 -1
3 4 12 14 12 28 28 30 29 27 16 18 33 0
4 5 12 13 16 17 28 32 33 30 29 17 35 1

Python Heap Sort is changing numbers to 0 instead of sorting them

My professor gave us some code to implement heapsort into our sorting class, and I can't seem to get it to work right. Every time I print it out, some of the numbers are converting into 0 (or 1s with a random fill) and not getting sorted. I know this because I have a fill function that just creates an array of numbers with increasing value that it is supposed to sort.
def heapsort(self):
n = self.size # Doing this for simplicity
for k in range((n-2) // 2, -1, -1):
self.downheap(n, k)
for m in range(n - 1, 0, -1):
self.data[m], self.data[0] = self.data[0], self.data[m]
self.downheap(m, 0)
def downheap(self, n, k):
if n > 1:
key = self.data[k]
isHeap = False
while (k <= (n-2) // 2) and not isHeap:
j = 2 * k + 1
if j + 1 < n:
if self.data[j] < self.data[j + 1]:
j += 1
if key >= self.data[j]:
isHeap = True
else:
k = j
self.data[k] = key
Unsorted list looks like-
17 19 8 8 9 3 17 13 9 1
14 19 15 12 19 4 12 6 1 8
13 8 10 5 6 6 9 17 6 5
12 5 7 16 9 10 11 3 10 14
5 3 12 1 3 10 18 10 4 19
5 10 14 9 16 8 3 14 4 13
12 8 13 10 16 17 16 10 11 3
16 9 3 16 15 3 2 11 15 3
3 3 18 7 9 6 10 4 1 4
15 10 9 1 2 18 14 11 4 3
"sorted" list looks like-
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 8 13
1 3 5 6 6 9 12 6 5 12
1 1 3 3 1 1 3 10 8 5
3 12 1 1 1 1 10 4 3 5
10 6 9 9 8 3 6 4 5 12
8 8 1 2 1 2 3 3 2 1
9 3 11 6 3 2 1 10 3 3
3 5 7 3 6 1 1 1 4 3
1 1 1 2 10 5 4 4 3 17
And here's what it does to the inc. numbers-
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
"Sorted"
0 0 3 4 0 0 7 8 9 10
11 12 0 0 15 16 17 18 19 20
21 22 23 24 25 26 27 28 0 0
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 24
51 25 12 5 55 27 13 28 59 29
0 0 63 31 15 32 67 33 16 7
71 35 17 36 75 37 3 8 79 39
19 40 83 41 20 9 87 43 21 44
91 45 4 1 95 47 23 48 11 0
I've been pouring over this the last couple of days, I know I wrote everything down correctly, and I can't for the life of me figure out what is going on. Any help would be appreciated. Thanks.
Figured it out!
I had to put -
self.data[k] = self.data[j]
over k = j

Columns located within a column

I am trying to extract a dataframe from a web api and can't seem to work out how to break columns out. For Home and Away, they have breakdowns inside them, so should read Home Wins, Home Draws etc.
url = "http://api.football-data.org/v1/soccerseasons/398/leagueTable/?matchday=38"
response = requests.get(url)
response_json = response.content
result = json.loads(response_json)
football = pd.DataFrame(result['standing'], columns=['position','teamName','playedGames','wins','draws','losses','goals',
'goalsAgainst','home','away','goalDifference','points'])
football
football.home
this shows the problem:
0 {u'wins': 12, u'losses': 1, u'draws': 6, u'goa...
I think you can use json_normalize:
import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
url = "http://api.football-data.org/v1/soccerseasons/398/leagueTable/?matchday=38"
result = json.loads(requests.get(url).text)
#print (result)
df = json_normalize(result["standing"])
print (df)
_links.team.href away.draws away.goals \
0 http://api.football-data.org/v1/teams/338 6 33
1 http://api.football-data.org/v1/teams/57 7 34
2 http://api.football-data.org/v1/teams/73 7 34
3 http://api.football-data.org/v1/teams/65 7 24
4 http://api.football-data.org/v1/teams/66 4 22
5 http://api.football-data.org/v1/teams/340 6 20
6 http://api.football-data.org/v1/teams/563 7 31
7 http://api.football-data.org/v1/teams/64 4 30
8 http://api.football-data.org/v1/teams/70 5 19
9 http://api.football-data.org/v1/teams/61 5 27
10 http://api.football-data.org/v1/teams/62 9 24
11 http://api.football-data.org/v1/teams/72 5 22
12 http://api.football-data.org/v1/teams/346 3 20
13 http://api.football-data.org/v1/teams/74 8 14
14 http://api.football-data.org/v1/teams/354 6 20
15 http://api.football-data.org/v1/teams/1044 4 22
16 http://api.football-data.org/v1/teams/71 6 25
17 http://api.football-data.org/v1/teams/67 3 12
18 http://api.football-data.org/v1/teams/68 2 13
19 http://api.football-data.org/v1/teams/58 3 13
away.goalsAgainst away.losses away.wins \
0 18 2 11
1 25 4 8
2 20 3 9
3 20 5 7
4 26 8 7
5 19 6 7
6 25 5 7
7 28 7 8
8 31 8 6
9 23 7 7
10 25 5 5
11 32 10 4
12 31 10 6
13 22 7 4
14 28 8 5
15 33 9 6
16 42 10 3
17 41 14 2
18 37 14 3
19 41 15 1
crestURI draws goalDifference \
0 http://upload.wikimedia.org/wikipedia/en/6/63/... 12 32
1 http://upload.wikimedia.org/wikipedia/en/5/53/... 11 29
2 http://upload.wikimedia.org/wikipedia/de/b/b4/... 13 34
3 http://upload.wikimedia.org/wikipedia/de/f/fd/... 9 30
4 http://upload.wikimedia.org/wikipedia/de/d/da/... 9 14
5 http://upload.wikimedia.org/wikipedia/de/c/c9/... 9 18
6 http://upload.wikimedia.org/wikipedia/de/e/e0/... 14 14
7 http://upload.wikimedia.org/wikipedia/de/0/0a/... 12 13
8 http://upload.wikimedia.org/wikipedia/de/a/a3/... 9 -14
9 http://upload.wikimedia.org/wikipedia/de/5/5c/... 14 6
10 http://upload.wikimedia.org/wikipedia/de/f/f9/... 14 4
11 http://upload.wikimedia.org/wikipedia/de/a/ab/... 11 -10
12 https://upload.wikimedia.org/wikipedia/en/e/e2... 9 -10
13 http://upload.wikimedia.org/wikipedia/de/8/8b/... 13 -14
14 http://upload.wikimedia.org/wikipedia/de/b/bf/... 9 -12
15 https://upload.wikimedia.org/wikipedia/de/4/41... 9 -22
16 http://upload.wikimedia.org/wikipedia/de/6/60/... 12 -14
17 http://upload.wikimedia.org/wikipedia/de/5/56/... 10 -21
18 http://upload.wikimedia.org/wikipedia/de/8/8c/... 7 -28
19 http://upload.wikimedia.org/wikipedia/de/9/9f/... 8 -49
goals ... home.goals home.goalsAgainst home.losses home.wins \
0 68 ... 35 18 1 12
1 65 ... 31 11 3 12
2 69 ... 35 15 3 10
3 71 ... 47 21 5 12
4 49 ... 27 9 2 12
5 59 ... 39 22 5 11
6 65 ... 34 26 3 9
7 63 ... 33 22 3 8
8 41 ... 22 24 7 8
9 59 ... 32 30 5 5
10 59 ... 35 30 8 6
11 42 ... 20 20 5 8
12 40 ... 20 19 7 6
13 34 ... 20 26 8 6
14 39 ... 19 23 10 6
15 45 ... 23 34 9 5
16 48 ... 23 20 7 6
17 44 ... 32 24 5 7
18 39 ... 26 30 8 6
19 27 ... 14 35 12 2
losses playedGames points position teamName wins
0 3 38 81 1 Leicester City FC 23
1 7 38 71 2 Arsenal FC 20
2 6 38 70 3 Tottenham Hotspur FC 19
3 10 38 66 4 Manchester City FC 19
4 10 38 66 5 Manchester United FC 19
5 11 38 63 6 Southampton FC 18
6 8 38 62 7 West Ham United FC 16
7 10 38 60 8 Liverpool FC 16
8 15 38 51 9 Stoke City FC 14
9 12 38 50 10 Chelsea FC 12
10 13 38 47 11 Everton FC 11
11 15 38 47 12 Swansea City FC 12
12 17 38 45 13 Watford FC 12
13 15 38 43 14 West Bromwich Albion FC 10
14 18 38 42 15 Crystal Palace FC 11
15 18 38 42 16 AFC Bournemouth 11
16 17 38 39 17 Sunderland AFC 9
17 19 38 37 18 Newcastle United FC 9
18 22 38 34 19 Norwich City FC 9
19 27 38 17 20 Aston Villa FC 3
[20 rows x 22 columns]

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6
Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()
Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

Categories