Columns located within a column - python

I am trying to extract a dataframe from a web api and can't seem to work out how to break columns out. For Home and Away, they have breakdowns inside them, so should read Home Wins, Home Draws etc.
url = "http://api.football-data.org/v1/soccerseasons/398/leagueTable/?matchday=38"
response = requests.get(url)
response_json = response.content
result = json.loads(response_json)
football = pd.DataFrame(result['standing'], columns=['position','teamName','playedGames','wins','draws','losses','goals',
'goalsAgainst','home','away','goalDifference','points'])
football
football.home
this shows the problem:
0 {u'wins': 12, u'losses': 1, u'draws': 6, u'goa...

I think you can use json_normalize:
import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
url = "http://api.football-data.org/v1/soccerseasons/398/leagueTable/?matchday=38"
result = json.loads(requests.get(url).text)
#print (result)
df = json_normalize(result["standing"])
print (df)
_links.team.href away.draws away.goals \
0 http://api.football-data.org/v1/teams/338 6 33
1 http://api.football-data.org/v1/teams/57 7 34
2 http://api.football-data.org/v1/teams/73 7 34
3 http://api.football-data.org/v1/teams/65 7 24
4 http://api.football-data.org/v1/teams/66 4 22
5 http://api.football-data.org/v1/teams/340 6 20
6 http://api.football-data.org/v1/teams/563 7 31
7 http://api.football-data.org/v1/teams/64 4 30
8 http://api.football-data.org/v1/teams/70 5 19
9 http://api.football-data.org/v1/teams/61 5 27
10 http://api.football-data.org/v1/teams/62 9 24
11 http://api.football-data.org/v1/teams/72 5 22
12 http://api.football-data.org/v1/teams/346 3 20
13 http://api.football-data.org/v1/teams/74 8 14
14 http://api.football-data.org/v1/teams/354 6 20
15 http://api.football-data.org/v1/teams/1044 4 22
16 http://api.football-data.org/v1/teams/71 6 25
17 http://api.football-data.org/v1/teams/67 3 12
18 http://api.football-data.org/v1/teams/68 2 13
19 http://api.football-data.org/v1/teams/58 3 13
away.goalsAgainst away.losses away.wins \
0 18 2 11
1 25 4 8
2 20 3 9
3 20 5 7
4 26 8 7
5 19 6 7
6 25 5 7
7 28 7 8
8 31 8 6
9 23 7 7
10 25 5 5
11 32 10 4
12 31 10 6
13 22 7 4
14 28 8 5
15 33 9 6
16 42 10 3
17 41 14 2
18 37 14 3
19 41 15 1
crestURI draws goalDifference \
0 http://upload.wikimedia.org/wikipedia/en/6/63/... 12 32
1 http://upload.wikimedia.org/wikipedia/en/5/53/... 11 29
2 http://upload.wikimedia.org/wikipedia/de/b/b4/... 13 34
3 http://upload.wikimedia.org/wikipedia/de/f/fd/... 9 30
4 http://upload.wikimedia.org/wikipedia/de/d/da/... 9 14
5 http://upload.wikimedia.org/wikipedia/de/c/c9/... 9 18
6 http://upload.wikimedia.org/wikipedia/de/e/e0/... 14 14
7 http://upload.wikimedia.org/wikipedia/de/0/0a/... 12 13
8 http://upload.wikimedia.org/wikipedia/de/a/a3/... 9 -14
9 http://upload.wikimedia.org/wikipedia/de/5/5c/... 14 6
10 http://upload.wikimedia.org/wikipedia/de/f/f9/... 14 4
11 http://upload.wikimedia.org/wikipedia/de/a/ab/... 11 -10
12 https://upload.wikimedia.org/wikipedia/en/e/e2... 9 -10
13 http://upload.wikimedia.org/wikipedia/de/8/8b/... 13 -14
14 http://upload.wikimedia.org/wikipedia/de/b/bf/... 9 -12
15 https://upload.wikimedia.org/wikipedia/de/4/41... 9 -22
16 http://upload.wikimedia.org/wikipedia/de/6/60/... 12 -14
17 http://upload.wikimedia.org/wikipedia/de/5/56/... 10 -21
18 http://upload.wikimedia.org/wikipedia/de/8/8c/... 7 -28
19 http://upload.wikimedia.org/wikipedia/de/9/9f/... 8 -49
goals ... home.goals home.goalsAgainst home.losses home.wins \
0 68 ... 35 18 1 12
1 65 ... 31 11 3 12
2 69 ... 35 15 3 10
3 71 ... 47 21 5 12
4 49 ... 27 9 2 12
5 59 ... 39 22 5 11
6 65 ... 34 26 3 9
7 63 ... 33 22 3 8
8 41 ... 22 24 7 8
9 59 ... 32 30 5 5
10 59 ... 35 30 8 6
11 42 ... 20 20 5 8
12 40 ... 20 19 7 6
13 34 ... 20 26 8 6
14 39 ... 19 23 10 6
15 45 ... 23 34 9 5
16 48 ... 23 20 7 6
17 44 ... 32 24 5 7
18 39 ... 26 30 8 6
19 27 ... 14 35 12 2
losses playedGames points position teamName wins
0 3 38 81 1 Leicester City FC 23
1 7 38 71 2 Arsenal FC 20
2 6 38 70 3 Tottenham Hotspur FC 19
3 10 38 66 4 Manchester City FC 19
4 10 38 66 5 Manchester United FC 19
5 11 38 63 6 Southampton FC 18
6 8 38 62 7 West Ham United FC 16
7 10 38 60 8 Liverpool FC 16
8 15 38 51 9 Stoke City FC 14
9 12 38 50 10 Chelsea FC 12
10 13 38 47 11 Everton FC 11
11 15 38 47 12 Swansea City FC 12
12 17 38 45 13 Watford FC 12
13 15 38 43 14 West Bromwich Albion FC 10
14 18 38 42 15 Crystal Palace FC 11
15 18 38 42 16 AFC Bournemouth 11
16 17 38 39 17 Sunderland AFC 9
17 19 38 37 18 Newcastle United FC 9
18 22 38 34 19 Norwich City FC 9
19 27 38 17 20 Aston Villa FC 3
[20 rows x 22 columns]

Related

create dataframe with increasing numbers in python

I want to create the following dataframe: n is the number of rows, and m is the columns.
In R, this would be generated by:
ia=array((1:m),c(m,n))
But I do not know how i can achieve the same in python.
Kind regards,
Use numpy.broadcast_to with DataFrame constructor:
m = 24
n = 13
df = pd.DataFrame(np.broadcast_to(np.arange(1, m + 1)[:, None], (m, n)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1 1 1 1 1 1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5 5 5 5 5 5 5
5 6 6 6 6 6 6 6 6 6 6 6 6 6
6 7 7 7 7 7 7 7 7 7 7 7 7 7
7 8 8 8 8 8 8 8 8 8 8 8 8 8
8 9 9 9 9 9 9 9 9 9 9 9 9 9
9 10 10 10 10 10 10 10 10 10 10 10 10 10
10 11 11 11 11 11 11 11 11 11 11 11 11 11
11 12 12 12 12 12 12 12 12 12 12 12 12 12
12 13 13 13 13 13 13 13 13 13 13 13 13 13
13 14 14 14 14 14 14 14 14 14 14 14 14 14
14 15 15 15 15 15 15 15 15 15 15 15 15 15
15 16 16 16 16 16 16 16 16 16 16 16 16 16
16 17 17 17 17 17 17 17 17 17 17 17 17 17
17 18 18 18 18 18 18 18 18 18 18 18 18 18
18 19 19 19 19 19 19 19 19 19 19 19 19 19
19 20 20 20 20 20 20 20 20 20 20 20 20 20
20 21 21 21 21 21 21 21 21 21 21 21 21 21
21 22 22 22 22 22 22 22 22 22 22 22 22 22
22 23 23 23 23 23 23 23 23 23 23 23 23 23
23 24 24 24 24 24 24 24 24 24 24 24 24 24
df = df.rename(index = lambda x: x+1, columns=lambda x: x+1)
print (df)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9 9 9 9 9
10 10 10 10 10 10 10 10 10 10 10 10 10 10
11 11 11 11 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19 19 19 19 19 19 19
20 20 20 20 20 20 20 20 20 20 20 20 20 20
21 21 21 21 21 21 21 21 21 21 21 21 21 21
22 22 22 22 22 22 22 22 22 22 22 22 22 22
23 23 23 23 23 23 23 23 23 23 23 23 23 23
24 24 24 24 24 24 24 24 24 24 24 24 24 24
You can use np.repeat or np.tile
n = 5 # 13
m = 8 # 24
# Enhanced by #mozway
df = pd.DataFrame(np.tile(np.arange(1, m+1),(n, 1)).T)
# OR
df = pd.DataFrame(np.repeat(np.arange(1, m+1), m).reshape(-1, m))
print(df)
# Output
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
5 6 6 6 6 6
6 7 7 7 7 7
7 8 8 8 8 8

Replace the value in a row if his value his greater than in the previous row in my DataFrame (Pandas)

This is my problem, which may be so simple, but I am a novice. I have a DataFrame, and, in a determinate column, I want to replace a value in a row if this is greater than the previous row.
The steps I am following are:
df1 = pd.read_csv("M10_10.txt") (Reading my CSV)
In that CSV, there is a column named m_Crit200. That is the column which values I want to replace if they match my condition.
I am using this, but it does not working:
for i in range(1,len(df6)):
if df6.m_Crit200[i] < df6.m_Crit200[i+1]:
df6.m_Crit200[i]=df6.m_Crit200[i+1]
else:
df6.m_Crit200[i]=df6.m_Crit200[i]
This is the "if" code I am using, but does not working. Sorry about my explanation, as I said, I am novice and this is my first time here.
Thanks in advance
This is an example of what I want: I have this
Value
0 10
1 7
2 6
3 12
4 3
5 2
6 1
I want this
Value
0 10
1 7
2 6
3 6
4 3
5 2
6 1
I want to replace the value by the value at the previous row, is that is greater.
This is my second error. When I use the methods in the answers below, I get this
0 6.540991
1 6.540991
2 6.971319
3 6.971319
4 6.971319
5 6.971319
6 7.057385
7 6.540991
8 6.540991
9 6.282794
10 6.282794
11 6.540991
12 6.540991
13 7.315582
14 8.176239
15 8.090173
16 7.831976
17 5.594269
18 3.959021
19 3.528693
20 3.528693
21 3.528693
22 3.528693
23 3.528693
24 3.700824
25 3.614758
26 3.614758
27 3.356561
28 3.356561
29 2.926233
30 2.754101
31 2.754101
32 2.754101
33 2.840167
34 2.323773
35 2.323773
36 2.495904
37 2.495904
38 2.323773
39 1.463116
40 1.032788
41 1.032788
Name: m_Crit200, dtype: float64
And here I let you my original
0 6.540991
1 6.971319
2 6.971319
3 6.971319
4 6.971319
5 7.057385
6 7.057385
7 6.540991
8 6.627057
9 6.282794
10 6.713122
11 6.540991
12 7.315582
13 8.348371
14 8.176239
15 8.090173
16 7.831976
17 5.594269
18 3.959021
19 3.528693
20 3.528693
21 3.786890
22 3.528693
23 3.700824
24 3.700824
25 3.614758
26 3.872955
27 3.356561
28 3.356561
29 2.926233
30 2.754101
31 2.754101
32 2.840167
33 3.098364
34 2.323773
35 2.754101
36 2.495904
37 2.495904
38 2.323773
39 1.463116
40 1.032788
41 1.032788
Name: m_Crit200, dtype: float64
You can use shift for this:
df.Value.where(df.Value.shift(periods=1).fillna(np.inf)>df.Value, df.Value.shift(periods=1))
#output
0 10
1 7
2 6
3 6
4 3
5 2
6 1
Since you want a non-increasing Series, you can use an expanding window and choose the minimum value for each window:
df6["m_Crit200"] = df6["m_Crit200"].expanding().apply(min)
>>> df6
0 6.540991
1 6.540991
2 6.540991
3 6.540991
4 6.540991
5 6.540991
6 6.540991
7 6.540991
8 6.540991
9 6.282794
10 6.282794
11 6.282794
12 6.282794
13 6.282794
14 6.282794
15 6.282794
16 6.282794
17 5.594269
18 3.959021
19 3.528693
20 3.528693
21 3.528693
22 3.528693
23 3.528693
24 3.528693
25 3.528693
26 3.528693
27 3.356561
28 3.356561
29 2.926233
30 2.754101
31 2.754101
32 2.754101
33 2.754101
34 2.323773
35 2.323773
36 2.323773
37 2.323773
38 2.323773
39 1.463116
40 1.032788
41 1.032788
Name: m_Crit200, dtype: float64

Python Heap Sort is changing numbers to 0 instead of sorting them

My professor gave us some code to implement heapsort into our sorting class, and I can't seem to get it to work right. Every time I print it out, some of the numbers are converting into 0 (or 1s with a random fill) and not getting sorted. I know this because I have a fill function that just creates an array of numbers with increasing value that it is supposed to sort.
def heapsort(self):
n = self.size # Doing this for simplicity
for k in range((n-2) // 2, -1, -1):
self.downheap(n, k)
for m in range(n - 1, 0, -1):
self.data[m], self.data[0] = self.data[0], self.data[m]
self.downheap(m, 0)
def downheap(self, n, k):
if n > 1:
key = self.data[k]
isHeap = False
while (k <= (n-2) // 2) and not isHeap:
j = 2 * k + 1
if j + 1 < n:
if self.data[j] < self.data[j + 1]:
j += 1
if key >= self.data[j]:
isHeap = True
else:
k = j
self.data[k] = key
Unsorted list looks like-
17 19 8 8 9 3 17 13 9 1
14 19 15 12 19 4 12 6 1 8
13 8 10 5 6 6 9 17 6 5
12 5 7 16 9 10 11 3 10 14
5 3 12 1 3 10 18 10 4 19
5 10 14 9 16 8 3 14 4 13
12 8 13 10 16 17 16 10 11 3
16 9 3 16 15 3 2 11 15 3
3 3 18 7 9 6 10 4 1 4
15 10 9 1 2 18 14 11 4 3
"sorted" list looks like-
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 8 13
1 3 5 6 6 9 12 6 5 12
1 1 3 3 1 1 3 10 8 5
3 12 1 1 1 1 10 4 3 5
10 6 9 9 8 3 6 4 5 12
8 8 1 2 1 2 3 3 2 1
9 3 11 6 3 2 1 10 3 3
3 5 7 3 6 1 1 1 4 3
1 1 1 2 10 5 4 4 3 17
And here's what it does to the inc. numbers-
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
"Sorted"
0 0 3 4 0 0 7 8 9 10
11 12 0 0 15 16 17 18 19 20
21 22 23 24 25 26 27 28 0 0
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 24
51 25 12 5 55 27 13 28 59 29
0 0 63 31 15 32 67 33 16 7
71 35 17 36 75 37 3 8 79 39
19 40 83 41 20 9 87 43 21 44
91 45 4 1 95 47 23 48 11 0
I've been pouring over this the last couple of days, I know I wrote everything down correctly, and I can't for the life of me figure out what is going on. Any help would be appreciated. Thanks.
Figured it out!
I had to put -
self.data[k] = self.data[j]
over k = j

Choosing the larger probability from a specific indexID

I have a database as follows:
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 clean 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 clean 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 clean 30
What I want to do is, for each repeated indexID, I would like to choose the entry that is of higher probability and mark that as clean and the other as dirty.
The output should look something like this:
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
If need pandas solution create boolean mask by comparing Probability column by Series.ne (!=) with max values per groups created by transform, because need Series with same size as df:
mask = df['Probability'].ne(df.groupby('indexID')['Probability'].transform('max'))
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 dirty 35
1 0 2 1 clean 75
2 0 2 2 dirty 25
5 3 4 5 dirty 40
6 3 5 6 clean 85
9 4 5 9 clean 74
12 6 7 12 dirty 23
13 6 8 13 clean 72
14 7 8 14 clean 85
15 9 10 15 clean 76
16 10 11 16 clean 91
19 13 14 19 dirty 27
23 13 17 23 dirty 10
28 13 18 28 clean 71
32 20 21 32 clean 97
33 20 22 33 dirty 30
Detail:
print (df.groupby('indexID')['Probability'].transform('max'))
0 75
1 75
2 75
5 85
6 85
9 74
12 72
13 72
14 85
15 76
16 91
19 71
23 71
28 71
32 97
33 97
Name: Probability, dtype: int64
If want compare mean with gt (>):
mask = df['Probability'].gt(df['Probability'].mean())
df.loc[mask, 'userClean'] = 'dirty'
print (df)
indexID matchID order userClean Probability
0 0 1 0 clean 35
1 0 2 1 dirty 75
2 0 2 2 clean 25
5 3 4 5 clean 40
6 3 5 6 dirty 85
9 4 5 9 dirty 74
12 6 7 12 clean 23
13 6 8 13 dirty 72
14 7 8 14 dirty 85
15 9 10 15 dirty 76
16 10 11 16 dirty 91
19 13 14 19 clean 27
23 13 17 23 clean 10
28 13 18 28 dirty 71
32 20 21 32 dirty 97
33 20 22 33 clean 30

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6
Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()
Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

Categories