joining a table to another table in pandas

joining a table to another table in pandas - python

I am trying to grab the data from, https://www.espn.com/nhl/standings
When I try to grab it, it is putting Florida Panthers one row to high and messing up the data. All the team names need to be shifted down a row. I have tried to mutate the data and tried,
dataset_one = dataset_one.shift(1)
and then joining with the stats table but I am getting NaN.
The docs seem to show a lot of ways of joining and merging data with similar columns headers but not sure the best solution here without a similar column header to join with.
Code:
import pandas as pd
page = pd.read_html('https://www.espn.com/nhl/standings')
dataset_one = page[0] # Team Names
dataset_two = page[1] # Stats
combined_data = dataset_one.join(dataset_two)
print(combined_data)
Output:
FLAFlorida Panthers GP W L OTL ... GF GA DIFF L10 STRK
0 CBJColumbus Blue Jackets 6 5 0 1 ... 22 16 6 5-0-1 W2
1 CARCarolina Hurricanes 10 4 3 3 ... 24 28 -4 4-3-3 L1
2 DALDallas Stars 6 5 1 0 ... 18 10 8 5-1-0 W4
3 TBTampa Bay Lightning 6 4 1 1 ... 23 14 9 4-1-1 L2
4 CHIChicago Blackhawks 6 4 1 1 ... 19 14 5 4-1-1 W1
5 NSHNashville Predators 10 3 4 3 ... 26 31 -5 3-4-3 W1
6 DETDetroit Red Wings 8 4 4 0 ... 20 24 -4 4-4-0 L1
Desired:
GP W L OTL ... GF GA DIFF L10 STRK
0 FLAFlorida Panthers 6 5 0 1 ... 22 16 6 5-0-1 W2
1 CBJColumbus Blue Jackets 10 4 3 3 ... 24 28 -4 4-3-3 L1
2 CARCarolina Hurricanes 6 5 1 0 ... 18 10 8 5-1-0 W4
3 DALDallas Stars 6 4 1 1 ... 23 14 9 4-1-1 L2
4 TBTampa Bay Lightning 6 4 1 1 ... 19 14 5 4-1-1 W1
5 CHIChicago Blackhawks 10 3 4 3 ... 26 31 -5 3-4-3 W1
6 NSHNashville Predators 8 4 4 0 ... 20 24 -4 4-4-0 L1
7 DETDetriot Red Wings 10 2 6 2 6 ... 20 35 -15 2-6-2 L6

Providing an alternative approach to #Noah's answer. You can first add an extra row, shift the df down by a row and then assign the header col as index 0 value.
import pandas as pd
page = pd.read_html('https://www.espn.com/nhl/standings')
dataset_one = page[0] # Team Names
dataset_two = page[1] # Stats
# Shifting down by one row
dataset_one.loc[max(dataset_one.index) + 1, :] = None
dataset_one = dataset_one.shift(1)
dataset_one.iloc[0] = dataset_one.columns
dataset_one.columns = ['team']
combined_data = dataset_one.join(dataset_two)

Just create the df slightly differently so it knows what is the proper header
dataset_one = pd.DataFrame(page[0], columns=["Team Name"])
Then when you join it should be aligned properly.
Another alternative is to do the following:
dataset_one = page[0].to_frame(name='Team Name')

Related

Applying Pandas iterrows logic across many groups in a dataframe

I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..

No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?

Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Replace and mapping string values in a Python dataframe with pandas

Hi i've been tryng to replace string values in a dataframe (strings are abbreviation of NFL teams), i have something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 Phi Atl Phi Phi Phi
1 2 Bal Bal Bal Buf Bal
2 3 Ind Ind Cin Cin Ind
3 4 NE NE Hou NE NE
4 5 Jax Jax NYG NYG NYG
and a Dataframe with the mapping, something like this:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
...
31 WAS 32
I want to replace every string with the TeamID to make basic statistics (frequency), i've tried the next:
## Dataframe with strings and Team ID
dfDicTeams = dfTeams[['TEAM_YH','TeamID']].to_dict('dict')
## Dataframe with selections by users
dfW1.replace(dfDicTeams[['TEAM_YH']],dfDicTeams[['TeamID']]) ## Error: unhashable type: 'list'
dfW1.replace(dfDicTeams) ## Error: Replacement not allowed with overlapping keys and values
what am i doing wrong? is it posible to do it?
I'm using Python 3, and i want something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 26 2 26 26 26
1 2 3 3 3 4 3
2 3 14 14 7 7 14
3 4 21 21 13 21 21
4 5 15 15 23 23 23
to aggregate the options:
IDMatch ATeam Count HTeam Count
1 26 4 2 1
2 3 4 4 1
3 14 3 7 2
4 21 4 13 1
5 15 2 23 3

Given a main input dataframe df and a mapping dataframe df_map, you can create a series mapping, then use pd.DataFrame.applymap with a custom function:
s = df_map.set_index('TEAM_YH')['TeamID']
df.iloc[:, 2:] = df.iloc[:, 2:].applymap(lambda x: s.get(x.upper(), -1))
print(df)
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 0 1 7 2 7 7 7
1 1 2 3 3 3 4 3
2 2 3 5 5 -1 -1 5
3 3 4 -1 -1 -1 -1 -1
4 4 5 6 6 -1 -1 -1
The example df_map used to calculate the above result:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
3 BUF 4
4 IND 5
5 JAX 6
6 PHI 7
32 WAS 32

how to groupby hour in a pandas multiindex

I have a pandas multiindex with two indices, a data and a gender columns. It looks like this:
Division North South West East
Date Gender
2016-05-16 19:00:00 F 0 2 3 3
M 12 15 12 12
2016-05-16 20:00:00 F 12 9 11 11
M 10 13 8 9
2016-05-16 21:00:00 F 9 4 7 1
M 5 1 12 10
Now if I want to find the average values for each hour, I know I can do like:
df.groupby(df.index.hour).mean()
but this does not seem to work when you have a multi index. I found that I could do reach the Date index like:
df.groupby(df.index.get_level_values('Date').hour).mean()
which sort of averages over the 24 hours in a day, but I loose track of the Gender index...
so my question is: how can I find the average hourly values for each Division by Gender?

I think you can add level of MultiIndex, need pandas 0.20.1+:
df1 = df.groupby([df.index.get_level_values('Date').hour,'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Another solution:
df1 = df.groupby([df.index.get_level_values('Date').hour,
df.index.get_level_values('Gender')]).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Or simply create columns from MultiIndex:
df = df.reset_index()
df1 = df.groupby([df['Date'].dt.hour, 'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10

Compare two pandas dataframe with different size

I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris

If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN

You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want

You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31

Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

joining a table to another table in pandas - python

Just create the df slightly differently so it knows what is the proper header dataset_one = pd.DataFrame(page[0], columns=["Team Name"]) Then when you join it should be aligned properly. Another alternative is to do the following: dataset_one = page[0].to_frame(name='Team Name')

Related

Applying Pandas iterrows logic across many groups in a dataframe

Get longest streak of consecutive weeks by group in pandas

Replace and mapping string values in a Python dataframe with pandas

how to groupby hour in a pandas multiindex

Compare two pandas dataframe with different size

Categories

Resources