How to calculate a win streak in Python/Pandas - python

I'm trying to calculate the win-streak or losing-streak going into a game. My goal is to generate a betting decision based on these streak factors or a recent record. I am new to Python and Pandas (and programming in general), so any detailed explanation of what code does would be welcome.
Here's my data
Season Game Date Game Index Away Team Away Score Home Team Home Score Winner Loser
0 2014 Regular Season Saturday, March 22, 2014 2014032201 Los Angeles Dodgers 3 Arizona D'Backs 1 Los Angeles Dodgers Arizona D'Backs
1 2014 Regular Season Sunday, March 23, 2014 2014032301 Los Angeles Dodgers 7 Arizona D'Backs 5 Los Angeles Dodgers Arizona D'Backs
2 2014 Regular Season Sunday, March 30, 2014 2014033001 Los Angeles Dodgers 1 San Diego Padres 3 San Diego Padres Los Angeles Dodgers
3 2014 Regular Season Monday, March 31, 2014 2014033101 Seattle Mariners 10 Los Angeles Angels 3 Seattle Mariners Los Angeles Angels
4 2014 Regular Season Monday, March 31, 2014 2014033102 San Francisco Giants 9 Arizona D'Backs 8 San Francisco Giants Arizona D'Backs
5 2014 Regular Season Monday, March 31, 2014 2014033103 Boston Red Sox 1 Baltimore Orioles 2 Baltimore Orioles Boston Red Sox
6 2014 Regular Season Monday, March 31, 2014 2014033104 Minnesota Twins 3 Chicago White Sox 5 Chicago White Sox Minnesota Twins
7 2014 Regular Season Monday, March 31, 2014 2014033105 St. Louis Cardinals 1 Cincinnati Reds 0 St. Louis Cardinals Cincinnati Reds
8 2014 Regular Season Monday, March 31, 2014 2014033106 Kansas City Royals 3 Detroit Tigers 4 Detroit Tigers Kansas City Royals
9 2014 Regular Season Monday, March 31, 2014 2014033107 Colorado Rockies 1 Miami Marlins 10 Miami Marlins Colorado Rockies
Dictionary below:
{'Away Score': {0: 3, 1: 7, 2: 1, 3: 10, 4: 9},
'Away Team': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'Los Angeles Dodgers',
3: 'Seattle Mariners',
4: 'San Francisco Giants'},
'Game Date': {0: 'Saturday, March 22, 2014',
1: 'Sunday, March 23, 2014',
2: 'Sunday, March 30, 2014',
3: 'Monday, March 31, 2014',
4: 'Monday, March 31, 2014'},
'Game Index': {0: 2014032201,
1: 2014032301,
2: 2014033001,
3: 2014033101,
4: 2014033102},
'Home Score': {0: 1, 1: 5, 2: 3, 3: 3, 4: 8},
'Home Team': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'San Diego Padres',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Loser': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'Los Angeles Dodgers',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Season': {0: '2014 Regular Season',
1: '2014 Regular Season',
2: '2014 Regular Season',
3: '2014 Regular Season',
4: '2014 Regular Season'},
'Winner': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'San Diego Padres',
3: 'Seattle Mariners',
4: 'San Francisco Giants'}}
I've tried looping through the season and the team, and then creating a streak count based on [this]: https://github.com/nhcamp/EPL-Betting/blob/master/EPL%20Match%20Results%20DF.ipynb github project.
I run into key errors early in building my loops, and I have trouble identifying data
game_table = pd.read_csv('MLB_Scores_2014_2018.csv')
# Get Team List
team_list = game_table['Away Team'].unique()
# Get Season List
season_list = game_table['Season'].unique()
#Defining "chunks" to append gamedata to the total dataframe
chunks = []
for season in season_list:
# Looping through seasons. Streaks reset for each season
season_games = game_table[game_table['Season'] == season]
for team in team_list:
# Looping through teams
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
#Setting streak list and streak counter values
streak_list = []
streak = 0
# Looping through each game
for game in season_team_games.iterrow():
# Check if team is a winner, and up the streak
if game_table['Winner'] == team:
streak_list.append(streak)
streak += 1
# If not the winner, append streak and set to zero
elif game_table['Winner'] != team:
streak_list.append(streak)
streak = 0
# Just in case something wierd happens with the scores
else:
streak_list.append(streak)
game_table['Streak'] = streak_list
chunk_list.append(game_table)
And that's kind of where I lose it. How do I append separately if each team is the home team or the away team? Is there a better way to display this data?
As a general matter, I want to add a win-streak and/or losing-streak for each team in each game. Headers would look like this:
| Season | Game Date | Game Index | Away Team | Away Score | Home Team | Home Score | Winner | Loser | Away Win Streak | Away Lose Streak | Home Win Streak | Home Lose Streak |
Edit: this error message has been resolved
I also get an error creating the dataframe 'season_team_games."
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]

The error you are seeing come from the statement
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
When you're adding two boolean conditions, you need to separate them out with parentheses. This is because the | operator takes precedence over the == operator. So this should become:
season_team_games = season_games[(season_games['Away Team'] == team) | (season_games['Home Team'] == team)]
I know there is more to the question than this error, but as mentioned in the comment, once you provide some text based data, it might be easier to help

Related

Split a row into more rows based on a string (regex)

I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?
You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks

Extracting country information from description using geograpy

PROBLEM: I want to extract country information from a user description. So far, I'm giving a try with the geograpy package. I like the behavior when the input is not very clear for example in Evesham or Rochdale, however, the package interprets some strings like Zaragoza, Spain as two mentions while the user is clearing saying that its location is in Spain. Still, I don't know why amsterdam is not giving as output Holland... How can I improve the outputs? Am I missing anything important? Is there a better package to achieve this?
DATA: My data example is:
user_location
2 Socialist Republic of Alachua
3 Hérault, France
4 Gwalior, India
5 Zaragoza,España
7 amsterdam
8 Evesham
9 Rochdale
I want to get something like this:
user_location country
2 Socialist Republic of Alachua ['USSR', 'United States']
3 Hérault, France ['France']
4 Gwalior, India ['India']
5 Zaragoza,España ['Spain']
7 amsterdam ['Holland']
8 Evesham ['United Kingdom']
9 Rochdale ['United Kingdom', 'United States']
REPREX:
import pandas as pd
import geograpy3
df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})
df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)
print(df)
#> user_location country
#> 2 Socialist Republic of Alachua [USSR, Union of Soviet Socialist Republics, Al...
#> 3 Hérault, France [France, Hérault]
#> 4 Gwalior, India [British Indian Ocean Territory, Gwalior, India]
#> 5 Zaragoza,España [Zaragoza, España, Spain, El Salvador]
#> 7 amsterdam []
#> 8 Evesham [Evesham, United Kingdom]
#> 9 Rochdale [Rochdale, United Kingdom, United States]
Created on 2020-06-02 by the reprexpy package
geograpy3 was not behaving correctly anymore regarding country lookup since it didn't check if None was returned by pycountry. As a committer i just fixed this.
I have added your slightly modified example (to avoid the pandas import) as a unit test case:
def testStackoverflow62152428(self):
'''
see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
'''
examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}
for index,text in examples.items():
places=geograpy.get_geoPlace_context(text=text)
print("example %d: %s" % (index,places.countries))
and the result is now:
example 2: ['United States']
example 3: ['France']
example 4: ['British Indian Ocean Territory', 'India']
example 5: ['Spain', 'El Salvador']
example 7: []
example 8: ['United Kingdom']
example 9: ['United Kingdom', 'United States']
indeed there is room for improvement for example 5. I have added an issue https://github.com/somnathrakshit/geograpy3/issues/7 - please stay tuned ...

Adding column of values to pandas DataFrame

I'm doing a simple sentiment analysis and am stuck on something that I feel is very simple. I'm trying to add an new column with a set of values, in this example compound values. But after the for loop iterates it adds the same value for all the rows rather than a value for each iteration. The compound values are the last column in the DataFrame. There should be a quick fix. thanks!
for i, row in real.iterrows():
real['compound'] = sid.polarity_scores(real['title'][i])['compound']
title text subject date compound
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews December 31, 2017 0.2263
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews December 29, 2017 0.2263
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews December 31, 2017 0.2263
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews December 30, 2017 0.2263
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews December 29, 2017 0.2263
IIUC:
real['compound'] = real.apply(lambda row: sid.polarity_scores(row['title'])['compound'], axis=1)

Python get a list of cities, states, region

I have a dataframe that contains a column of cities. I am looking to match the city with its region. For example, San Francisco would be West.
Here is my original dataframe:
data = {'city': ['San Francisco', 'New York', 'Chicago', 'Philadelphia', 'Boston'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
city year reports
San Francisco 2012 Cochice
New York 2012 Pima
Chicago 2013 Santa Cruz
Philadelphia 2014 Maricopa
Boston 2014 Yuma
Here I pull data that contains region by state. However, it does not contain city.
pd.read_csv('https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv')
How do I get the state per city? That way I can then join the original dataframe including state with the second dataframe that has region.
On this Github project there is a CSV that the creator claims to contain all American cities and states.
The following data is presented:
City|State short name|State full name|County|City Alias Mixed Case
Example:
San Francisco|CA|California|SAN FRANCISCO|San Francisco
San Francisco|CA|California|SAN MATEO|San Francisco Intnl Airport
San Francisco|CA|California|SAN MATEO|San Francisco
San Francisco|CA|California|SAN FRANCISCO|Presidio
San Francisco|CA|California|SAN FRANCISCO|Bank Of America
San Francisco|CA|California|SAN FRANCISCO|Wells Fargo Bank
San Francisco|CA|California|SAN FRANCISCO|First Interstate Bank
San Francisco|CA|California|SAN FRANCISCO|Uc San Francisco
San Francisco|CA|California|SAN FRANCISCO|Union Bank Of California
San Francisco|CA|California|SAN FRANCISCO|Irs Service Center
San Francisco|CA|California|SAN FRANCISCO|At & T
San Francisco|CA|California|SAN FRANCISCO|Pacific Gas And Electric
Sacramento|CA|California|SACRAMENTO|Sacramento
Sacramento|CA|California|SACRAMENTO|Ca Franchise Tx Brd Brm
Sacramento|CA|California|SACRAMENTO|Ca State Govt Brm
I suggest you parse the above file to extract the info you need (on this case, the state given a specific city) then you correlate with the region on the other csv you have.
Better still would be for you to create your own table using all the csvs you access to contain only the info you really need.

How can I improve performance on my apply() with fuzzy matching statement

I've written a function called muzz that leverages the fuzzywuzzy module to 'merge' two pandas dataframes. Works great, but the performance is pretty bad on larger frames. Please take a look at my apply() that does the extracting/scoring and let me know if you have any ideas that could speed it up.
import pandas as pd
import numpy as np
import fuzzywuzzy as fw
Create a frame of raw data
dfRaw = pd.DataFrame({'City': {0: u'St Louis',
1: 'Omaha',
2: 'Chicogo',
3: 'Kansas city',
4: 'Des Moine'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 3 : 'MO', 4 : 'IA'}})
Which yields
City State
0 St Louis MO
1 Omaha NE
2 Chicogo IL
3 Kansas city MO
4 Des Moine IA
Then a frame that represents the good data that we want to look up
dfLocations = pd.DataFrame({'City': {0: 'Saint Louis',
1: u'Omaha',
2: u'Chicago',
3: u'Kansas City',
4: u'Des Moines'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL',
3 : 'KS', 4 : 'IA'},
u'Zip': {0: '63201', 1: '68104', 2: '60290',
3: '68101', 4: '50301'}})
Which yields
City State Zip
0 Saint Louis MO 63201
1 Omaha NE 68104
2 Chicago IL 60290
3 Kansas City KS 68101
4 Des Moines IA 50301
and now the muzz function. EDIT: Added choices= right[match_col_name] line and used choices in the apply per Brenbarn suggestion. I also, per Brenbarn suggestion, ran some tests with the extractOne() without the apply and it it appears to be the bottleneck. Maybe there's a faster way to do the fuzzy matching?
def muzz(left, right, on, match_col_name='match_on',score_col_name='score_match',
right_suffix='_match', score_cutoff=80):
right[match_col_name] = np.sum(right[on],axis=1)
choices= right[match_col_name]
###The offending statement###
left[[match_col_name,score_col_name]] =
pd.Series(np.sum(left[on],axis=1)).apply(lambda x : pd.Series(
fw.process.extractOne(x,choices,score_cutoff=score_cutoff)))
dfTemp = pd.merge(left,right,how='left',on=match_col_name,suffixes=('',right_suffix))
return dfTemp.drop(match_col_name, axis=1)
Calling muzz
muzz(dfRaw.copy(),dfLocations,on=['City','State'], score_cutoff=85)
Which yields
City State score_match City_match State_match Zip
0 St Louis MO 87 Saint Louis MO 63201
1 Omaha NE 100 Omaha NE 68104
2 Chicogo IL 89 Chicago IL 60290
3 Kansas city MO NaN NaN NaN NaN
4 Des Moine IA 96 Des Moines IA 50301

Categories