Mix and match columns - python/pandas

Mix and match columns - python/pandas - python

With this set I need to find which team has scored the most goals in non-friendly games since 2010 (or line 31992).
I started by isolating non-friendly games with:
conditions = [df['tournament'] != ('Friendly')]
values = ['FIFAEVENT']
df['FE'] = np.select(conditions, values)
Don't know how to proceed from here tbh. Any help or suggestions is greatly appreciated.
Dataset : https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017

You can try this:
import pandas as pd
import datetime
df = pd.read_csv('../input/international-football-results-from-1872-to-2017/results.csv')
df['date'] = pd.to_datetime(df['date']).dt.date
df_sub = df[(df.date>datetime.date(2009,12,31))&(df.tournament!='Friendly')] #Selecting dates from 2010 and excluding friendly matches
unique_teams = list(df_sub.home_team.unique())
unique_teams.extend(list(df_sub.away_team.unique()))
unique_teams = set(unique_teams) # finding the unique set of teams
goals = []
for team in unique_teams:
home_goal = df_sub[df_sub['home_team']==team]['home_score'].sum()
away_goal = df_sub[df_sub['away_team']==team]['away_score'].sum()
goals.append(home_goal+away_goal) # calculate and append total goal for each team
df_most_goals = pd.DataFrame(data={'teams':list(unique_teams),'total_goals':goals}) #creating a dataframe with team and the total goals scored
df_most_goals = df_most_goals.sort_values(by='total_goals',ascending=False) #Sorting based on descending order
df_most_goals = df_most_goals.reset_index(drop=True)
print(df_most_goals.head()) # printing top5 teams scored most number of goals after 2010

Related

How to iterate and sample from each category of a dataframe?

I'm working on a script that calculates a sample size, and then extracts samples from each category in a dataframe evenly. I want to re-use this code for various dataframes with different categories, but I'm having trouble figuring out the for loop to do this:
df2 = df.loc[(df['Track Item']=='Y')]
categories = df2['Category'].unique()
categories_total = len(categories)
total_rows = len(df2.axes[0])
ss = (2.58**2)*(0.5)*(1-0.5)/.04**2
ss2 = ss / categories_total
ss3 = round(ss2)
one = df.loc[(df['Category']=='HOUSEHOLD FANS')].sample(ss3)
two = df.loc[(df['Category']=='HUMIDIFIERS')].sample(ss3)
three = df.loc[(df['Category']=='HOME WATER FILTERS')].sample(ss3)
four = df.loc[(df['Category']=='CAMPING & HIKING WATER FILTERS')].sample(ss3)
five = df.loc[(df['Category']=='THERMOMETERS')].sample(ss3)
six = df.loc[(df['Category']=='AIR PURIFIERS')].sample(ss3)
seven = df.loc[(df['Category']=='DETECTORS')].sample(ss3)
eight = df.loc[(df['Category']=='AIR CONDITIONERS')].sample(ss3)
nine = df.loc[(df['Category']=='AROMATHERAPY')].sample(ss3)
ten = df.loc[(df['Category']=='AIR HEATING')].sample(ss3)
eleven = df.loc[(df['Category']=='HOUSEHOLD FANS')].sample(ss3)
I need to loop through each category, taking a sample from each one evenly. Any idea how I can accomplish this task?

How about a groupby with sample instead:
df.groupby('Category').apply(lambda x: x.sample(ss3))

Associating .csv column data for calculations

I am a complete nube to Python3 and coding so go easy on me please. :)
As a project I'm creating a football league table based on 2018 EPL results. I have been able to break the .csv file containing an entire seasons worth of data into round by round results, into .csv using Pandas module. Now I need to extract the table points for each team by round, based on the home and away goals for each team. I'm having a hard time associating the goals with the teams in each fixture. I can figure out how to apply win/draw/lose (3/1/0) points but only mandrolically per fixture, not dynamically for all fixtures in the round. Then I need to write the table to another .csv file.
FTHG-Full Time Home Goals, FTAG-Full Time Away Goals, FTR-Full Time Result
Example Data
Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,10/08/2018,Man United,Leicester,2,1,H
1,11/08/2018,Bournemouth,Cardiff,2,0,H
2,11/08/2018,Fulham,Crystal Palace,0,2,A
3,11/08/2018,Huddersfield,Chelsea,0,3,A
4,11/08/2018,Newcastle,Tottenham,1,2,A
5,11/08/2018,Watford,Brighton,2,0,H
6,11/08/2018,Wolves,Everton,2,2,D
7,12/08/2018,Arsenal,Man City,0,2,A
8,12/08/2018,Liverpool,West Ham,4,0,H
9,12/08/2018,Southampton,Burnley,0,0,D
Example Code
import pandas as pd
results = pd.read_csv("2018 Round 1.csv")
team = results.iloc[2,2]
if results.iloc[2,4] > results.iloc[2,5]:
points = 3
elif results.iloc[2, 4] < results.iloc[2, 5]:
points = 0
else:
results.iloc[2, 4] = results.iloc[2, 5]
points = 1
table_entry = (team + " " + str(points))
print(table_entry)
table_entry = pd.to_csv("EPL Table Round 1.csv", index = False)
Thanks for your help.

I hope this helps :)
Please fell free to ask if the code it's not clear
import pandas as pd
import numpy as np
df = pd.read_csv('foot.txt')
#Make a list with all tema names
Home_teams = pd.unique(df['HomeTeam'])
Away_teams = pd.unique(df['AwayTeam'])
teams = np.concatenate((Home_teams, Away_teams))
df_teams = pd.DataFrame(columns=['team', 'points'])
#For each team in the list...
for team in teams:
print("*******" + team+ "*****")
points = 0
df_home = df[(df['HomeTeam'] == team)]
res_home = df_home['FTR'].value_counts()
try:
points += res_home['H']*3;
except:
print("Didn't win when Home")
try:
points += res_home['D']*1;
except:
print("No Draws")
df_away = df[(df['AwayTeam'] == team)]
res_away = df_away['FTR'].value_counts()
try:
points += res_away['A']*3;
except:
print("Didn't win when Away")
df_teams = df_teams.append({'team': team, 'points': points}, ignore_index=True)
print(team +"has "+ str(points) +" points" )

Pandas capitalization of compound interests

I am writing an emulation of a bank deposit account in pandas.
I got stuck with Compound interest (It is the result of reinvesting interest, so that interest in the next period is then earned on the principal sum plus previously accumulated interest.)
So far I have the following code:
import pandas as pd
from pandas.tseries.offsets import MonthEnd
from datetime import datetime
# Create a date range
start = '21/11/2017'
now = datetime.now()
date_rng = pd.date_range(start=start, end=now, freq='d')
# Create an example data frame with the timestamp data
df = pd.DataFrame(date_rng, columns=['Date'])
# Add column (EndOfMonth) - shows the last day of the current month
df['LastDayOfMonth'] = pd.to_datetime(df['Date']) + MonthEnd(0)
# Add columns for interest, Sasha, Artem, Total, Description
df['Debit'] = 0
df['Credit'] = 0
df['Total'] = 0
df['Description'] = ''
# Iterate through the DataFrame to set "IsItLastDay" value
for i in df:
df['IsItLastDay'] = (df['LastDayOfMonth'] == df['Date'])
# Add the transaction of the first deposit
df.loc[df.Date == '2017-11-21', ['Debit', 'Description']] = 10000, "First deposit"
# Calculate the principal sum (It the summ of all deposits minus all withdrows plus all compaund interests)
df['Total'] = (df.Debit - df.Credit).cumsum()
# Calculate interest per day and Cumulative interest
# 11% is the interest rate per year
df['InterestPerDay'] = (df['Total'] * 0.11) / 365
df['InterestCumulative'] = ((df['Total'] * 0.11) / 365).cumsum()
# Change the order of columns
df = df[['Date', 'LastDayOfMonth', 'IsItLastDay', 'InterestPerDay', 'InterestCumulative', 'Debit', 'Credit', 'Total', 'Description']]
df.to_excel("results.xlsx")
The output file looks fine, but I need the following:
The "InterestCumulative" column adds to the "Total" column at the last day of each months (compounding the interests)
At the beggining of each month the "InterestCumulative" column should be cleared (Because the interest were added to the Principal sum).
How can I do this?

You're going to need to loop, as your total changes depending on previous rows, which then affects the later rows. As a result your current interest calculations are wrong.
total = 0
cumulative_interest = 0
total_per_day = []
interest_per_day = []
cumulative_per_day = []
for day in df.itertuples():
total += day.Debit - day.Credit
interest = total * 0.11 / 365
cumulative_interest += interest
if day.IsItLastDay:
total += cumulative_interest
total_per_day.append(total)
interest_per_day.append(interest)
cumulative_per_day.append(cumulative_interest)
if day.IsItLastDay:
cumulative_interest = 0
df.Total = total_per_day
df.InterestPerDay = interest_per_day
df.InterestCumulative = cumulative_per_day
This is unfortunately a lot more confusing looking, but that's what happens when values depend on previous values. Depending on your exact requirements there may be nice ways to simplify this using math, but otherwise this is what you've got.
I've written this directly into stackoverflow so it may not be perfect.

Pandas find min date within lookback window from first order for each user

For every user, I'd like to find the date of their earliest visit that falls within a 90 day lookback window from their first order date.
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
lookback = test[test['orderNumber']==1]['date'].apply(lambda x: x - timedelta(days=90))
lookback.name = 'window_min'
ids = test['fullVisitorId']
ids = ids.reset_index()
ids = ids.set_index('index')
lookback = lookback.reset_index()
lookback['fullVisitorId'] = lookback['index'].map(ids['fullVisitorId'])
lookback = lookback.set_index('fullVisitorId')
test['window'] = test['fullVisitorId'].map(lookback['window_min'])
test = test[test['window']<test['date']]
test.loc[test.groupby('fullVisitorId')['date'].idxmin()]
This works, but I feel like there ought to be a cleaner way...

How about this? Basically we assign a new column (order-90days) to help us filter away those who are False.
We apply groupby and pick the 1st (0-nth) element.
import pandas as pd
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
test.sort_values(by='date', inplace=True)
firstorder = test[test.orderNumber > 0].set_index('fullVisitorId').date
test['firstorder_90'] = test.fullVisitorId.map(firstorder - pd.Timedelta(days=90))
test.query('date >= firstorder_90').groupby('fullVisitorId', as_index=False).nth(0)
We get:
date fullVisitorId sessionId \
121154 2016-10-07 7634897085866546110 7634897085866546110_1475846055
189763 2016-12-18 643786734868244401 0643786734868244401_1482120932
orderNumber firstorder_90
121154 0.0 2016-10-07
189763 1.0 2016-09-19

Calculating percentage of filtered column attribute

I want to calculate the percentage of broken water points per community. So far, I am able to get the list of communities and the broken water points.
This is my code so far:
import pandas as pd
df = pd.DataFrame((data))
gb = df.groupby(['water_point_condition'])
grouped = gb[["communities_villages", "water_point_condition"]].get_group("broken")
print(grouped)
The result is:
This fixed my problem and I was able to get percentage of broken water points per community:
df = pd.DataFrame(data)
grouped = df.groupby(['water_point_condition'])
rank_by_percentage = 100 * df[df.water_point_condition == 'broken'].communities_villages.value_counts() / grouped["water_point_condition"].get_group("broken").count()
print(rank_by_percentage)

You need to group by ommunities_villages and not water_point_condition
df.groupby('communities_villages')['water_point_condition'].apply(lambda x: (x == 'broken').mean())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mix and match columns - python/pandas - python

Related

How to iterate and sample from each category of a dataframe?

Associating .csv column data for calculations

Pandas capitalization of compound interests

Pandas find min date within lookback window from first order for each user

Calculating percentage of filtered column attribute

Categories

Resources