Associating .csv column data for calculations - python

I am a complete nube to Python3 and coding so go easy on me please. :)
As a project I'm creating a football league table based on 2018 EPL results. I have been able to break the .csv file containing an entire seasons worth of data into round by round results, into .csv using Pandas module. Now I need to extract the table points for each team by round, based on the home and away goals for each team. I'm having a hard time associating the goals with the teams in each fixture. I can figure out how to apply win/draw/lose (3/1/0) points but only mandrolically per fixture, not dynamically for all fixtures in the round. Then I need to write the table to another .csv file.
FTHG-Full Time Home Goals, FTAG-Full Time Away Goals, FTR-Full Time Result
Example Data
Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,10/08/2018,Man United,Leicester,2,1,H
1,11/08/2018,Bournemouth,Cardiff,2,0,H
2,11/08/2018,Fulham,Crystal Palace,0,2,A
3,11/08/2018,Huddersfield,Chelsea,0,3,A
4,11/08/2018,Newcastle,Tottenham,1,2,A
5,11/08/2018,Watford,Brighton,2,0,H
6,11/08/2018,Wolves,Everton,2,2,D
7,12/08/2018,Arsenal,Man City,0,2,A
8,12/08/2018,Liverpool,West Ham,4,0,H
9,12/08/2018,Southampton,Burnley,0,0,D
Example Code
import pandas as pd
results = pd.read_csv("2018 Round 1.csv")
team = results.iloc[2,2]
if results.iloc[2,4] > results.iloc[2,5]:
points = 3
elif results.iloc[2, 4] < results.iloc[2, 5]:
points = 0
else:
results.iloc[2, 4] = results.iloc[2, 5]
points = 1
table_entry = (team + " " + str(points))
print(table_entry)
table_entry = pd.to_csv("EPL Table Round 1.csv", index = False)
Thanks for your help.

I hope this helps :)
Please fell free to ask if the code it's not clear
import pandas as pd
import numpy as np
df = pd.read_csv('foot.txt')
#Make a list with all tema names
Home_teams = pd.unique(df['HomeTeam'])
Away_teams = pd.unique(df['AwayTeam'])
teams = np.concatenate((Home_teams, Away_teams))
df_teams = pd.DataFrame(columns=['team', 'points'])
#For each team in the list...
for team in teams:
print("*******" + team+ "*****")
points = 0
df_home = df[(df['HomeTeam'] == team)]
res_home = df_home['FTR'].value_counts()
try:
points += res_home['H']*3;
except:
print("Didn't win when Home")
try:
points += res_home['D']*1;
except:
print("No Draws")
df_away = df[(df['AwayTeam'] == team)]
res_away = df_away['FTR'].value_counts()
try:
points += res_away['A']*3;
except:
print("Didn't win when Away")
df_teams = df_teams.append({'team': team, 'points': points}, ignore_index=True)
print(team +"has "+ str(points) +" points" )

Related

Problem generating a list with a numeric qualifier

I am working on a course with low code requirements, and have one step where I am stuck.
I have this code that creates a list of restaurants and the number of reviews each has:
Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
df_rated['rating'] = df_rated['rating'].astype('int')
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
From there I am supposed to create a list limited to those above 50 reviews, starting from this base:
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count['______________']['restaurant_name']
# Filter to get the data of restaurants that have rating count more than 50
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
# Group the restaurant names with their ratings and find the mean rating of each restaurant
df_mean_4.groupby(['_______'])['_______'].mean().sort_values(ascending = False).reset_index().dropna() ## Complete the code to find the mean rating
Where I am stuck is on the first step.
rest_names = df_rating_count['______________']['restaurant_name']
I am pretty confident in the other 2 steps.
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
df_mean_4.groupby(['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index().dropna()
I have frankly tried so many different things I don't even know where to start.
Does anyone have any hints to at least point me in the right direction?
you can index and filter using [].
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count[df_rating_count['rating'] > 50]['restaurant_name']
#function to determine the revenue
def compute_rev(x):
if x > 20:
return x*0.25
elif x > 5:
return x*0.15
else:
return x*0
## Write the appropriate column name to compute the revenue
df['Revenue'] = df['________'].apply(compute_rev)
df.head()

Mix and match columns - python/pandas

With this set I need to find which team has scored the most goals in non-friendly games since 2010 (or line 31992).
I started by isolating non-friendly games with:
conditions = [df['tournament'] != ('Friendly')]
values = ['FIFAEVENT']
df['FE'] = np.select(conditions, values)
Don't know how to proceed from here tbh. Any help or suggestions is greatly appreciated.
Dataset : https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017
You can try this:
import pandas as pd
import datetime
df = pd.read_csv('../input/international-football-results-from-1872-to-2017/results.csv')
df['date'] = pd.to_datetime(df['date']).dt.date
df_sub = df[(df.date>datetime.date(2009,12,31))&(df.tournament!='Friendly')] #Selecting dates from 2010 and excluding friendly matches
unique_teams = list(df_sub.home_team.unique())
unique_teams.extend(list(df_sub.away_team.unique()))
unique_teams = set(unique_teams) # finding the unique set of teams
goals = []
for team in unique_teams:
home_goal = df_sub[df_sub['home_team']==team]['home_score'].sum()
away_goal = df_sub[df_sub['away_team']==team]['away_score'].sum()
goals.append(home_goal+away_goal) # calculate and append total goal for each team
df_most_goals = pd.DataFrame(data={'teams':list(unique_teams),'total_goals':goals}) #creating a dataframe with team and the total goals scored
df_most_goals = df_most_goals.sort_values(by='total_goals',ascending=False) #Sorting based on descending order
df_most_goals = df_most_goals.reset_index(drop=True)
print(df_most_goals.head()) # printing top5 teams scored most number of goals after 2010

Concatenating tables with axis=1 in Orange python

I'm fairly new to Orange.
I'm trying to separate rows of angle (elv) into intervals.
Let's say, if I want to separate my 90-degree angle into 8 intervals, or 90/8 = 11.25 degrees per interval.
Here's the table I'm working with
Here's what I did originally, separating them by their elv value
Here's the result that I want, x rows 16 columns separated by their elv value.
But I want them done dynamically.
I list them out and turn each list into a table with x rows and 2 columns.
This is what I originally did
from Orange.data.table import Table
from Orange.data import Domain, Domain, ContinuousVariable, DiscreteVariable
import numpy
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame()
num = 10 #number of intervals that we want to seperate our elv into.
interval = 90.00/num #separating them into degree/interval
low = 0
high = interval
table = []
first = []
second = []
for i in range(num):
between = []
if i != 0: #not the first run
low = high
high = high + interval
for row in in_data: #Run through the whole table to see if the elv falls in between interval
if row[0] >= low and row[0] < high:
between.append(row)
elv = "elv" + str(i)
err = "err" + str(i)
domain = Domain([ContinuousVariable.make(err)],[ContinuousVariable.make(elv)])
data = Table.from_numpy(domain, numpy.array(between))
print("table number ", i)
print(data[:3])
Here's the output
But as you can see, these are separated tables being assigned every loop.
And I have to find a way to concatenate axis = 1 for these tables.
Even the source code for Orange3 forbids this for some reason.

I want create a list from values in a dataset based on a specific condition

I am working with a Dataset that contains the information of every March Madness game since 1985. I want to know which teams have won it all and how many times each.
I masked the main dataset and created a new one containing only information about the championship game. Now I am trying to create a loop that compares the scores from both teams that played in the championship game, detects the winner and adds that team to a list. This is how the dataset looks like: https://imgur.com/tXhPYSm
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for i in champions:
if champions['Score'] > champions['Score.1']:
list_champs.append(i['Team'])
else:
list_champs.append(i['Team.1'])
Why do you need to loop through the DataFrame?
Basic filtering should work well. Something like this:
champs1 = champions.loc[champions['Score'] > champions['Score.1'], 'Team']
champs2 = champions.loc[champions['Score'] < champions['Score.1'], 'Team.1']
list_champs = list(champs1) + list(champs2)
A minimalist change (not the most efficient) to get your code working:
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for row in champions.iterrows():
if row['Score'] > row['Score.1']:
list_champs.append(row['Team'])
else:
list_champs.append(row['Team.1'])
Otherwise, you could simply do:
df.apply(lambda row: row['Team'] if row['Score'] > row['Score.1'] else row['Team.1'], axis=1).values

How to compare these data sets from a csv? Python 2.7

I have a project where I'm trying to create a program that will take a csv data set from www.transtats.gov which is a data set for airline flights in the US. My goal is to find the flight from one airport to another that had the worst delays overall, meaning it is the "worst flight". So far I have this:
`import csv
with open('826766072_T_ONTIME.csv') as csv_infile: #import and open CSV
reader = csv.DictReader(csv_infile)
total_delay = 0
flight_count = 0
flight_numbers = []
delay_totals = []
dest_list = [] #create empty list of destinations
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['FL_NUM'] not in flight_numbers:
flight_numbers.append(row['FL_NUM'])
if row['DEST'] not in dest_list: #if the dest is not already in the list
dest_list.append(row['DEST']) #append the dest to dest_list
for number in flight_numbers:
for row in reader:
if row['ORIGIN'] == 'BOS': #for flights leaving BOS
if row['FL_NUM'] == number:
if float(row['CANCELLED']) < 1: #if the flight is not cancelled
if float(row['DEP_DELAY']) >= 0: #and the delay is greater or equal to 0 (some flights had negative delay?)
total_delay += float(row['DEP_DELAY']) #add time of delay to total delay
flight_count += 1 #add the flight to total flight count
for row in reader:
for number in flight_numbers:
delay_totals.append(sum(row['DEP_DELAY']))`
I was thinking that I could create a list of flight numbers and a list of the total delays from those flight numbers and compare the two and see which flight had the highest delay total. What is the best way to go about comparing the two lists?
I'm not sure if I understand you correctly, but I think you should use dict for this purpose, where key is a 'FL_NUM' and value is total delay.
In general I want to eliminate loops in Python code. For files that aren't massive I'll typically read through a data file once and build up some dicts that I can analyze at the end. The below code isn't tested because I don't have the original data but follows the general pattern I would use.
Since a flight is identified by the origin, destination, and flight number I would capture them as a tuple and use that as the key in my dict.
from collections import defaultdict
flight_delays = defaultdict(list) # look this up if you aren't familiar
for row in reader:
if row['ORIGIN'] == 'BOS': #only take flights leaving BOS
if row['CANCELLED'] > 0:
flight = (row['ORIGIN'], row['DEST'], row['FL_NUM'])
flight_delays[flight].append(float(row['DEP_DELAY']))
# Finished reading through data, now I want to calculate average delays
worst_flight = ""
worst_delay = 0
for flight, delays in flight_delays.items():
average_delay = sum(delays) / len(delays)
if average_delay > worst_delay:
worst_flight = flight[0] + " to " + flight[1] + " on FL#" + flight[2]
worst_delay = average_delay
A very simple solution would be. Adding two new variables:
max_delay = 0
delay_flight = 0
# Change: if float(row['DEP_DELAY']) >= 0: FOR:
if float(row['DEP_DELAY']) > max_delay:
max_delay = float(row['DEP_DELAY'])
delay_flight = #save the row number or flight number for reference.

Categories