Pandas aggregations in python

Pandas aggregations in python - python

I have the following data set. I want to create a dataframe that contains all teams and include the number of games played, wins, losses, and draws, and average point differential in 2017 (Y = 17).
Date Y HomeTeam AwayTeam HomePoints AwayPoints
2014-08-16 14 Arsenal Crystal Palace 2 1
2014-08-16 14 Leicester Everton 2 2
2014-08-16 14 Man United Swansea 1 2
2014-08-16 14 QPR Hull 0 1
2014-08-16 14 Stoke Aston Villa 0 1
I wrote the following code:
df17 = df[df['Y'] == 17]
df17['differential'] = abs(df['HomePoints'] - df['AwayPoints'])
df17['home_wins'] = np.where(df17['HomePoints'] > df17['AwayPoints'], 1, 0)
df17['home_losses'] = np.where(df17['HomePoints'] < df17['AwayPoints'], 1, 0)
df17['home_ties'] = np.where(df17['HomePoints'] == df17['AwayPoints'], 1, 0)
df17['game_count'] = 1
df17.groupby("HomeTeam").agg({"differential": np.mean, "home_wins": np.sum, "home_losses": np.sum, "home_ties": np.sum, "game_count": np.sum}).sort_values(["differential"], ascending = False)
But i dont think this is correct as I'm only accounting for home team..does someone have a cleanear method?

Melting the dataframe granting us two new lines per old line, this allows us to have a line for the HomeTeam and a line for the AwayTeam.
Please find the documentation for the melt method here : https://pandas.pydata.org/docs/reference/api/pandas.melt.html
df = pd.melt(df, id_vars=['Date', 'Y', 'HomePoints', 'AwayPoints'], value_vars=['HomeTeam', 'AwayTeam'])
df = df.rename({'value': 'Team', 'variable': 'Home/Away'}, axis=1)
df['Differential'] = df['Home/Away'].replace({'HomeTeam': 1, 'AwayTeam': -1}) * (df['HomePoints'] - df['AwayPoints'])
def count_wins(x):
return (x > 0).sum()
def count_losses(x):
return (x < 0).sum()
def count_draws(x):
return (x == 0).sum()
df = df.groupby('Team')['Differential'].agg(['count', count_wins, count_losses, count_draws, 'sum'])
df = df.rename({'count': 'Number of games', 'count_wins': 'Wins', 'count_losses': 'Losses', 'count_draws': 'Draws', 'sum': 'Differential'}, axis=1)

Related

Conditional Probability in Python

I have a dataframe where I have six columns that are coded 1 for yes and 0 for no. There is also a column for year. The output I need is finding the conditional probability between each column being coded 1 according to year. I tried incorporating some suggestions from this post: Pandas - Conditional Probability of a given specific b but with no luck. Other things I came up with are inefficient. I am really struggling to find the best way to go about this.
Current dataframe:
Output I am seeking:

To get your wide-formatted data into the long format of linked post, consider running melt and then run a self merge by year for all pairwise combinations (avoiding same keys and reverse duplicates). Then calculate as linked post shows:
long_df = current_df.melt(
id_vars = "Year",
var_name = "Key",
value_name = "Value"
)
pairwise_df = (
long_df.merge(
long_df,
on = "Year",
suffixes = ["1", "2"]
).query("Key1 < Key2")
.assign(
Both_Occur = lambda x: np.where(
(x["Value1"] == 1) & (x["Value2"] == 1),
1,
0
)
)
)
prob_df = (
(pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].value_counts() /
pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].count()
).to_frame(name = "Prob")
.reset_index()
.query("Both_Occur == 1")
.drop(["Both_Occur"], axis = "columns")
)
To demonstrate with reproducible data
import numpy as np
import pandas as pd
np.random.seed(112621)
random_df = pd.DataFrame({
'At least one tree': np.random.randint(0, 2, 100),
'At least two trees': np.random.randint(0, 2, 100),
'Clouds': np.random.randint(0, 2, 100),
'Grass': np.random.randint(0, 2, 100),
'At least one mounain': np.random.randint(0, 2, 100),
'Lake': np.random.randint(0, 2, 100),
'Year': np.random.randint(1983, 1995, 100)
})
# ...same code as above...
prob_df
Year Key1 Key2 Prob
0 1983 At least one mounain At least one tree 0.555556
2 1983 At least one mounain At least two trees 0.555556
5 1983 At least one mounain Clouds 0.416667
6 1983 At least one mounain Grass 0.555556
8 1983 At least one mounain Lake 0.555556
.. ... ... ... ...
351 1994 At least two trees Grass 0.490000
353 1994 At least two trees Lake 0.420000
355 1994 Clouds Grass 0.280000
357 1994 Clouds Lake 0.240000
359 1994 Grass Lake 0.420000

Adding columns to a data frame with data coming from multiple dataframes

I want this matrix as outcome:
int = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
"sales_year_data":['15','205','24','920','1310'],
"country_image_data":['2','5','-6','7','-1'],
}
df_inv = pd.DataFrame(int)
The data of colum "sales_year_data" in df_inv come df1=
sales_year_data = {
"country":['Spain','France','Germany','Belgium','Italy'],
"2014":['45','202','24','216','219'],
"2015":['15','55','214','2016','209'],
"2016":['615','2333','205','207','920'],
"2017":['1215','255','234','2116','101'],
"2018":['415','1320','214','2516','2019'],
"2019":['215','220','5614','416','1310'],
"2020":['205','202','44','296','2011'],
}
df1 = pd.DataFrame(sales_year_data)
As you can see in the column "sales_year_data" of df_inv, the number 15 is the intersection in df1 between year 2015 and Spain, the number 205 is in the intersection between Spain and 2020, 24 is in the intersection between Germany and 2014 and so on.
Data of colum "country_image_data" in df_inv comes from df2
country_change_data = {
"country":['Spain','Spain','Germany','Italy','Italy'],
"2014":['4','2','-6','6','9'],
"2015":['2','5','-5','2','3'],
"2016":['5','3','5','7','9'],
"2017":['8','7','5','6','1'],
"2018":['5','1','4','6','2'],
"2019":['1','2','4','6','-1'],
"2020":['5','2','4','6','2'],
}
df2 = pd.DataFrame(country_change_data)
As you can see in the column "country_change_data" of df_inv, the number 2 is the intersection in df2 between year 2015 and Spain, the number 5 is in the intersection between Spain and 2020, -6 is in the intersection between Germany and 2014 and so on.
If my original dataframe is:
inv = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
}
df0 = pd.DataFrame(inv)
How could I automate the search across various df1 and df2 in the intersections of interest for building df_inv departing prom df0?

This does it.
sales_counters = {}
country_counters = {}
new_df_data = []
for row in df0.iloc:
c = row['country']
y = row['yeardum']
sales_idx = sales_counters[c] = sales_counters.get(c, -1) + 1
country_idx = country_counters[c] = country_counters.get(c, -1) + 1
d1 = df1[df1['country'] == c]
d2 = df2[df2['country'] == c]
sales_year = d1.iloc[min(sales_idx, d1.shape[0]-1)][y]
country_image = d2.iloc[min(country_idx, d2.shape[0]-1)][y]
new_df_data.append([sales_year, country_image])
df0 = pd.concat([df0, pd.DataFrame(new_df_data)], axis=1).rename({0: 'sales_year_data', 1: 'country_image_data'}, axis=1)
Test:
>>> df0
vendor country yeardum sales_year_data country_image_data
0 A Spain 2015 15 2
1 B Spain 2020 205 2
2 C Germany 2014 24 -6
3 D Italy 2016 920 7
4 E Italy 2019 1310 -1

Matching lists to dataframes

I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!

IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult

Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)

To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young

Getting a value out of pandas dataframe based on a set of conditions

I have a dataframe as shown below
Token Label StartID EndID
0 Germany Country 0 2
1 Berlin Capital 6 9
2 Frankfurt City 15 18
3 four million Number 21 24
4 Sweden Country 26 27
5 United Kingdom Country 32 34
6 ten million Number 40 45
7 London Capital 50 55
I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin
3 four million Number 21 24 - > 1 Berlin Capital 6 9
or something like:
df[row3] -> df [row1]
A pseudo logic
First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country
What I have done until now,
columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)
key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)
key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)
key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)
The logic is to compare the index's and then make possible relations
i.e.
df[row3] -> [ df [row1], df[row7]]

you could use merge_asof with the parameter direction=nearest for example:
df_nb_cap = pd.merge_asof(df_with_number.reset_index(),
df_with_capitals.reset_index(),
on='index',
suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
index Token_nb Label_nb StartID_nb EndID_nb Token_cap Label_cap \
0 3 four_million Number 21 24 Berlin Capital
1 6 ten_million Number 40 45 London Capital
StartID_cap EndID_cap
0 6 9
1 50 55

# adjusted sample data
s = """Token,Label,StartID,EndID
Germany,Country,0,2
Berlin,Capital,6,9
Frankfurt,City,15,18
four million,Number,21,24
Sweden,Country,26,27
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))
# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') |
(df['Label'].shift(-2) == 'Capital')) &
(df['Label'].shift(3) == 'Country') |
(df['Label'].shift(-1) == 'Country'))
# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') |
(df['Label'].shift(-2) == 'Number')) &
(df['Label'].shift(1) == 'Country'))
# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)
0 1 2 3 4 5 6 7
0 four million Number 21 24 Berlin Capital 6 9

How to create a dataframe with simulated data in python

I have sample schema, which consists 12 columns, and each column has certain category. Now i need to simulate those data into a dataframe of around 1000 rows. How do i go about it?
I have used below code to generate data for each column
Location = ['USA','India','Prague','Berlin','Dubai','Indonesia','Vienna']
Location = random.choice(Location)
Age = ['Under 18','Between 18 and 64','65 and older']
Age = random.choice(Age)
Gender = ['Female','Male','Other']
Gender = random.choice(Gender)
and so on
I need the output as below
Location Age Gender
Dubai below 18 Female
India 65 and older Male
.
.
.
.

You can create each column one by one using np.random.choice:
df = pd.DataFrame()
N = 1000
df["Location"] = np.random.choice(Location, size=N)
df["Age"] = np.random.choice(Age, size=N)
df["Gender"] = np.random.choice(Gender, size=N)
Or do that using a list comprehension:
column_to_choice = {"Location": Location, "Age": Age, "Gender": Gender}
df = pd.DataFrame(
[np.random.choice(column_to_choice[c], 100) for c in column_to_choice]
).T
df.columns = list(column_to_choice.keys())
Result:
>>> print(df.head())
Location Age Gender
0 India 65 and older Female
1 Berlin Between 18 and 64 Female
2 USA Between 18 and 64 Male
3 Indonesia Under 18 Male
4 Dubai Under 18 Other

You can create a for loop for the number of rows you want in your dataframe and then generate a list of dictionary. Use the list of dictionary to generate the dataframe.
In [16]: for i in range(5):
...: k={}
...: loc = random.choice(Location)
...: age = random.choice(Age)
...: gen = random.choice(Gender)
...: k = {'Location':loc,'Age':age, 'Gender':gen}
...: list2.append(k)
...:
In [17]: import pandas as pd
In [18]: df = pd.DataFrame(list2)
In [19]: df
Out[19]:
Age Gender Location
0 Between 18 and 64 Other Berlin
1 65 and older Other USA
2 65 and older Male Dubai
3 Between 18 and 64 Male Dubai
4 Between 18 and 64 Male Indonesia

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas aggregations in python - python

Related

Conditional Probability in Python

Adding columns to a data frame with data coming from multiple dataframes

Matching lists to dataframes

Getting a value out of pandas dataframe based on a set of conditions

How to create a dataframe with simulated data in python

Categories

Resources