I have a large dataset (circa. 200,000 rows x 30 columns) as a CSV. I need to use pandas to pre-process this data. I have included a dummy dataset below to help visualise the problem.
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
df
The goal is to have individual columns that show the probability of each outcome for a batsman & bowler. By way of an example from the dummy dataset, Tom would have a 50% chance of an outcome of '1' or 'Out'
This is calculated by:
Batsman column - The total number of rows with batsman 'X';
Outcome column - The total number of outcomes with 'X';
Point 2. / Point 1. to determine the probability of each outcome;
Repeat the above to determine the Bowler probabilities
The final dataframe from the dummy data should look similar to:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'zero_prob_bat':[0,0.4,0.4,0.4,0,0.4,0.4,0.5,0.5],
'one_prob_bat':[0.5,0.4,0.4,0.4,0.5,0.4,0.4,0,0],
'two_prob_bat':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bat':[0,0,0,0,0,0,0,0,0],
'four_prob_bat':[0,0.2,0.2,0.2,0,0.2,0.2,0,0],
'six_prob_bat':[0,0,0,0,0,0,0,0,0],
'out_prob_bat':[0.5,0,0,0,0.5,0,0,0,0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben'],
'zero_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.5,0.5],
'one_prob_bowl':[0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0,0],
'two_prob_bowl':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bowl':[0,0,0,0,0,0,0,0,0],
'four_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'six_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'out_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0]
}
One issue is that with my original dataset there are over 600 unique names. I could manually .groupby each unique name in the batsman/bowler columns, but this is not a scaleable solution as new names will continually be added.
I am tempted to:-
.count the number of instances of each unique name for batsman/bowler;
.count the number of different outcomes for each unique batsman/bowler;
Perform a lookup to match the probability next to each batsman/bowler;
However, I am cautious about implementing a lookup function as detailed in the answer here due to my dataset size which will continuously grow. In the past this has also created numerous issues when I have worked with excel/CSVs so I do not want to fall into any similar traps.
If someone could explain how they would go about solving this problem, so that I have something to aim towards, then it would be much appreciated.
Not sure how much this scales with your actual dataset, but I find it hard to think of a better solution than using groupby on the "Batsman" column and then value_counts on the grouped "Outcome" column. Example:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
grouped_data = df.groupby('Batsman')['Outcome'].value_counts(normalize=True)
print(grouped_data)
Output:
Batsman Outcome
Nick 1 0.4
0 0.2
4 0.2
Wide 0.2
Pete 0 0.5
2 0.5
Tom 1 0.5
Out 0.5
Name: Outcome, dtype: float64
Note that we did not need to groupby over each unique name manually, since groupby already does that for us.
The same logic can be applied to the "Bowler" column by simply replacing the "Batsman" string in the groupby call.
I think this answers your question...
import pandas as pd
import numpy as np
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
display(df)
batsman = df['Batsman'].unique()
bowler = df['Bowler'].unique()
print(sorted(batsman))
print(sorted(bowler))
final_df = pd.DataFrame()
for man in batsman:
df1 = df[df['Batsman'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
batsman_prob = np.array(count_man/count_outcome)
batsman_df = pd.DataFrame(data=[batman_prob], columns=[man], index=['Batsman'])
final_df = pd.concat([final_df, batsman_df, ], axis=1)
for man in bowler:
df1 = df[df['Bowler'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
bowler_prob = np.array(count_man/count_outcome)
bowler_df = pd.DataFrame(data=[bowler_prob], columns=[man], index=['Bowler'])
final_df = pd.concat([final_df, bowler_df, ], axis=1)
display(final_df)
Here is the output:
Tom Nick Pete Bill Ben
Batsman 0.333333 0.333333 0.333333 NaN NaN
Bowler NaN NaN NaN 1.166667 0.333333
Related
In the DataFrame below, I want to rearrange the nested columns - i.e. to have 'region_sea' appearing before 'region_inland'
df = pd.DataFrame( {'state': ['WA', 'CA', 'NY', 'NY', 'CA', 'CA', 'WA' ]
, 'region': ['region_sea', 'region_inland', 'region_sea', 'region_inland', 'region_sea', 'region_sea', 'region_inland',]
, 'count': [1, 3, 4, 6, 7, 8, 4]
, 'income': [100, 200, 300, 400, 600, 400, 300]
}
)
df = df.pivot_table(index='state', columns='region', values=['count', 'income'], aggfunc={'count': 'sum', 'income': 'mean'})
df
I tried the code below but it's not working...any idea how to do this? Thanks
df[['count']]['region_sea', 'region_inland']
You can use sort_index to sort it. However, as it is nested columns, it will replace income and count too.
df.sort_index(axis='columns', level=0, ascending=False, inplace=True)
If you don't want replace income/count, than it will not give common header for both.
df.sort_index(axis='columns', level='region', ascending=False, inplace=True)
I have a column in a dataset. I need to compare each value from that column to a list. After comparison, if it satisfies a condition, the value of another column should change.
for example,
List- james, michael, clara
According to the code, if a name in col A is in the list, col B should be 1, else 0.
How to solve this in python
Change B column where value A is in List
Using the loc operator you can easily select the rows where the item in the A column is in your List, and change the B column of these rows.
df.loc[(df["A"].isin(List)), "B"] = 1
Use np.fillna to fill empty cells with zeros.
df.fillna(0, inplace=True)
Full Code
names = ['james', 'randy', 'mona', 'lisa', 'paul', 'clara']
List = ["james", "michael", "clara"]
df = pd.DataFrame(data=names, columns=['A'])
df["B"] = np.nan
df.loc[(df["A"].isin(List)), "B"] = 1
df.fillna(0, inplace=True)
This would be a good time to use np.where()
import pandas as pd
import numpy as np
name_list = ['James', 'Sally', 'Sarah', 'John']
df = pd.DataFrame({
'Names' : ['James', 'Roberts', 'Stephen', 'Hannah', 'John', 'Sally']
})
df['ColumnB'] = np.where(df['Names'].isin(name_list), 1, 0)
df
I have 3 dataframes with the same format.
Then I combine them horizontally and get
I would like to add a row to denote the name of each dataframe, i.e.,
I get above form by copying the data to MS Excel and manually adding the row. Is there anyway to directly do so for displaying in Python?
import pandas as pd
data = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21]}
df1 = pd.DataFrame(data)
data = {'Name': ['John', 'Kim'], 'Age': [15, 17]}
df2 = pd.DataFrame(data)
data = {'Name': ['Paul', 'Dood'], 'Age': [10, 5]}
df3 = pd.DataFrame(data)
pd.concat([df1, df2, df3], axis = 1)
Use key parameter in concat:
df = pd.concat([df1, df2, df3], axis = 1, keys=('df1','df2','df3'))
print (df)
df1 df2 df3
Name Age Name Age Name Age
0 Tom 20 John 15 Paul 10
1 Joseph 21 Kim 17 Dood 5
The row is actually a first-level column. You can have it by adding this level to each dataframe before concatenating:
for df_name, df in zip(("df1", "df2", "df3"), (df1, df2, df3)):
df.columns = pd.MultiIndex.from_tuples(((df_name, col) for col in df))
pd.concat([df1, df2, df3], axis = 1)
Very nich case, but you can use Multindex objects in order to be able to build want you want.
Consider that what you need is a "two level headers" to display the information as you want. Multindex at a columns level can accomplish that.
To understand more the code, read about Multindex objects in pandas. You basically create the labels (called levels) and then use indexes to point to those labels (called codes) to build the object.
Here how to do it:
data = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21]}
df1 = pd.DataFrame(data)
data = {'Name': ['John', 'Kim'], 'Age': [15, 17]}
df2 = pd.DataFrame(data)
data = {'Name': ['Paul', 'Dood'], 'Age': [10, 5]}
df3 = pd.DataFrame(data)
df1.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[0, 0], [0, 1]])
df2.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[1, 1], [0, 1]])
df3.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[2, 2], [0, 1]])
And after the concatenation, you will have:
pd.concat([df1, df2, df3], axis = 1)
I have a challenge in a pandas dataframe.
Basically, I have 2 columns. In the first one, I have 3 different classes and in the second a list of students that are enrolled in the subject. The example is as follow:
df = pd.DataFrame({'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
I would like to have a second dataframe with the number of students that are presented in more thant one class. In other words, the intersection between the classes, as follow
result= pd.DataFrame({'Comparison': ['1A-2B','1A-2C', '2B-2C'],
'Intersection size': [2, 1, 2]})
Thank you for your help and attention!
import pandas as pd
import itertools
df = pd.DataFrame({'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
combination: to generate a combination of all colunms. use itertools.
col=list(itertools.combinations(df.Class,2))
> col Out[68]:
> [('1A','2B'), ('1A', '2C'), ('2B', '2C')]
explode: to form a structured dataframe
df1=df.explode('Students')
write a for
d={}
for c in col:
tmp=df1[(df1['Class']==c[0]) | (df1['Class']==c[1])]
count=len(tmp)-tmp.Students.nunique()
d[str(c[0])+'-'+str(c[1])]=count
The dictionary d has what you want:
d
Out[71]: {'1A-2B': 2, '1A-2C': 1, '2B-2C': 2}
You can try the following:
import pandas as pd
df = pd.DataFrame({
'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
ddf = df.explode("Students")
ddf = pd.crosstab(ddf["Students"], ddf["Class"])
A = ddf.values
result = pd.DataFrame(A.T # A, index=ddf.columns, columns=ddf.columns)
print(result)
It gives:
Class 1A 2B 2C
Class
1A 3 2 1
2B 2 4 2
2C 1 2 4
The intersection of every row and column gives the number of students taking both classes. Diagonal entries give numbers of students in each individual class.
If you want to get a dataframe listing only combinations of different classes with non-zero intersection values, then the following should work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Class': ['1A', '2B', '2C'],
'Students': [['Alice', 'Philips', 'John'],
['Philips', 'John', 'Anna', 'William'],
['Arthur', 'Alice', 'Anna', 'William']]
})
ddf = df.explode("Students")
ddf = pd.crosstab(ddf["Students"], ddf["Class"])
A = ddf.values
result = pd.DataFrame(np.tril(A.T # A, k=-1).T,
index=ddf.columns,
columns=ddf.columns).stack()
result.index = result.index.map(lambda x: f"{x[0]}-{x[1]}")
result[result > 0]
It gives:
1A-2B 2.0
1A-2C 1.0
2B-2C 2.0
I have a list of dictionaries of the following form:
lst = [{"Name":'Nick','Hour':0,'Value':2.75},
{"Name":'Sam','Hour':1,'Value':7.0},
{"Name":'Nick','Hour':0,'Value':2.21},
{'Name':'Val',"Hour":1,'Value':10.1},
{'Name':'Nick','Hour':1,'Value':2.1},
{'Name':'Val',"Hour":1,'Value':11},]
I want to be able to sum all values for a name for a particular hour, e.g. if Name == Nick and Hour == 0, I want value to give me the sum of all values meeting the condition. 2.75 + 2.21, according to the piece above.
I have already tried the following but it doesn't help me out with both conditions.
finalList = collections.defaultdict(float)
for info in lst:
finalList[info['Name']] += info['Value']
finalList = [{'Name': c, 'Value': finalList[c]} for c in finalList]
This sums up all the values for a particular Name, not checking if the Hour was the same. How can I incorporate that condition into my code as well?
My expected output :
finalList = [{"Name":'Nick','Hour':0,'Value':4.96},
{"Name":'Sam','Hour':1,'Value':7.0},
{'Name':'Val',"Hour":1,'Value':21.1},
{'Name':'Nick','Hour':1,'Value':2.1}...]
consider using pandas module - it's very comfortable for such data sets:
import pandas as pd
In [109]: lst
Out[109]:
[{'Hour': 0, 'Name': 'Nick', 'Value': 2.75},
{'Hour': 1, 'Name': 'Sam', 'Value': 7.0},
{'Hour': 0, 'Name': 'Nick', 'Value': 2.21},
{'Hour': 1, 'Name': 'Val', 'Value': 10.1},
{'Hour': 1, 'Name': 'Nick', 'Value': 2.1}]
In [110]: df = pd.DataFrame(lst)
In [111]: df
Out[111]:
Hour Name Value
0 0 Nick 2.75
1 1 Sam 7.00
2 0 Nick 2.21
3 1 Val 10.10
4 1 Nick 2.10
In [123]: df.groupby(['Name','Hour']).sum().reset_index()
Out[123]:
Name Hour Value
0 Nick 0 4.96
1 Nick 1 2.10
2 Sam 1 7.00
3 Val 1 10.10
export it to CSV:
df.groupby(['Name','Hour']).sum().reset_index().to_csv('/path/to/file.csv', index=False)
result:
Name,Hour,Value
Nick,0,4.96
Nick,1,2.1
Sam,1,7.0
Val,1,10.1
if you want to have it as a dictionary:
In [125]: df.groupby(['Name','Hour']).sum().reset_index().to_dict('r')
Out[125]:
[{'Hour': 0, 'Name': 'Nick', 'Value': 4.96},
{'Hour': 1, 'Name': 'Nick', 'Value': 2.1},
{'Hour': 1, 'Name': 'Sam', 'Value': 7.0},
{'Hour': 1, 'Name': 'Val', 'Value': 10.1}]
you can do many fancy things using pandas:
In [112]: df.loc[(df.Name == 'Nick') & (df.Hour == 0), 'Value'].sum()
Out[112]: 4.96
In [121]: df.groupby('Name')['Value'].agg(['sum','mean'])
Out[121]:
sum mean
Name
Nick 7.06 2.353333
Sam 7.00 7.000000
Val 10.10 10.100000
[{'Name':name, 'Hour':hour, 'Value': sum(d['Value'] for d in lst if d['Name']==name and d['Hour']==hour)} for hour in hours for name in names]
if you don't already have all names and hours in lists (or sets) you can get them like so:
names = {d['Name'] for d in lst}
hours= {d['Hour'] for d in lst}
You can use any (hashable) object as a key for a python dictionary, so just use a tuple containing Name and Hour as the key:
from collections import defaultdict
d = defaultdict(float)
for item in lst:
d[(item['Name'], item['Hour'])] += item['Value']