Iterate through and overwrite specific values in a pandas dataframe - python

I have a large dataframe collating a bunch of basketball data (screenshot below). Every column to the right of Opp Lineup is a dummy variable indicating if that player (indicated in the column name) is in the current lineup (the last part of the column name is team name, which needs to be compared to the opponent column to make sure two players with the same number and name on different teams don't mess it up). I know several ways of iterating through a pandas dataframe (iterrows, itertuples, iteritems), but I don't know the way to accomplish what I need to, which is for each line in each column:
Compare the team (columnname.split()[2:]) to the Opponent column (except for LSU players)
See if the name (columnname.split()[:2]) is in Opp Lineup or, for LSU players, lineup
If the above conditions are satisfied, replace that value with 1, otherwise leave it as 0
What is the best method for looping through the dataframe and accomplishing this task? Speed doesn't really matter in this instance. I understand all of the logic involved, except I'm not familiar enough with pandas to know how to loop through it, and trying various things I've seen on Google isn't working.

Consider a reshape/pivot solution as your data is in wide format but you need to compare values row-wise in long format. So, first melt your data so all column headers become an actual column 'Player' and its corresponding value to 'IsInLineup'. Run your conditional comparison for dummy values, and then pivot back to original structure with players across column headers. Of course, I do not have actual data to test this example fully.
# MELT
reshapedf = pd.melt(df, id_vars=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'],
var_name='Player', value_name='IsInLineup')
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
reshapedf['IsInLineup'] = reshapedf.apply(lambda row: (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and
' '.join(row['Player'].split(' ')[2:]) in row['Opponent'])*1, axis=1)
# PIVOT (UNMELT)
df2 = reshapedf.pivot_table(index=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'], columns='Player').reset_index()
df2.columns = df2.columns.droplevel(0).rename(None)
df2.columns = df.columns
If above lambda function looks a little complex, try equivalent apply defined function():
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
def f(row):
if (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and \
' '.join(row['Player'].split(' ')[2:]) in row['Opponent']):
return 1
else:
return 0
reshapedf['IsInLineup'] = reshapedf.apply(f,axis=1)

I ended up using a work around. I iterated through using df.iterrows and for each one created a list for each iteration where checked for the value I wanted and then appended the 0 or 1 to the temporary list. Then I simply inserted it to the dataframe. Possibly not the most efficient memory-wise, but it worked.

Related

Populating a subset of rows in a dataframe with values from another column / Collapsing several columns

First time posting here. I expect there's a better way of doing what I'm trying to do. I've been going round in circles for days and would really appreciate some help.
I am working with survey data about prisoners and their sentences.
Each prisoner has a type for the purpose of the survey, and this is stored in the column 'prisoner_type'. For each prisoner type, there is a group of 5 columns where their offenses can be recorded (not all columns are necessarilly used). I'd like to collapse these groups of columns into one set of 5 columns and add these to the dataset so that, on each row, there is one set of 5 columns where I can find the offenses.
I have created a dictionary to look up the column names that the offence codes and offence types are stored in for each prisoner type. The key in the outer dictionary is the prisoner type. Here is an abridged version:
offense_variables=
{ 3={'codes':{1:'V0114',2:'V0115',3:'V0116',4:'V0117',5:'V0118'},
'off_types':{1:'V0124',2:'V0125',3:'V0126',4:'V0127',5:'V0128'}}
8={'codes':{1:'V0270',2:'V0271',3:'V0272',4:'V0273',5:'V0274'},
'off_types': {1:'V0280',2:'V0281',3:'V0282',4:'V0283',5:'V0285'}} }
I am first creating 10 new columns: offense_1...offense_5 and type_1...type_5.
I am then trying to:
Use pandas iloc to locate the all the rows for a given prisoner type
Set the values for the new columns by looking up the variable for each offense number under
that prisoner type in the dictionary, and assign that column as the new values.
Problems:
The code doesn't terminate. I'm not sure why it's running on and on.
I recieve the error message "A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
pris_types=[3,8]
for pt in pris_types:
#five offenses are listed in the survey, so we need five columns to hold offence codes
#and five to hold offence types
#1 and 2 are just placeholder values
for item in [i+1 for i in range(5)]:
dataset[f'off_{item}_code']='1'
dataset[f'off_{item}_type']='2'
#then use .loc to get indexes for this prisoner type
#look up the variable of the column that we need to take the values from
#using the dictionary shown above
for item in [i+1 for i in range(5)]:
dataset.loc[dataset['prisoner_type'] == pt, \
dataset[f'off_{item}_code']] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
dataset[f'off_{item}_type']] = \
dataset[offense_variables[pt]['types'][item]]
The problem is that in your .loc[] sections, you just need to use the column label (string object) to identify the column where values are to be set, not the entire series/column object, as you are currently doing. With your current code, you are creating new columns named with values stored in the dataset[f'off_{item}_type'] columns. So, instead of:
for item in [i+1 for i in range(5)]:
dataset.loc[dataset['prisoner_type'] == pt, \
dataset[f'off_{item}_code']] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
dataset[f'off_{item}_type']] = \
dataset[offense_variables[pt]['types'][item]]
use:
for item in range(1,6):
(dataset.loc[dataset['prisoner_type'] == pt, \
f'off_{item}_code'] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
f'off_{item}_type'] = \
dataset[offense_variables[pt]['types'][item]]
(I simplified your range loop line too.)
Also, you don't need to have the statements creating the 10 new columns inside the loop over prisoner types, you can move them outside of that loop. You actually don't need to create them manually like that. The .loc[] code would create them for you.

Apply function to several DataFrames by creating new DataFrame

May be Someone can just help me to find the solution:
I have 100 dataframes. Each dataframe contains time / High_Price / Low_price
I would like to create new Dataframe, which contains Gains from each DataFrame.
Example:
df1 = pd.DataFrame({"high":[5,4,5,2],
"low":[1,2,2,1]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
df100 = pd.DataFrame({"high":[7,5,6,7],
"low":[1,2,3,4]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
Functions:
def myfunc(data, amount):
data= data.loc[(data!=0).any(1)]
profit = (amount/data.iloc[0]['low']) * data.iloc[-1]['high']
return profit
Output should be:
output= pd.DataFrame({"Gain":[1,6]},
index=["df1","df100"])
How can I apply function to 100 DataFrames and get from them only Gains by creating the Dataframe, where we see the name of DataFrame and the Gain for this DataFrame?
Put your dataframes in a list and access them by integer index. Having variables named df1 to df100 is bad programming style because a) the dataframes belong together, so put them in a collection (e.g. list) and b) you cannot get "the" name of an object from its value, leading to complications such as the one you are facing now.
So let dfs be your list of 100 dataframes, starting at index 0.
Use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs], columns=['Gain'])
The index of output now corresponds to the index of dfs, starting at 0. There's no reason to rename it to 'df1' ... 'df100', you gain no information and the output becomes harder to handle.
In case of arbitrary dataframe names, use a dictionary that maps name to df. Let's call it dfs again. Then use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs.values()], columns=['Gain'], index=dfs.keys()])
I'm assuming myfunc is correct, I did not debug it.

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories