I am trying to replace multiple names throughout my entire DF to match a certain output. For example how can I make it where the DF will replace all "Ronald Acuna" with "Ronald Acuna Jr." and "Corbin Burns" to "Corbin B"
lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
This works, but then when I make another line for Ronald Acuna, Corbin B goes back to his full name. Im sure there is a way to somehow loop it all together, but I can't find it.
Thanks
Most likely you will need to reassign the new replaced dataframe back to the dataframe
lineups = lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
lineups = lineups.replace(to_replace = ['Ronald Acuna'], value ='Ronald Acuna Jr')
And so on.
Related
I am trying to assign an object(it could be a list,tuple, string) to a specific cell in a dataframe, but it does not work. I am filtering first and then trying to assign the value.
enter image description here
I am using df.loc[df['name']=='aagicampus'].reset_index(drop=True).at[0,'words']='test'
The expected result is something like
enter image description here
It works if I create a copy of the dataframe, but I must keep the original dataframe to iterate later over a list and perform this procedure many times.
Thanks for your help.
You can do it by first getting the indices of the row(s) that you want to change, and then setting cells at one of those locations to the desired value.
This code gets the locations of rows that satisfy your condition of df['name'] == 'aagicampus
locations = df.index[df['name'] == 'aagicampus']
then you just .loc on locations[0] to change the first row that satisfies the condition. Here it is all together:
df = pd.DataFrame({'name':['something','aagicampus','something'], 'words':['unchanged', 'unchanged', 'unchanged'] })
locations = df.index[df['name'] == 'aagicampus']
df.words.loc[locations[0]] = 'CHANGED'
df.head()
this will return a table:
name words
0 something unchanged
1 aagicampus CHANGED
2 something unchanged
I'm trying to learn more about re-indexing.
For background context, I have a data frame called sleep_cycle.
In this data frame, the columns are: Name, Age, Average sleep time.
I want to pick out only those who's names begin with the letter 'B'.
I then want to re-index these 'B' people, so that I have a new data frame that has the same columns, but only has those who's name begins with B.
Here was my attempt to do it:
info = list(sleep_cycle.columns) #this is just to set a list of the existing columns
b_names = [name for name in sleep_cycle['name'] if name[0] == 'B']
b_sleep_cycle = sleep_cycle.reindex(b_names, columns = info) #re-index with the 'B' people, and set columns to the list I saved earlier.
Results: Re-indexing was succesful, managed to pick those who only began with the letter 'B', and the columns remained the same. Great! Problem was: All the data has been replaced with NaN.
Can someone help me with this one? What am I doing wrong? It would be best appreciated if you could suggest a solution that is only in one line of code.
Based on your description (example data and expected output would be better), this would work:
sleep_cycle[sleep_cycle['name'].str.startswith['B']]
I'm looking to create/update a new column, 'dept' if the text in column
A contains a string. It's working without a forloop involved but when I try to iterate it is setting the default instead of the detected value.
Surely I shouldn't manually add the same line 171 times, I've scoured the internet and SO for possible hints and or solutions and can't seem to locate any good info.
Working Code:
df['dept'] = np.where(df.a.str.contains("PHYS"), "PHYS", "Unknown")
But when I try:
depts = ['PHYS', 'PSYCH']
for dept in depts:
df['dept'] = np.where(df.a.str.contains(dept), dept, "Unknown")
print(dept)
I get all "Unknowns" but properly prints out each dept. I've also tried to make sure dept is fed in as a string by explicitly stating dept = str(dept) to no avail.
Thanks in advance for any and all help. I feel like this is a simple issue that should be easily sorted but I'm experiencing a block.
We usually do
df['dept'] = df.a.str.findall('|'.join(depts)).str[0]
I prefer str.extract:
df['depth'] = df['a'].str.extract(f"({'|'.join(depts)})").fillna("Unknown")
Or:
df['depth'] = df['a'].str.extract('(' + '|'.join(depts) + ')').fillna("Unknown")
Both codes output:
>>> df
a depth
0 ewfefPHYS PHYS
1 QWQiPSYCH PSYCH
2 fwfew Unknown
>>>
#U-12-Forward has a great solution if there is only supposed to be one new column entitled specifically with the string 'dept', not the value of each dept variable in the loop.
If the intent is to create a new column for each dept in depts, then remove the quotations around "dept" in the column indexer:
for dept in depts:
df[dept] = np.where(df.a.str.contains(dept), dept, "Unknown")
The example is confusing because it is not clear whether there is supposed to be a new column for each dept (i.e, PHYS, PSYCH) because of the variable name.
This excerpt will not "work" because it would overwrite df['dept'] on the second assignment with something that is only a combination of 'PSYCH' and 'Unknown' (there would be no 'PHYS').
df['dept'] = np.where(df.a.str.contains("PHYS"), "PHYS", "Unknown")
df['dept'] = np.where(df.a.str.contains("PSYCH"), "PSYCH", "Unknown")
What you are describing would certainly happen if there are no strings in column a that contain the final element in depts because the result of the last np.where would be all False, therefore return a full series of 'Unknown'.
New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.
I have a large dataframe collating a bunch of basketball data (screenshot below). Every column to the right of Opp Lineup is a dummy variable indicating if that player (indicated in the column name) is in the current lineup (the last part of the column name is team name, which needs to be compared to the opponent column to make sure two players with the same number and name on different teams don't mess it up). I know several ways of iterating through a pandas dataframe (iterrows, itertuples, iteritems), but I don't know the way to accomplish what I need to, which is for each line in each column:
Compare the team (columnname.split()[2:]) to the Opponent column (except for LSU players)
See if the name (columnname.split()[:2]) is in Opp Lineup or, for LSU players, lineup
If the above conditions are satisfied, replace that value with 1, otherwise leave it as 0
What is the best method for looping through the dataframe and accomplishing this task? Speed doesn't really matter in this instance. I understand all of the logic involved, except I'm not familiar enough with pandas to know how to loop through it, and trying various things I've seen on Google isn't working.
Consider a reshape/pivot solution as your data is in wide format but you need to compare values row-wise in long format. So, first melt your data so all column headers become an actual column 'Player' and its corresponding value to 'IsInLineup'. Run your conditional comparison for dummy values, and then pivot back to original structure with players across column headers. Of course, I do not have actual data to test this example fully.
# MELT
reshapedf = pd.melt(df, id_vars=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'],
var_name='Player', value_name='IsInLineup')
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
reshapedf['IsInLineup'] = reshapedf.apply(lambda row: (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and
' '.join(row['Player'].split(' ')[2:]) in row['Opponent'])*1, axis=1)
# PIVOT (UNMELT)
df2 = reshapedf.pivot_table(index=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'], columns='Player').reset_index()
df2.columns = df2.columns.droplevel(0).rename(None)
df2.columns = df.columns
If above lambda function looks a little complex, try equivalent apply defined function():
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
def f(row):
if (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and \
' '.join(row['Player'].split(' ')[2:]) in row['Opponent']):
return 1
else:
return 0
reshapedf['IsInLineup'] = reshapedf.apply(f,axis=1)
I ended up using a work around. I iterated through using df.iterrows and for each one created a list for each iteration where checked for the value I wanted and then appended the 0 or 1 to the temporary list. Then I simply inserted it to the dataframe. Possibly not the most efficient memory-wise, but it worked.