I'm trying to learn more about re-indexing.
For background context, I have a data frame called sleep_cycle.
In this data frame, the columns are: Name, Age, Average sleep time.
I want to pick out only those who's names begin with the letter 'B'.
I then want to re-index these 'B' people, so that I have a new data frame that has the same columns, but only has those who's name begins with B.
Here was my attempt to do it:
info = list(sleep_cycle.columns) #this is just to set a list of the existing columns
b_names = [name for name in sleep_cycle['name'] if name[0] == 'B']
b_sleep_cycle = sleep_cycle.reindex(b_names, columns = info) #re-index with the 'B' people, and set columns to the list I saved earlier.
Results: Re-indexing was succesful, managed to pick those who only began with the letter 'B', and the columns remained the same. Great! Problem was: All the data has been replaced with NaN.
Can someone help me with this one? What am I doing wrong? It would be best appreciated if you could suggest a solution that is only in one line of code.
Based on your description (example data and expected output would be better), this would work:
sleep_cycle[sleep_cycle['name'].str.startswith['B']]
Related
I am trying to replace multiple names throughout my entire DF to match a certain output. For example how can I make it where the DF will replace all "Ronald Acuna" with "Ronald Acuna Jr." and "Corbin Burns" to "Corbin B"
lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
This works, but then when I make another line for Ronald Acuna, Corbin B goes back to his full name. Im sure there is a way to somehow loop it all together, but I can't find it.
Thanks
Most likely you will need to reassign the new replaced dataframe back to the dataframe
lineups = lineups.replace(to_replace = ['Corbin Burnes'], value ='Corbin B')
lineups = lineups.replace(to_replace = ['Ronald Acuna'], value ='Ronald Acuna Jr')
And so on.
I have a initial dummy dataframe with 7 columns, 1 row and given columns names and initialised zeros
d = pandas.DataFrame(numpy.zeros((1, 7)))
d = d.rename(columns={0:"Gender_M",
1:"Gender_F",
2:"Employed_Self",
3:"Employed_Employee",
4:"Married_Y",
5:"Married_N",
6:"Salary"})
Now I have a single record
data = [['M', 'Employee', 'Y',85412]]
data_test = pd.DataFrame(data, columns = ['Gender', 'Employed', 'Married','Salary'])
From the single record I have to create a new dataframe, where if the
Gender column has M, then Gender_M should be changed to 1, Gender_F left with zero
Employed column has Employee, then Employed_Employee changed to 1, Employed_Self with zero
same with Married and for the integer column Salary, just set the value 85412, I tried with if statements, but its a long set of codes, is there a simple way?
Here is one way using update twice
d.update(df)
df.columns=df.columns+'_'+df.astype(str).iloc[0]
df.iloc[:]=1
d.update(df)
Alas homework is often designed to be boring and repetitive ...
You do not have a problem - rather you want other people to do the work for you. SO is not for this purpose - post a problem, you will find many people willing to help.
So show your FULL answer then ask for "Is there a better way"
I have two dataframes: Instructor_Info and Operator_Info
Instructor_Info contains a column called Names and OperatorName, and Operator_Info also has a column called Names. All names in Instructor_Info have an associated name in Operator Info. I want to use fuzz.token_sort_ratio() to find these matches by comparing each name in Instructor_Info to every name in Operator_Info and storing the string with the highest score in the OperatorName column.
This is what I have so far:
for index, row in Instructor_Info.iterrows():
match = 0
for index1,row1 in Operator_Info.iterrows():
if fuzz.token_sort_ratio(row['Names'],row1['Names']) > match:
row['OperatorName'] = row1['Names']
This code runs extremely slow and gets a couple of false matches (I can fix these manually so speed is the main issue). If anyone has any faster ideas it would be much appreciated. Thanks in advance.
I have a list of lists and I want to assign each of the lists to a specific column, I have created the columns of the Dataframe. But in each column, the elements are coming as a list. I want each element of this list to be a separate row as part of that particular column.
Here's what I did:
df = pd.DataFrame([np.array(dataset).T],columns=list1)
print(df)
Attached screenshot for the output.
I want each element of that list to be a row, as my output.
This should do the work for you:
import pandas as pd
Fasteners = ['Screws & Bolts', 'Threaded Rods & Studs', 'Eyebolts', 'U-Bolts']
Adhesives_and_Tape = ['Adhesives','Tape','Hook & Loop']
Weld_Braz_Sold = ['Electrodes & Wire','Gas Regulators','Welding Gloves','Welding Helmets & Glasses','Protective Screens']
df = pd.DataFrame({'Fastener': pd.Series(Fasteners), 'Adhesives_and_Tape': pd.Series(Adhesives_and_Tape), 'Weld_Braz_Sold': pd.Series(Weld_Braz_Sold)})
print(df)
Please provide the structure of the database you are starting from or the structure of the respective lists. I can give you are more focussed answer to your specific problem then.
If the structure is getting larger, you can also iterate through all lists when generating the data frame. This is just the basic process to solve your question.
Feel free to comment for further help.
EDIT
If you want to loop through a database of lists. Use the following code additionally:
for i in range(len(list1)): df.iloc[:,i] = pd.Series(dataset[i])
I have a large dataframe collating a bunch of basketball data (screenshot below). Every column to the right of Opp Lineup is a dummy variable indicating if that player (indicated in the column name) is in the current lineup (the last part of the column name is team name, which needs to be compared to the opponent column to make sure two players with the same number and name on different teams don't mess it up). I know several ways of iterating through a pandas dataframe (iterrows, itertuples, iteritems), but I don't know the way to accomplish what I need to, which is for each line in each column:
Compare the team (columnname.split()[2:]) to the Opponent column (except for LSU players)
See if the name (columnname.split()[:2]) is in Opp Lineup or, for LSU players, lineup
If the above conditions are satisfied, replace that value with 1, otherwise leave it as 0
What is the best method for looping through the dataframe and accomplishing this task? Speed doesn't really matter in this instance. I understand all of the logic involved, except I'm not familiar enough with pandas to know how to loop through it, and trying various things I've seen on Google isn't working.
Consider a reshape/pivot solution as your data is in wide format but you need to compare values row-wise in long format. So, first melt your data so all column headers become an actual column 'Player' and its corresponding value to 'IsInLineup'. Run your conditional comparison for dummy values, and then pivot back to original structure with players across column headers. Of course, I do not have actual data to test this example fully.
# MELT
reshapedf = pd.melt(df, id_vars=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'],
var_name='Player', value_name='IsInLineup')
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
reshapedf['IsInLineup'] = reshapedf.apply(lambda row: (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and
' '.join(row['Player'].split(' ')[2:]) in row['Opponent'])*1, axis=1)
# PIVOT (UNMELT)
df2 = reshapedf.pivot_table(index=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'], columns='Player').reset_index()
df2.columns = df2.columns.droplevel(0).rename(None)
df2.columns = df.columns
If above lambda function looks a little complex, try equivalent apply defined function():
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
def f(row):
if (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and \
' '.join(row['Player'].split(' ')[2:]) in row['Opponent']):
return 1
else:
return 0
reshapedf['IsInLineup'] = reshapedf.apply(f,axis=1)
I ended up using a work around. I iterated through using df.iterrows and for each one created a list for each iteration where checked for the value I wanted and then appended the 0 or 1 to the temporary list. Then I simply inserted it to the dataframe. Possibly not the most efficient memory-wise, but it worked.