beginner panda change row data based upon code - python

I'm a beginner in panda and python, trying to learn it.
I would like to iterate over panda rows, to apply simple coded logic.
Instead of fancy mapping functions, I just want simple coded logic.
So then I can easily adapt it later for other coded logic rules as well.
In my dataframe dc,
I like to check if column AgeUnkown == 1 (or >0 )
And if so it should move the value of column Age to AgeUnknown.
And then make Age equal to 0.0
I tried various combinations of my below code but it won't work.
# using a row reference #########
for index, row in dc.iterrows():
r = row['AgeUnknown']
if (r>0):
w = dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Another attempt
for index in dc.index:
r = dc.at[index,'AgeUnknown'].[0] # also tried .sum here
if (r>0):
w= dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Also tried
if(dc[index,'Age']>0 #wasnt allowed either
Why isn't this working as far as I understood a dataframe should be able to be addressed like above.

I realize you requested a solution involving iterating the df, but I thought I'd provide one that I think is more traditional.
A non-iterating solution to your problem is something like this- 1) get all the indexes that meet your criteria 2) set those indexes of the df to what you want.
# indexes where column AgeUnknown is >0
inds = dc[dc['AgeUnknown'] > 0].index.tolist()
# change the indexes of AgeUnknown to to the Age column
dc.loc[inds, 'AgeUnknown'] = dc.loc[inds, 'Age']
# change the Age to 0 at those indexes
dc.loc[inds, 'Age'] = 0

Related

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Unable to loop through Dataframe rows: Length of values does not match length of index

I'm not entirely sure why I am getting this error as I have a very simple dataframe that I am currently working with. Here is a sample of the dataframe (the date column is the index):
date
News
2021-02-01
This is a news headline. This is a news summary.
2021-02-02
This is another headline. This is another summary
So basically, all I am trying to do is loop through the dataframe one row at a time and pull the News item, use the Sentiment Intensity Analyzer on it and store the compound value into a separate list (which I am appending to an empty list). However, when I run the loop, it gives me this error:
Length of values (5085) does not match the length of index (2675)
Here is a sample of the code that I have so far:
sia = SentimentIntensityAnalyzer()
news_sentiment_list = []
for i in range (0, (df_news.shape[0]-1)):
n = df_news.iloc[i][0]
news_sentiment_list.append(sia.polarity_scores(n)['compound'])
df['News Sentiment'] = news_sentiment_list
I've tried the loop statement a number of different ways using the FOR loop, and I always return that error. I am honestly lost at this point =(
edit: The shape of the dataframe is: (5087, 1)
The target dataframe is df whereas you loop on df_news, the indexes are probably not the same. You might need to merge the dataframes before doing so.
Moreover, there is an easier approach to your problem that would avoid having to loop on it. Assuming your dataframe df_news holds the column News (as shown on your table), you can add a column to this dataframe simply by doing:
sia = SentimentIntensityAnalyzer()
df_news['News Sentiment'] = df_news['News'].apply(lambda x: sia.polarity_scores(x)['compound'])
A general rule when using pandas is to avoid as much as possible using for-loops, except when you have a very specific edge case panda's built-in methods will be sufficient.

Select specific rows on pandas based on condition

I have a dataframe containing a column called bmi (Body Mass Index) containing int values
I have to separate the values in bmi column into Under weight, Normal, Over weight and Obese based on the values. Below is the loop for the same
However I am getting an error. I am a beginner. Just started coding 2 weeks back.
Generally speaking, using a for loop in pandas is usually a bad idea. Pandas allows you to manipulate data easily. For example, if you want to filter by some condition:
print(df[df["bmi"] > 30])
will print all rows where there bmi>30. It works as follows: df[condition]. Condition in this case is "bmi" is larger then 30, so our condition is df["bmi"] > 30. Notice the line df[df["bmi"] > 30] returns all rows that satisfy the condition. I printed them, but you can manipulate them whatever you like.
Even though it's a bad technique (or used only for specific need), you can of course iterate through dataframe. This is not done via for l in df, as df is a dataframe object. To iterate through it you can use iterrows:
for index, row in df.iterrows():
if (row["bmi"] > 30)
print("Obese")
Also for next time please provide your code inline. Don't paste an image of it
If your goal is to separate into different labels, I suggest the following:
df.loc[df[df["bmi"] > 30, "NewColumn"] = "Obese"
df.loc[df[df["bmi"] < 18.5, "NewColumn"] = "Underweight"
.loc operator allows me to manipulate only part of the data. It's format is [rows, columns]. So the above code takes on rows where bmi>30, and it takes only "NewColumn" (change it whatever you like) which is a new column. It puts the value on the right to this column. That way, after that operation, you have a new column in your dataframe which has "Obese/Underweight" as you like.
As side note - there are better ways to map values (e.g pandas' map and others) but if you are a beginner, it's important to understand simple methods to manipulate data before diving into more complex one. That's why I am avoiding into explaining more complex method
First of all, As mentioned in the comment you should post text/code instead of screenshots.
You could do binning in pandas:
bmi_labels = ['Normal', 'Overweight', 'Obese']
cut_bins = [18.5, 24.9, 29.9, df["bmi"].max()]
df['bmi_label'] = pd.cut(df['bmi'], bins=cut_bins, labels=bmi_labels)
Here, i have made a seperate column (bmi_label) to store label but you could can do it in same column (bmi) too.

Given an index label, how would you extract the index position in a dataframe?

New to python, trying to take a csv and get the country that has the max number of gold medals. I can get the country name as a type Index but need a string value for the submission.
csv has rows of countries as the indices, and columns with stats.
ind = DataFrame.index.get_loc(index_result) doesn't work because it doesn't have a valid key.
If I run dataframe.loc[ind], it returns the entire row.
df = read_csv('csv', index_col=0,skiprows=1)
for loop to get the most gold medals:
mostMedals= iterator
getIndex = (df[df['medals' == mostMedals]).index #check the column medals
#for mostMedals cell to see what country won that many
ind = dataframe.index.get_loc[getIndex] #doesn't like the key
What I'm going for is to get the index position of the getIndex so I can run something like dataframe.index[getIndex] and that will give me the string I need but I can't figure out how to get that index position integer.
Expanding on my comments above, this is how I would approach it. There may be better/other ways, pandas is a pretty enormous library with lots of neat functionality that I don't know yet, either!
df = read_csv('csv', index_col=0,skiprows=1)
max_medals = df['medals'].max()
countries = list(df.where(df['medals'] == max_medals).dropna().index)
Unpacking that expression, the where method returns a frame based on df that matches the condition expressed. dropna() tells us to remove any rows that are NaN values, and index returns the remaining row index. Finally, I wrap that all in list, which isn't strictly necessary but I prefer working with simple built-in types unless I have a greater need.

How to maintain lexsort status when adding to a multi-indexed DataFrame?

Say I construct a dataframe with pandas, having multi-indexed columns:
mi = pd.MultiIndex.from_product([['trial_1', 'trial_2', 'trial_3'], ['motor_neuron','afferent_neuron','interneuron'], ['time','voltage','calcium']])
ind = np.arange(1,11)
df = pd.DataFrame(np.random.randn(10,27),index=ind, columns=mi)
Link to image of output dataframe
Say I want only the voltage data from trial 1. I know that the following code fails, because the indices are not sorted lexically:
idx = pd.IndexSlice
df.loc[:,idx['trial_1',:,'voltage']]
As explained in another post, the solution is to sort the dataframe's indices, which works as expected:
dfSorted = df.sortlevel(axis=1)
dfSorted.loc[:,idx['trial_1',:,'voltage']]
I understand why this is necessary. However, say I want to add a new column:
dfSorted.loc[:,('trial_1','interneuron','scaledTime')] = 100 * dfSorted.loc[:,('trial_1','interneuron','time')]
Now dfSorted is not sorted anymore, since the new column was tacked onto the end, rather than snuggled into order. Again, I have to call sortlevel before selecting multiple columns.
I feel this makes for repetitive, bug-prone code, especially when adding lots of columns to the much bigger dataframe in my own project. Is there a (preferably clean-looking) way of inserting new columns in lexical order without having to call sortlevel over and over again?
One approach would be to use filter which does a text filter on the column names:
In [117]: df['trial_1'].filter(like='voltage')
Out[117]:
motor_neuron afferent_neuron interneuron
voltage voltage voltage
1 -0.548699 0.986121 -1.339783
2 -1.320589 -0.509410 -0.529686

Categories