I'm trying to add each value from one column ('smoking') with another column ('sex') and put the result in a new column called 'something'. The dataset is a DataFrame called 'data'. The values in the columns 'smoking' and 'sex' are int64.
The rows of the column 'smoking' have 1 or 0. The number 1 means that the persons smoke and the number 0 means that the person doesn't smoke. In the column 'sex' have 0 and 1 too, 0 for female and 1 for male.
for index, row in data.iterrows():
data.loc[index, 'something'] = row['smoking'] + row['sex']
data
The problem is that in the column 'something' there is just number 2.0, that means even in a row of 'smoking' is 0 and in the row of 'sex' is 1, the sum in 'something' is 2.0.
I am not undestanding the error.
I'm using python 3.9.2
The dataset is in this link of kaggle: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data
I see #Vishnudev just posted the solution in a comment, but allow me to explain what is going wrong:
The issue here is that the addition is somehow resulting in a float as a result instead of an int. There are two solutions:
With the loop, casting the result to int:
for index, row in data.iterrows():
data.loc[index, 'something'] = row['smoking'] + row['sex']
data = data.astype(int)
data
Without the loop (as #Vishnudev suggested):
data['something'] = data['smoking'] + data['sex']
data
You need not iterate over entire rows for doing that, you could just use:
data['something'] = data['smoking'] + data['sex']
Related
I have a pandas dataframe where I want to loop through each row of duplicate patient IDs, and grab the latest CONDITION_STOP value.
Below is a screenshot of the dataframe:
Case 1:
For example, for Patient = '90008189-5c4f-f24d-ecc4-e41919d547e1' (rows 21-22), I expect to return back CONDITION_DESCRIPTION = 'COVID-19', because the CONDITION_STOP value in row 22 is greater (later) than the CONDITION_STOP value in row 21.
Case 2:
The situation is a little different where the CONDITION_STOP = 'NaT' for a Patient.
For example, for Patient = '0a47e17e-70a1-c91b-9f50-b06804878e2b' (rows 28-29), I expect to return back CONDITION_DESCRIPTION = 'COVID-19' where CONDITION_STOP = 'NaT'.
Thus far, I have tried using drop_duplicates, but this works in case 1, but not case 2.
pd_covid_conditions1 = pd_covid_conditions.sort_values(by=['CONDITION_STOP'], ascending = False)\
.drop_duplicates(subset=['PATIENT'], keep = 'first')
The result I get is this (expected):
However, when I try to run it for the 'Nat' scenario, the result is not as expected (I am expecting it to return CONDITION_DESCRIPTION = 'COVID-19'):
It doesn't appear that drop_duplicates() function will work for my scenario. Can anyone advise what I should change?
I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
print(df_cities.iloc[0:20,])
The column names can be found in row 15, but I would like this row number to be automatically determined. I thought the best way would be to take the first row for which the values are non-Nan for less than 10 items.
I combined this answer, to find this answer to do the following:
amount_Nan = df_cities.shape[1] - df_cities.count(axis=1)
# OR df.isnull().sum(axis=1).tolist()
print(amount_Nan)
col_names_index = next(i for i in amount_Nan if i < 3)
print(col_names_index)
df_cities.columns = df_cities.iloc[col_names_index]
The problem is that col_names_index keeps returning 0, while it should be 15. I think it is because amount_Nan returns rows and columns because of which next(i for i in amount_Nan if i < 3) works differently than expected.
The thing is that I do not really understand why. Can anyone help?
IIUC you can get first index of non missing value per second column by DataFrame.iloc with Series.notna and Series.idxmax, set columns names by this row and filter out values before this row by index:
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]
I'm trying to code a line in which I drop a row in a dataframe if a pvalue (columns) is lower than 1.3 for 3 out of 5 columns. If the pvalue is greater than 1.3 in 3 out of 5 columns i keep the row. The code looks like this:
for i in np.arange(pvalue.shape[0]):
if (pvalue.iloc[i,1:] < 1.3).count() > 2:
pvalue.drop(index = pvalue.index[i], axis = 0, inplace = True)
else:
None
the pvalue dataframe has 6 columns, first column is a string and the next 5 are pvalues of an experiment. I get this error:
IndexError: single positional indexer is out-of-bounds
and I don't how to fix this. I appreciate every help. BTW I'm a complete python beginner, so be patient with me! :) Thanks and looking forward to your solutions!
I am not very knowledgeable with Pandas so there probably is a better way to go about it but this should work:
By using iterrows(), you can iterate over each row of a DataFrame.
for idx, row in pvalue.iterrows():
In the loop you will have access to the idx variable which is the index of the row you're currently iterating on, and the row values itself in the row variable.
Then for every row, you can iterate through each column value with a simple for loop.
for val in row[1:]:
while making sure you start with the 2nd value (or in other words, by ignoring the index 0 and starting with index 1).
The rest is pretty straightforward.
threshold = 1.3
for idx, row in pvalue.iterrows():
count = 0
for val in row[1:]:
if val < threshold:
count += 1
if count > 2:
pvalue.drop(idx, inplace=True)
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I have written some code to essentially do a excel style vlookup on two pandas dataframes and want to speed it up.
The structure of the data frames is as follows:
dbase1_df.columns:
'VALUE', 'COUNT', 'GRID', 'SGO10GEO'
merged_df.columns:
'GRID', 'ST0, 'ST1', 'ST2', 'ST3', 'ST4', 'ST5', 'ST6', 'ST7', 'ST8', 'ST9', 'ST10'
sgo_df.columns:
'mkey', 'type'
To combine them, I do the following:
1. For each row in dbase1_df, find the row where its 'SGO10GEO' value matches the 'mkey' value of sgo_df. Obtain the 'type' from that row in sgo_df.
'type' contains an integer ranging from 0 to 10. Create a column name by appending 'ST' to type.
Find the value in merged_df, where its 'GRID' value matches the 'GRID' value in dbase1_df and the column name is the one we obtained in step 2. Output this value into a csv file.
// Read in dbase1 dbf into data frame
dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col=False)
merged_df = pandas.DataFrame.from_csv('merged.csv',index_col=False)
lup_out.writerow(["VALUE","TYPE",EXTRACT_VAR.upper()])
// For each unique value in dbase1 data frame:
for index, row in dbase1_df.iterrows():
# 1. Find the soil type corresponding to the mukey
tmp = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
if tmp.size > 0:
s_type = 'ST'+tmp[0]
val = int(row['VALUE'])
# 2. Obtain hmu value
tmp_val = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])]
if tmp_val.size > 0:
hmu_val = tmp_val[0]
# 4. Output into data frame: VALUE, hmu value
lup_out.writerow([val,s_type,hmu_val])
else:
err_out.writerow([merged_df['GRID'], type, row['GRID']])
Is there anything here that might be a speed bottleneck? Currently it takes me around 20 minutes for around ~500,000 rows in dbase1_df; ~1,000 rows in merged_df and ~500,000 rows in sgo_df.
thanks!
You need to use the merge operation in Pandas to get a better performance. I'm not able to test the below code since I don't have the data but at minimum it should help you to get the idea:
import pandas as pd
dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)
#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey
dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']
#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)
#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))
# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)
# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']
result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])