I don't get why python won't update my dataframe object:
The code snippet is this:
for index, row in df.iterrows():
t = df.loc[index, :"score"]
b = [float(i) for i in t if i != 's']
m = sum(b)/len(b)
df.at[index, "score"] = m
print(df.at[index, "score"]) # Does not print out m, it prints out 0, the default value
The thing that this snippet should do is get all the values in a row, compute the average and then add this average to the dataframe.
Iterating over rows in a DataFrame is very seldomly the way to go.
Instead, use
df.loc[:, :'score'].mean('columns')
which is more readable and much faster.
To answer your question directly (why your way doesn't work) we would need more information (see comments).
Related
I have a dataframe (very simplified version below):
d = {'col1': [1, '', 2], 'col2': ['', '', 3], 'col3': [4, 5, 6]}
df = pd.DataFrame(data=d)
I need to loop through the dataframe and check how many columns are populated per row. If the row has just one column populated, then I can continue onto the next row. If however, the column has more than one non-NaN value, I need to make all the columns into NaNs apart from one, based on some hierarchy.
For example, let's say the hierarchy is:
col1 is the most important
col2 second etc.
Therefore, if there were two or more columns with data and one of them happened to be column 1, I would drop all other column values, otherwise I would defer to check if col2 has a value etc and then repeat for the next row.
I have something like this as an idea:
nrows = df.shape[0]
for index in range(0, nrows):
print(index)
#check is the row has only one column populated
if (df.iloc[[index]].notna().sum() == 1):
continue
#check if more than one column is populated for that row
elif (df.iloc[[index]].notna().sum() >= 1):
if (index['col1'].notna() == True):
df.loc[:, df.columns != 'col1'] == 'NaN'
#continue down the hierarchy
but this is not correct as it gives True/False for every column and cannot read it the way I need.
Any suggestions very welcome! I was thinking of creating some sort of key, but feel there may be a more simply way to get there with the code I already have?
Edit:
Another important point which I should have included is that my index is not integers - it is unique identifiers which look something like this: '123XYZ', which is why I used range(0,n) and reshaped the df.
For the example dataframe you gave I don't think it would change after applying this algorithm so I didn't test it thoroughly, but something like this should work:
import numpy as np
heirarchy = ['col1', 'col2', 'col3']
inds = df.isna().sum(axis=1)
inds = inds[inds >= 2].index
for i in inds:
for col in heirarchy:
if not pd.isna(df.iloc[[i]][col]).all():
tmp = df.iloc[[i]][col]
df.iloc[[i]] = np.nan
df.iloc[[i]][col] = tmp
Note I'm assuming that you actually mean nan and not the empty string like you have in your example. If you want to look for empty strings then inds and the if statement would change above
I also think this should be faster than what you have above since it's only looping through the rows with more than 1 nan values.
I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?
If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]
It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.
I was wondering if there is an easy way to get only the first row of each grouped object (subject id for example) in a dataframe. Doing this:
for index, row in df.iterrows():
# do stuff
gives us each one of the rows, but I am interested in doing something like this:
groups = df.groupby('Subject id')
for index, row in groups.iterrows():
# give me the first row of each group
continue
Is there a pythonic way to do the above?
Direct solution - without .groupby() - by .drop_duplicates()
what you want is to keep only the rows with first occurrencies in a specific column:
df.drop_duplicates(subset='Subject id', keep='first')
General solution
Using the .apply(func) in Pandas:
df.groupby('Subject id').apply(lambda df: df.iloc[0, :])
It applies a function (mostly on the fly generated with lambda) to every data frame in the list of data frames returned by df.groupby() and aggregates the result to a single final data frame.
However, the solution by #AkshayNevrekar is really nice with .first(). And like he did there, you could also attach here - a .reset_index() at the end.
Let's say this is the more general solution - where you could also take any nth row ... - however, this works only if all sub-dataframes have at least n rows.
Otherwise, use:
n = 3
col = 'Subject id'
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < (df.shape[0]):
res_df = res_df.append(df.reset_index().iloc[n, :])
Or as a function:
def group_by_select_nth_row(df, col, n):
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < df.shape[0]:
res_df = res_df.append(df.reset_index().iloc[n, :])
return res_df
Quite confusing is that df.append() in contrast to list.append() only returns the appended value but leaves the original df unchanged.
Therefore you should always reassign it if you want an 'in place' appending, like one is used from list.append().
Use first() to get first row of each group.
df = pd.DataFrame({'subject_id': [1,1,2,2,2,3,4,4], 'val':[20,32,12,34,45,43,23,10]})
# print(df.groupby('subject_id').first().reset_index())
print(df.groupby('subject_id', as_index=False).first())
Output:
subject_id val
0 1 20
1 2 12
2 3 43
3 4 23
Could I ask how to retrieve an index of a row in a DataFrame?
Specifically, I am able to retrieve the index of rows from a df.loc.
idx = data.loc[data.name == "Smith"].index
I can even retrieve row index from df.loc by using data.index like this:
idx = data.loc[data.index == 5].index
However, I cannot retrieve the index directly from the row itself (i.e., from row.index, instead of df.loc[].index). I tried using these codes:
idx = data.iloc[5].index
The result of this code is the column names.
To provide context, the reason I need to retrieve the index of a specific row (instead of rows from df.loc) is to use df.apply for each row.
I plan to use df.apply to apply a code to each row and copy the data from the row immediately above them.
def retrieve_gender (row):
# This is a panel data, whose only data in 2000 is already keyed in. Time-invariant data in later years are the same as those in 2000.
if row["Year"] == 2000:
pass
elif row["Year"] == 2001: # To avoid complexity, let's use only year 2001 as example.
idx = row.index # This is wrong code.
row["Gender"] = row.iloc[idx-1]["Gender"]
return row["Gender"]
data["Gender"] = data.apply(retrieve_gender, axis=1)
With Pandas you can loop through your dataframe like this :
for index in range(len(df)):
if df.loc[index,'year'] == "2001":
df.loc[index,'Gender'] = df.loc[index-1 ,'Gender']
apply gives series indexed by column labels
The problem with idx = data.iloc[5].index is data.iloc[5] converts a row to a pd.Series object indexed by column labels.
In fact, what you are asking for is impossible via pd.DataFrame.apply because the series that feeds your retrieve_gender function does not include any index identifier.
Use vectorised logic instead
With Pandas row-wise logic is inefficient and not recommended; it involves a Python-level loop. Use columnwise logic instead. Taking a step back, it seems you wish to implement 2 rules:
If Year is not 2001, leave Gender unchanged.
If Year is 2001, use Gender from previous row.
np.where + shift
For the above logic, you can use np.where with pd.Series.shift:
data['Gender'] = np.where(data['Year'] == 2001, data['Gender'].shift(), data['Gender'])
mask + shift
Alternatively, you can use mask + shift:
data['Gender'] = data['Gender'].mask(data['Year'] == 2001, data['Gender'].shift())
I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))
It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.
Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))
This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})
If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]
We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]
I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]
The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))
cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T
Simply do this:
for i in df.columns:
print(df[i].unique())
Or in short it can be written as:
for val in df['column_name'].unique():
print(val)
Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T
This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)
The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()