Looping through rows based on column values - python

Trying to loop through a report and eliminate/hide/replace cell values if they are repeated in the row above. This is conditional to certain columns in the row but not the entire row as each row will contain at least 1 piece of data that is unique to the row. I know I am close but I'm missing my mark and looking for a nudge in the right direction. Trying to eliminate redundant information to increase legibility of the final report. Essentially what I am trying to do is:
for cell in row:
if column["column_name"] == (line above):
cell.value = " "
Because each row has a unique piece of data drop duplicates does not work.
Once I can clear the intended column in each row where applicable I will expand the process to loop through and apply to other columns where the initial is blanked out. I should be able to work that out once the first domino falls. Any advice is appreciated.
I've tried
np.where(cell) = [iloc-1]
and
masking based on the same parameter.
I get errors that 'row' and 'iloc' are undefined or None of [Index (all content)] are in the [index].

You can use shift() to compare the row elements. If I understand your issue then the example code below indicates an approach you can use (it replaces duplicated numbers by 0):
import pandas as pd
df = pd.DataFrame({ 'A': [1, 2, 2, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e']
})
df['A'] = df['A'].where(df['A'] != df.shift(-1)['A'], 0)
print(df)

Related

Check for populated columns and delete values based on hierarchy?

I have a dataframe (very simplified version below):
d = {'col1': [1, '', 2], 'col2': ['', '', 3], 'col3': [4, 5, 6]}
df = pd.DataFrame(data=d)
I need to loop through the dataframe and check how many columns are populated per row. If the row has just one column populated, then I can continue onto the next row. If however, the column has more than one non-NaN value, I need to make all the columns into NaNs apart from one, based on some hierarchy.
For example, let's say the hierarchy is:
col1 is the most important
col2 second etc.
Therefore, if there were two or more columns with data and one of them happened to be column 1, I would drop all other column values, otherwise I would defer to check if col2 has a value etc and then repeat for the next row.
I have something like this as an idea:
nrows = df.shape[0]
for index in range(0, nrows):
print(index)
#check is the row has only one column populated
if (df.iloc[[index]].notna().sum() == 1):
continue
#check if more than one column is populated for that row
elif (df.iloc[[index]].notna().sum() >= 1):
if (index['col1'].notna() == True):
df.loc[:, df.columns != 'col1'] == 'NaN'
#continue down the hierarchy
but this is not correct as it gives True/False for every column and cannot read it the way I need.
Any suggestions very welcome! I was thinking of creating some sort of key, but feel there may be a more simply way to get there with the code I already have?
Edit:
Another important point which I should have included is that my index is not integers - it is unique identifiers which look something like this: '123XYZ', which is why I used range(0,n) and reshaped the df.
For the example dataframe you gave I don't think it would change after applying this algorithm so I didn't test it thoroughly, but something like this should work:
import numpy as np
heirarchy = ['col1', 'col2', 'col3']
inds = df.isna().sum(axis=1)
inds = inds[inds >= 2].index
for i in inds:
for col in heirarchy:
if not pd.isna(df.iloc[[i]][col]).all():
tmp = df.iloc[[i]][col]
df.iloc[[i]] = np.nan
df.iloc[[i]][col] = tmp
Note I'm assuming that you actually mean nan and not the empty string like you have in your example. If you want to look for empty strings then inds and the if statement would change above
I also think this should be faster than what you have above since it's only looping through the rows with more than 1 nan values.

Updating element of dataframe while referencing column name and row number

I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?
If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]
It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.

Remove index from MultiIndex dataframe if child index has column value meeting criteria

I had originally asked this question here, and I believe it was incorrectly marked as a duplicate. I will do my best here to clarify my question and how I believe it is unique.
Given the following example MultiIndex dataframe:
import pandas as pd
import numpy as np
first = ['A', 'B', 'C']
second = ['a', 'b', 'c', 'd']
third = ['1', '2', '3']
indices = [first, second, third]
index = pd.MultiIndex.from_product(indices, names=['first', 'second', 'third'])
df = pd.DataFrame(np.random.randint(10, size=(len(first)*len(second)*len(third), 4)), index=index, columns=['Val1','Val2',' Val3', 'Val4'])
Goal: I would like to retain a specific level=1 index (such as 'a') if the value of column 'Val2' corresponding to index value 1 in level=2 is greater than 5 for that level=1 index. Therefore, if this criteria is not met (i.e. column 'Val2' is less than or equal to 5 for index 1 in level=2), then the corresponding level=1 index would be removed from the dataframe. If all level=1 indices do not meet the criteria for a given level=0 index, then that level=0 index would also be removed. My previous post contains my expected output (I can add it here, but I wanted this post to be as succinct as possible for clarity).
Here is my current solution, the performance of which I'm sure can be improved:
grouped = df.groupby(level=0)
output = pd.concat([grouped.get_group(key).groupby(level=1).filter(lambda x: (x.loc[pd.IndexSlice[:, :, '1'], 'Val2']>5).any()) for key, group in grouped])
This does produce my desired output, but for a dataframe with 100,000's of rows, the performance is rather poor. Is there something obvious I am missing here to better utilize the under-the-hood optimization of pandas?
I got the same result as your example solution by doing the following:
df.loc[df.xs('1', level=2)['Val2'] > 5]
Comparing time performance this is ~15X faster (in my machine your example takes 36ms while this take 2ms).

Python Pandas Series - order output via Slice

Trying to produce an 'ordered' output from a Pandas Series using the Slice notation. The example below seems to be in err.
x = pd.Series([2,4,6], index=['a', 'b', 'c'])
print(x[2,1]) #Fail
To expand a little bit on the comments, pandas uses the [...] to figure out the row, column coordinates it's looking up. So if you wanted to pick out 2 items: the item at row 2 and the item at row 1, you need to tell pandas that it's looking for a list of rows. Hence:
# x[rows, columns]
x[[2,1]]
will return what you want.

Adding a column to a python pandas data frame based on the value of another column

I have some pandas data frame, and I would like to add a column that is the difference of a column, based on the value of a third column. Here is a toy example:
import pandas as pd
import numpy as np
d = {'one' : pd.Series(range(4), index=['a', 'b', 'c', 'd']),
'two' : pd.Series(range(4), index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df['three'] = [2,2,3,3]
four = []
for i in set(df['three']):
for j in range(len(df) -1):
four.append(df[df['three'] == i]['two'][j + 1] - df[df['three']==i]['two'][j])
four.append(0)
df['four'] = four
The final column should be [1, 1, 1, Nan], since that is the difference between each of the rows in the 'two' column
This makes more sense in the context of my original code -- my data frame is organized by some IDs, and then by time, and when I take the subset of the data frame by IDs, I'm left with the time series evolution of the variables for each individual ID. However, I keep on either receiving a key error, or attempting to edit a copy of the original data frame. What is the right way to go about this?
You could replace df[df['three'] == i] with a groupby on column three. And perhaps replace ['two'][j + 1] - ['two'][j] with df['two'].shift(-1) - df['two'].
I think that would be identical to what you are doing now within the nested loop. It depends a bit on what format you want as a result on how you would implement this. One way would be:
df.groupby('three').apply(lambda grp: pd.Series(grp['two'].shift(-1) - grp['two']))
Which would result in:
two a b
three
2 1 NaN
3 1 NaN
The columns names become a bit meaningless after this operation.
If all you want to do is to get the difference between the rows of column two you use the shift method.
df['four'] = df.two.shift(-1) - df.two

Categories