Python iterating through data and returning deltas - python

Python newbie here with a challenge I'm working to solve...
My goal is to iterate through a data frame and return what changed line by line. Here's what I have so far:
pseudo code (may not be correct method)
step 1: set row 0 to an initial value
step 2: compare row 1 to row 0, add changes to a list and record row number
step 3: set current row to new initial
step 4: compare row 2 to row 1, add changes to a list and record row number
step 5: iterate through all rows
step 6: return a table with changes and row index where change occurred
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta():
changes = []
initial = df.loc[0]
for row in df:
if row[i] != initial:
changes.append[i]
delta()
changes I expect to see:
index 1: col2 changed from 1 to 2, 2 should be added to changes list
index 2: col 1 and col3 changed from 1 to 2, both 2s should be added to changes list
index 4: col 1 changed from 2 to 3, 3 should be added to changes list

You can check where each of the columns have changed using the shift method and then use a mask to only get the ones that have changed
df.loc[:, 'col1_changed'] = df['col1'].mask(df['col1'].eq(df['col1'].shift()))
df.loc[:, 'col2_changed'] = df['col2'].mask(df['col2'].eq(df['col2'].shift()))
df.loc[:, 'col3_changed'] = df['col3'].mask(df['col3'].eq(df['col3'].shift()))
Once you have identified the changes, you can agg them together
# We don't consider the first row
df.loc[0, ['col1_changed', 'col2_changed', 'col3_changed']] = [np.nan] * 3
df[['col1_changed', 'col2_changed', 'col3_changed']].astype('str').agg(','.join, axis=1).str.replace('nan', 'no change')
#0 no change,no change,no change
#1 no change,2.0,no change
#2 2.0,no change,2.0
#3 no change,no change,no change
#4 3.0,no change,no change

You can use the pandas function diff() which will already provide the increment compared to the previous row:
import pandas as pd
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta(df):
deltas = df.diff() # will convert to float because this is needed to set Nans in the first row
deltas.iloc[0] = df.iloc[0] # replace Nans in first row with original data from first row
deltas = deltas.astype(df.dtypes) # reset data types according to input data
filter = (deltas!=0).any(axis=1) # filter to use only those rows where all values are non zero
filter.iloc[0] = True # make sure the first row is included even if original data for first row held only zeros
deltas = deltas.loc[filter] # actually apply the filter
return deltas
print( delta(df) )
This prints:
col1 col2 col3
0 1 1 1
1 0 1 0
2 1 0 1
4 1 0 0

Related

How to compare and replace individual cell values in data according to a list?: Pandas

I have a dataframe containing numerical values. I want to replace all values in the dataframe by comparing individual cell values to the respective elements of the list. The length of the list and the length of the columns are the same. Here's an example:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Output
a b c
0 101 2 3
1 4 500 6
2 712 8 9
list_numbers = [100,100,100]
I want to compare individual cell values to the respective elements of the list.
So, the column 'a' will be compared to 100. If the values are greater than hundred, I want to replace the values with another number.
Here is my code so far:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df_columns = df.columns
df_index = df.index
#Creating a new dataframe to store the values.
df1 = pd.DataFrame(index= df_index, columns = df_columns)
df1 = df1.fillna(0)
for index, value in enumerate(df.columns):
#df.where replaces values where the condition is false
df1[[value]] = df[[value]].where(df[[value]] > list_numbers [index], -1)
df1[[value]] = df[[value]].where(df[[value]] < list_numbers [index], 1)
#I am getting something like: nan for column a and error for other columns.
#The output should look something like:
Output
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1
Iterating over a DataFrame iterates over its column names. So you could simply do:
df1 = pd.DataFrame()
for i, c in enumerate(df):
df1[c] = np.where(df[c] >= list_numbers[i], 1, -1)
You can avoid iterating over the columns, and use numpy broadcasting (which is more efficient):
df1 = pd.DataFrame(
np.where(df.values > np.array(list_numbers), 1, -1),
columns=df.columns)
df1
Output:
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1

Update column in pandas dataframe based on another column of the same dataframe

I am struggling with updating a dataframe columns. Here is a sample of my dataframe :
data1={'UserId': [1, 2, 3], 'OldAnswer': [4, 4, None]}
df1 = pd.DataFrame.from_dict(data1)
data2={'UserId': [1, 2, 3], 'NewAnswer' : [4, 5, None]}
df2 = pd.DataFrame.from_dict(data2)
merged = pd.merge(df1, df2, on ='UserId', how='outer')
Which gives me :
UserId
OldAnswer
NewAnswer
1
4
4
2
4
5
3
NaN
NaN
Now I Want to update the "OldAnswer" with the "NewAnswer" on rows but when I check the difference between the two columns, it says that on the third row, OldAnswer and NewAnswer are differents. The following code gives me the following result :
merged['OldAnswer'] != merged['NewAnswer']
> False
> True
> True
I thought I would have been able to update my column by doing this :
i = 0
while i < len(merged):
if merged['OldAnswer'].iloc[i] != merged['NewAnswer'].iloc[i]:
merged['OldAnswer'].iloc[i] = merged['NewAnswer'].iloc[i]
i += 1
else:
i += 1
But it doesn't work either.
I feel a bit dumb right now! The simple following code solved it:
merged['OldAnswer'] = merged['NewAnswer']
merged.drop(columns='NewAnswer')

Replace values by result of a function

I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.
Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4

Pandas: using iloc to retrieve data does not match input index

I have a dataset which contains contributor's id and contributor_message. I wanted to retrieve all samples with the same message, say, contributor_message == 'I support this proposal because...'.
I use data.loc[data.contributor_message == 'I support this proposal because...'].index -> so basically you can get the index in the DataFrame with the same message, say those indices are 1, 2, 50, 9350, 30678,...
Then I tried data.iloc[[1,2,50]] and this gives me correct answer, i.e. the indices matches with the DataFrame indices.
However, when I use data.iloc[9350] or higher indices, I will NOT get the corresponding DataFrame index. Say I got 15047 in the DataFrame this time.
Can anyone advise how to fix this problem?
This occurs when your indices are not aligned with their integer location.
Note that pd.DataFrame.loc is used to slice by index and pd.DataFrame.iloc is used to slice by integer location.
Below is a minimal example.
df = pd.DataFrame({'A': [1, 2, 1, 1, 5]}, index=[0, 1, 2, 4, 5])
idx = df[df['A'] == 1].index
print(idx) # Int64Index([0, 2, 4], dtype='int64')
res1 = df.loc[idx]
res2 = df.iloc[idx]
print(res1)
# A
# 0 1
# 2 1
# 4 1
print(res2)
# A
# 0 1
# 2 1
# 5 5
You have 2 options to resolve this problem.
Option 1
Use pd.DataFrame.loc to slice by index, as above.
Option 2
Reset index and use pd.DataFrame.iloc:
df = df.reset_index(drop=True)
idx = df[df['A'] == 1].index
res2 = df.iloc[idx]
print(res2)
# A
# 0 1
# 2 1
# 3 1

Python - insert multiples rows into an existing data frame

I am trying to insert two lines into an existing data frame, but can't seem to get it to work. The existing df is:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
I want to add two blank rows after the 1st and 2nd block rows. I would like the new data frame to look like this:
df_new = pd.DataFrame({"a" : [1,2,0,3,4,0,5,6], "block" : [1, 1, 0, 2, 2, 0, 3, 3]})
There doesn't need to be any values in the rows, I'm planning on using them as placeholders for something else. I've looked into adding rows, but most posts suggest appending one row to the beginning or end of a data frame, which won't work in my case.
Any suggestions as to my dilemma?
import pandas as pd
# Adds a new row to a DataFrame
# oldDf - The DataFrame to which the row will be added
# index - The index where the row will be added
# rowData - The new data to be added to the row
# returns - A new DataFrame with the row added
def AddRow(oldDf, index, rowData):
newDf = oldDf.head(index)
newDf = newDf.append(pd.DataFrame(rowData))
newDf = newDf.append(oldDf.tail(-index))
# Clean up the row indexes so there aren't any doubles.
# Figured you may want this.
newDf = newDf.reset_index(drop=True)
return newDf
# Initial data
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
# Insert rows
blankRow = {"a": [0], "block": [0]}
df2 = AddRow(df1, 2, blankRow)
df2 = AddRow(df2, 5, blankRow)
For the sake of performance, you can removed the reference to Reset_Index() found in the AddRow() function, and simply call it after you've added all your rows.
If you always want to insert the new row of zeros after each group of values in the block column you can do the following:
Start with your data frame:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
Group it using the values in the block column:
gr = df1.groupby('block')
Add a row of zeros to the end of each group:
df_new = gr.apply(lambda x: x.append({'a':0,'block':0}, ignore_index=True))
Reset the indexes of the new dataframe:
df_new.reset_index(drop = True, inplace=True)
You can simply groupby the data based on the block column, then concat the placeholder at the bottom of each group then append to a new dataframe.
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
df1 # original data
Out[67]:
a block
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
df_group = df1.groupby('block')
df = pd.DataFrame({"a" : [], "block" : []}) # final data to be appended
for name,group in df_group:
group = pd.concat([group,pd.DataFrame({"a" : [0], "block" : [0]})])
df = df.append(group, ignore_index=True)
df
Out[71]:
a block
0 1 1
1 2 1
2 0 0
3 3 2
4 4 2
5 0 0
6 5 3
7 6 3
8 0 0

Categories