Python - insert multiples rows into an existing data frame - python

I am trying to insert two lines into an existing data frame, but can't seem to get it to work. The existing df is:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
I want to add two blank rows after the 1st and 2nd block rows. I would like the new data frame to look like this:
df_new = pd.DataFrame({"a" : [1,2,0,3,4,0,5,6], "block" : [1, 1, 0, 2, 2, 0, 3, 3]})
There doesn't need to be any values in the rows, I'm planning on using them as placeholders for something else. I've looked into adding rows, but most posts suggest appending one row to the beginning or end of a data frame, which won't work in my case.
Any suggestions as to my dilemma?

import pandas as pd
# Adds a new row to a DataFrame
# oldDf - The DataFrame to which the row will be added
# index - The index where the row will be added
# rowData - The new data to be added to the row
# returns - A new DataFrame with the row added
def AddRow(oldDf, index, rowData):
newDf = oldDf.head(index)
newDf = newDf.append(pd.DataFrame(rowData))
newDf = newDf.append(oldDf.tail(-index))
# Clean up the row indexes so there aren't any doubles.
# Figured you may want this.
newDf = newDf.reset_index(drop=True)
return newDf
# Initial data
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
# Insert rows
blankRow = {"a": [0], "block": [0]}
df2 = AddRow(df1, 2, blankRow)
df2 = AddRow(df2, 5, blankRow)
For the sake of performance, you can removed the reference to Reset_Index() found in the AddRow() function, and simply call it after you've added all your rows.

If you always want to insert the new row of zeros after each group of values in the block column you can do the following:
Start with your data frame:
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
Group it using the values in the block column:
gr = df1.groupby('block')
Add a row of zeros to the end of each group:
df_new = gr.apply(lambda x: x.append({'a':0,'block':0}, ignore_index=True))
Reset the indexes of the new dataframe:
df_new.reset_index(drop = True, inplace=True)

You can simply groupby the data based on the block column, then concat the placeholder at the bottom of each group then append to a new dataframe.
df1 = pd.DataFrame({"a" : [1,2,3,4,5,6], "block" : [1, 1, 2, 2, 3, 3]})
df1 # original data
Out[67]:
a block
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
df_group = df1.groupby('block')
df = pd.DataFrame({"a" : [], "block" : []}) # final data to be appended
for name,group in df_group:
group = pd.concat([group,pd.DataFrame({"a" : [0], "block" : [0]})])
df = df.append(group, ignore_index=True)
df
Out[71]:
a block
0 1 1
1 2 1
2 0 0
3 3 2
4 4 2
5 0 0
6 5 3
7 6 3
8 0 0

Related

Python iterating through data and returning deltas

Python newbie here with a challenge I'm working to solve...
My goal is to iterate through a data frame and return what changed line by line. Here's what I have so far:
pseudo code (may not be correct method)
step 1: set row 0 to an initial value
step 2: compare row 1 to row 0, add changes to a list and record row number
step 3: set current row to new initial
step 4: compare row 2 to row 1, add changes to a list and record row number
step 5: iterate through all rows
step 6: return a table with changes and row index where change occurred
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta():
changes = []
initial = df.loc[0]
for row in df:
if row[i] != initial:
changes.append[i]
delta()
changes I expect to see:
index 1: col2 changed from 1 to 2, 2 should be added to changes list
index 2: col 1 and col3 changed from 1 to 2, both 2s should be added to changes list
index 4: col 1 changed from 2 to 3, 3 should be added to changes list
You can check where each of the columns have changed using the shift method and then use a mask to only get the ones that have changed
df.loc[:, 'col1_changed'] = df['col1'].mask(df['col1'].eq(df['col1'].shift()))
df.loc[:, 'col2_changed'] = df['col2'].mask(df['col2'].eq(df['col2'].shift()))
df.loc[:, 'col3_changed'] = df['col3'].mask(df['col3'].eq(df['col3'].shift()))
Once you have identified the changes, you can agg them together
# We don't consider the first row
df.loc[0, ['col1_changed', 'col2_changed', 'col3_changed']] = [np.nan] * 3
df[['col1_changed', 'col2_changed', 'col3_changed']].astype('str').agg(','.join, axis=1).str.replace('nan', 'no change')
#0 no change,no change,no change
#1 no change,2.0,no change
#2 2.0,no change,2.0
#3 no change,no change,no change
#4 3.0,no change,no change
You can use the pandas function diff() which will already provide the increment compared to the previous row:
import pandas as pd
d = {
'col1' : [1, 1, 2, 2, 3],
'col2' : [1, 2, 2, 2, 2],
'col3' : [1, 1, 2, 2, 2]
}
df = pd.DataFrame(data=d)
def delta(df):
deltas = df.diff() # will convert to float because this is needed to set Nans in the first row
deltas.iloc[0] = df.iloc[0] # replace Nans in first row with original data from first row
deltas = deltas.astype(df.dtypes) # reset data types according to input data
filter = (deltas!=0).any(axis=1) # filter to use only those rows where all values are non zero
filter.iloc[0] = True # make sure the first row is included even if original data for first row held only zeros
deltas = deltas.loc[filter] # actually apply the filter
return deltas
print( delta(df) )
This prints:
col1 col2 col3
0 1 1 1
1 0 1 0
2 1 0 1
4 1 0 0

Change column values based on other dataframe columns

I have two dataframes that look like this
df1 ==
IDLocation x-coord y-coord
1 -1.546 7.845
2 3.256 1.965
.
.
35 5.723 -2.724
df2 ==
PIDLocation DIDLocation
14 5
3 2
7 26
I want to replace the columns PIDLocation, DIDLocation with Px-coord, Py-coord, Dx-coord, Dy-coord such that the two columns PIDLocation, DIDLocation are IDLocation and each IDLocation corresponds to an x-coord and y-coord in the first dataframe.
If you set the ID column as the index of df1, you can get the coord values by indexing. I changed the values in df2 in the example below to avoid index errors that would result from not having the full dataset.
import pandas as pd
df1 = pd.DataFrame({'IDLocation': [1, 2, 35],
'x-coord': [-1.546, 3.256, 5.723],
'y-coord': [7.845, 1.965, -2.724]})
df2 = pd.DataFrame({'PIDLocation': [35, 1, 2],
'DIDLocation': [2, 1, 35]})
df1.set_index('IDLocation', inplace=True)
df2['Px-coord'] = [df1['x-coord'].loc[i] for i in df2.PIDLocation]
df2['Py-coord'] = [df1['y-coord'].loc[i] for i in df2.PIDLocation]
df2['Dx-coord'] = [df1['x-coord'].loc[i] for i in df2.DIDLocation]
df2['Dy-coord'] = [df1['y-coord'].loc[i] for i in df2.DIDLocation]
del df2['PIDLocation']
del df2['DIDLocation']
print(df2)
Px-coord Py-coord Dx-coord Dy-coord
0 5.723 -2.724 3.256 1.965
1 -1.546 7.845 -1.546 7.845
2 3.256 1.965 5.723 -2.724

Update column in pandas dataframe based on another column of the same dataframe

I am struggling with updating a dataframe columns. Here is a sample of my dataframe :
data1={'UserId': [1, 2, 3], 'OldAnswer': [4, 4, None]}
df1 = pd.DataFrame.from_dict(data1)
data2={'UserId': [1, 2, 3], 'NewAnswer' : [4, 5, None]}
df2 = pd.DataFrame.from_dict(data2)
merged = pd.merge(df1, df2, on ='UserId', how='outer')
Which gives me :
UserId
OldAnswer
NewAnswer
1
4
4
2
4
5
3
NaN
NaN
Now I Want to update the "OldAnswer" with the "NewAnswer" on rows but when I check the difference between the two columns, it says that on the third row, OldAnswer and NewAnswer are differents. The following code gives me the following result :
merged['OldAnswer'] != merged['NewAnswer']
> False
> True
> True
I thought I would have been able to update my column by doing this :
i = 0
while i < len(merged):
if merged['OldAnswer'].iloc[i] != merged['NewAnswer'].iloc[i]:
merged['OldAnswer'].iloc[i] = merged['NewAnswer'].iloc[i]
i += 1
else:
i += 1
But it doesn't work either.
I feel a bit dumb right now! The simple following code solved it:
merged['OldAnswer'] = merged['NewAnswer']
merged.drop(columns='NewAnswer')

How to name Pandas Dataframe Columns automatically?

I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1

How to delete the randomly sampled rows of a dataframe, to avoid sampling them again?

I have dataframe (df) of 12 rows x 5 columns. I sample 1 row from each label and create a new dataframe (df1) of 3 rows x 5 columns. I need that the next time I sample more rows from df I will not choose the same ones that are already in df1. So how can I delete the already sampled rows from df?
import pandas as pd
import numpy as np
# 12x5
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
#3x5
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
#My attempt. It should be a 9x5 dataframe
df2 = pd.concat(f.drop(idx) for idx, f in df1.groupby('label'))
df
df1
df2
Starting with this DataFrame:
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
Your first sample is this:
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
For the second sample, you can drop df1's indices from df:
pd.concat(g.sample(1) for idx, g in df.drop(df1.index).groupby('label'))
Out:
0 1 2 3 4 label
2 0.188005 0.765640 0.549734 0.712261 0.334071 1
4 0.599812 0.713593 0.366226 0.374616 0.952237 2
8 0.631922 0.585104 0.184801 0.147213 0.804537 3
This is not an inplace operation. It doesn't modify the original DataFrame. It just drops the rows, returns a copy, and samples from that copy. If you want it to be permanent, you can do:
df2 = df.drop(df1.index)
And sample from df2 afterwards.

Categories