Let's say we have a data frame below
df = pd.DataFrame(numpy.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
df
A B C D
0 3 3 0 0
1 0 3 3 2
2 1 0 0 0
3 2 4 4 0
4 3 2 2 4
I would want to append a new row from the existing data and modify several columns
newrow = df.loc[0].copy()
newrow.A = 99
newrow.B = 90
df.append(newrow)
By doing this I got a warning when trying to modify the row
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
<string>:23: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
What would be the clean way of achieving what I intend to do ? I won't have index to use loc because the row is not inside the df yet
If later I would like to come back to this row, how could I retrieve its index at the moment of appending.
newrow = df.loc[0].copy()
df.append(newrow)
df.loc[which index to use, "A"] = 99
In other words, let's say I would want to add the row first then modify it later, how could I get the added row's index
As I can see, you modify every value of the current df row, so it might unnecessary to copy the current row and get the warning.
Just create a dict with your values and append it to the df:
newrow = {'A':99,'B':90,'C':92, 'D':93}
df = df.append(newrow, ignore_index=True)
Use ignore_index=True and the newrow will just be the last index in your df.
use df.iloc[-1] to find the appended line if you didn't use the ignore_index = True tip.
Related
So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.
I am trying to split an option chain into a separate data frame for rows with just the calls ('C') from the column, Right.
options_df
Index
Right
0
P
1
P
2
P
3
C
4
C
5
C
I try to make a new data frame, df, to hold the calls ('C'):
df = options_df
df.drop(df[df["Right"] == 'P'].index)
This returns the data frame, df, but unfortunately it keeps the indexing from the original data frame, options_df:
df
Index
Right
3
C
4
C
5
C
Ideally, the data frame for df would look like this:
Index
Right
0
C
1
C
2
C
But, it does not.
I've tried to correct with resetting the index, as below:
df.reset_index(drop=True)
But it also does not work and gives me back the entire original data frame, options_df:
df
Index
Right
0
P
1
P
2
P
3
C
4
C
5
C
I'm sure there is a simple solution, but I just cannot figure this one out. Thank you for you help!
You don't need to use .drop(), just select the rows of the condition you want and then reset the index by reset_index(drop=True), as follows:
df = df[df["Right"] == 'C'].reset_index(drop=True)
print(df)
Right
0 C
1 C
2 C
When you reset your index you need to add inplace=True
df.reset_index(drop=True, inplace=True)
Or, assign the result back to df with the line as you've written it:
df = df.reset_index(drop=True)
Has the title say, I would like to find a way to drop the row (erase it) in a data frame from a column to the end of the data frame but I don't find any way to do so.
I would like to start with
A B C
-----------
1 1 1
1 1 1
1 1 1
and get
A B C
-----------
1
1
1
I was trying with
df.drop(df.loc[:, 'B':].columns, axis = 1, inplace = True)
But this delete the column itself too
A
-
1
1
1
am I missing something?
If you only know the column name that you want to keep:
import pandas as pd
new_df = pd.DataFrame(df["A"])
If you only know the column names that you want to drop:
new_df = df.drop(["B", "C"], axis=1)
For your case, to keep the columns, but remove the content, one possible way is:
new_df = pd.DataFrame(df["A"], columns=df.columns)
Resulting df contains columns "A" and "B" but without values (NaN instead)
I have a dataframe that is a result of some multiple step processing. I am adding one row to this dataframe like so:
df.loc[‘newindex’] = 0
Where ‘newindex’ is unique to the dataframe. I expect the new row to show up as a last row in the dataframe. But the row shows up somewhere near the middle of the dataframe.
What could be the reason of such behavior? I have to add row exactly at the last position, with its index name preserved.
* update *
I was wrong about uniqueness of the df index. The value has already been there.
I think value newindex is already in index, so loc select and overwite row instead append:
df = pd.DataFrame({'a':range(5)}, index=['a','s','newindex','d','f'])
print (df)
a
a 0
s 1
newindex 2
d 3
f 4
df.loc['newindex'] = 0
df.loc['newindex1'] = 0
print (df)
a
a 0
s 1
newindex 0
d 3
f 4
newindex1 0
I have a pandas dataframe that looks like this:
I would like to iterate through column 3 and if an element exists, add a new row to the dataframe, using the value in column 3 as the new value in column 2, while also using the data in columns 0 and 1 from the row where it was found as the values for columns 0 and 1 in the newly added row:
Here, row 2 is the newly added row. The values in columns 0 and 1 in this row come from the row where "D" was found, and now column 2 of the new row contains the value from column 3 in the first row, "D".
Here is one way to do it, but surely there must be a more general solution, especially if I wish to scan more than a single column:
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
for tu in a.itertuples(index=False): # Iterate by row
if tu[3]: # If exists
b = b.append([[tu[0],tu[1],tu[3]]], ignore_index=True) # Append with new row using correct tuple elements.
You can do this without any loops by creating a new df with the columns you want and appending it to the original.
import pandas as pd
import numpy as np
df = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
ndf = df[pd.notnull(df[3])][[0,1,3]]
ndf.columns = [0,1,2]
df = df.append(ndf, ignore_index=True)
This will leave NaN for the new missing values which you can change then change to None.
df[3] = df[3].where((pd.notnull(df[3])), None)
prints
0 1 2 3
0 A B C D
1 1 2 C None
2 A B D None
This may be a bit more general (assuming your columns are integers and that you are always looking to fill the previous columns in this pattern)
import pandas as pd
def append_rows(scan_row,scanned_dataframe):
new_df = pd.DataFrame()
for i,row in scanned_dataframe.iterrows():
if row[scan_row]:
new_row = [row[i] for i in range(scan_row -1)]
new_row.append(row[scan_row])
print new_row
new_df = new_df.append([new_row],ignore_index=True)
return new_df
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
b = b.append(append_rows(3,a))