I have a pandas dataframe that looks like this:
I would like to iterate through column 3 and if an element exists, add a new row to the dataframe, using the value in column 3 as the new value in column 2, while also using the data in columns 0 and 1 from the row where it was found as the values for columns 0 and 1 in the newly added row:
Here, row 2 is the newly added row. The values in columns 0 and 1 in this row come from the row where "D" was found, and now column 2 of the new row contains the value from column 3 in the first row, "D".
Here is one way to do it, but surely there must be a more general solution, especially if I wish to scan more than a single column:
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
for tu in a.itertuples(index=False): # Iterate by row
if tu[3]: # If exists
b = b.append([[tu[0],tu[1],tu[3]]], ignore_index=True) # Append with new row using correct tuple elements.
You can do this without any loops by creating a new df with the columns you want and appending it to the original.
import pandas as pd
import numpy as np
df = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
ndf = df[pd.notnull(df[3])][[0,1,3]]
ndf.columns = [0,1,2]
df = df.append(ndf, ignore_index=True)
This will leave NaN for the new missing values which you can change then change to None.
df[3] = df[3].where((pd.notnull(df[3])), None)
prints
0 1 2 3
0 A B C D
1 1 2 C None
2 A B D None
This may be a bit more general (assuming your columns are integers and that you are always looking to fill the previous columns in this pattern)
import pandas as pd
def append_rows(scan_row,scanned_dataframe):
new_df = pd.DataFrame()
for i,row in scanned_dataframe.iterrows():
if row[scan_row]:
new_row = [row[i] for i in range(scan_row -1)]
new_row.append(row[scan_row])
print new_row
new_df = new_df.append([new_row],ignore_index=True)
return new_df
a = pd.DataFrame([['A','B','C','D'],[1,2,'C']])
b = a.copy()
b = b.append(append_rows(3,a))
Related
I am stuck on an issue in which I have a CSV file and need to keep all the headers in row 1 in the specific order I was given, but the row 2 and below data for some of the columns are displaced meaning in Column C I will need to move that column of data excluding row 1 header to Column F. I looked through stackoverflow and found solutions in python, but those solutions move the entire columns in order, but my goal here is to move only the data in the columns to different columns while leaving the row headers exactly where they are originally.
Please note that I am not allowed to use Excel to easily move the data over, but instead will need to work with just a common CSV file.
A B C D
4 1 10 7
5 2 11 8
6 3 12 9
For example, I will need to keep the Column headers in row 1 in the exact same order, but rearrange the data in rows 2-4 from Column B to Column A and the Data from Column D to Column C.
df = pd.read_csv("csv file path")
# swap Col A and Col B
df['F'] = df['A']
df['A'] = df['B']
df['B'] = df['F']
# swap Col C and Col D
df['F'] = df['C']
df['C'] = df['D']
df['D'] = df['F']
df.drop('F', axis=1) # Delete Temp Col
I guess you mean that?
Why do I get NaN value when adding values in the b column and not for a?
This is the code:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
#extract all rows where a is present in the grps column
#for each a in a row, create an entry in a column (index 'a') in newdf from corresponding value on row
newdf = pd.DataFrame()
newdf['a'] = df[df["grps"] == 'a']['vals']
#print(df[df["grps"] == 'b']['vals'])
newdf['b']=df[df["grps"] == 'b']['vals']
print(newdf)
This is the output:
a b
0 12 NaN
1 345 NaN
2 3 NaN
6 4 NaN
7 52 NaN
Try the following:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
#extract all rows where a is present in the grps column
#for each a in a row, create an entry in a column (index 'a') in newdf from corresponding value on row
newdf = pd.DataFrame()
newdf['a'] = df[df["grps"] == 'a']['vals'].values
#print(df[df["grps"] == 'b']['vals'])
newdf['b']=df[df["grps"] == 'b']['vals'].values
print(newdf)
The problem is that df[df["grps"] == 'b']['vals'] is a pd.Series which is an index array and a value array, but newdf already has an index that it got from the previous line newdf['a'] = df[df["grps"] == 'a']['vals']. So when you do that twice, the indices are not matching anymore and pd does not know how to handle your command.
By adding the .values accessor you only append the values array, creating a default index which is now going to be just [0,1,2,3,4]
Has the title say, I would like to find a way to drop the row (erase it) in a data frame from a column to the end of the data frame but I don't find any way to do so.
I would like to start with
A B C
-----------
1 1 1
1 1 1
1 1 1
and get
A B C
-----------
1
1
1
I was trying with
df.drop(df.loc[:, 'B':].columns, axis = 1, inplace = True)
But this delete the column itself too
A
-
1
1
1
am I missing something?
If you only know the column name that you want to keep:
import pandas as pd
new_df = pd.DataFrame(df["A"])
If you only know the column names that you want to drop:
new_df = df.drop(["B", "C"], axis=1)
For your case, to keep the columns, but remove the content, one possible way is:
new_df = pd.DataFrame(df["A"], columns=df.columns)
Resulting df contains columns "A" and "B" but without values (NaN instead)
I have a dataframe that is a result of some multiple step processing. I am adding one row to this dataframe like so:
df.loc[‘newindex’] = 0
Where ‘newindex’ is unique to the dataframe. I expect the new row to show up as a last row in the dataframe. But the row shows up somewhere near the middle of the dataframe.
What could be the reason of such behavior? I have to add row exactly at the last position, with its index name preserved.
* update *
I was wrong about uniqueness of the df index. The value has already been there.
I think value newindex is already in index, so loc select and overwite row instead append:
df = pd.DataFrame({'a':range(5)}, index=['a','s','newindex','d','f'])
print (df)
a
a 0
s 1
newindex 2
d 3
f 4
df.loc['newindex'] = 0
df.loc['newindex1'] = 0
print (df)
a
a 0
s 1
newindex 0
d 3
f 4
newindex1 0
Let's say we have a data frame below
df = pd.DataFrame(numpy.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
df
A B C D
0 3 3 0 0
1 0 3 3 2
2 1 0 0 0
3 2 4 4 0
4 3 2 2 4
I would want to append a new row from the existing data and modify several columns
newrow = df.loc[0].copy()
newrow.A = 99
newrow.B = 90
df.append(newrow)
By doing this I got a warning when trying to modify the row
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
<string>:23: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
What would be the clean way of achieving what I intend to do ? I won't have index to use loc because the row is not inside the df yet
If later I would like to come back to this row, how could I retrieve its index at the moment of appending.
newrow = df.loc[0].copy()
df.append(newrow)
df.loc[which index to use, "A"] = 99
In other words, let's say I would want to add the row first then modify it later, how could I get the added row's index
As I can see, you modify every value of the current df row, so it might unnecessary to copy the current row and get the warning.
Just create a dict with your values and append it to the df:
newrow = {'A':99,'B':90,'C':92, 'D':93}
df = df.append(newrow, ignore_index=True)
Use ignore_index=True and the newrow will just be the last index in your df.
use df.iloc[-1] to find the appended line if you didn't use the ignore_index = True tip.