I have a large dataframe containing lots of columns.
For each row/index in the dataframe I do some operations, read in some ancilliary ata, etc and get a new value. Is there a way to add that new value into a new column at the correct row/index?
I can use .assign to add a new column but as I'm looping over the rows and only generating the data to add for one value at a time (generating it is quite involved). When it's generated I'd like to immediately add it to the dataframe rather than waiting until I've generated the entire series.
This doesn't work and gives a key error:
df['new_column_name'].iloc[this_row]=value
Do I need to initialise the column first or something?
There are two steps to created & populate a new column using only a row number...
(in this approach iloc is not used)
First, get the row index value by using the row number
rowIndex = df.index[someRowNumber]
Then, use row index with the loc function to reference the specific row and add the new column / value
df.loc[rowIndex, 'New Column Title'] = "some value"
These two steps can be combine into one line as follows
df.loc[df.index[someRowNumber], 'New Column Title'] = "some value"
If you have a dataframe like
import pandas as pd
df = pd.DataFrame(data={'X': [1.5, 6.777, 2.444, pd.np.NaN], 'Y': [1.111, pd.np.NaN, 8.77, pd.np.NaN], 'Z': [5.0, 2.333, 10, 6.6666]})
Instead of iloc,you can use .loc with row index and column name like df.loc[row_indexer,column_indexer]=value
df.loc[[0,3],'Z'] = 3
Output:
X Y Z
0 1.500 1.111 3.000
1 6.777 NaN 2.333
2 2.444 8.770 10.000
3 NaN NaN 3.000
If you want to add values to certain rows in a new column, depending on values in other cells of the dataframe you can do it like this:
import pandas as pd
df = pd.DataFrame(data={"A":[1,1,2,2], "B":[1,2,3,4]})
Add value in a new column based on the values in cloumn "A":
df.loc[df.A == 2, "C"] = 100
This creates the column "C" and addes the value 100 to it, if column "A" is 2.
Output:
A B C
0 1 1 NaN
1 1 2 NaN
2 2 3 100
3 2 4 100
It is not necessary to initialise the column first.
You can just use pandas built in function DataFrame.at
You can chose a list on several index or a single index and column
df.at[4, 'B'] = 10
Related
So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.
I have a dataset where I need to match the column A and fetch its corresponding next value in next column B.
For example , I have to check if 1 is matched in column A, If true then print "First Page"
Similarly for all the values in column A has to be matched with say X , if true, then print its next value in column B.
Example:
By using df.iloc you can can get the row or column you want by index.
By using mask you can filter the data frame to get the row you want (where column a == some value) and take the value in the second column by df.iloc[0,1].
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4,3,2,1]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 4
1 2 3
2 3 2
3 4 1
# a is the value in the first column and df is the data frame
def a2b(a,df):
return df[df.iloc[:,0]==a].iloc[0,1]
a2b(2,df)
returns 3
I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2
For example, I have a dataframe called dat, then I want to apply a function on each column of the dataframe, if the return value is Ture, then keep this column and turn to next column, if the return value is False, then drop this column and turn to next column.
I know I can write a for loop to do this, but is there a efficient way to do this?
You could do it like this using boolean index on df.columns:
I want to drop all columns where the 'sum' for simplicity is greater than 50
df = pd.DataFrame({'A':[2,4,6,8],'B':[101,102,102,102]})
r = df.apply(np.sum) # applies the sum function to all columns
c = r <= 50 #create boolean test for columns
df[c[c].index] #Use boolea indexing to get columns and column filter for dataframe
Output:
A
0 2
1 4
2 6
3 8
Updating an old answer:
df.loc[:, df.sum() <= 50]
I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4