Replace column name by Index - python

I have the below data in a Dataframe.
+----+------+----+------+
| Id | Name | Id | Name |
+----+------+----+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+----+------+----+------+
Though the column names are repeating, ideally, its a comparison of 1st 2 columns (old data) with the last 2 columns (new data).
I was trying to rename the 2nd last column by appending _New to it with Index using the below code. Unfortunately, the 1st column is also getting appended with _New.
df.rename(columns={df.columns[2]: df.columns[2] + '_New'}, inplace=True)
Here's the result I am getting using the above code.
+--------+------+--------+------+
| Id_New | Name | Id_New | Name |
+--------+------+--------+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+--------+------+--------+------+
My understanding is that it should add _New to only the 2nd last column. Below is the expected result.
+----+------+--------+------+
| Id | Name | Id_New | Name |
+----+------+--------+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+----+------+--------+------+
Is there any way to accomplish this?

You can use a simple loop with a dictionary to keep track of the increments. I generalized the logic here to handle an arbitrary number of duplicates:
cols = {}
new_cols = []
for c in df.columns:
if c in cols:
new_cols.append(f'{c}_New{cols[c]}')
cols[c] += 1
else:
new_cols.append(c)
cols[c] = 1
df.columns = new_cols
output:
Id Name Id_New1 Name_New1
0 1 A 1 C
1 2 B 2 B
If you really want Id_New then Id_New2 etc. change:
new_cols.append(f'{c}_New{cols[c]}')
to
i = cols[c] if cols[c] != 1 else ''
new_cols.append(f'{c}_New{i}')

Related

Appending column values into new cell in the same row in Pandas dataframe

I have a csv file that has columns name, sub_a, sub_b, sub_c, sub_d, segment and gender. I would like create a new column classes with all the classes (sub-columns) seperated by comma that each student takes.
What would be the easiest way to accomplish this?
The result dataframe should look like this:
+------+-------+-------+-------+-------+---------+--------+---------------------+
| name | sub_a | sub_b | sub_c | sub_d | segment | gender | classes |
+------+-------+-------+-------+-------+---------+--------+---------------------+
| john | 1 | 1 | 0 | 1 | 1 | 0 | sub_a, sub_b, sub_d |
+------+-------+-------+-------+-------+---------+--------+---------------------+
| mike | 1 | 0 | 1 | 1 | 0 | 0 | sub_a, sub_c, sub_d |
+------+-------+-------+-------+-------+---------+--------+---------------------+
| mary | 1 | 1 | 0 | 1 | 1 | 1 | sub_a, sub_b, sub_d |
+------+-------+-------+-------+-------+---------+--------+---------------------+
| fred | 1 | 0 | 1 | 0 | 0 | 0 | sub_a, sub_c |
+------+-------+-------+-------+-------+---------+--------+---------------------+
Let us try dot
s=df.filter(like='sub')
df['classes']=s.astype(bool).dot(s.columns+',').str[:-1]
You can use apply with axis=1
For Ex.: if your dataframe like
df
A_a A_b B_b B_c
0 1 0 0 1
1 0 1 0 1
2 1 0 1 0
you can do
df['classes'] = df.apply(lambda x: ', '.join(df.columns[x==1]), axis = 1)
df
A_a A_b B_b B_c classes
0 1 0 0 1 A_a, B_c
1 0 1 0 1 A_b, B_c
2 1 0 1 0 A_a, B_b
To apply on specific columns you can filter first using loc
#for your sample data
df_ = df.loc[:,'sub_a':'sub_d'] #or df.loc[:,'sub_a', 'sub_b', 'sub_c', 'sub_d']
df_.apply(lambda x: ', '.join(df_.columns[x==1]), axis = 1)
You indeed want to iterate through the rows. However, you can not directly add the classes to the DataFrame as all columns of the DataFrame need to be equally long. So the trick is to first generate the column and then add it later:
subjects = ['subj_a', 'subj_b', 'subj_c']
classes_per_student [] # the empty column
for _, student in df.iterrows():
# first create a list of the classes taken by this student
classes = [subj for subj in subjects if student[subj]]
# create a single string
classes = ', '.join(classes)
# append to the column under construction
classes_per_student.append(classes)
# and finaly add the column to the DataFrame
df['classes'] = classes_per_student
You can use apply only on the sub-columns to apply a lambda function that will join the names of the sub-columns where the values of the columns equal 1:
sub_cols = ['sub_a', 'sub_b', 'sub_c', 'sub_d']
df['classes'] = df[sub_cols].apply(lambda x: ', '.join(df[sub_cols].columns[x == 1]), axis=1)

Creating a new conditional column in a dataframe with values from multiple columns

I need help with creating a conditional column using values from multiple other columns with pandas.
Column1|Column2|Column4|Column4
1 | 2 | 5 | A
2 | 3 | 4 | B
3 | 4 | 3 | C
4 | 5 | 2 | B
5 | 1 | 1 | C
And what I want is to create a new column such that if column4 is equal to A then the new column will be equal to the value in column1 so the final dataframe would look like this
Column1|Column2|Column4|Column4|column5
1 | 2 | 5 | A | 1
2 | 3 | 4 | B | 3
3 | 4 | 3 | C | 3
4 | 5 | 2 | B | 5
5 | 1 | 1 | C | 1
Here is what I have tried so far but keep getting the response data.column1 (x) object is not callable
def column5(x):
if x['column4'] == 'A'
return data.column1(x)
elif x['column4'] == 'B'
return data.column2(x)
elif x['column4'] == 'C'
return data.column3(x)
You got error because data.column1 is a pandas.Series, you cannot call it like a function with data.column1(x).
Also your desired value are different for each row based on value of col4, so you will need to use either a loop, or better: using pandas's apply() function.
Try this:
# map value to column
val_to_col = {
'A': 'Column1',
'B': 'Column2',
'C': 'Column3'
}
# get data from col, based on row[col4]
df['column5'] = df.apply(lambda row: row[val_to_col.get(row['Column4'])], axis=1)

What is the smartest way to get the rest of a pandas.DataFrame?

Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |
Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.
If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.
I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]

Creating new complex index with pandas Dataframe columns

I'm trying to concatenate the columns 'A' and 'C' in a Dataframe like the following to use it as a new Index:
A | B | C | ...
---------------------------
0 5 | djn | 0 | ...
1 5 | vlv | 1 | ...
2 5 | bla | 2 | ...
3 5 | ses | 3 | ...
4 5 | dug | 4 | ...
The desired result would be a Dataframe which is similar to the following result:
A | B | C | ...
-------------------------------
05000 5 | djn | 0 | ...
05001 5 | vlv | 1 | ...
05002 5 | bla | 2 | ...
05003 5 | ses | 3 | ...
05004 5 | dug | 4 | ...
I've searched my eyes off, does someone know how to manipulate a dataframe to get such result?
#dummying up a dataframe
cf['A'] = 5*[5]
cf['C'] = range(5)
cf['B'] = list('qwert')
#putting together two columns into a new one -- EDITED so string formatting is OK
cf['D'] = map(lambda x: str(x).zfill(5), 1000*cf.A + cf.C)
# use it as the index
cf.index = cf.D
# we don't need it as a column
cf.drop('D', axis=1, inplace=True)
print(cf.to_csv())
D,A,C,B
05000,5,0,q
05001,5,1,w
05002,5,2,e
05003,5,3,r
05004,5,4,t
That said, I suspect you'd be safer with multi-indexing (what if the values in B go above 999....), or sorting or grouping on multi-columns.

Python Pandas: count unique values in row [duplicate]

So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())

Categories