Rename Dataframe column based on column index - python

Is there a built in function to rename a pandas dataframe by index?
I thought I knew the name of my column headers, but it turns out the second column has some hexadecimal characters in it. I will likely come across this issue with column 2 in the future based on the way I receive my data, so I cannot hard code those specific hex characters into a dataframe.rename() call.
Is there a function that would be appropriately named rename_col_by_index() that I have not been able to find?
Ex:
>>> df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
>>> df.rename_col_by_index(1, 'new_name')
>>> df
a new_name
0 1 3
1 2 4

#MaxU's answer is better
df.rename(columns={"col1": "New name"})
More in docs

UPDATE: thanks to #Vincenzzzochi:
In [138]: df.rename(columns={df.columns[1]: 'new'})
Out[138]:
a new c
0 1 3 5
1 2 4 6
In [140]: df
Out[140]:
a b c
0 1 3 5
1 2 4 6
or bit more flexible:
In [141]: mapping = {df.columns[0]:'new0', df.columns[1]: 'new1'}
In [142]: df.rename(columns=mapping)
Out[142]:
new0 new1 c
0 1 3 5
1 2 4 6

Related

Appending rows to existing pandas dataframe

I have a pandas dataframe df1
a b
0 1 2
1 3 4
I have another dataframe in the form of a dictionary
dictionary = {'2' : [5, 6], '3' : [7, 8]}
I want to append the dictionary values as rows in dataframe df1. I am using pandas.DataFrame.from_dict() to convert the dictionary into dataframe. The constraint is, when I do it, I cannot provide any value to the 'column' argument for the method from_dict().
So, when I try to concatenate the two dataframes, the pandas adds the contents of the new dataframe as new columns. I do not want that. The final output I want is in the format
a b
0 1 2
1 3 4
2 5 6
3 7 8
Can someone tell me how do I do this in least painful way?
Use concat with help of pd.DataFrame.from_dict, setting the columns of df1 during the conversion:
out = pd.concat([df1,
pd.DataFrame.from_dict(dictionary, orient='index',
columns=df1.columns)
])
Output:
a b
0 1 2
1 3 4
2 5 6
3 7 8
Another possible solution, which uses numpy.vstack:
pd.DataFrame(np.vstack([df.values, np.array(
list(dictionary.values()))]), columns=df.columns)
Output:
a b
0 1 2
1 3 4
2 5 6
3 7 8

Pivot table based on the first value of the group in Pandas

Have the following DataFrame:
I'm trying to pivot it in pandas and achieve the following format:
Actually I tried the classical approach with pd.pivot_table() but it does not work out:
pd.pivot_table(df,values='col2', index=[df.index], columns = 'col1')
Would be appreciate for some suggestions :) Thanks!
You can use pivot and then dropna for each column:
>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1 a b c
0 1 2 9
1 4 5 0
2 6 8 7
Another option is to create a Series of lists using groupby.agg; then construct a DataFrame:
out = df.groupby('col1')['col2'].agg(list).pipe(lambda x: pd.DataFrame(zip(*x), columns=x.index.tolist()))
Output:
A B C
0 1 2 9
1 4 5 0
2 6 8 7

Pandas Sum & Count Across Only Certain Columns

I have just started learning pandas, and this is a very basic question. Believe me, I have searched for an answer, but can't find one.
Can you please run this python code?
import pandas as pd
df = pd.DataFrame({'A':[1,0], 'B':[2,4], 'C':[4,4], 'D':[1,4],'count__4s_abc':[1,2],'sum__abc':[7,8]})
df
How do I create column 'count__4s_abc' in which I want to count how many times the number 4 appears in just columns A-C? (While ignoring column D.)
How do I create column 'sum__abc' in which I want to sum the amounts in just columns A-C? (While ignoring column D.)
Thanks much for any help!
Using drop
df.assign(
count__4s_abc=df.drop('D', 1).eq(4).sum(1),
sum__abc=df.drop('D', 1).sum(1)
)
Or explicitly choosing the 3 columns.
df.assign(
count__4s_abc=df[['A', 'B', 'C']].eq(4).sum(1),
sum__abc=df[['A', 'B', 'C']].sum(1)
)
Or using iloc to get first 3 columns.
df.assign(
count__4s_abc=df.iloc[:, :3].eq(4).sum(1),
sum__abc=df.iloc[:, :3].sum(1)
)
All give
A B C D count__4s_abc sum__abc
0 1 2 4 1 1 7
1 0 4 4 4 2 8
One additional option:
In [158]: formulas = """
...: new_count__4s_abc = (A==4)*1 + (B==4)*1 + (C==4)*1
...: new_sum__abc = A + B + C
...: """
In [159]: df.eval(formulas)
Out[159]:
A B C D count__4s_abc sum__abc new_count__4s_abc new_sum__abc
0 1 2 4 1 1 7 1 7
1 0 4 4 4 2 8 2 8
DataFrame.eval() method can (but not always) be faster compared to regular Pandas arithmetic

Pandas: Combination of two data frames

I have two data frames, old and new. Both have identical columns.
I want to, by the index,
Add rows to old that exist in new but not in old
Update rows at old with data in new.
Is there any efficient way of doing so in pandas? I found update(), which does exactly the second step. However, it doesn't add rows. I could do the first step using some set logic onto the indices. However, that does not appear to efficient. What's the best way to do these two operations?
Example
old
a b
0 1 1
1 3 3
new
a b
1 1 2
2 1 2
result
a b
0 1 1
1 1 2
2 1 2
You could first find common indices for both dataframes then for first with that indices assign values of the second. And then you'll get the result with combine_first:
In [35]: df1
Out[35]:
a b
0 1 1
1 3 3
In [36]: df2
Out[36]:
a b
1 1 2
2 1 2
idx = df1.index & df2.index
df1.loc[idx, :] = df2.loc[idx, :]
df1 = df1.combine_first(df2)
In [39]: df1
Out[39]:
a b
0 1 1
1 1 2
2 1 2
you can do the first step using df.reindex()method
old = old.reindex(index=new.index)

Remove lines in dataframe using a list in Pandas

It's a generic question about filtering a pandas dataframe using a list. The problem is the following:
I have a pandas dataframe df with a column field
I have a list of banned fields, for example ban_field=['field1','field2','field3']
All elements of ban_field appear in df.field
For the moment, to retrieve the dataframe without the banned field, I proceed as follows:
for f in ban_field:
df = df[df.field!=f]
Is there a more pythonic way to proceed (in one line?)?
Method #1: use isin and a boolean array selector:
In [47]: df = pd.DataFrame({"a": [2]*10, "field": range(10)})
In [48]: ban_field = [3,4,6,7,8]
In [49]: df[~df.field.isin(ban_field)]
Out[49]:
a field
0 2 0
1 2 1
2 2 2
5 2 5
9 2 9
[5 rows x 2 columns]
Method #2: use query:
In [51]: df.query("field not in #ban_field")
Out[51]:
a field
0 2 0
1 2 1
2 2 2
5 2 5
9 2 9
[5 rows x 2 columns]
You can remove it by using the isin function and the negation (~) operator.
df[~df.field.isin(ban_field)]

Categories