Pandas: Create new dataframe based on existing dataframe - python

what is the most elegant way to create a new dataframe from an existing dataframe, by 1. selecting only certain columns and 2. renaming them at the same time?
For instance I have the following dataframe, where I want to pick column B, D and F and rename them into X, Y, Z
base dataframe
A B C D E F
1 2 3 4 5 6
1 2 3 4 5 6
new dataframe
X Y Z
2 4 6
2 4 6

You can select and rename the columns in one line
df2=df[['B','D','F']].rename({'B':'X','D':'Y','F':'Z'}, axis=1)

Slightly more general selection of every other column:
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],
'C':[7,8,9], 'D':[10,11,12]})
df_half = df.iloc[:, ::2]
with df_half being:
A C
0 1 7
1 2 8
2 3 9
You can then use the rename method mentioned in the answer by #G. Anderson or directly assign to the columns:
df_half.columns = ['X','Y']
returning:
X Y
0 1 7
1 2 8
2 3 9

Related

Dividing pandas dataframe to to separate dataframes by rows

I have a dataframe, for example:
a b c
0 1 2
3 4 5
6 7 8
and i need to separate it by rows and create a new dataframe from each row.
i tried to iterate over the rows and then for each row (which is a seriese) i tried the command row.to_df() but it gives me a weird result.
basicly im looking to create bew dataframe sa such:
a b c
0 1 2
a b c
3 4 5
a b c
7 8 9
You can simply iterate row-by-row and use .to_frame(). For example:
for _, row in df.iterrows():
print(row.to_frame().T)
print()
Prints:
a b c
0 0 1 2
a b c
1 3 4 5
a b c
2 6 7 8
You can try doing:
for _, row in df.iterrows():
new_df = pd.DataFrame(row).T.reset_index(drop=True)
This will create a new DataFrame object from each row (Series Object) in the original DataFrame df.

Pandas: How to merge columns containing the same name within a single data frame?

I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6

Pandas: remove old DataFrame from memory after groupby

value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows

Modifying DataFrames in loop

Given this data frame:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I'd like to create 3 new data frames; one from each column.
I can do this one at a time like this:
a=pd.DataFrame(df[['A']])
a
A
0 1
1 2
2 3
But instead of doing this for each column, I'd like to do it in a loop.
Here's what I've tried:
a=b=c=df.copy()
dfs=[a,b,c]
fields=['A','B','C']
for d,f in zip(dfs,fields):
d=pd.DataFrame(d[[f]])
...but when I then print each one, I get the whole original data frame as opposed to just the column of interest.
a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Update:
My actual data frame will have some columns that I do not need and the columns will not be in any sort of order, so I need to be able to get the columns by name.
Thanks in advance!
A simple list comprehension should be enough.
In [68]: df_list = [df[[x]] for x in df.columns]
Printing out the list, this is what you get:
In [69]: for d in df_list:
...: print(d)
...: print('-' * 5)
...:
A
0 1
1 2
2 3
-----
B
0 4
1 5
2 6
-----
C
0 7
1 8
2 9
-----
Each element in df_list is its own data frame, corresponding to each data frame from the original. Furthermore, you don't even need fields, use df.columns instead.
Or you can try this, instead create copy of df, this method will return the result as single Dataframe, not a list, However, I think save Dataframe into a list is better
dfs=['a','b','c']
fields=['A','B','C']
variables = locals()
for d,f in zip(dfs,fields):
variables["{0}".format(d)] = df[[f]]
a
Out[743]:
A
0 1
1 2
2 3
b
Out[744]:
B
0 4
1 5
2 6
c
Out[745]:
C
0 7
1 8
2 9
You should use loc
a = df.loc[:,0]
and then loop through like
for i in range(df.columns.size):
dfs[i] = df.loc[:, i]

Renaming tuple column name in dataframe

I am new to python and pandas. I have attached a picture of a pandas dataframe,
I need to know how I can fetch data from the last column and how to rename the last column.
You can use:
df = df.rename(columns = {df.columns[-1] : 'newname'})
Or:
df.columns = df.columns[:-1].tolist() + ['new_name']
It seems solution:
df.columns.values[-1] = 'newname'
is buggy. Because after rename pandas functions return weird errors.
For fetch data from last column is possible use select by position by iloc:
s = df.iloc[:,-1]
And after rename:
s1 = df['newname']
print (s1)
Sample:
df = pd.DataFrame({'R':[7,8,9],
'T':[1,3,5],
'E':[5,3,6],
('Z', 'a'):[7,4,3]})
print (df)
E T R (Z, a)
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
s = df.iloc[:,-1]
print (s)
0 7
1 4
2 3
Name: (Z, a), dtype: int64
df.columns = df.columns[:-1].tolist() + ['new_name']
print (df)
E T R new_name
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
df = df.rename(columns = {('Z', 'a') : 'newname'})
print (df)
E T R newname
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
s = df['newname']
print (s)
0 7
1 4
2 3
Name: newname, dtype: int64
df.columns.values[-1] = 'newname'
s = df['newname']
print (s)
>KeyError: 'newname'
fetch data from the last column
Retrieving the last column using df.iloc[:,-1] as suggested by other answers works fine only when it is indeed the last column.
However, using absolute column positions like -1 is not a stable solution, i.e. if you add some other column, your code will break.
A stable, generic approach
First of all, make sure all your column names are strings:
# rename columns
df.columns = [str(s) for s in df.columns]
# access column by name
df['(vehicle_id, reservation_count)']
rename the last column
It is preferable to have similar column names for all columns, without brackets in them - make your code more readable and your dataset easier to use:
# access column by name
df['vehicle_id_reservation_count`]
This is a straight forward conversion on all columns that are named by a tuple:
# rename columns
def rename(col):
if isinstance(col, tuple):
col = '_'.join(str(c) for c in col)
return col
df.columns = map(rename, df.columns)
You can drop the last column and reassign it with a different name.
This isn't technically renaming the column. However, I think its intuitive.
Using #jezrael's setup
df = pd.DataFrame({'R':[7,8,9],
'T':[1,3,5],
'E':[5,3,6],
('Z', 'a'):[7,4,3]})
print(df)
R T E (Z, a)
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
How can I fetch the last column?
You can use iloc
df.iloc[:, -1]
0 5
1 3
2 6
Name: c, dtype: int64
You can rename the column after you've extracted it
df.iloc[:, -1].rename('newcolumn')
0 5
1 3
2 6
Name: newcolumn, dtype: int64
In order to rename it within the dataframe, you can do a great number of ways. To continue with the theme that I've started, namely, fetching the column, then renaming it:
option 1
start by dropping the last column with iloc[:, :-1]
use join to add the renamed column referenced above
df.iloc[:, :-1].join(df.iloc[:, -1].rename('newcolumn'))
R T E newname
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
option 2
Or we can use assign to put it back and save the rename
df.iloc[:, :-1].assign(newname=df.iloc[:, -1])
R T E newname
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
For changeing the column name
columns=df.columns.values
columns[-1]="Column name"
For fetch data from dataframe
You can use loc,iloc and ix methods.
loc is for fetch value using label
iloc is for fetch value using indexing
ix can fetch data with both using index and label
Learn about loc and iloc
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
Learn more about indexing and selecting data
http://pandas.pydata.org/pandas-docs/stable/indexing.html

Categories