Python Pandas working with dataframes in functions

Python Pandas working with dataframes in functions - python

I have a DataFrame which I want to pass to a function, derive some information from and then return that information. Originally I set up my code like:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
def test_function(df):
df['D'] = 0
df.D = np.random.rand(len(df))
grouped = df.groupby('A')
df = grouped.first()
df = df['D']
return df
Ds = test_function(df)
print(df)
print(Ds)
Which returns:
A B C D
0 1 5 1 0.582319
1 1 5 1 0.269779
2 1 6 1 0.421593
3 1 7 1 0.797121
4 2 5 1 0.366410
5 2 6 1 0.486445
6 2 6 1 0.001217
7 3 7 1 0.262586
8 3 7 1 0.146543
9 4 6 1 0.985894
10 4 7 1 0.312070
11 4 7 1 0.498103
A
1 0.582319
2 0.366410
3 0.262586
4 0.985894
Name: D, dtype: float64
My thinking was along the lines of, I don't want to copy my large dataframe, so I will add a working column to it, and then just return the information I want with out affecting the original dataframe. This of course doesn't work, because I didn't copy the dataframe so adding a column is adding a column. Currently I'm doing something like:
add column
results = Derive information
delete column
return results
which feels a bit kludgy to me, but I can't think of a better way to do it without copying the dataframe. Any suggestions?

If you do not want to add a column to your original DataFrame, you could create an independent Series and apply the groupby method to the Series instead:
def test_function(df):
ser = pd.Series(np.random.rand(len(df)))
grouped = ser.groupby(df['A'])
return grouped.first()
Ds = test_function(df)
yields
A
1 0.017537
2 0.392849
3 0.451406
4 0.234016
dtype: float64
Thus, test_function does not modify df at all. Notice that ser.groupby can be passed a sequence of values (such as df['A']) by which to group instead of the just the name of a column.

Related

How to apply .astype() method to a dataframe in Python?

I want to convert multiple columns in a dataframe (pandas) to the type "category" using the method .astype. Here is my code:
df['Field_1'].astype('category').cat.codes
works however
categories = df.select_types('objects')
categories['Field_1'].cat.codes
doesn't.
Would someone please tell my why?
In general, the question is how to apply a method (.astype) to a dataframe? I know how to apply a method to a column in a dataframe, however, applying it to a dataframe hasnt been successful, even with for loop since the for loop returns a series and the method .cat.codes is not appliable for the series.

I think you need processing each column separately in DataFrame.apply and lambda function, your code failed, because Series.cat.codes is not implemented for DataFrame:
df = pd.DataFrame({
'A':list('acbdac'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':list('dddbbb')
})
cols = df.select_dtypes('object').columns
df[cols] = df[cols].apply(lambda x: x.astype('category').cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0
Similar idea, not sure if same output if convert all columns to categorical in first step by DataFrame.astype:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].astype('category').apply(lambda x: x.cat.codes)
print (df)
A B C D
0 0 4 7 1
1 2 5 8 1
2 1 4 9 1
3 3 5 4 0
4 0 5 2 0
5 2 4 3 0

Pandas: How to merge columns containing the same name within a single data frame?

I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.

import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6

How to assign values to multiple non existing columns in a pandas dataframe?

So what I want to do is to add columns to a dataframe and fill them (all rows respectively) with a single value.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns = ["A","B"])
arr = np.array([7,8])
# this is what I would like to do
df[["C","D"]] = arr
# and this is what I want to achieve
# A B C D
# 0 1 2 7 8
# 1 3 4 7 8
# but it yields an "KeyError" sadly
# KeyError: "['C' 'D'] not in index"
I do know about the assign-functionality and how I would tackle this issue if I only were to add one column at once. I just want to know whether there is a clean and simple way to do this with multiple new columns as I was not able to find one.

For me working:
df[["C","D"]] = pd.DataFrame([arr], index=df.index)
Or join:
df = df.join(pd.DataFrame([arr], columns=['C','D'], index=df.index))
Or assign:
df = df.assign(**pd.Series(arr, index=['C','D']))
print (df)
A B C D
0 1 2 7 8
1 3 4 7 8

You can using assign and pass a dict in it
df.assign(**dict(zip(['C','D'],[arr.tolist()]*2)))
Out[755]:
A B C D
0 1 2 7 7
1 3 4 8 8

Modifying DataFrames in loop

Given this data frame:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I'd like to create 3 new data frames; one from each column.
I can do this one at a time like this:
a=pd.DataFrame(df[['A']])
a
A
0 1
1 2
2 3
But instead of doing this for each column, I'd like to do it in a loop.
Here's what I've tried:
a=b=c=df.copy()
dfs=[a,b,c]
fields=['A','B','C']
for d,f in zip(dfs,fields):
d=pd.DataFrame(d[[f]])
...but when I then print each one, I get the whole original data frame as opposed to just the column of interest.
a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Update:
My actual data frame will have some columns that I do not need and the columns will not be in any sort of order, so I need to be able to get the columns by name.
Thanks in advance!

A simple list comprehension should be enough.
In [68]: df_list = [df[[x]] for x in df.columns]
Printing out the list, this is what you get:
In [69]: for d in df_list:
...: print(d)
...: print('-' * 5)
...:
A
0 1
1 2
2 3
-----
B
0 4
1 5
2 6
-----
C
0 7
1 8
2 9
-----
Each element in df_list is its own data frame, corresponding to each data frame from the original. Furthermore, you don't even need fields, use df.columns instead.

Or you can try this, instead create copy of df, this method will return the result as single Dataframe, not a list, However, I think save Dataframe into a list is better
dfs=['a','b','c']
fields=['A','B','C']
variables = locals()
for d,f in zip(dfs,fields):
variables["{0}".format(d)] = df[[f]]
a
Out[743]:
A
0 1
1 2
2 3
b
Out[744]:
B
0 4
1 5
2 6
c
Out[745]:
C
0 7
1 8
2 9

You should use loc
a = df.loc[:,0]
and then loop through like
for i in range(df.columns.size):
dfs[i] = df.loc[:, i]

Renaming tuple column name in dataframe

I am new to python and pandas. I have attached a picture of a pandas dataframe,
I need to know how I can fetch data from the last column and how to rename the last column.

You can use:
df = df.rename(columns = {df.columns[-1] : 'newname'})
Or:
df.columns = df.columns[:-1].tolist() + ['new_name']
It seems solution:
df.columns.values[-1] = 'newname'
is buggy. Because after rename pandas functions return weird errors.
For fetch data from last column is possible use select by position by iloc:
s = df.iloc[:,-1]
And after rename:
s1 = df['newname']
print (s1)
Sample:
df = pd.DataFrame({'R':[7,8,9],
'T':[1,3,5],
'E':[5,3,6],
('Z', 'a'):[7,4,3]})
print (df)
E T R (Z, a)
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
s = df.iloc[:,-1]
print (s)
0 7
1 4
2 3
Name: (Z, a), dtype: int64
df.columns = df.columns[:-1].tolist() + ['new_name']
print (df)
E T R new_name
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
df = df.rename(columns = {('Z', 'a') : 'newname'})
print (df)
E T R newname
0 5 1 7 7
1 3 3 8 4
2 6 5 9 3
s = df['newname']
print (s)
0 7
1 4
2 3
Name: newname, dtype: int64
df.columns.values[-1] = 'newname'
s = df['newname']
print (s)
>KeyError: 'newname'

fetch data from the last column
Retrieving the last column using df.iloc[:,-1] as suggested by other answers works fine only when it is indeed the last column.
However, using absolute column positions like -1 is not a stable solution, i.e. if you add some other column, your code will break.
A stable, generic approach
First of all, make sure all your column names are strings:
# rename columns
df.columns = [str(s) for s in df.columns]
# access column by name
df['(vehicle_id, reservation_count)']
rename the last column
It is preferable to have similar column names for all columns, without brackets in them - make your code more readable and your dataset easier to use:
# access column by name
df['vehicle_id_reservation_count`]
This is a straight forward conversion on all columns that are named by a tuple:
# rename columns
def rename(col):
if isinstance(col, tuple):
col = '_'.join(str(c) for c in col)
return col
df.columns = map(rename, df.columns)

You can drop the last column and reassign it with a different name.
This isn't technically renaming the column. However, I think its intuitive.
Using #jezrael's setup
df = pd.DataFrame({'R':[7,8,9],
'T':[1,3,5],
'E':[5,3,6],
('Z', 'a'):[7,4,3]})
print(df)
R T E (Z, a)
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
How can I fetch the last column?
You can use iloc
df.iloc[:, -1]
0 5
1 3
2 6
Name: c, dtype: int64
You can rename the column after you've extracted it
df.iloc[:, -1].rename('newcolumn')
0 5
1 3
2 6
Name: newcolumn, dtype: int64
In order to rename it within the dataframe, you can do a great number of ways. To continue with the theme that I've started, namely, fetching the column, then renaming it:
option 1
start by dropping the last column with iloc[:, :-1]
use join to add the renamed column referenced above
df.iloc[:, :-1].join(df.iloc[:, -1].rename('newcolumn'))
R T E newname
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3
option 2
Or we can use assign to put it back and save the rename
df.iloc[:, :-1].assign(newname=df.iloc[:, -1])
R T E newname
0 7 1 5 7
1 8 3 3 4
2 9 5 6 3

For changeing the column name
columns=df.columns.values
columns[-1]="Column name"
For fetch data from dataframe
You can use loc,iloc and ix methods.
loc is for fetch value using label
iloc is for fetch value using indexing
ix can fetch data with both using index and label
Learn about loc and iloc
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection
Learn more about indexing and selecting data
http://pandas.pydata.org/pandas-docs/stable/indexing.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas working with dataframes in functions - python

Related

How to apply .astype() method to a dataframe in Python?

Pandas: How to merge columns containing the same name within a single data frame?

How to assign values to multiple non existing columns in a pandas dataframe?

Modifying DataFrames in loop

Renaming tuple column name in dataframe

Categories

Resources