Create a function to extract specific columns and rename pandas - python

I have a target table structure (3 columns). I have multiple sources, each with its own nuances but ultimately I want to use each table to populate the target table (append entries)
I want to use a function (I know I can do it without a function but it will help me out in the long run to be able to use a function)
I have the following source table
id col1 col2 col3 col4
1 a b c g
1 a b d h
1 c d e i
I want this final structure
id num group
1 a b
1 a b
1 c d
So all I am doing is returning id, col1 and col2 from the source table (but note the column name changes. For different source tables it will be a different set of 3 columns that I will be extracting hence the use of a function).
The function I am using is currently returning only 1 column (instead of 3)
Defining function:
def func(x, col1='id', col2='num', col3='group'):
d=[{'id':x[col1], 'num':x[col2], 'group':x[col3]}]
return pd.DataFrame(d)
Applying the function to a source table.
target= source.apply(func, axis=1)

Here's a flexible way to write this function:
def func(dframe, **kwargs):
return dframe.filter(items=kwargs.keys()).rename(columns=kwargs)
func(df, id="id", col1="num", col2="group")
# group id num
# 0 b 1 a
# 1 b 1 a
# 2 d 1 c
To ensure that your new dataframe preserves the column order of the original, you can sort the argument keys first:
def func(dframe, **kwargs):
keys = sorted(kwargs.keys(), key=lambda x: list(dframe).index(x))
return dframe.filter(items=keys).rename(columns=kwargs)
func(df, id="id", col1="num", col2="group")
# id num group
# 0 1 a b
# 1 1 a b
# 2 1 c d

You can also do:
def func(df, *l):
d = pd.DataFrame(df, columns=l)
d.rename(columns={'col1':'num', 'col2':'group'}, inplace=True)
return d
df2 = func(df, 'id','col1','col2')
print(df2)
id num group
0 1 a b
1 1 a b
2 1 c d

Related

python: creating new comparison columns based on column references in seperate datatable

I have a dataset with columns a_x,b_x,c_x,d_x, a_z,b_z,c_z,d_z
df=pd.DataFrame({'a_x':['a','b','c'],'b_x':['a','b','c'] ,'c_x':['a','b','c'],'d_x':['a','b','c'],'a_z':['a','b','i'],'b_z':['a','t','c'] ,'c_z':['c','c','c'],'d_z':['a','b','c']})
I have another dataset with columns : original,_x,_z.
header_comp=pd.DataFrame({'original':['a','b','c','d'],'_x':['a_x','b_x','c_x','d_x'],'_z':['a_z','b_z','c_z','d_z']})
I'm trying to create a loop using the header_comp to compare the _x columns to the corresponding _z columns such that new columns are created in the original df dataset: a_comp, b_comp, c_comp, d_comp.
Each of these columns will compare if i_x is equal to i_z and spit out either 1 or 0.
output should therefore look like this:
df=pd.DataFrame({'a_x':['a','b','c'],'b_x':['a','b','c'] ,'c_x':['a','b','c'],'d_x':['a','b','c'],'a_z':['a','b','i'],'b_z':['a','t','c'] ,'c_z':['c','c','c'],'d_z':['a','b','c'],'a_comp':[1,1,0],'b_comp':[1,0,1] ,'c_comp':[0,0,1],'d_comp':[1,1,1]})
So far, my code looks like this
for i in range(0, len(header_match)):
df[header_matrch.iloc[i,0] + ' comp'] = (df[header_match.iloc[i,1]==df[header_match.iloc[i,2]]).astype(int)
however, this is not working, with an error of 'Pivotrelease_x'. Is anyone able to troubleshoot this for me?
If I just use the code for individual columns outside of the for loop, there are no problems. e.g.
df[header_matrch.iloc[1,0] + ' comp'] = (df[header_match.iloc[1,1]==df[header_match.iloc[1,2]]).astype(int)
Thanks.
You can just use the values in header_comp to index the values in df:
df[header_comp['original'] + '_comp'] = (df[header_comp['_x']].to_numpy() == df[header_comp['_z']]).astype(int)
Output:
>>> df
a_x b_x c_x d_x a_z b_z c_z d_z a_comp b_comp c_comp d_comp
0 a a a a a a c a 1 1 0 1
1 b b b b b t c b 1 0 0 1
2 c c c c i c c c 0 1 1 1

How to lookup value in another table in Python

I have two (actually many, but stick with two) datasets and I need to merge them together. However, they are not same range and they have different reference values. Lets consider
a 1
b 2
c 3
e 4
and
a 2
b 3
d 7
e 2
I tried to simulate Excel index and match function, but I am not able to get the right result
b = []
f = []
for i in data1["c1"]:
if i in data2["c1"]:
a = d3[data2["c4"].index[i]]
f = b.append(a)
else:
continue
print(f)
Can you please help me how this works? I would also welcome some link with further information about this topic. Thank you
If you want to create a consolidated file from the two above like:
Col1 Col2 Col3
a 1 2
b 2 3
c 3 7
d 4 2
You can simply use dictionaries, with keys as your column 1 values: a, b, c, d and values as list of the 2nd column values from your two DataFrames respectively like:
your_dict = {a:[1,2], b:[2,3], c:[3,7], d:[4,2]}
Then to output that into one DataFrame such as the one above, just use the .from_dict() method in pandas with the orient parameter equal to 'index' see documentation here.

Add column to DataFrame in a loop

Let's say I have a very simple pandas dataframe, containing a single indexed column with "initial values". I want to read in a loop N other dataframes to fill a single "comparison" column, with matching indices.
For instance, with my inital dataframe as
Initial
0 a
1 b
2 c
3 d
and the following two dataframes to read in a loop
Comparison
0 e
1 f
Comparison
2 g
3 h
4 i <= note that this index doesn't exist in Initial so won't be matched
I would like to produce the following result
Initial Comparison
0 a e
1 b f
2 c g
3 d h
Using merge, concat or join, I only ever seem to be able to create a new column for each iteration of the loop, filling the blanks with NaN.
What's the most pandas-pythonic way of achieving this?
Below an example from the proposed duplicate solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
print df1
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
print df2
df3 = pd.DataFrame(np.array([[2,'g'],[3,'h'],[4,'i']]), columns=['','Compare'])
df3 = df3.set_index('')
print df3
print df1.merge(df2,left_index=True,right_index=True).merge(df3,left_index=True,right_index=True)
>>
Initial
0 a
1 b
2 c
3 d
Compare
0 e
1 f
Compare
2 g
3 h
4 i
Empty DataFrame
Columns: [Initial, Compare_x, Compare_y]
Index: []
Second edit: #W-B, the following seems to work, but it can't be the case that there isn't a simpler option using proper pandas methods. It also requires turning off warnings, which might be dangerous...
pd.options.mode.chained_assignment = None
df1["Compare"]=pd.Series()
for ind in df1.index.values:
if ind in df2.index.values:
df1["Compare"][ind]=df2.T[ind]["Compare"]
if ind in df3.index.values:
df1["Compare"][ind]=df3.T[ind]["Compare"]
print df1
>>
Initial Compare
0 a e
1 b f
2 c g
3 d h
Ok , since Op need more info
Data input
import functools
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
df1['Compare']=np.nan
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
df3 = pd.DataFrame(np.array(['g','h','i']), columns=['Compare'],index=[2,3,4])
Solution
newdf=functools.reduce(lambda x,y: x.fillna(y),[df1,df2,df3])
newdf
Out[639]:
Initial Compare
0 a e
1 b f
2 c g
3 d h

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

extract rows with conditions and with new created column in python

I have a data like this
id name sub marks
1 a m 52
1 a s 69
1 a p 63
2 b m 36
2 b s 52
2 b p 56
3 c m 85
3 c s 62
3 c p 56
And I want output table which contain columns such as id, name and new column result(using criteria if marks in all subject is greater than 40 then this student is pass)
id name result
1 a pass
2 b fail
3 c pass
I would like to do this in python.
Create a boolean mask from marks, and then use groupby (on id and name) + all:
import pandas as pd
df = pd.read_csv('file.csv')
v = df.assign(result=df.marks.gt(40))\
.groupby(['id', 'name'])\
.result\
.all()\
.reset_index()
v['result'] = np.where(v['result'], 'pass', 'fail')
v
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Here's one way
In [127]: df.groupby(['id', 'name']).marks.agg(
lambda x: 'pass' if x.ge(40).all() else 'fail'
).reset_index(name='result')
Out[127]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Another way, inspired from jpp's solution, use replace or map
In [132]: df.groupby(['id', 'name']).marks.min().ge(40).replace(
{True: 'pass', False: 'fail'}
).reset_index(name='result')
Out[132]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Here is one way via pandas. Note your criteria is equivalent to the minimum mark being above 40. This algorithm is computationally more efficient.
import pandas as pd
df = pd.read_csv('file.csv')
df = df.groupby(['id', 'name'])['marks'].apply(min).reset_index()
df['result'] = np.where(df['marks'] > 40, 'pass', 'fail')
df = df[['id', 'name', 'result']]
Result
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Explanation
First perform a groupby.min() by id and name.
Then assign the column a string depending on value.

Categories