Operations on multiple Dataframes in Python - python

Data frames are provided:
a = pd.DataFrame({'A':[1, 2]})
b = pd.DataFrame({'B':[2, 3]})
C = pd.DataFrame({'C':[4, 5]})
and list d = [A, C, B, B]
How to write an mathematical operations (((A + C) * B) - B) on frame values to create a new data frame?
The result is, for example, a frame in the form:
e = pd.DataFrame({'E':[8, 18]})

IIUC:
In [132]: formula = "E = (((A + C) * B) - B)"
In [133]: pd.concat([a,b,C], axis=1).eval(formula, inplace=False)
Out[133]:
A B C E
0 1 2 4 8
1 2 3 5 18
In [134]: pd.concat([a,b,C], axis=1).eval(formula, inplace=False)[['E']]
Out[134]:
E
0 8
1 18

Related

change value in pandas dataframe using iteration

I have a training data set of the following format:
print(data.head(5))
#Output
0 1
0 a b c d e 1
1 a b c d e 1
2 a b c d e 1
3 a b c d e 1
4 a b c d e 1
It is a text classification task and I am trying to split the text "a b c d e" in to a Python list. I tried iteration:
data #the dataset
len_data = len(data)
for row_num in range(len_data):
data.loc[row_num, 0] = data.loc[row_num, 0].split(" ")
However this doesn't work and returned the error Must have equal len keys and value when setting with an iterable. Could someone help me with this problem? Many thanks!
Use str.split:
df[0] = df[0].str.split()
print(df)
# Output
0 1
0 [a, b, c, d, e] 1
1 [a, b, c, d, e] 1
2 [a, b, c, d, e] 1
3 [a, b, c, d, e] 1
4 [a, b, c, d, e] 1
Setup:
data = {0: {0: 'a b c d e', 1: 'a b c d e'}, 1: {0: 1, 1: 1}}
df = pd.DataFrame(data)

How to apply function on checking the specific column Null Values

I am trying to apply a function on the dataframe by checking for NULL values on each rows of an specific column.
However i have created the function but , i am not getting how to use the function on the rows having the values.
Input:
A B C D E F
0 f e b a d a
1 c b a c b
2 f f a b c c
3 d c c d c d
4 f b b b e b
5 b a f c d a
Expected Output
A B C D E F MATCHES Comments
0 f e b a d a AD, BC Unmatched
1 c b a c b BC Unmatched F is having blank values
2 f f a b c c AD, BC Unmatched
3 d c c d c d ALL MATCHED
4 f b b b e b AD Unmatched
5 b a f c d a AD, BC Unmatched
The script created is working when we don't have to check for the NaN values in df['F'] column, BUt when we check for the empty rows in df['F'] , It gives Error.
Code i have been trying:
def test(x):
try:
for idx in df.index:
unmatch_list = []
if not df.loc[idx, 'A'] == df.loc[idx, 'D']:
unmatch_list.append('AD')
if not df.loc[idx, 'B'] == df.loc[idx, 'C']:
unmatch_list.append('BC')
# etcetera...
if len(unmatch_list):
unmatch_string = ', '.join(unmatch_list) + ' Unmatched'
else:
unmatch_string = 'ALL MATCHED'
df.loc[idx, 'MATCHES'] = unmatch_string
except ValueError:
print ('error')
return df
## df = df.apply(lambda x: test(x) if(pd.notna(df['F'])) else x)
for row in df:
if row['F'].isna() == True:
row['Comments'] = "F is having blank values"
else:
df = test(df)
Please Suggest how can i use to function .
You could try something like this:
# get combis
df1 = df.copy().reset_index().melt(id_vars=['index'])
df1 = df1.merge(df1, on=['index', 'value'], how='inner')
df1 = df1[df1['variable_x'] != df1['variable_y']]
df1['combis'] = df1['variable_x'] + ':' + df1['variable_y']
df1 = df1.groupby(['index'])['combis'].apply(list)
# get empty rows
df2 = df.copy().reset_index().melt(id_vars=['index'])
df2 = df2[df2['value'].isna()]
df2 = df2.groupby(['index'])['variable'].apply(list)
# combine
df.join(df1).join(df2)
# A B C ... F combis variable
# 0 f e b ... a [D:F, F:D] NaN
# 1 c b a ... None [A:D, D:A, B:E, E:B] [F]
# 2 f f a ... c [A:B, B:A, E:F, F:E] NaN
# 3 d c c ... d [A:D, A:F, D:A, D:F, F:A, F:D, B:C, B:E, C:B, ... NaN
# 4 f b b ... b [B:C, B:D, B:F, C:B, C:D, C:F, D:B, D:C, D:F, ... NaN
# 5 b a f ... a [B:F, F:B] NaN
# [6 rows x 8 columns]
If you are only interested in the unmatched combinations you can use this:
import itertools
combis = [x+':'+y for x,y in itertools.permutations(df.columns, 2)]
df.join(df1).join(df2)['combis'].map(lambda lst: list(set(combis) - set(lst)))

Generate pairs from columns in Pandas

What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?
Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)
This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E

Find columns where values are greater than column-wise mean

How to print the column headers if the row values are greater than the mean value (or median) of the column.
For Eg.,
df =
a b c d
0 12 11 13 45
1 6 13 12 23
2 5 12 6 35
the output should be 0: a, c, d. 1: a, b, c. 2: b.
In [22]: df.gt(df.mean()).T.agg(lambda x: df.columns[x].tolist())
Out[22]:
0 [a, c, d]
1 [b, c]
2 [d]
dtype: object
or:
In [23]: df.gt(df.mean()).T.agg(lambda x: ', '.join(df.columns[x]))
Out[23]:
0 a, c, d
1 b, c
2 d
dtype: object
You can try this by using pandas , I break down the steps
df=df.reset_index().melt('index')
df['MEAN']=df.groupby('variable')['value'].transform('mean')
df[df.value>df.MEAN].groupby('index').variable.apply(list)
Out[1016]:
index
0 [a, c, d]
1 [b, c]
2 [d]
Name: variable, dtype: object
Use df.apply to generate a mask, which you can then iterate over and index into df.columns:
mask = df.apply(lambda x: x > x.mean())
out = [(i, ', '.join(df.columns[x])) for i, x in mask.iterrows()]
print(out)
[(0, 'a, c, d'), (1, 'b, c'), (2, 'd')]
d = defaultdict(list)
v = df.values
[d[df.index[r]].append(df.columns[c])
for r, c in zip(*np.where(v > v.mean(0)))];
dict(d)
{0: ['a', 'c', 'd'], 1: ['b', 'c'], 2: ['d']}

Combination aggregations in Pandas

I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''

Categories