Check two series are equal with a condition - python

I have two series and want to check if they are equal with a condition on the combination between 'a' and 'b' is acceptable
first = pd.Series(['a', 'a', 'b', 'c', 'd'])
second = pd.Series(['A', 'B', 'C', 'C', 'K'])
expected output :
0 True
1 True
2 False
3 True
4 False
So far I know eq can compare the two series but I am not sure how to include the condition
def helper(s1, s2):
return s1.str.lower().eq(s2.str.lower())

You can use bitwise logic operations to include your additional logic.
So that's:
condition_1 = first.str.casefold().eq(second.str.casefold())
condition_2 = first.str.casefold().isin(['a', 'b']) & second.str.casefold().isin(['a', 'b'])
result = condition_1 | condition_2
Or with numpy:
condition_1 = first.str.casefold().eq(second.str.casefold())
condition_2 = numpy.bitwise_and(
first.str.casefold().isin(['a', 'b']),
second.str.casefold().isin(['a', 'b'])
)
result = numpy.bitwise_or(condition_1, condition_2)

You can use replace to map all a to b:
def transform(s):
return s.str.lower().replace({'a':'b'})
transform(first).eq(transform(second))

You can specify an "ascii_distance" as follows:
import pandas as pd
s1 = pd.Series(['a', 'a', 'b', 'c', 'd'])
s2 = pd.Series(['A', 'A', 'b', 'C', 'F'])
def helper(s1, s2, ascii_distance):
s1_processed = [ord(c1) for c1 in s1.str.lower()]
s2_processed = [ord(c2) for c2 in s2.str.lower()]
print(f'ascii_distance = {ascii_distance}')
print(f's1_processed = {s1_processed}')
print(f's2_processed = {s2_processed}')
result = []
for i in range(len(s1)):
result.append((abs(s1_processed[i] - s2_processed[i]) <= ascii_distance))
return result
ascii_distance = 2
print(helper(s1, s2, ascii_distance))
Output:
ascii_distance = 2
s1_processed = [97, 97, 98, 99, 100]
s2_processed = [97, 97, 98, 99, 102]
[True, True, True, True, True]

Related

Aggregate by percentile and count for groups in python

I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4
You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)

Count values of column A observed only with a single distinct value in columns B and/or C

I'd like to summarise a large dataframe in terms of distinct values of one column with respect to whether they are restricted to occuring with single OR multiple distinct values of other column(s). My current approach for doing this is really convoluted, and I'm looking for a pandas pattern for solving these kinds of problems.
Given the following example dataframe:
import pandas as pd
pd.DataFrame({'c': ['x', 'x', 'y', 'y', 'z', 'z'],
's': ['a1', 'a1', 'a1', 'a1', 'a1', 'a2'],
't': [1, 1, 1, 2, 1, 1]})
How may I obtain (and count) the distinct values of column c:
1) that are observed only in conjunction with a single value of columns s and t.
Desired output: set('x') and/or its length: 1
2) that are observed only in conjuction with a single value of column s but >1 values of column t.
Desired output: set('y') and/or its length: 1
3) that are observed in conjuction with >1 values of columns s and any number of distinct column t values.
Desired output: set('z') and/or its length: 1
Edit:
One more q, using the following revised df!
df = pd.DataFrame({'c': ['x', 'x', 'y', 'y', 'z', 'z', 'z1', 'z1', 'z2'],
's': ['a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a3', 'a3', 'a4'],
't': [1, 1, 1, 2, 1, 1, 3, 3, 1],
'cat': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'a']})
4) observed twice or more, and only in conjunction with a single value of columns s and t, and also restricted to cat 'a'
Desired output: set('x') and/or its length: 1
Thanks!
Idea is use DataFrame.duplicated by multiple columns with keep=False for all dupes and filtering by boolean indexing:
m1 = df.duplicated(['c','s','t'], keep=False)
m2 = df.duplicated(['c','s'], keep=False) & ~m1
m3 = df.duplicated(['c','t'], keep=False) & ~m1
a = df.loc[m1, 'c']
print (a)
0 x
1 x
Name: c, dtype: object
b = df.loc[m2, 'c']
print (b)
2 y
3 y
Name: c, dtype: object
c = df.loc[m3, 'c']
print (c)
4 z
5 z
Name: c, dtype: object
And then convert columns to sets:
out1, out2, out3 = set(a['c']), set(b['c']), set(c['c'])
print (out1)
{'x'}
print (out2)
{'y'}
print (out3)
{'z'}
And for lengths:
out11, out21, out31 = len(out1), len(out2), len(out3)
print (out11)
print (out21)
print (out31)
1
1
1
Another idea is create new column by concat and DataFrame.dot:
df1 = pd.concat([m1, m2, m3], axis=1, keys=('s&t','s','t'))
print (df1)
s&t s t
0 True False False
1 True False False
2 False True False
3 False True False
4 False False True
5 False False True
df['new'] = df1.dot(df1.columns)
And then aggregate with sets and function nunique:
df2 = (df.groupby('new')['c']
.agg([('set', lambda x: set(x)),('count','nunique')])
.reset_index())
print (df2)
new set count
0 s {y} 1
1 s&t {x} 1
2 t {z} 1

Left Outer join with condition

I would like to merge two data frames (how=left) but not only on an index but only on a condition.
E.g assume two data frame
C1 C2
A = I 3
K 2
L 5
C1 C2 C3
B = I 5 T
I 0 U
K 1 X
L 7 Z
Now I would like to left outer join table A with B using index C1 under the condition that A.C2 > B.C2. That is, the final result should look like
A.C1 A.C2 B.C2 B.C3
A<-B = I 1 0 U
K 2 1 X
L 5 Null Null
P.S.: If you want to test it your self:
import pandas as pd
df_A = pd.DataFrame([], columns={'C 1', 'C2'})
df_A['C 1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']
The quick and dirty solution would be to simply join on the C1 column and then put NULL or NaN into C3 for all the rows where C2_1 > C2_2.
Method: direct SQL query into pandas using pandasql library. reference
import pandas as pd
df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K']
df_A['C2'] = [3, 2]
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K']
df_B['C2'] = [5, 0, 2]
df_B['C3'] = ['T', 'U', 'X']
It appears to me that the conditions that you specified for doing the outer join on (A.C1 = B.C1) does not produce the expected result. I needed to do GROUP BY A.C1 in order to drop duplicate rows having same values in A.C1 after the join.
import pandasql as ps
q = """
SELECT A.C1 as 'A.C1',
A.C2 as 'A.C2',
B.C2 as 'B.C2',
B.C3 as 'B.C3'
FROM df_A AS A
LEFT OUTER JOIN df_B AS B
--ON A.C1 = B.C1 AND A.C2 = B.C2
WHERE A.C2 > B.C2
GROUP BY A.C1
"""
print(ps.sqldf(q, locals()))
Output
A.C1 A.C2 B.C2 B.C3
0 I 3 2 X
1 K 2 0 U
Other References
https://www.zentut.com/sql-tutorial/sql-outer-join/
Executing an SQL query over a pandas dataset
How to do a conditional join in python Pandas?
https://github.com/pandas-dev/pandas/issues/7480
https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e
I found one non-pandas-native solution:
import pandas as pd
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
cols = df_A.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df_A.columns = cols
df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']
# df_merge = pd.merge(left=df_A, right=df_B, how='left', on='C1')
df_sql = pysqldf("""
select *
from df_A t_1
left join df_B t_2 on t_1.C1 = t_2.C1 and t_1.C2 >= t_2.C2
;
""")
However, for big tables, pandasql turns out to be less performant.
Output:
C2 C1 C3 C2 C1
0 3 I U 0.0 I
1 2 K X 2.0 K
2 5 L None NaN None

Pandas test reappearance of values based on a rolling period

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColN']=['AAA', 'AAA', 'AAA', 'AAA', 'ABC']
df['ColN_dt']=['03-01-2018', '03-04-2018', '03-05-2018', \
'03-08-2018', '03-12-2018']
df['ColN_ext']=['A', 'B', 'B', 'B', 'B']
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
I am trying to solve the following problem based on above DataFrame:
within a window of (say) 5 days, I want to check if ColN_ext values are appearing before and after a particular row by group ColN .
i.e. I am trying to create a flag:
df['flag'] = [NaN, 0, 1, NaN, NaN] . Any help would be appreciated.
I was able to do this by defining custom function:
import numpy as np
import pandas as pd
flag_list = []
def create_flag(dt, lookupdf):
stdt = dt - lkfwd
enddt = dt + lkfwd
bckset_ext = set(lookupdf.loc[(lookupdf['ColN_dt'] >= stdt) & \
(lookupdf['ColN_dt'] < dt)]['ColN_ext'])
fwdset_ext = set(lookupdf.loc[(lookupdf['ColN_dt'] > dt) & \
(lookupdf['ColN_dt'] <= enddt)]['ColN_ext'])
flag_list.append(bool(bckset_ext.intersection(fwdset_ext)))
return None
# Define the rolling days
lkfwd = pd.Timedelta(days=5)
df = pd.DataFrame()
df['ColN']=['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'ABC']
df['ColN_dt']=['03-12-2018', '03-13-2018', '03-13-2018', '03-01-2018', '03-05-2018', '03-04-2018', '03-08-2018', '02-04-2018']
df['ColN_ext']=['A', 'B', 'A', 'A', 'B', 'B', 'C', 'A']
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
dfs = df.sort_values(by=['ColN', 'ColN_dt']).reset_index(drop=True)
dfg = dfs.groupby('ColN')
for _, grpdf in dfg:
grpdf['ColN_dt'].apply(create_flag, args=(grpdf,))
dfs['flag'] = flag_list
This generates:
dfs['flag'] = [False, False, False, True, False, False, False, False]
I am now trying to achieve the same using pandas.groupby + rolling + (may be) resample

updating a list based on the values on other lists

I have a list of lists, each list contains four elements, and the elements represent id, age, val1, val2. I am manipulating each list in such a way that the val1 and val2 values of that list always depend on the most recent values seen in the previous lists. The previous lists for a list are those lists for which the age difference is not less than timeDelta. The list of lists are in sorted order by age.
My code is working perfect but it is slow. I feel that the line marked ** is generating too many lists of lists and can be avoided, by keep on deleting the lists from the begining one I know that the age difference of a list with the next list is more than timeDelta.
myList = [
[1, 20, '', 'x'],
[1, 25, 's', ''],
[1, 26, '', 'e'],
[1, 30, 'd', 's'],
[1, 50, 'd', 'd'],
[1, 52, 'f', 'g']
]
age_Idx =1
timeDelta = 10
for i in range(len(myList))[1:]:
newList = myList[:i+1] #Subset of lists. #********
respList = newList.pop(-1)
currage = float(respList[age_Idx])
retval = collapseListTogether(newList, age_Idx, currage, timeDelta)
if(len(retval) == 0):
continue
retval[0:2] = respList[0:2]
print(retval)
def collapseListTogether(li, age_Idx, currage, timeDelta):
finalList = []
for xl in reversed(li) :
#print(xl)
oldage = float(xl[age_Idx])
if ((currage-timeDelta) <= oldage < currage):
finalList.append(xl)
else:
break
return([reduce(lambda a, b: b or a, tup) for tup in zip(*finalList[::-1])])
Example
[1, 20, '', 'x'] ==> Not dependent on anything. Skip this list
[1, 25, 's', ''] == > [1, 25, '', 'x']
[1, 26, '', 'e'] ==> [1, 26, 's', 'x']
[1, 30, 'd', 's'] ==> [1, 30, 's', 'e']
[1, 50, 'd', 'd'] ==> Age difference (50-30 = 20) which is more than 10
[1, 52, 'f', 'g'] ==> [1, 52, 'd', 'd']
I'm just rewriting your data structure and your code:
from collections import namedtuple
Record = namedtuple('Record', ['id', 'age', 'val1', 'val2'])
myList = [
Record._make([1, 20, '', 'x']),
Record._make([1, 25, 's', '']),
Record._make([1, 26, '', 'e']),
Record._make([1, 30, 'd', 's']),
Record._make([1, 50, 'd', 'd']),
Record._make([1, 52, 'f', 'g'])
]
timeDelta = 10
for i in range(1, len(myList)):
subList = list(myList[:i+1])
rec = supList.pop(-1)
age = float(rec.age)
retval = collapseListTogether(subList, age, timeDelta)
if len(retval) == 0:
continue
retval.id, retval.age = rec.id, rec.age
print(retval)
def collapseListTogether(lst, age, tdelta):
finalLst = []
[finalLst.append(ele) if age - float(ele.age) <= tdelta and age > float(ele.age)
else None for ele in lst]
return([reduce(lambda a, b: b or a, tup) for tup in zip(*finalLst[::-1])])
Your code is not readable to me. I did not change the logic, but just modify places for performance.
One of the way out is to replace your 4-element list with tuple, even better with namedtuple, which is a famous high-performance container in Python. Also, for-loop should be avoided in interpreted languages. In python, one would use comprehensions instead of for-loop if possible to enhance performance. Your list is not too large, so time earned in efficient line interpreting should be more than that in breaking.
To me, your code should not work, but I am not sure.
Assuming your example is correct, I see no reason you can't do this in a single pass, since they're sorted by age. If the last sublist you inspected has too great a difference, you know nothing earlier will count, so you should just leave the current sublist unmodified.
previous_age = None
previous_val1 = ''
previous_val2 = ''
for sublist in myList:
age = sublist[1]
latest_val1 = sublist[2]
latest_val2 = sublist[3]
if previous_age is not None and ((age - previous_age) <= timeDelta):
# there is at least one previous list
sublist[2] = previous_val1
sublist[3] = previous_val2
previous_age = age
previous_val1 = latest_val1 or previous_val1
previous_val2 = latest_val2 or previous_val2
When testing, that code produces this modified value for your initial myList:
[[1, 20, '', 'x'],
[1, 25, '', 'x'],
[1, 26, 's', 'x'],
[1, 30, 's', 'e'],
[1, 50, 'd', 'd'],
[1, 52, 'd', 'd']]
It's a straightforward modification to build a new list rather than edit one in place, or to entirely omit the skipped lines rather than just leave them unchanged.
reduce and list comprehensions are powerful tools, but they're not right for all problems.

Categories