import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColN']=['AAA', 'AAA', 'AAA', 'AAA', 'ABC']
df['ColN_dt']=['03-01-2018', '03-04-2018', '03-05-2018', \
'03-08-2018', '03-12-2018']
df['ColN_ext']=['A', 'B', 'B', 'B', 'B']
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
I am trying to solve the following problem based on above DataFrame:
within a window of (say) 5 days, I want to check if ColN_ext values are appearing before and after a particular row by group ColN .
i.e. I am trying to create a flag:
df['flag'] = [NaN, 0, 1, NaN, NaN] . Any help would be appreciated.
I was able to do this by defining custom function:
import numpy as np
import pandas as pd
flag_list = []
def create_flag(dt, lookupdf):
stdt = dt - lkfwd
enddt = dt + lkfwd
bckset_ext = set(lookupdf.loc[(lookupdf['ColN_dt'] >= stdt) & \
(lookupdf['ColN_dt'] < dt)]['ColN_ext'])
fwdset_ext = set(lookupdf.loc[(lookupdf['ColN_dt'] > dt) & \
(lookupdf['ColN_dt'] <= enddt)]['ColN_ext'])
flag_list.append(bool(bckset_ext.intersection(fwdset_ext)))
return None
# Define the rolling days
lkfwd = pd.Timedelta(days=5)
df = pd.DataFrame()
df['ColN']=['AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'AAA', 'ABC']
df['ColN_dt']=['03-12-2018', '03-13-2018', '03-13-2018', '03-01-2018', '03-05-2018', '03-04-2018', '03-08-2018', '02-04-2018']
df['ColN_ext']=['A', 'B', 'A', 'A', 'B', 'B', 'C', 'A']
df['ColN_dt'] = pd.to_datetime(df['ColN_dt'])
dfs = df.sort_values(by=['ColN', 'ColN_dt']).reset_index(drop=True)
dfg = dfs.groupby('ColN')
for _, grpdf in dfg:
grpdf['ColN_dt'].apply(create_flag, args=(grpdf,))
dfs['flag'] = flag_list
This generates:
dfs['flag'] = [False, False, False, True, False, False, False, False]
I am now trying to achieve the same using pandas.groupby + rolling + (may be) resample
Related
I have two series and want to check if they are equal with a condition on the combination between 'a' and 'b' is acceptable
first = pd.Series(['a', 'a', 'b', 'c', 'd'])
second = pd.Series(['A', 'B', 'C', 'C', 'K'])
expected output :
0 True
1 True
2 False
3 True
4 False
So far I know eq can compare the two series but I am not sure how to include the condition
def helper(s1, s2):
return s1.str.lower().eq(s2.str.lower())
You can use bitwise logic operations to include your additional logic.
So that's:
condition_1 = first.str.casefold().eq(second.str.casefold())
condition_2 = first.str.casefold().isin(['a', 'b']) & second.str.casefold().isin(['a', 'b'])
result = condition_1 | condition_2
Or with numpy:
condition_1 = first.str.casefold().eq(second.str.casefold())
condition_2 = numpy.bitwise_and(
first.str.casefold().isin(['a', 'b']),
second.str.casefold().isin(['a', 'b'])
)
result = numpy.bitwise_or(condition_1, condition_2)
You can use replace to map all a to b:
def transform(s):
return s.str.lower().replace({'a':'b'})
transform(first).eq(transform(second))
You can specify an "ascii_distance" as follows:
import pandas as pd
s1 = pd.Series(['a', 'a', 'b', 'c', 'd'])
s2 = pd.Series(['A', 'A', 'b', 'C', 'F'])
def helper(s1, s2, ascii_distance):
s1_processed = [ord(c1) for c1 in s1.str.lower()]
s2_processed = [ord(c2) for c2 in s2.str.lower()]
print(f'ascii_distance = {ascii_distance}')
print(f's1_processed = {s1_processed}')
print(f's2_processed = {s2_processed}')
result = []
for i in range(len(s1)):
result.append((abs(s1_processed[i] - s2_processed[i]) <= ascii_distance))
return result
ascii_distance = 2
print(helper(s1, s2, ascii_distance))
Output:
ascii_distance = 2
s1_processed = [97, 97, 98, 99, 100]
s2_processed = [97, 97, 98, 99, 102]
[True, True, True, True, True]
Example Dataframe =
df = pd.DataFrame({'ID': [1,1,2,2,2,3,3,3],
... 'Type': ['b','b','b','a','a','a','a']})
I would like to return the counts grouped by ID and then a column for each unique ID in Type and the count of each Type for that grouped row:
pd.DataFrame({'ID': [1,2,3],'Count_TypeA': [0,2,3], 'CountTypeB':[2,1,0]}, 'TotalCount':[2,3,3])
Is there an easy way to do this using the groupby function in pandas?
For what you need you can use the method get_dummies from pandas. This will convert categorical variable into dummy/indicator variables. You can check the reference here.
Check if this meets your requirements:
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 2, 2, 3, 3, 3],
'Type': ['b', 'b', 'b', 'a', 'a', 'a', 'a', 'a']})
dummy_var = pd.get_dummies(df["Type"])
dummy_var.rename(columns={'a': 'CountTypeA', 'b': 'CountTypeB'}, inplace=True)
df1 = pd.concat([df['ID'], dummy_var], axis=1)
df_group1 = df1.groupby(by=['ID'], as_index=False).sum()
df_group1['TotalCount'] = df_group1['CountTypeA'] + df_group1['CountTypeB']
print(df_group1)
This will print the following result:
ID CountTypeA CountTypeB TotalCount
0 1 0 2 2
1 2 2 1 3
2 3 3 0 3
I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4
You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)
I expect the DataFrame to output in an 'Excel' type of fashion, but instead, get the index error:
'IndexError: too many indices for array'
import numpy as np
import pandas as pd
from numpy.random import randn
rowi = ['A', 'B', 'C', 'D', 'E']
coli = ['W', 'X', 'Y', 'Z']
df = pd.DataFrame(randn[5, 4], rowi, coli) # data , index , col
print(df)
How do I solve the problem?
Is this what you want:
df = pd.DataFrame(randn(5, 4), rowi, coli)
Out[583]:
W X Y Z
A -0.630006 -0.033165 -1.005409 -0.827504
B 0.044278 0.526636 1.082062 -1.664397
C 0.523847 -0.688798 -0.626712 0.149128
D 0.541975 -1.448316 -0.961484 -0.526547
E 0.066888 0.238089 1.180641 0.462298
I'm trying to change the values of only certain values in a dataframe:
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a':2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr[x]), test.col2)
However, this doesn't seem to work because even though I'm looking only at the values in col1 that are 'a', the error says
KeyError: 'b'
Implying that it also looks at the values of col1 with values 'b'. Why is this? And how do I fix it?
The error is originating from the test.col1.map(lambda x: dict_curr[x]) part. You look up the values from col1 in dict_curr, which only has an entry for 'a', not for 'b'.
You can also just index the dataframe:
test.loc[test.col1 == 'a', 'col2'] = 2
The problem is that when you call np.where all of its parameters are evaluated first, and then the result is decided depending on the condition. So the dictionary is queried also for 'b' and 'c', even if those values will be discarded later. Probably the easiest fix is:
import pandas as pd
import numpy as np
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr.get(x, 0)), test.col2)
This will give the value 0 for keys not in the dictionary, but since it will be discarded later it does not matter which value you use.
Another easy way of getting the same result is:
import pandas as pd
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = test.apply(lambda x: dict_curr.get(x.col1, x.col2), axis=1)