Create a column based on a condition python pandas - python

Here is the sample data
import pandas as pd
df=pd.DataFrame({'P_Name':['ABC','ABC','ABC','ABC','PQR','PQR','PQR','PQR','XYZ','XYZ','XYZ','XYZ'],
'Date':['11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020'],
'Open':['242.584','238.179','233.727','229.441','241.375','28.965','235.96','233.193','280.032','78.472','277.592','276.71'],
'End':['4.405','4.452','4.286','4.405','2.41','3.005','2.767','3.057','1.56','0.88','0.882','0.88'],
'Close':['238.179','233.727','229.441','225.036','238.965','235.96','233.193','230.136','278.472','277.592','276.71','275.83']})
I'm trying to create a new column where the condition will be that for every new product entry, the corresponding will be 1 AND will also have to check the condition if df['Close'][0] == df['Open'][1] are same the value will be 1 if not same(E.g df['Close'][8] == df['Open'][9]) then 0
df after these conditions
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1

You can compare shifted values per groups by DataFrameGroupBy.shift with Series.eq with replace missing values by another column by Series.fillna with cast mask to 0,1 with Series.astype:
df['Check'] = df.Open.eq(df.groupby('P_Name').Close.shift().fillna(df.Open)).astype(int)
Anothr idea is compare without groups, but chain another mask with Series.duplicated for match first rows per groups:
df['Check'] = (~df.P_Name.duplicated() | df.Open.eq(df.Close.shift())).astype(int)
print (df)
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1

check = []
for i in range(df.index - 1):
if df['Close'][i] == df['Open'][i+1]:
check.append (1)
else
check.append (0)
df['Check'] = check

Related

keep only ids that have all three values of the column Mode

I have a pandas dataframe with multiple columns, which looks like the following:
Index
ID
Year
Code
Type
Mode
0
100
2018
ABC
1
1
1
100
2019
DEF
2
2
2
100
2019
GHI
3
3
3
102
2018
JKL
4
1
4
103
2019
MNO
5
1
5
103
2018
PQR
6
2
6
102
2019
PQR
3
2
I only want to keep ids that have rows against all the values for the column Mode. An example would look like this:
Index
ID
Year
Code
Type
Mode
0
100
2018
ABC
1
1
1
100
2019
DEF
2
2
2
100
2019
GHI
3
3
I have already tried doing so by using the following code:
df = data.groupby('ID').filter(lambda x: {1, 2, 3}.issubset(x['Mode']))
but this returns an empty result. Can someone help me here?
TIA
You can try
out = df.groupby('ID').filter(lambda x : pd.Series([1,2,3]).isin(x['Mode']).all())
Out[9]:
Index ID Year Code Type Mode
0 0 100 2018 ABC 1 1
1 1 100 2019 DEF 2 2
2 2 100 2019 GHI 3 3
Your code works just fine (on pandas 1.3, python 3.9):
out = df.groupby('ID').filter(lambda x: {1,2,3}.issubset(x['Mode']))
Output:
Index ID Year Code Type Mode
0 0 100 2018 ABC 1 1
1 1 100 2019 DEF 2 2
2 2 100 2019 GHI 3 3

Split a Range of Numbers into different Rows - Pandas

I have a dataframe having column values like this:
num_range id description
'5000-6000' 1 lmn
'6100-6102' 1 lmn
'6363-6363' 3 xyz
'Q7890-Q8000' 2 pqr
So is there a way I can write a loop which will split into rows and give me the values, for ex. for the first num_range value, something like this:
num_range id description
5000 1 lmn
5001 1 lmn
5002 1 lmn
..... ... ....
5999 1 lmn
6000 1 lmn
Q7891 2 pqr
Q7892 2 pqr
... ... ...
Q8000 2 pqr
Like wise I want rows for all the num_range values along with the id and the description.
Use Series.str.findall for get numeric values, also working if before non numeric values like F in last row, then create Series by lists comprehension and join to original:
print (df)
num_range id description
0 5000-5005 1 lmn
1 6100-6102 1 lmn
2 6363-6363 3 xyz
3 Q7890-Q7893 2 pqr
s = df.pop('num_range').str.findall('\d+')
a = [(i, x) for i, (a, b) in s.items() for x in range(int(a), int(b) + 1)]
s = pd.DataFrame(a).set_index(0)[1].rename('num_range')
df = df.join(s)
print (df)
id description num_range
0 1 lmn 5000
0 1 lmn 5001
0 1 lmn 5002
0 1 lmn 5003
0 1 lmn 5004
0 1 lmn 5005
1 1 lmn 6100
1 1 lmn 6101
1 1 lmn 6102
2 3 xyz 6363
3 2 pqr 7890
3 2 pqr 7891
3 2 pqr 7892
3 2 pqr 7893
If need also first value before numeric first extract this values by Series.str.extract, replace - toe emty string and map in list comprehension:
d = df['num_range'].str.extract('(\D+)\d+', expand=False).replace('-','').to_dict()
print (d)
{0: '', 1: '', 2: '', 3: 'Q'}
s = df.pop('num_range').str.findall('\d+')
a = [(i, '{}{}'.format(d.get(i), x))
for i, (a, b) in s.items() for x in range(int(a), int(b) + 1)]
s = pd.DataFrame(a).set_index(0)[1].rename('num_range')
df = df.join(s).reset_index(drop=True)
print (df)
id description num_range
0 1 lmn 5000
1 1 lmn 5001
2 1 lmn 5002
3 1 lmn 5003
4 1 lmn 5004
5 1 lmn 5005
6 1 lmn 6100
7 1 lmn 6101
8 1 lmn 6102
9 3 xyz 6363
10 2 pqr Q7890
11 2 pqr Q7891
12 2 pqr Q7892
13 2 pqr Q7893
This is a bit brute force but explains a way of explicitly doing it.
One can use .apply etc. in fancy ways too to cut out some loops
# going to save it here
newdf = pd.DataFrame()
for _, row in df.iterrows():
# split num_range and cast to a list of ints
s, e = [x for x in map(int, row.num_range.split("-"))]
# need to add one to e cause we need to include it
for n in range(s, e+1):
# replace the number on the row you've iterated on.
row.num_range = n
newdf = newdf.append(row)

"Rank" DataFrame columns per row

Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks
Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3

How to transpose dataframe using classification variable [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have dataframe in following format-
Index Stock Open
1 ABC 10
2 ABC 12
: : :
1 PQR 12
2 PQR 23
: : :
1 XYZ 0.5
2 XYZ 0.9
: : :
I would like transform above dataframe using variable(Stock) which classification variable, required format is shown below-
Index ABC PQR XYZ
1 10 12 0.5
2 12 23 0.9
: : : :
Note-there might be multiple variable...
Please tell me how can transform or transpose dataframe in above format.
I think you are searching for a pivot table:
df.pivot('Index', 'Stock', ['Open', 'Close'])
Open Close
Stock ABC PQR XYZ ABC PQR XYZ
Index
1 10.0 12.0 0.5 13.0 15.0 0.13
2 12.0 23.0 0.9 14.0 16.0 0.14
I used a test dataframe constructed like this:
s = '''Index Stock Open Close
1 ABC 10 13
2 ABC 12 14
1 PQR 12 15
2 PQR 23 16
1 XYZ 0.5 .13
2 XYZ 0.9 .14'''
df = pd.read_table(StringIO(s), sep='\s+', engine='python')
Index Stock Open Close
0 1 ABC 10.0 13.00
1 2 ABC 12.0 14.00
2 1 PQR 12.0 15.00
3 2 PQR 23.0 16.00
4 1 XYZ 0.5 0.13
5 2 XYZ 0.9 0.14

Pandas deleting rows in order

Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!
To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz
I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz

Categories