Split a Range of Numbers into different Rows - Pandas

Split a Range of Numbers into different Rows - Pandas - python

I have a dataframe having column values like this:
num_range id description
'5000-6000' 1 lmn
'6100-6102' 1 lmn
'6363-6363' 3 xyz
'Q7890-Q8000' 2 pqr
So is there a way I can write a loop which will split into rows and give me the values, for ex. for the first num_range value, something like this:
num_range id description
5000 1 lmn
5001 1 lmn
5002 1 lmn
..... ... ....
5999 1 lmn
6000 1 lmn
Q7891 2 pqr
Q7892 2 pqr
... ... ...
Q8000 2 pqr
Like wise I want rows for all the num_range values along with the id and the description.

Use Series.str.findall for get numeric values, also working if before non numeric values like F in last row, then create Series by lists comprehension and join to original:
print (df)
num_range id description
0 5000-5005 1 lmn
1 6100-6102 1 lmn
2 6363-6363 3 xyz
3 Q7890-Q7893 2 pqr
s = df.pop('num_range').str.findall('\d+')
a = [(i, x) for i, (a, b) in s.items() for x in range(int(a), int(b) + 1)]
s = pd.DataFrame(a).set_index(0)[1].rename('num_range')
df = df.join(s)
print (df)
id description num_range
0 1 lmn 5000
0 1 lmn 5001
0 1 lmn 5002
0 1 lmn 5003
0 1 lmn 5004
0 1 lmn 5005
1 1 lmn 6100
1 1 lmn 6101
1 1 lmn 6102
2 3 xyz 6363
3 2 pqr 7890
3 2 pqr 7891
3 2 pqr 7892
3 2 pqr 7893
If need also first value before numeric first extract this values by Series.str.extract, replace - toe emty string and map in list comprehension:
d = df['num_range'].str.extract('(\D+)\d+', expand=False).replace('-','').to_dict()
print (d)
{0: '', 1: '', 2: '', 3: 'Q'}
s = df.pop('num_range').str.findall('\d+')
a = [(i, '{}{}'.format(d.get(i), x))
for i, (a, b) in s.items() for x in range(int(a), int(b) + 1)]
s = pd.DataFrame(a).set_index(0)[1].rename('num_range')
df = df.join(s).reset_index(drop=True)
print (df)
id description num_range
0 1 lmn 5000
1 1 lmn 5001
2 1 lmn 5002
3 1 lmn 5003
4 1 lmn 5004
5 1 lmn 5005
6 1 lmn 6100
7 1 lmn 6101
8 1 lmn 6102
9 3 xyz 6363
10 2 pqr Q7890
11 2 pqr Q7891
12 2 pqr Q7892
13 2 pqr Q7893

This is a bit brute force but explains a way of explicitly doing it.
One can use .apply etc. in fancy ways too to cut out some loops
# going to save it here
newdf = pd.DataFrame()
for _, row in df.iterrows():
# split num_range and cast to a list of ints
s, e = [x for x in map(int, row.num_range.split("-"))]
# need to add one to e cause we need to include it
for n in range(s, e+1):
# replace the number on the row you've iterated on.
row.num_range = n
newdf = newdf.append(row)

Related

create new column based on value column a mapping on column b get column c

i have a dataframe like this:
num
text
foreign
1
abc
4
2
bcd
1
3
efg
3
4
jkl
2
4
jkl
1
i want to make new column based on 'foreign' column match with 'id' column and get the 'text' column.
so, im expecting:
num
text
foreign
foreign_txt
1
abc
4
jkl
2
bcd
1
abc
3
efg
3
bcd
4
jkl
2
efg
4
jkl
1
abc
how the syntax to make 'foreign_txt'?
I cant drop any row.
i forget how to do it. can you help me?

Use Series.map with Series creted by DataFrame.set_index:
df['foreign_txt'] = df['foreign'].map(df.set_index('id')['text'])
print (df)
id text foreign foreign_txt
0 1 abc 4 jkl
1 2 bcd 1 abc
2 3 efg 3 efg
3 4 jkl 2 bcd
Or:
df = (df.merge(df.drop('foreign', axis=1)
.rename(columns={'id':'foreign', 'text':'foreign_txt'}), how='left'))
EDIT: If need first value of text by id add DataFrame.drop_duplicates, for avoid remove unique texts per id with aggregate join:
print (df)
id text foreign
0 1 abc 4
1 2 bcd 1
2 3 efg 3
3 4 jkl 2
4 4 jkl 1
5 3 aaa 8
#join unique duplicates
s = df.drop_duplicates(['id','text']).groupby('id')['text'].agg(','.join)
df['foreign_txt1'] = df['foreign'].map(s)
#get first duplicates
df['foreign_txt2'] = df['foreign'].map(df.drop_duplicates('id').set_index('id')['text'])
#get last duplicates
df['foreign_txt3'] = df['foreign'].map(df.drop_duplicates('id', keep='last').set_index('id')['text'])
print (df)
id text foreign foreign_txt1 foreign_txt2 foreign_txt3
0 1 abc 4 jkl jkl jkl
1 2 bcd 1 abc abc abc
2 3 efg 3 efg,aaa efg aaa
3 4 jkl 2 bcd bcd bcd
4 4 jkl 1 abc abc abc
5 3 aaa 8 NaN NaN NaN

you can apply map method..!
Code:
import pandas as pd
data = {'num': [1, 2, 3, 4,4],
'text': ['abc', 'bcd', 'efg', 'jkl','jkl'],
'foreign': [4, 1, 3, 2,1]}
df = pd.DataFrame(data)
foreign_dict = df.set_index('num')['text'].to_dict()
#print(foreign_dict) #{1: 'abc', 2: 'bcd', 3: 'efg', 4: 'jkl'}
df['foreign_txt'] = df['foreign'].map(foreign_dict)
print(df)
Output:
num text foreign foreign_txt
0 1 abc 4 jkl
1 2 bcd 1 abc
2 3 efg 3 efg
3 4 jkl 2 bcd
4 4 jkl 1 abc

Create a column based on a condition python pandas

Here is the sample data
import pandas as pd
df=pd.DataFrame({'P_Name':['ABC','ABC','ABC','ABC','PQR','PQR','PQR','PQR','XYZ','XYZ','XYZ','XYZ'],
'Date':['11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020'],
'Open':['242.584','238.179','233.727','229.441','241.375','28.965','235.96','233.193','280.032','78.472','277.592','276.71'],
'End':['4.405','4.452','4.286','4.405','2.41','3.005','2.767','3.057','1.56','0.88','0.882','0.88'],
'Close':['238.179','233.727','229.441','225.036','238.965','235.96','233.193','230.136','278.472','277.592','276.71','275.83']})
I'm trying to create a new column where the condition will be that for every new product entry, the corresponding will be 1 AND will also have to check the condition if df['Close'][0] == df['Open'][1] are same the value will be 1 if not same(E.g df['Close'][8] == df['Open'][9]) then 0
df after these conditions
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1

You can compare shifted values per groups by DataFrameGroupBy.shift with Series.eq with replace missing values by another column by Series.fillna with cast mask to 0,1 with Series.astype:
df['Check'] = df.Open.eq(df.groupby('P_Name').Close.shift().fillna(df.Open)).astype(int)
Anothr idea is compare without groups, but chain another mask with Series.duplicated for match first rows per groups:
df['Check'] = (~df.P_Name.duplicated() | df.Open.eq(df.Close.shift())).astype(int)
print (df)
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1

check = []
for i in range(df.index - 1):
if df['Close'][i] == df['Open'][i+1]:
check.append (1)
else
check.append (0)
df['Check'] = check

Dropping row in pandas if 2 columns in the same row have NAN value in it

I am new to pandas and trying to complete the following:
I have a dataframe which look like this:
row A B
1 abc abc
2 abc
3 abc
4
5 abc abc
My desired output would look like this:
row A B
1 abc abc
2 abc
3 abc
5 abc abc
I am trying to drop rows if there is no value in both A and B columns:
if finalized_export_cf[finalized_export_cf['A']].str.len()<2:
if finalized_export_cf[finalized_export_cf['B']].str.len()<2:
finalized_export_cf[finalized_export_cf['B']].drop()
But that gives me the following error:
ValueError: cannot index with vector containing NA / NaN values
How could I drop values when both columns have an empty cell?
Thank you for your suggestions.

You can check whether all rows have a null by using .isnull() and all() in a chain. isnull() produces a dataframe with booleans, and all(axis=1) checks whether all values in a given rows are true. If that's the case, that means that all values in the rows are nulls:
inds = df[["A", "B"]].isnull().all(axis=1)
You can then use inds to clean up all rows that have only nulls. First negate it using the tilda ~, or else you can only missing values:
df = df.loc[~inds, :]

For your use case you can create a mask and get the values where A & B are not True:
mask = df.isna()
df[~((mask.A == True) & (mask.B == True))]
output:
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
4 5 abc abc

If missing values are NaNs then use DataFrame.dropna with all and subset parameter:
print (df)
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
3 4 NaN NaN
4 5 abc abc
df = df.dropna(how='all', subset=['A','B'])
print (df)
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
4 5 abc abc
Or if empty value is empty string use DataFrame.any with compare not equal '':
print (df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
3 4
4 5 abc abc
df = df[df[['A','B']].ne('').any(axis=1)]
print (df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
4 5 abc abc

if you have only two columns - you can use the how attribute of the pandas.dataFrame.dropna by setting it to 'all':
df.dropna(how='all')

first of all we need to change the blank spaces to NaN
df = df.replace(r'^\s*$',np.nan,regex=True)
then drop na whilst sub-setting your rows
df.dropna(subset=['A','B'],how='all').fillna(' ') # if you want spaces for na
print(df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
4 5 abc abc

Pandas deleting rows in order

Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!

To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz

I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz

Pandas Multilevel Dataframe Melt

I have a following pivoted multilevel pandas dataframe structure:
Example1 Example2 Weight Rank Difference
VC X Y X Y
0 ABC XYZ 1 2 1 2 0
1 PQR BCD 3 4 3 4 1
I want to melt the data frame and get the following structure:
VC Example1 Example2 Weight Rank Difference
X ABC XYZ 1 1 0
Y ABC XYZ 2 2 0
X PQR BCD 3 3 1
Y PQR BCD 4 4 1
Code:
df = df.pivot_table(index =
['Example1','Example2'],columns='VC', values=
['Weight','Rank']).reset_index()
df['Difference'] = (df['rank']['X']-df['rank']['Y'])
The above code got me to the pivoted frame, the original frame is the output required. So basically, I pivoted a dataframe, now want to melt it to get it back to same structure.
Original Dataframe:
VC Example1 Example2 Weight Rank
X ABC XYZ 1 1
Y ABC XYZ 2 2
X PQR BCD 3 3
Y PQR BCD 4 4
Any help is appreciated! Thanks!

IIUC, I think you need stack, groupby, ffill:
df2.stack(1).groupby(level=0).ffill().dropna().reset_index().drop('level_0', axis=1)
Or
df.fillna(9999999).stack(1).groupby(level=0).bfill().dropna(‌).reset_index().drop‌('level_0', axis=1)
EXAMPLE
df_in
VC Example1 Example2 Weight Rank
0 X ABC XYZ 1 1
1 Y ABC XYZ 2 2
2 X PQR BCD 3 3
3 Y PQR BCD 4 4
Your code:
df2 = df_in.pivot_table(index =
['Example1','Example2'],columns='VC', values=
['Weight','Rank']).reset_index()
df2['Difference'] = (df2['Rank']['X']-df2['Rank']['Y'])
df2
Example1 Example2 Rank Weight Difference
VC X Y X Y
0 ABC XYZ 1 2 1 2 -1
1 PQR BCD 3 4 3 4 -1
Reshaping:
df2.stack(1).groupby(level=0).ffill().dropna().reset_index().drop('level_0', axis=1)
Output:
VC Difference Example1 Example2 Rank Weight
0 X -1.0 ABC XYZ 1.0 1.0
1 Y -1.0 ABC XYZ 2.0 2.0
2 X -1.0 PQR BCD 3.0 3.0
3 Y -1.0 PQR BCD 4.0 4.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a Range of Numbers into different Rows - Pandas - python

Related

create new column based on value column a mapping on column b get column c

Create a column based on a condition python pandas

Dropping row in pandas if 2 columns in the same row have NAN value in it

Pandas deleting rows in order

Pandas Multilevel Dataframe Melt

Categories

Resources