I have a following pivoted multilevel pandas dataframe structure:
Example1 Example2 Weight Rank Difference
VC X Y X Y
0 ABC XYZ 1 2 1 2 0
1 PQR BCD 3 4 3 4 1
I want to melt the data frame and get the following structure:
VC Example1 Example2 Weight Rank Difference
X ABC XYZ 1 1 0
Y ABC XYZ 2 2 0
X PQR BCD 3 3 1
Y PQR BCD 4 4 1
Code:
df = df.pivot_table(index =
['Example1','Example2'],columns='VC', values=
['Weight','Rank']).reset_index()
df['Difference'] = (df['rank']['X']-df['rank']['Y'])
The above code got me to the pivoted frame, the original frame is the output required. So basically, I pivoted a dataframe, now want to melt it to get it back to same structure.
Original Dataframe:
VC Example1 Example2 Weight Rank
X ABC XYZ 1 1
Y ABC XYZ 2 2
X PQR BCD 3 3
Y PQR BCD 4 4
Any help is appreciated! Thanks!
IIUC, I think you need stack, groupby, ffill:
df2.stack(1).groupby(level=0).ffill().dropna().reset_index().drop('level_0', axis=1)
Or
df.fillna(9999999).stack(1).groupby(level=0).bfill().dropna().reset_index().drop('level_0', axis=1)
EXAMPLE
df_in
VC Example1 Example2 Weight Rank
0 X ABC XYZ 1 1
1 Y ABC XYZ 2 2
2 X PQR BCD 3 3
3 Y PQR BCD 4 4
Your code:
df2 = df_in.pivot_table(index =
['Example1','Example2'],columns='VC', values=
['Weight','Rank']).reset_index()
df2['Difference'] = (df2['Rank']['X']-df2['Rank']['Y'])
df2
Example1 Example2 Rank Weight Difference
VC X Y X Y
0 ABC XYZ 1 2 1 2 -1
1 PQR BCD 3 4 3 4 -1
Reshaping:
df2.stack(1).groupby(level=0).ffill().dropna().reset_index().drop('level_0', axis=1)
Output:
VC Difference Example1 Example2 Rank Weight
0 X -1.0 ABC XYZ 1.0 1.0
1 Y -1.0 ABC XYZ 2.0 2.0
2 X -1.0 PQR BCD 3.0 3.0
3 Y -1.0 PQR BCD 4.0 4.0
Related
i have a dataframe like this:
num
text
foreign
1
abc
4
2
bcd
1
3
efg
3
4
jkl
2
4
jkl
1
i want to make new column based on 'foreign' column match with 'id' column and get the 'text' column.
so, im expecting:
num
text
foreign
foreign_txt
1
abc
4
jkl
2
bcd
1
abc
3
efg
3
bcd
4
jkl
2
efg
4
jkl
1
abc
how the syntax to make 'foreign_txt'?
I cant drop any row.
i forget how to do it. can you help me?
Use Series.map with Series creted by DataFrame.set_index:
df['foreign_txt'] = df['foreign'].map(df.set_index('id')['text'])
print (df)
id text foreign foreign_txt
0 1 abc 4 jkl
1 2 bcd 1 abc
2 3 efg 3 efg
3 4 jkl 2 bcd
Or:
df = (df.merge(df.drop('foreign', axis=1)
.rename(columns={'id':'foreign', 'text':'foreign_txt'}), how='left'))
EDIT: If need first value of text by id add DataFrame.drop_duplicates, for avoid remove unique texts per id with aggregate join:
print (df)
id text foreign
0 1 abc 4
1 2 bcd 1
2 3 efg 3
3 4 jkl 2
4 4 jkl 1
5 3 aaa 8
#join unique duplicates
s = df.drop_duplicates(['id','text']).groupby('id')['text'].agg(','.join)
df['foreign_txt1'] = df['foreign'].map(s)
#get first duplicates
df['foreign_txt2'] = df['foreign'].map(df.drop_duplicates('id').set_index('id')['text'])
#get last duplicates
df['foreign_txt3'] = df['foreign'].map(df.drop_duplicates('id', keep='last').set_index('id')['text'])
print (df)
id text foreign foreign_txt1 foreign_txt2 foreign_txt3
0 1 abc 4 jkl jkl jkl
1 2 bcd 1 abc abc abc
2 3 efg 3 efg,aaa efg aaa
3 4 jkl 2 bcd bcd bcd
4 4 jkl 1 abc abc abc
5 3 aaa 8 NaN NaN NaN
you can apply map method..!
Code:
import pandas as pd
data = {'num': [1, 2, 3, 4,4],
'text': ['abc', 'bcd', 'efg', 'jkl','jkl'],
'foreign': [4, 1, 3, 2,1]}
df = pd.DataFrame(data)
foreign_dict = df.set_index('num')['text'].to_dict()
#print(foreign_dict) #{1: 'abc', 2: 'bcd', 3: 'efg', 4: 'jkl'}
df['foreign_txt'] = df['foreign'].map(foreign_dict)
print(df)
Output:
num text foreign foreign_txt
0 1 abc 4 jkl
1 2 bcd 1 abc
2 3 efg 3 efg
3 4 jkl 2 bcd
4 4 jkl 1 abc
I am new to pandas and trying to complete the following:
I have a dataframe which look like this:
row A B
1 abc abc
2 abc
3 abc
4
5 abc abc
My desired output would look like this:
row A B
1 abc abc
2 abc
3 abc
5 abc abc
I am trying to drop rows if there is no value in both A and B columns:
if finalized_export_cf[finalized_export_cf['A']].str.len()<2:
if finalized_export_cf[finalized_export_cf['B']].str.len()<2:
finalized_export_cf[finalized_export_cf['B']].drop()
But that gives me the following error:
ValueError: cannot index with vector containing NA / NaN values
How could I drop values when both columns have an empty cell?
Thank you for your suggestions.
You can check whether all rows have a null by using .isnull() and all() in a chain. isnull() produces a dataframe with booleans, and all(axis=1) checks whether all values in a given rows are true. If that's the case, that means that all values in the rows are nulls:
inds = df[["A", "B"]].isnull().all(axis=1)
You can then use inds to clean up all rows that have only nulls. First negate it using the tilda ~, or else you can only missing values:
df = df.loc[~inds, :]
For your use case you can create a mask and get the values where A & B are not True:
mask = df.isna()
df[~((mask.A == True) & (mask.B == True))]
output:
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
4 5 abc abc
If missing values are NaNs then use DataFrame.dropna with all and subset parameter:
print (df)
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
3 4 NaN NaN
4 5 abc abc
df = df.dropna(how='all', subset=['A','B'])
print (df)
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
4 5 abc abc
Or if empty value is empty string use DataFrame.any with compare not equal '':
print (df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
3 4
4 5 abc abc
df = df[df[['A','B']].ne('').any(axis=1)]
print (df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
4 5 abc abc
if you have only two columns - you can use the how attribute of the pandas.dataFrame.dropna by setting it to 'all':
df.dropna(how='all')
first of all we need to change the blank spaces to NaN
df = df.replace(r'^\s*$',np.nan,regex=True)
then drop na whilst sub-setting your rows
df.dropna(subset=['A','B'],how='all').fillna(' ') # if you want spaces for na
print(df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
4 5 abc abc
I have a dataframe having column values like this:
num_range id description
'5000-6000' 1 lmn
'6100-6102' 1 lmn
'6363-6363' 3 xyz
'Q7890-Q8000' 2 pqr
So is there a way I can write a loop which will split into rows and give me the values, for ex. for the first num_range value, something like this:
num_range id description
5000 1 lmn
5001 1 lmn
5002 1 lmn
..... ... ....
5999 1 lmn
6000 1 lmn
Q7891 2 pqr
Q7892 2 pqr
... ... ...
Q8000 2 pqr
Like wise I want rows for all the num_range values along with the id and the description.
Use Series.str.findall for get numeric values, also working if before non numeric values like F in last row, then create Series by lists comprehension and join to original:
print (df)
num_range id description
0 5000-5005 1 lmn
1 6100-6102 1 lmn
2 6363-6363 3 xyz
3 Q7890-Q7893 2 pqr
s = df.pop('num_range').str.findall('\d+')
a = [(i, x) for i, (a, b) in s.items() for x in range(int(a), int(b) + 1)]
s = pd.DataFrame(a).set_index(0)[1].rename('num_range')
df = df.join(s)
print (df)
id description num_range
0 1 lmn 5000
0 1 lmn 5001
0 1 lmn 5002
0 1 lmn 5003
0 1 lmn 5004
0 1 lmn 5005
1 1 lmn 6100
1 1 lmn 6101
1 1 lmn 6102
2 3 xyz 6363
3 2 pqr 7890
3 2 pqr 7891
3 2 pqr 7892
3 2 pqr 7893
If need also first value before numeric first extract this values by Series.str.extract, replace - toe emty string and map in list comprehension:
d = df['num_range'].str.extract('(\D+)\d+', expand=False).replace('-','').to_dict()
print (d)
{0: '', 1: '', 2: '', 3: 'Q'}
s = df.pop('num_range').str.findall('\d+')
a = [(i, '{}{}'.format(d.get(i), x))
for i, (a, b) in s.items() for x in range(int(a), int(b) + 1)]
s = pd.DataFrame(a).set_index(0)[1].rename('num_range')
df = df.join(s).reset_index(drop=True)
print (df)
id description num_range
0 1 lmn 5000
1 1 lmn 5001
2 1 lmn 5002
3 1 lmn 5003
4 1 lmn 5004
5 1 lmn 5005
6 1 lmn 6100
7 1 lmn 6101
8 1 lmn 6102
9 3 xyz 6363
10 2 pqr Q7890
11 2 pqr Q7891
12 2 pqr Q7892
13 2 pqr Q7893
This is a bit brute force but explains a way of explicitly doing it.
One can use .apply etc. in fancy ways too to cut out some loops
# going to save it here
newdf = pd.DataFrame()
for _, row in df.iterrows():
# split num_range and cast to a list of ints
s, e = [x for x in map(int, row.num_range.split("-"))]
# need to add one to e cause we need to include it
for n in range(s, e+1):
# replace the number on the row you've iterated on.
row.num_range = n
newdf = newdf.append(row)
Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!
To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz
I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz
I have a dataframe:
id category value
1 1 abc
2 2 abc
3 1 abc
4 4 abc
5 4 abc
6 3 abc
Category 1 = best, 2 = good, 3 = bad, 4 =ugly
I want to create a new column such that, for category 1 the value in the column should be cat_1, for category 2, the value should be cat2.
in new_col2 for category 1 should be cat_best, for category 2, the value should be cat_good.
df['new_col'] = ''
my final df
id category value new_col new_col2
1 1 abc cat_1 cat_best
2 2 abc cat_2 cat_good
3 1 abc cat_1 cat_best
4 4 abc cat_4 cat_ugly
5 4 abc cat_4 cat_ugly
6 3 abc cat_3 cat_bad
I can iterate it in for loop:
for index,row in df.iterrows():
df.loc[df.id == row.id,'new_col'] = 'cat_'+str(row['category'])
Is there a better way of doing it (least time consuming)
I think you need join string with column converted to string and map with join for second column:
d = {1:'best', 2: 'good', 3 : 'bad', 4 :'ugly'}
df['new_col'] = 'cat_'+ df['category'].astype(str)
df['new_col2'] = 'cat_'+ df['category'].map(d)
Or:
df = df.assign(new_col= 'cat_'+ df['category'].astype(str),
new_col2='cat_'+ df['category'].map(d))
print (df)
id category value new_col new_col2
0 1 1 abc cat_1 cat_best
1 2 2 abc cat_2 cat_good
2 3 1 abc cat_1 cat_best
3 4 4 abc cat_4 cat_ugly
4 5 4 abc cat_4 cat_ugly
5 6 3 abc cat_3 cat_bad
You can do it by using apply also:
df['new_col']=df['category'].apply(lambda x: "cat_"+str(x))