I want to slice this string using indexes from another column. I'm getting NaN instead of slices of the string.
import pandas as pd
from pandas import DataFrame, Series
sales = {'name': ['MSFTCA', 'GTX', 'MSFTUSA', ],
'n_chars': [2, 2, 3],
'Jan': [150, 200, 50],
'Feb': [200, 210, 90],
'Mar': [140, 215, 95]}
df = pd.DataFrame.from_dict(sales)
df
def extract_location(name, n_chars):
return( name.str[-n_chars:])
df.assign(location=(lambda x: extract_location(x['name'], x['n_chars']))).to_dict()
Gives:
{'Feb': {0: 200, 1: 210, 2: 90},
'Jan': {0: 150, 1: 200, 2: 50},
'Mar': {0: 140, 1: 215, 2: 95},
'location': {0: nan, 1: nan, 2: nan},
'n_chars': {0: 2, 1: 2, 2: 3},
'name': {0: 'MSFTCA', 1: 'GTX', 2: 'MSFTUSA'}}
You need apply with axis=1 for processing by rows:
def extract_location(name, n_chars):
return( name[-n_chars:])
df=df.assign(location=df.apply(lambda x: extract_location(x['name'], x['n_chars']), axis=1))
print (df)
Feb Jan Mar n_chars name location
0 200 150 140 2 MSFTCA CA
1 210 200 215 2 GTX TX
2 90 50 95 3 MSFTUSA USA
df = df.assign(location=df.apply(lambda x: x['name'][-x['n_chars']:], axis=1))
print (df)
Feb Jan Mar n_chars name location
0 200 150 140 2 MSFTCA CA
1 210 200 215 2 GTX TX
2 90 50 95 3 MSFTUSA USA
Using a comprehension
df.assign(location=[name[-n:] for n, name in zip(df.n_chars, df.name)])
Feb Jan Mar n_chars name location
0 200 150 140 2 MSFTCA CA
1 210 200 215 2 GTX TX
2 90 50 95 3 MSFTUSA USA
You can speed it up a bit with
df.assign(location=[name[-n:] for n, name in zip(df.n_chars.values, df.name.values)])
Related
The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0
I have this dataframe that i need to pivot or reshape based on the frame col
df = {'frame': {0: 0, 1: 1, 2: 2, 3: 0, 4: 1, 5: 2}, 'pvol': {0: nan, 1: nan, 2: nan, 3: 23.1, 4: 24.3, 5: 25.6}, 'vvol': {0: 109.8, 1: 140.5, 2: 160.4, 3: nan, 4: nan, 5: nan}, 'area': {0: 120, 1: 130, 2: 140, 3: 110, 4: 110, 5: 112}, 'label': {0: 'v', 1: 'v', 2: 'v', 3: 'p', 4: 'p', 5: 'p'}}
Current dataframe
frame pvol vvol area label
0 NaN 109.8 120 v
1 NaN 140.5 130 v
2 NaN 160.4 140 v
0 23.1 NaN 110 p
1 24.3 NaN 110 p
2 25.6 NaN 112 p
Expected output
frame pvol vvol v_area p_area
0 23.1 109.8 110 110
1 24.3 140.5 110 110
2 25.6 160.4 112 112
The prefix v and p aren't necessary I just need a way to tell the columns apart
This is how I got it to work, but it seems long. I'm sure there is a better way
for name, tdf in df.groupby('label'):
df.loc[tdf.index, '{}_area'.format(name)] = tdf['area']
pdf = df[df['label'].eq('p')][['frame', 'label', 'pvol', 'p_area']]
vdf = df[df['label'].eq('v')][['frame', 'vvol', 'v_area']]
df = pdf.merge(vdf, on='frame', how='outer')
Let's try pivot and dropna:
out = df.pivot(index='frame', columns='label').dropna(axis=1)
out.columns = [f'{y}_{x}' for x,y in out.columns]
Output:
p_pvol v_vvol p_area v_area
frame
0 23.1 109.8 110 120
1 24.3 140.5 110 130
2 25.6 160.4 112 140
Let us try
s = df.set_index(['frame','label']).unstack().dropna(axis=1)
s.columns=s.columns.map('_'.join)
s
Out[102]:
pvol_p vvol_v area_p area_v
frame
0 23.1 109.8 110 120
1 24.3 140.5 110 130
2 25.6 160.4 112 140
I feel stupid for posting this after the gems dropped by #BEN_YO and #Quang_Hoang..
df.set_index('label', inplace=True)
d1 = df.loc['v', ['frame', 'vvol', 'area']].rename(columns={'area':'v_area'})
d2 = df.loc['p', ['frame', 'pvol', 'area']].rename(columns={'area':'p_area'})
pd.merge(d1, d2, on='frame')
frame vvol v_area pvol p_area
0 0 109.8 120 23.1 110
1 1 140.5 130 24.3 110
2 2 160.4 140 25.6 112
I have 2 dataframe -
print(d)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 53 53
1 2020 3443 455 455 455
2 2021 6777 123 123 123
3 2019 5466 313 313 313
4 2020 4656 545 545 545
5 2021 4565 775 775 775
6 2019 4654 567 567 567
7 2020 7867 657 657 657
8 2021 6766 567 567 567
print(d1)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 73 63
import pandas as pd
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
],
'Amount1': [
53,
455,
123,
313,
545,
775,
567,
657,
567
], 'Amount2': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
d1 = pd.DataFrame({
'Year': [
2019
],
'Salary': [
1200
],
'Amount': [
53
],
'Amount1': [
73
], 'Amount2': [
63
]
})
I want to compare the 'Salary' value of dataframe d1 i.e. 1200 with all the values of 'Salary' in dataframe d and set a count if it is >= or < (a Boolean comparison) - this is to be done for all the columns(amount, amount1, amount2 etc), if the value in any column of d1 is NaN/None, no comparison needs to be done. The name of the columns will always be same so it is basically one to one column comparison.
My approach and thoughts -
I can get the values of d1 in a list by doing -
l = []
for i in range(len(d1.columns.values)):
if i == 0:
continue
else:
num = d1.iloc[0, i]
l.append(num)
print(l)
# list comprehension equivalent
lst = [d1.iloc[0, i] for i in range(len(d1.columns.values)) if i != 0]
[1200, 53, 73, 63]
and then use iterrows to iterate over all the columns and rows in dataframe d OR
I can iterate over d and then perform a similar comparison by looping over d1 - but these would be time consuming for a high dimensional dataframe(d in this case).
What would be the more efficient or pythonic way of doing it?
IIUC, you can do:
(df1 >= df2.values).sum()
Output:
Year 9
Salary 9
Amount 9
Amount1 8
Amount2 8
dtype: int64
student,total,m1,m2,m3
a,500,120,220,160
b,600,180,120,200
This is my dataframe and I just want to calculate m1,m2,m3 columns as percentages of the total column. I need output like following dataframe
student,total,m1,m2,m3,m1(%),m2(%),m3(%)
a,500,120,220,160,24,44,32
...
for example m1(%) column will be calculated by using (m1/total)*100.
I think you can use div:
df = pd.DataFrame({'total': {0: 500, 1: 600},
'm1': {0: 120, 1: 180},
'm3': {0: 160, 1: 200},
'student': {0: 'a', 1: 'b'},
'm2': {0: 220, 1: 120}},
columns=['student','total','m1','m2','m3'])
print df
student total m1 m2 m3
0 a 500 120 220 160
1 b 600 180 120 200
df[['m1(%)','m2(%)','m3(%)']] = df[['m1','m2','m3']].div(df.total, axis=0)*100
print df
student total m1 m2 m3 m1(%) m2(%) m3(%)
0 a 500 120 220 160 24.0 44.0 32.000000
1 b 600 180 120 200 30.0 20.0 33.333333
(Note that this SO question is similar-looking but different.)
I have a MultiIndexed DataFrame with columns representing yearly data:
>>> x = pd.DataFrame({
'country': {0: 4.0, 1: 8.0, 2: 12.0},
'series': {0: 553.0, 1: 553.0, 2: 553.0},
'2000': {0: '1100', 1: '28', 2: '120'},
'2005': {0: '730', 1: '24', 2: '100'}
}).set_index(['country', 'series'])
>>> x
2000 2005
country series
4 553 1100 730
8 553 28 24
12 553 120 100
When I stack the years, the new index level has no name:
>>> x.stack()
country series
4 553 2000 1100
2005 730
8 553 2000 28
2005 24
12 553 2000 120
2005 100
dtype: object
Is there a nice way to tell stack I'd like the new level to be called 'year'? It doesn't mention this in the docs.
I can always do:
>>> x.columns.name = 'year'
>>> x.stack()
But, to my mind, this doesn't qualify as very 'nice'. Can anyone do it in one line?
There is a chaining-friendly way to do it in one line (although admittedly not much nicer) using DataFrame.rename_axis:
x.rename_axis('year', axis=1).stack()