Manipulate string values in pandas - python

I have a pandas dataframe with different formats for one column like this
Name
Values
First
5-9
Second
7
Third
-
Fourth
12-16
I need to iterate over all Values column, and if the format is like the first row 5-9 or like fourth row 12-16 replace it with the mean between the 2 numbers in string.
For first row replace 5-9 to 7, or for fourth row replace 12-16 to 14.
And if the format is like third row - replace it to 0
I have tried
if df["Value"].str.len() > 1:
df["Value"] = df["Value"].str.split('-')
df["Value"] = (df["Value"][0] + df["Value"][1]) / 2
elif df["Value"].str.len() == 1:
df["Value"] = df["Value"].str.replace('-', 0)
Expected output
Name
Values
First
7
Second
7
Third
0
Fourth
14

Let us split and expand the column then cast values to float and calculate mean along column axis:
s = df['Values'].str.split('-', expand=True)
df['Values'] = s[s != ''].astype(float).mean(1).fillna(0)
Name Values
0 First 7.0
1 Second 7.0
2 Third 0.0
3 Fourth 14.0

You can use str.replace with customized replacement function
mint = lambda s: int(s or 0)
repl = lambda m: str(sum(map(mint, map(m.group, [1,2])))/2)
df['Values'] = df['Values'].str.replace('(\d*)-(\d*)', repl, regex=True)
print(df)
Name Values
0 First 7.0
1 Second 7
2 Third 0.0
3 Fourth 14.0

Related

Replacing string with value calculated from the max of another column in a dataframe

I have a dataframe with an ID column that has dtype Object (as contains INTs and STRs) so am trying to use np.where to replace each of them in turn with the next highest number... However for some reason in the example below it's only replacing one of the 2 strings and I have no idea why?
df = pd.DataFrame({'IDstr':['480610_ABC_087', '78910_ABC_087','4806105017087','414149'],
'IDint':[ 0, 0, 4806105017087, 414149]})
print (df)
unique_str_IDs = df['IDstr'][df['IDstr'].str.contains("ABC", na=False)].unique()
for i in range(len(unique_str_IDs)):
df['SKUintTEST']=np.where(df['IDstr'] == unique_str_IDs[i].strip(),
df['SKUint_y'].max()+i+1, df['SKUint_y'])
Has anyone got any ideas?
You can use map with a dictionary created with in incremental for each unique id, then fillna with the original value for the rows not mapped:
df = pd.DataFrame({'IDstr':['480610_ABC_087', '78910_ABC_087','4806105017087','414149'],
'IDint':[ 0, 0, 4806105017087, 414149],
'SKUint_y': range(10,14)})
unique_str_IDs = df.loc[df['IDstr'].str.contains("ABC", na=False), 'IDstr'].unique()
df['SKUintTEST'] = df['IDstr'].map({idx:i for i, idx in enumerate(unique_str_IDs, df.SKUint_y.max()+1)})\
.fillna(df.SKUint_y)
print (df)
IDstr IDint SKUint_y SKUintTEST
0 480610_ABC_087 0 10 14.0
1 78910_ABC_087 0 11 15.0
2 4806105017087 4806105017087 12 12.0
3 414149 414149 13 13.0

Groupby when given the start positional index of each group

I have one series of values that I would like to group, and another series containing the starting positional index of each group after the first (the first group is understood to begin at positional index 0). The series of values can have an arbitrary index. Is there a way to use this to produce a groupby-aggregate? Ideally empty groups would be preserved. Example:
values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])
Now, values.groupby(group_indices) should be grouped so that the first group is values.iloc[:3], the second is values.iloc[3:3] (an empty group), the third is values.iloc[3:8], and the fourth is values.iloc[8:], and values.groupby(group_indices).mean() would be pd.Series([11.0, NaN, 15.0, 18.5]).
Here is a easy way
values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]:
1 11.0
2 15.0
3 18.5
dtype: float64
Straightforwardly with numpy.split routine:
In [1286]: values = pd.Series(np.arange(10, 20))
In [1287]: group_indices = pd.Series([0, 3, 8])
In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]:
0 11.0
1 15.0
2 18.5
dtype: float64
To account "empty" group - just remove if s.size check:
In [1304]: group_indices = pd.Series([3, 3, 8])
In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]:
0 11.0
1 NaN
2 15.0
3 18.5
dtype: float64
Given your update, here's an odd way to do this with pd.merge_asof. Some care needs to be taken to deal with the first group that's from 0 to your first index in the Series.
import pandas as pd
import numpy as np
(pd.merge_asof(values.to_frame('val'),
values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0),
left_index=True, right_on='index',
direction='backward')
.fillna({'level_0': -1}) # Because your first group is 0: first index
.groupby('level_0').val.mean()
.reindex([-1]+[*range(len(group_indices))]) # Get 0 size groups in output
)
level_0
-1 11.0
0 NaN
1 15.0
2 18.5
Name: val, dtype: float64
Let's change the group_indicies a bit, so that the group names (1,2,3) are visible,
group_indices = pd.Series([1,2,3],index=[0, 3, 8])
then
values.groupby(group_indices.reindex(values.index,method='ffill')).mean()
would give you what you want.
Note that group_indices.reindex(values.index,method='ffill') gives you
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
which assigns each row of values with a group number.
My solution involves keeping the inputs as they are and doing some ugly adjustments:
pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
[-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()
Output
0
group
0 11.0
1 15.0
2 18.5
Thanks to all the answers, especially WeNYoBen's. The following will produce the correct groups and skip over empty groups.
# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)
# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds
# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()
# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]
This will have a noncontinuous index, with skipped elements corresponding to empty groups in the original groupby. If you want to place NaN in those spots, you can do
means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))

Extracting Max value along each row from strings in column

I've a column of strings in a DataFrame which contains comma-separated numbers. I need to extract the maximum value along each row from the strings. The maximum value returned should be the max till the 13th index from the beginning.
I've tried splitting the sting using ',' as a separator to convert it into a list with expand option enabled. Then I'm using the assign method of Pandas to find the max value along the vertical axis.
sample_dt1 = sample_dt['pyt_hist'].str.split(',', expand=True).astype(float)
sample_dt = sample_dt.assign(max_value=sample_dt1.max(axis=1))
Sample Data:
index pyt_hist
0 0,0,0,0,0,0,0,0,0,0,0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2 0,0,0,360,420,392,361,330,300,269,239,208,177
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0
Expected Result:
index pyt_hist max_value
0 0,0,0,0,0,0,0,0,0,0,0 0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
2 0,0,0,360,420,392,361,330,300,269,239,208,177 420
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0 0
Results obtained using my code:
index pyt_hist max_value
0 0,0,0,0,0,0,0,0,0,0,0 0.0
1 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0.0
2 0,0,0,360,420,392,361,330,300,269,239,208,177 420.0
3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,23,0,0,56,0 56.0
You are very close, sample_dt1.iloc[:,:13] gives you the first 13 columns of sample_dt1. So you can do:
sample_dt = sample_dt.assign(max_value=sample_dt1.iloc[:,:13].max(axis=1))
df.pyt_hist.str.split(',').apply(lambda x: max([int(i) for i in x[:13] if i]))
Output
0 0
1 0
2 420
3 0
Name: pyt_hist, dtype: int64

How to get columns index which meet some condition in pandas?

I have the following:
x = pd.DataFrame({'a':[1,5,5], 'b':[7,0,7]})
And for every row, i want to get the index of the first column that met the condition that it's value is greater than some value, let's say greater than
4.
In this example, the answer is 1, (correspond to the index of the value 7 in the first row) and 0 (correspond to the index of the value 5 in the second row), and 1(correspond to the index of the value 5 in the third row).
Which means the answer is [1,0,0].
I tried it with apply method:
def get_values_from_row(row, th=0.9):
"""Get a list of column names that meet some condition that their values are larger than a threshold.
Args:
row(pd.DataFrame): a row.
th(float): the threshold.
Returns:
string. contains the columns that it's value met the condition.
"""
return row[row > th].index.tolist()[0]
It works, but i have a large data set, and it's quite slow.
What is a better alternative.
I think you need first_valid_index with get_loc:
print (x[x > 4])
a b
0 NaN 7.0
1 5.0 NaN
2 7.0 5.0
print (x[x > 4].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1))
0 1
1 0
2 0
dtype: int64

Pandas substring

I have the following dataframe:
contract
0 WTX1518X22
1 WTX1518X20.5
2 WTX1518X19
3 WTX1518X15.5
I need to add a new column containing everything following the last 'X' from the first column. So the result would be:
contract result
0 WTX1518X22 22
1 WTX1518X20.5 20.5
2 WTX1518X19 19
3 WTX1518X15.5 15.5
So I figure I first need to find the string index position of the last 'X' (because there may be more than one 'X' in the string). Then get a substring containing everything following that index position for each row.
EDIT:
I have managed to get the index position of 'X' as required:
df.['index_pos'] = df['contract'].str.rfind('X', start=0, end=None)
But I still can't seem to get a new column containing all characters following the 'X'. I am trying:
df['index_pos'] = df['index_pos'].convert_objects(convert_numeric=True)
df['result'] = df['contract'].str[df['index_pos']:]
But this just gives me an empty column called 'result'. This is strange because if I do the following then it works correctly:
df['result'] = df['contract'].str[8:]
So I just need a way to not hardcode '8' but to instead use the column 'index_pos'. Any suggestions?
Use vectorised str.split to split the string and cast the last split to float:
In [10]:
df['result'] = df['contract'].str.split('X').str[-1].astype(float)
df
​
Out[10]:
contract result
0 WTX1518X22 22.0
1 WTX1518X20.5 20.5
2 WTX1518X19 19.0
3 WTX1518X15.5 15.5
import pandas as pd
import re as re
df['result'] = df['contract'].map(lambda x:float(re.findall('([0-9\.]+)$',x)[0]))
Out[34]:
contract result
0 WTX1518X22 22.0
1 WTX1518X20.5 20.5
2 WTX1518X19 19.0
3 WTX1518X15.5 15.5
A similar approach to the one by EdChump using regular expressions, this one only assumes that the number is at the end of the string.

Categories