Pandas Dataframe: Multiplying Two Columns - python

I am trying to multiply two columns (ActualSalary * FTE) within the dataframe (OPR) to create a new column (FTESalary), but somehow it has stopped at row 21357, I don't understand what went wrong or how to fix it. The two columns came from importing a csv file using the line: OPR = pd.read_csv('OPR.csv', encoding='latin1')
[In] OPR
[out]
ActualSalary FTE
44600 1
58,000.00 1
70,000.00 1
17550 1
34693 1
15674 0.4
[In] OPR["FTESalary"] = OPR["ActualSalary"].str.replace(",", "").astype("float")*OPR["FTE"]
[In] OPR
[out]
ActualSalary FTE FTESalary
44600 1 44600
58,000.00 1 58000
70,000.00 1 70000
17550 1 NaN
34693 1 NaN
15674 0.4 NaN
I am not expecting any NULL values as an output at all, I am really struggling with this. I would really appreciate the help.
Many thanks in advance! (I am new to both coding and here, please let me know via message if I have made mistakes or can improve the way I post questions here)
Sharing the data #oppresiveslayer
[In] OPR[0:6].to_dict()
[out]
{'ActualSalary': {0: '44600',
1: '58,000.00',
2: '70,000.00',
3: '39,780.00',
4: '0.00',
5: '78,850.00'},
'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}}
For more information on the two columns #charlesreid1
[in] OPR['ActualSalary'].astype
[out]
Name: ActualSalary, Length: 21567, dtype: object>
[in] OPR['FTE'].astype
[out]
Name: FTE, Length: 21567, dtype: float64>
The version I am using:
python: 3.7.3, pandas: 0.25.1 on jupyter Notebook 6.0.0

I believe that your ActualSalary column is a mix of strings and integers. That is the only way I've been able to recreate your error:
df = pd.DataFrame(
{'ActualSalary': ['44600', '58,000.00', '70,000.00', 17550, 34693, 15674],
'FTE': [1, 1, 1, 1, 1, 0.4]})
>>> df['ActualSalary'].str.replace(',', '').astype(float) * df['FTE']
0 44600.0
1 58000.0
2 70000.0
3 NaN
4 NaN
5 NaN
dtype: float64
The issue arises when you try to remove the commas:
>>> df['ActualSalary'].str.replace(',', '')
0 44600
1 58000.00
2 70000.00
3 NaN
4 NaN
5 NaN
Name: ActualSalary, dtype: object
First convert them to strings, before converting back to floats.
fte_salary = (
df['ActualSalary'].astype(str).str.replace(',', '') # Remove commas in string, e.g. '55,000.00' -> '55000.00'
.astype(float) # Convert string column to floats.
.mul(df['FTE']) # Multiply by new salary column by Full-Time-Equivalent (FTE) column.
)
>>> df.assign(FTESalary=fte_salary) # Assign new column to dataframe.
ActualSalary FTE FTESalary
0 44600 1.0 44600.0
1 58,000.00 1.0 58000.0
2 70,000.00 1.0 70000.0
3 17550 1.0 17550.0
4 34693 1.0 34693.0
5 15674 0.4 6269.6

This should work:
OTR['FTESalary'] = OTR.apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1)
output
ActualSalary FTE FTESalary
0 44600 1.0 44600.0
1 58,000.00 1.0 58000.0
2 70,000.00 1.0 70000.0
3 17550 1.0 17550.0
4 34693 1.0 34693.0
5 15674 0.4 6269.6
ok, i think you need to do this:
OTR['FTESalary'] = OTR.reset_index().apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1).to_numpy().tolist()

I was able to do it in a couple steps, but with list comprehension which might be less readable for a beginner. It makes an intermediate column, which does the float conversion, since your ActualSalary column is full of strings at the start.
OPR["X"] = [float(x.replace(",","")) for x in OPR["ActualSalary"]]
OPR["FTESalary"] = OPR["X"]*OPR["FTE"]

Related

How fulfil empy df by FOR loop

I need to create a dataframe with two columns: variable, function based on this variable. There is an error in case of next code:
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
k = 0.5**i
test.append(i, k)
print(test)
TypeError: cannot concatenate object of type '<class 'int'>'; only Series and DataFrame objs are valid
What do I need to fix here? Looks like answer is easy, however it is uneasy to find it...
Many thanks for your help
Is there a specific reason you are trying to use the loop? You can create df with column_1 and use Pandas vectorized operations to create column_2
df = pd.DataFrame(np.arange(1,30), columns = ['Column_1'])
df['Column_2'] = 0.5**df['Column_1']
Column_1 Column_2
0 1 0.50000
1 2 0.25000
2 3 0.12500
3 4 0.06250
4 5 0.03125
I like Vaishali's way of approaching it. If you really want to use the for loop, this is how I would of done it:
import pandas as pd
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
test=test.append({'Column_1':i,'Column_2':0.5**i},ignore_index=True)
test = test.round(5)
print(test)
Column_1 Column_2
0 1.0 0.50000
1 2.0 0.25000
2 3.0 0.12500
3 4.0 0.06250
4 5.0 0.03125
5 6.0 0.01562

Create a new column in a dataframe and add 1 to the previous row of that column

I am looking to derive a new row from a current row in my dataframe, and add 1 to the previous row to keep a kind of running total
df['Touch_No'] = np.where((df.Time_btween_steps.isnull()) | (df.Time_btween_steps > 30), 1, df.First_touch.shift().add(1))
I basically want to check if the column value is null, if it is then set that to "First Activity"/resets the counter, if not, add 1 to the "previous activity", to give me a running total of the number of outreach we are doing on specific people:
Expected outcome:
Time Between Steps | Touch_No
Null. |. 1
0 |. 2
5.4 |. 3
6.7 |. 4
2 |. 5
null |. 1
1 |. 2
Answer using this. Combo of cumsum(), groupBy(), and cumcount()
df = pd.DataFrame(data=[None, 0, 5.4, 6.7, 2, None, 1], columns=['Time_btween_steps'])
df['Touch_No'] = np.where((df.Time_btween_steps.isnull()), (df.Time_btween_steps > 30), 1)
df['consec'] = df['Touch_No'].groupby((df['Touch_No']==0).cumsum()).cumcount()
df.head(10)
Edited according to your clarification:
df = pd.DataFrame(data=np.array(([None, 0, 5.4, 6.7, 2, None, 1],[50,1,2,3,4,35,1])).T, columns=['Time_btween_steps', 'Touch_No'])
mask = pd.isna(df['Time_btween_steps']) | df['Time_btween_steps']>30
df['Touch_No'][~mask] += 1
df['Touch_No'][mask] = 1
Returns:
Time_btween_steps Touch_No
0 None 51
1 0 2
2 5.4 3
3 6.7 4
4 2 5
5 None 36
6 1 2
In my opinion a solution like this is much more readable. We increment by 1 where the condition is not met, and we set the ones where the condition is true to 1. You can combine these into a single line if you wish.
Old answer for posterity.
Here is a simple solution using pandas apply functionality which takes a function.
import pandas as pd
df = pd.DataFrame(data=[1,2,3,4,None,5,0],columns=['test'])
df.test.apply(lambda x: 0 if pd.isna(x) else x+1)
Which returns:
0 2.0
1 3.0
2 4.0
3 5.0
4 0.0
5 6.0
6 1.0
Here I wrote the function in place but if you have more complicated logic, such as resetting if the number is something else, etc., you can write a custom function and pass it in instead of the lambda function. This is not the only way to do it, but if your data frame isn't huge (hundreds of thousands of rows), it should be performant. If you don't want a copy but to overwrite the array simply assign it back by prepending:
df['test'] = before the last line.
If you want the output to be ints, you can also do:
df['test'].astype(int) but be careful about converting None/Null to int.
Using np.where, index values with ffill for partitioning and simple rank:
import numpy as np
import pandas as pd
sodf = pd.DataFrame({'time_bw_steps': [None, 0, 5.4, 6.7, 2, None, 1]})
sodf['touch_partition'] = np.where(sodf.time_bw_steps.isna(), sodf.index, np.NaN)
sodf['touch_partition'] = sodf['touch_partition'].fillna(method='ffill')
sodf['touch_no'] = sodf.groupby('touch_partition')['touch_partition'].rank(method='first', ascending='False')
sodf.drop(columns=['touch_partition'], axis='columns', inplace=True)
sodf

Change number format in dataframe index

I would like to change the number format of the index of a dataframe.
From the screenshot below Paper ID is all in e+07 format (I don't know how to call this btw) and I would like to change them into normal number such as 1147687 instead of 1.147687e+07.
Here's my dataframe:
You can convert your indexes values to int, but you must do it carefully, cause you can loose some ids:
df = pd.DataFrame({'PaperId':[1000000000.0, 2.0, 3.0, 4.0],
'memberNum':[1, 2, 3, 4]})
df = df.set_index('PaperId')
df
memberNum
PaperId
1.000000e+09 1
2.000000e+00 2
3.000000e+00 3
4.000000e+00 4
df['PaperId'] = df.index
df['PaperId'] = df['PaperId'].astype('int')
df = df.set_index('PaperId')
df
memberNum
PaperId
1000000000 1
2 2
3 3
4 4

Groupby when given the start positional index of each group

I have one series of values that I would like to group, and another series containing the starting positional index of each group after the first (the first group is understood to begin at positional index 0). The series of values can have an arbitrary index. Is there a way to use this to produce a groupby-aggregate? Ideally empty groups would be preserved. Example:
values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])
Now, values.groupby(group_indices) should be grouped so that the first group is values.iloc[:3], the second is values.iloc[3:3] (an empty group), the third is values.iloc[3:8], and the fourth is values.iloc[8:], and values.groupby(group_indices).mean() would be pd.Series([11.0, NaN, 15.0, 18.5]).
Here is a easy way
values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]:
1 11.0
2 15.0
3 18.5
dtype: float64
Straightforwardly with numpy.split routine:
In [1286]: values = pd.Series(np.arange(10, 20))
In [1287]: group_indices = pd.Series([0, 3, 8])
In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]:
0 11.0
1 15.0
2 18.5
dtype: float64
To account "empty" group - just remove if s.size check:
In [1304]: group_indices = pd.Series([3, 3, 8])
In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]:
0 11.0
1 NaN
2 15.0
3 18.5
dtype: float64
Given your update, here's an odd way to do this with pd.merge_asof. Some care needs to be taken to deal with the first group that's from 0 to your first index in the Series.
import pandas as pd
import numpy as np
(pd.merge_asof(values.to_frame('val'),
values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0),
left_index=True, right_on='index',
direction='backward')
.fillna({'level_0': -1}) # Because your first group is 0: first index
.groupby('level_0').val.mean()
.reindex([-1]+[*range(len(group_indices))]) # Get 0 size groups in output
)
level_0
-1 11.0
0 NaN
1 15.0
2 18.5
Name: val, dtype: float64
Let's change the group_indicies a bit, so that the group names (1,2,3) are visible,
group_indices = pd.Series([1,2,3],index=[0, 3, 8])
then
values.groupby(group_indices.reindex(values.index,method='ffill')).mean()
would give you what you want.
Note that group_indices.reindex(values.index,method='ffill') gives you
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
which assigns each row of values with a group number.
My solution involves keeping the inputs as they are and doing some ugly adjustments:
pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
[-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()
Output
0
group
0 11.0
1 15.0
2 18.5
Thanks to all the answers, especially WeNYoBen's. The following will produce the correct groups and skip over empty groups.
# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)
# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds
# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()
# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]
This will have a noncontinuous index, with skipped elements corresponding to empty groups in the original groupby. If you want to place NaN in those spots, you can do
means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))

Find mean of the grouped rows of pandas dataframe

I am at very basic level of python. here i am stuck with a problem, can someone help me out?
i have a large pandas dataframe, i want to find rows and do mean, if the first column of each row has some similar value (ex: someinteger seperated by '_' another integer).
i tried to use .split to match 1st number of list, it works for single row but if i have iterate over row, it throws error.
my data frame looks like:
d = {'ID' : pd.Series(['1_1', '2_1', '1_2', '2_2' ], index=['0','1','2', '3']),
'one' : pd.Series([2.5, 2, 3.5, 2.5], index=['0','1', '2', '3']),
'two' : pd.Series([1, 2, 3, 4], index=['0', '1', '2', '3'])}
df2 = pd.DataFrame(d)
requirement:
mean of the rows which has similar ID at first position after split. ex. mean of 1_1 and 1_2, 2_1 and 2_2
output:
ID one two
0 1 3 2
1 2 2.25 3
here is my code,
working version : ((df2.ix[0,0]).split('_'))[0]
error version:
for i in df2.iterrows():
df2[df2.columns[((df2.ix[0,0]).split('_'))[0] == ((df2.ix[0,0]).split('_'))[0]]]
looking forward for sooner reply..
Thanks in advance..
You could create new column only with first number of your ID column with [str methods](http://pandas.pydata.org/pandas-docs/stable/text.html#splitting-and-replacing-strings) and then usegroupby` method:
df['groupedID'] = df.ID.str.split('_').str.get(0)
In [347]: df
Out[347]:
ID one two groupedID
0 10_1 2.5 1 10
1 2_1 2.0 2 2
2 10_2 3.5 3 10
3 2_2 2.5 4 2
df1 = df.groupby('groupedID').mean()
In [349]: df1
Out[349]:
one two
groupedID
10 3.00 2
2 2.25 3
If you need to change name of the index back to 'ID':
df1.index.name = 'ID'
In [351]: df1
Out[351]:
one two
ID
10 3.00 2
2 2.25 3

Categories