Problems multiplying dataframe cells by matrix values - python

Noob (trying to learn data_science) who has a simple portfolio in a dataframe. I want to sell a certain number of shares of each company, multiply the number of shares sold by the price, and add same to the existing cash value (15000), rounding to 2 decimal places. Briefly
new_port_df =
Name Price Starting Number_of_shares
0 MMM 10.00 50
1 AXP 20.00 100
2 AAPL 30.00 1000
3 Cash 1.00 15000
shares_sold = [[ 5.] [ 15.] [75.] [ 0.]] #(numpy.ndarray, shape (4,1))
new_port_df['Price'] =
0 10.00
1 20.00
2 30.00
3 1.00
Name: Low, dtype: float64 # pandas.core.series.Series
so basically Cash += 5 * 10 + 15 * 20 + 75 * 30 + 0 * 1 or 15000 + 2600 = 17600
As an intermediate step (after googling and reading other posts on here), I've tried:
cash_proceeds = np.dot(shares_sold, new_port['Price'])
ValueError: shapes (4,1) and (4,) not aligned: 1 (dim 1) != 4 (dim 0). I think I should be reshaping, but haven't had any luck.
Desired result is below (all working except for the 17600 cell)
updated_port_df =
Name Price Starting Number_of_shares
0 MMM 10.00 45
1 AXP 20.00 85
2 AAPL 30.00 925
3 Cash 1.00 17600 # only the 17600 not working
Simply answers I can understand are preferred to complex ones I can't. Thanks for any help.

Rather than initiating shares_sold as a list of lists i.e. [[],[],[]] you can just create a list of numbers in order to resolve your np.dot() error.
shares_sold = [5,15,75,0]
cash_proceeds = np.dot(new_port_df['Price'], shares_sold)
or as Andy pointed out, if shares_sold is already initiated as a list of lists you can convert it to an array and then flatten it and proceed from there. My answer wont address the change of approach that entails.
You can then change the last item in your shares_sold list/array to reflect the change in cash from the sale of stock (notice saved as negative because these will be subtracted from your Number of Shares column):
shares_sold[3] = -cash_proceeds
Now you can subtract shares sold from the Number of Shares column to reflect the change (you indicate you want updated_port_df to house this information so I first duplicate the initial portfolio and then make the change),
updated_port_df = new_port_df.copy()
updated_port_df['Number_of_shares'] = updated_port_df['Number_of_shares'] - shares_sold

You may use pandas dot, instead of np.dot. You need 1-d numpy array to using dot on series, so you need convert shares_sold to 1-d
shares_sold = np.array([[ 5.], [ 15.], [75.] ,[ 0.]])
shares_sold_1d = shares_sold.flatten()
cash_proceeds = new_port_df['Price'].dot(shares_sold_1d)
In [226]: print(cash_proceeds)
2600.0
To get your desired output, simple using .loc assignment and subtraction
(new_port_df.loc[new_port_df.Name.eq('Cash'), 'Starting_Number_of_shares'] =
new_port_df.loc[new_port_df.Name.eq('Cash'), 'Starting_Number_of_shares']
+ cash_proceeds)
new_port_df['Starting_Number_of_shares'] = new_port_df['Starting_Number_of_shares'] - shares_sold_1d
Out[235]:
Name Price Starting_Number_of_shares
0 MMM 10.0 45.0
1 AXP 20.0 85.0
2 AAPL 30.0 925.0
3 Cash 1.0 17600.0
Note: If you really want to use np.dot, you need swapping the order as follows
In [237]: np.dot(new_port_df['Price'], shares_sold)
Out[237]: array([2600.])

Related

Apply multiple condition groupby + sort + sum to pandas dataframe rows

I have a dataframe that has the following columns:
Acct Num, Correspondence Date, Open Date
For each opened account, I am being asked to look back at all the correspondences that happened within
30 days of opendate of that account, then assigning points as following to the correspondences:
Forty-twenty-forty: Attribute 40% (0.4 points) of the attribution to the first touch,
40% to the last touch, and divide the remaining 20% between all touches in between
So I know apply and group by functions, but this is beyond my paygrade.
I have to group by account, with conditional based on comparison of 2 columns against eachother,
I have to do that to get a total number of correspondences, and I guess they have to be sorted as well, as the following step of assigning points to correspondences depends on the order in which they occurred.
I would like to do this efficiently, as I have a ton of rows, I know apply() can go fast, but I am pretty bad at applying it when the row-level operation I am trying to do gets even a little complex.
I appreciate any help, as I am not good at pandas.
EDIT
as per request
Acct, ContactDate, OpenDate, Points (what I need to calculate)
123, 1/1/2018, 1/1/2021, 0 (because correspondance not within 30 days of open)
123, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
123, 12/11/2020, 1/1/2021, 0.2 (other 'touches' get 0.2/(num of touches-2) 'points')
123, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
456, 1/1/2018, 1/1/2021, 0 (again, because correspondance not within 30 days of open)
456, 12/10/2020, 1/1/2021, 0.4 (first touch gets 0.4)
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/11/2020, 1/1/2021, 0.1 (other 'touches' get 0.2/(num of touches-2) 'points')
456, 12/12/2020, 1/1/2021, 0.4 (last touch gets 0.4)
This returns a reduced dataframe in that it excludes timeframes exceeding 30 days and then merges the original df into it get all the data in one df. This assumes your date sorting is correct, otherwise, you may have to do that upfront before applying the function below.
df['Points'] = 0 #add column to dataframe before analysis
#df.columns
#Index(['Acct', 'ContactDate', 'OpenDate', 'Points'], dtype='object')
def points(x):
newx = x.loc[(x['OpenDate'] - x['ContactDate']) <= timedelta(days=30)] # reduce for wide > 30 days
# print(newx.Acct)
if newx.Acct.count() > 2: # check more than two dates exist
newx['Points'].iloc[0] = .4 # first row
newx['Points'].iloc[-1] = .4 # last row
newx['Points'].iloc[1:-1] = .2 / newx['Points'].iloc[1:-1].count() # middle rows / by count of those rows
return newx
elif newx.Acct.count() == 2: # placeholder for later
#edge case logic here for two occurences
return newx
elif newx.Acct.count() == 1: # placeholder for later
#edge case logic here one onccurence
return newx
# groupby Acct then clean up the indices so it can be merged back into original df
dft = df.groupby('Acct', as_index=False).apply(points).reset_index().set_index('level_1').drop('level_0', axis=1)
# merge on index
df_points = df[['Acct', 'ContactDate', 'OpenDate']].merge(dft['Points'], how='left', left_index=True, right_index=True).fillna(0)
Output:
Acct ContactDate OpenDate Points
0 123 2018-01-01 2021-01-01 0.0
1 123 2020-12-10 2021-01-01 0.4
2 123 2020-12-11 2021-01-01 0.2
3 123 2020-12-12 2021-01-01 0.4
4 456 2018-01-01 2021-01-01 0.0
5 456 2020-12-10 2021-01-01 0.4
6 456 2020-12-11 2021-01-01 0.1
7 456 2020-12-11 2021-01-01 0.1
8 456 2020-12-12 2021-01-01 0.4

parsing through and editing csv with pandas

I'm trying to parse through all of the cells in a csv file that represent heights and round what's after the decimal to match a number in a list (to round down to the nearest inch). After a few days of banging my head against the wall, this is the coding I've been able to get working:
import math
import pandas as pd
inch = [.0, .08, .16, .25, .33, .41, .50, .58, .66, .75, .83, .91, 1]
df = pd.read_csv("sample_csv.csv")
def to_number(s):
for index, row in df.iterrows():
try:
num = float(s)
num = math.modf(num)
num = list(num)
for i,j in enumerate(inch):
if num[0] < j:
num[0] = inch[i-1]
break
elif num[0] == j:
num[0] = inch[i]
break
newnum = num[0] + num[1]
return newnum
except ValueError:
return s
df = df.apply(lambda f : to_number(f[0]), axis=1).fillna('')
with open('new.csv', 'a') as f:
df.to_csv(f, index=False)
Ideally I'd like to have it parse over an entire CSV with n headers, ignoring all strings and round the floats to match the list. Is there a simple(r) way to achieve this with Pandas? And would it be possible (or a good idea?) to have it edit the existing excel workbook instead of creating a new csv i'd have to copy/paste over?
Any help or suggestions would be greatly appreciated as I'm very new to Pandas and it's pretty god damn intimidating!
Helping would be a lot easier if you include a sample mock of the data you're trying to parse. To clarify the points you don't specify, as I understand it
By "an entire CSV with n headers, ignoring all strings and round the floats to match the list" you mean some n-column dataframe with k numeric columns each of which describe someone's height in inches.
The entries in the numeric columns are measured in units of feet.
You want to ignore the non-numeric columns and transform the data as 6.14 -> 6 feet, 1 inches (I'm implicitly assuming that by "round down" you want an integer floor; i.e. 6.14 feet is 6 feet, 0.14*12 = 1.68 inches; it's up to you whether this is floored or rounded to the nearest integer).
Now for a subset of random heights measured in feet sampled uniformly over 5.1 feet and 6.9 feet, we could do the following:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
In [4]: df
Out[4]:
0 1 2
0 6.020613 6.315707 5.413499
1 5.942232 6.834540 6.761765
2 5.715405 6.162719 6.363224
3 6.416955 6.511843 5.512515
4 6.472462 5.789654 5.270047
5 6.370964 5.509568 6.113121
6 6.353790 6.466489 5.460961
7 6.526039 5.999284 6.617608
8 6.897215 6.016648 5.681619
9 6.886359 5.988068 5.575993
In [5]: np.fix(df) + np.floor(12*(df - np.fix(df)))/12
Out[5]:
0 1 2
0 6.000000 6.250000 5.333333
1 5.916667 6.833333 6.750000
2 5.666667 6.083333 6.333333
3 6.416667 6.500000 5.500000
4 6.416667 5.750000 5.250000
5 6.333333 5.500000 6.083333
6 6.333333 6.416667 5.416667
7 6.500000 5.916667 6.583333
8 6.833333 6.000000 5.666667
9 6.833333 5.916667 5.500000
We're using np.fix to extract the integral part of the height value. Likewise, df - np.fix(df) represents the fractional remainder in feet or in inches when multiplied by 12. np.floor just truncates this to the nearest inch below, and the final division by 12 returns the unit of measurement from inches to feet.
You can change np.floor to np.round to get an answer rounded to the nearest inch rather than truncated to the previous whole inch. Finally, you can specify the precision of the output to insist that the decimal portion is selected from your list.
In [6]: (np.fix(df) + np.round(12*(df - np.fix(df)))/12).round(2)
Out[6]:
0 1 2
0 6.58 5.25 6.33
1 5.17 6.42 5.67
2 6.42 5.83 6.33
3 5.92 5.67 6.33
4 6.83 5.25 6.58
5 5.83 5.50 6.92
6 6.83 6.58 6.25
7 5.83 5.33 6.50
8 5.25 6.00 6.83
9 6.42 5.33 5.08
Adding onto the other answer to address your problem with strings:
# Break the dataframe with a string
df = pd.DataFrame(np.random.uniform(5.1, 6.9, size=(10,3)))
df.ix[0,0] = 'str'
# Find out which things can be cast to numerics and put NaNs everywhere else
df_safe = df.apply(pd.to_numeric, axis=0, errors="coerce")
df_safe = (np.fix(df_safe) + np.round(12*(df_safe - np.fix(df_safe)))/12).round(2)
# Replace all the NaNs with the original data
df_safe[df_safe.isnull()] = df[df_safe.isnull()]
df_safe should be what you want. Despite the name, this isn't particularly safe and there are probably edge conditions that will be a problem.

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

scale numerical values for different groups in python

I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

pandas DataFrame Dividing a column by itself

I have a pandas dataframe that I filled with this:
import pandas.io.data as web
test = web.get_data_yahoo('QQQ')
The dataframe looks like this in iPython:
In [13]: test
Out[13]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 729 entries, 2010-01-04 00:00:00 to 2012-11-23 00:00:00
Data columns:
Open 729 non-null values
High 729 non-null values
Low 729 non-null values
Close 729 non-null values
Volume 729 non-null values
Adj Close 729 non-null values
dtypes: float64(5), int64(1)
When I divide one column by another, I get a float64 result that has a satisfactory number of decimal places. I can even divide one column by another column offset by one, for instance test.Open[1:]/test.Close[:], and get a satisfactory number of decimal places. When I divide a column by itself offset, however, I get just 1:
In [83]: test.Open[1:] / test.Close[:]
Out[83]:
Date
2010-01-04 NaN
2010-01-05 0.999354
2010-01-06 1.005635
2010-01-07 1.000866
2010-01-08 0.989689
2010-01-11 1.005393
...
In [84]: test.Open[1:] / test.Open[:]
Out[84]:
Date
2010-01-04 NaN
2010-01-05 1
2010-01-06 1
2010-01-07 1
2010-01-08 1
2010-01-11 1
I'm probably missing something simple. What do I need to do in order to get a useful value out of that sort of calculation? Thanks in advance for the assistance.
If you're looking to do operations between the column and lagged values, you should be doing something like test.Open / test.Open.shift().
shift realigns the data and takes an optional number of periods.
You may not be getting what you think you are when you do test.Open[1:]/test.Close. Pandas matches up the rows based on their index, so you're still getting each element of one column divided by its corresponding element in the other column (not the element one row back). Here's an example:
>>> print d
A B C
0 1 3 7
1 -2 1 6
2 8 6 9
3 1 -5 11
4 -4 -2 0
>>> d.A / d.B
0 0.333333
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
>>> d.A[1:] / d.B
0 NaN
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
Notice that the values returned are the same for both operations. The second one just has nan for the first one, since there was no corresponding value in the first operand.
If you really want to operate on offset rows, you'll need to dig down to the numpy arrays that underpin the pandas DataFrame, to bypass pandas's index-aligning features. You can get at these innards with the values attribute of a column.
>>> d.A.values[1:] / d.B.values[:-1]
array([-0.66666667, 8. , 0.16666667, 0.8 ])
Now you really are getting each value divided by the one before it in the other column. Note that here you have to explicitly slice the second operand to leave off the last element, to make them equal in length.
So you can do the same to divide a column by an offset version of itself:
>>> d.A.values[1:] / d.A.values[:-1]
45: array([-2. , -4. , 0.125, -4. ])

Categories