I am trying to understand how Pandas DataFrames works to copy information downward, and then reset until the next variables changes... Specifically below, how do I make Share_Amt_To_Buy reset to 0 once my Signal or Signal_Diff switches from 1 to 0?
Using .cumsum() on Share_Amt_To_Buy ends up bringing down the values and accumulating which is not exactly what I would like to do.
My goal is that when Signal changes from 0 to 1, the Share_Amt_To_Buy is calculated and copied until Signal switches back to 0. Then if Signal turns to 1 again, I want Share_Amt_To_Buy to be recalculated based on that point in time.
Hopefully this makes sense - please let me know.
Signal Signal_Diff Share_Amt_To_Buy (Correctly) Share_Amt_To_Buy (Currently)
0 0 0 0
0 0 0 0
0 0 0 0
1 1 100 100
1 0 100 100
1 0 100 100
0 -1 0 100
0 0 0 100
1 1 180 280
1 0 180 280
As you can see, my signals alternate from 0 to 1, and this means the following:
0 = no trade (or position)
1 = trade (with a position)
Signal_Diff is calculated as follows
portfolio['Signal_Diff'] = portfolio['Signal'].diff().fillna(0.0)
The column 'Share_Amt_To_Buy' is calculated when signal changes from 0 to 1. I have used the following as an example to calculate this
initial_cap = 100000.0
portfolio['close'] = my stock's closing prices as a float
portfolio['Share_Amt'] = np.where(variables['Signal']== 1.0, np.round(initial_cap / portfolio['close'] * 0.25 * portfolio['Signal']), 0.0).cumsum()
portfolio['Share_Amt_To_Buy'] = (portfolio['Share_Amt']*portfolio['Signal'])
From what I understand, there is no built-in formula module for pandas. You can perform formulas on columns, cells, arrays and generate different arrays or values from them (df[column].count() is an example), and do plenty of work like that, but there is no method for dynamically updating the array itself based on another value in the array (like an Excel formula).
You could always do the procedure iteratively and say:
>>> for index in df.index:
>>> if df['Signal_Diff'] == 0:
>>> df.loc[index, 'Signal_Diff'] = some_value
>>> elif df['Signal_Diff'] == 1:
>>> df.loc[index, 'Signal_Diff'] = some_other_value
Or you could create a custom function via the map tool:
https://stackoverflow.com/a/19226745/4131059
EDIT:
Another solution would be to query for all indexes with a value of 1 in the old array and the new array upon some change to the array:
>>> df_old_list = df[df.Signal_Diff == 1].index.tolist()
>>> ...
>>> df_new_list = df[df.Signal_Diff == 1].index.tolist()
>>>
>>> for x in df_old_list:
>>> if x in df_new_list:
>>> df_new_list.remove(x)
Then recalculate for only the indexes in df_new_list.
Related
I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)
I have a CSV file with several columns and I want to write a code that will read a specific column called 'ARPU average 6 month w/t roaming and discount' and then, create a new column called "Logical" which will be based on numpy.where(). Here is what I got at the moment:
csv_data = pd.read_csv("Results.csv")
data = csv_data[['ARPU average 6 month w/t roaming and discount']]
data = data.to_numpy()
sol = []
for target in data:
if1 = np.where(data < 0, 1, 0)
sol.append(if1)
csv_data["Logical"] = [sol].values
csv_data.to_csv ('Results2.csv', index = False, header=True)
This loop is made incorrectly and does not work. It does not create a new column with the corresponding value for each row. To make it clear: if the value in the column is bigger than 0, it will record "1", otherwise "0". The solution can be in any way (nor np.where(), nor loop is required)
If you want to understand what is "Results.csv"
It is actually a big file with data, I have highlighted the column we work with. The code needs to check if there is a value bigger than 0 in the column and give back 1 or 0 in the new column (as I described in the question)
updated answer
import pandas as pd
f1 = pd.read_csv("L1.csv")
f2 = pd.read_csv("L2.csv")
f3 = pd.merge(f1, f2, on ='CTN', how ='inner')
# f3.to_csv("Results.csv") # -> you do not need to save the file to a csv unless you really want to
# csv_data = pd.read_csv("Results.csv") # -> f3 is already saved in memory you do not need to read it again
# data = csv_data[['ARPU average 6 month w/t roaming and discount']] # -> you do not need this line
f3['Logical'] = (f3['ARPU average 6 month w/t roaming and discount']>0).astype(int)
f3.to_csv('Results2.csv', index = False, header=True)
original answer
Generally you do not need to use a loop when using pandas or numpy. Take this sample dataframe: df = pd.DataFrame([0,1,2,3,0,0,0,1], columns=['data'])
You can simply use the boolean values returned (where column is greater than 0 return 1 else return 0) to create a new column.
df['new_col'] = (df['data'] > 0).astype(int)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
or if you want to us numpy
df['new_col'] = np.where(df['data']>0, 1, 0)
data new_col
0 0 0
1 1 1
2 2 1
3 3 1
4 0 0
5 0 0
6 0 0
7 1 1
We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0
I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?
I have a pandas series, and a function that takes a value in the series and returns a dataframe. Is there a way to apply the function to the series and collate the results in a natural way?
What I am really trying to do is to use pandas series/multiindex to keep track of the results in each step of my data analysis pipeline, where the multiindex holds the parameters used to get the values. For example, the series (s below) is the result of step 0 in my data analysis pipeline. In step 1, I want to try x more dimensions (2 below, thus the dataframe) and collate the results into another series.
Can we do better than below? Where stack() calls seem a bit excessive. Will the xarray library be a good fit for my use case?
In [112]: s
Out[112]:
a 0
b 1
c 2
dtype: int64
In [113]: d = s.apply(lambda x: pd.DataFrame([[x,x*2],[x*3,x*4]]).stack()).stack().stack()
In [114]: d
Out[114]:
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 3
1 0 2
1 4
c 0 0 2
1 6
1 0 4
1 8
dtype: int64
This should give you a DataSet of 2D arrays, and align them for you. You may want to set the dimensions prior if you want them to be named a certain way / be a certain size.
xr.Dataset(k: func(v) for k, v in series.items())