I have a dataframe with four columns.
I should have normally : conso_HC=index_fin_HC-index_debut_HC. But as you can see it's not the case, subtraction is really equal to that. The problem is that if we want to find conso_HC you need to add sometimes 100000 to one of the index_fin_HC or index_debut_HC.
x=fichier['index_fin_HC']-fichier['index_debut_HC']
y=fichier['conso_HC']
def conditions(x,y):
if x+100000==y:
return x
elif x==y+100000:
return y
fichier['test']=fichirt
It is easy to build a new Series that test for the condition:
>>> (df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso)
0 False
1 False
2 False
3 False
4 True
5 True
6 False
dtype: bool
You can then:
add it to the dataframe as a new column:
df['condition'] = (df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso)
select rows in the dataframe matching the condition
>>> df[(df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso)]
debut fin conso soustraction
4 99193.0 526.0 1333.0 -98667.0
5 91833.0 6407.0 14574.0 -85426.0
select rows in the dataframe not matching the condition
>>> df[~ ((df.soustraction + 100000 == df.conso) | (df.soustraction - 100000 == df.conso))]
debut fin conso soustraction
0 34390.0 414.0 452.0 -33976.0
1 18117.0 85.0 216.0 -18032.0
2 37588.0 234.0 8468.0 -37354.0
3 49060.0 53.0 1399.0 -49007.0
6 38398.0 1594.0 1994.0 -36804.0
Related
Business Problem: For each row in a Pandas data frame where condition is true, set value in a column. When successive rows meet condition, then increase the value by one. The end goal is to create a column containing integers (e.g., 1, 2, 3, 4, ... , n) upon which a pivot table can be made. As a side note, there will be a second index upon which the pivot will be made.
Below is my attempt, but I'm new to using Pandas.
sales_data_cleansed_2.loc[sales_data_cleansed_2['Duplicate'] == 'FALSE', 'sales_index'] = 1
j = 2
# loop through whether duplicate exists.
for i in range(0, len(sales_data_cleansed_2)):
while sales_data_cleansed_2.loc[i,'Duplicate'] == 'TRUE':
sales_data_cleansed_2.loc[i,'sales_index'] = j
j = j + 1
break
j = 2
You can try:
import pandas as pd
# sample DataFrame
df = pd.DataFrame(np.random.randint(0,2, 15).astype(str), columns=["Duplicate"])
df = df.replace({'1': 'TRUE', '0':'FALSE'})
df['sales_index'] = ((df['Duplicate'] == 'TRUE')
.groupby((df['Duplicate'] != 'TRUE')
.cumsum()).cumsum() + 1)
print(df)
This gives:
Duplicate sales_index
0 FALSE 1
1 FALSE 1
2 TRUE 2
3 TRUE 3
4 TRUE 4
5 TRUE 5
6 TRUE 6
7 TRUE 7
8 TRUE 8
9 FALSE 1
10 FALSE 1
11 TRUE 2
12 TRUE 3
13 TRUE 4
14 FALSE 1
So I've narrowed down a previous problem down to this: I have a DataFrame that looks like this
id temp1 temp2
9 10.0 True False
10 10.0 True False
11 10.0 False True
12 10.0 False True
17 15.0 True False
18 15.0 True False
19 15.0 True False
20 15.0 True False
21 15.0 False False
33 27.0 True False
34 27.0 True False
35 27.0 False True
36 27.0 False False
40 31.0 True False
41 31.0 False True
.
.
.
and in reality, it's a few million lines long (and has a few other columns).
What I have it currently doing is
grouped = coinc.groupby('id')
final = grouped.filter(lambda x: ( x['temp2'].any() and x['temp1'].any()))
lanif = final.drop(['temp1','temp2'],axis = 1 )
(coinc is the name of the dataframe)
which only keeps rows (grouped by id) if there is a true in both temp1 and temp2 for some rows with the same id. For example, with the above dataframe, it would get rid of rows with id 15, but keep everything else.
This, however, is deathly slow and I was wondering if there was a faster way to do this.
Using filter with a lambda function here is slowing you down a lot. You can speed things up by removing that.
u = coinc.groupby('id')
m = u.temp1.any() & u.temp2.any()
res = df.loc[coinc.id.isin(m[m].index), ['id']]
Comparing this to your approach on a larger frame.
a = np.random.randint(1, 1000, 100_000)
b = np.random.randint(0, 2, 100_000, dtype=bool)
c = ~b
coinc = pd.DataFrame({'id': a, 'temp1': b, 'temp2': c})
In [295]: %%timeit
...: u = coinc.groupby('id')
...: m = u.temp1.any() & u.temp2.any()
...: res = coinc.loc[coinc.id.isin(m[m].index), ['id']]
...:
13.5 ms ± 476 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [296]: %%timeit
...: grouped = coinc.groupby('id')
...: final = grouped.filter(lambda x: ( x['temp2'].any() and x['temp1'].any()))
...: lanif = final.drop(['temp1','temp2'],axis = 1 )
...:
527 ms ± 7.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.array_equal(res.values, lanif.values)
True
i, u = pd.factorize(coinc.id)
t = np.zeros((len(u), 2), bool)
c = np.column_stack([coinc.temp1.to_numpy(), coinc.temp2.to_numpy()])
np.logical_or.at(t, i, c)
final = coinc.loc[t.all(1)[i], ['id']]
final
id
9 10.0
10 10.0
11 10.0
12 10.0
33 27.0
34 27.0
35 27.0
36 27.0
40 31.0
41 31.0
The problem isn't the groupby it's the lambda. Lambda operations are not vectorized*. You can get the same result faster using agg. I'd do:
groupdf = coinc.groupby('id').agg(any)
# Selects instance where both contain at least one true statement
mask = maskdf[['temp1','temp2']].all(axis=1)
lanif = groupdf[mask].drop(['temp1','temp2'],axis = 1 )
*This is a pretty nuanced issue that I'm waaaay oversimplifying, sorry.
Here is another alternative solution
f = coinc.groupby('id').transform('any')
result = coinc.loc[f['temp1'] & f['temp2'], coinc.columns.drop(['temp1', 'temp2'])]
My dataframe looks like this:
s gamma_star
0 0.000000 0.6261
1 0.000523 0.6262
2 0.000722 0.6263
3 0.000861 0.6267
4 0.000972 0.6269
5 0.001061 0.6260
6 0.001147 0.6263
7 0.001218 0.6261
I have a value s = 0.000871, what I need to look for is the related gamma_star that belongs to this s. In the example above it would be s is between 3 0.000861 0.6261 and 4 0.000972 0.6261 then it's related gamma_star is 0.6267! I am a bit stuck and do not know how to sart, any idea? Thanks!
You could do:
df.loc[(df.s > s).idxmax()-1, 'gamma_star']
# 0.6267
Where the use condition will be indicating the starting point on which the condition is satisfied
(df.s > s)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
Name: s, dtype: bool
and by taking the idxmax() we can find the beginning of the interval:
(df.s > s).idxmax()-1
# 3
You could do it with a loop, if you go through the s-values.
for i in range(len(s)-1):
if given_value > s[i] and given_value < s[i+1]:
gamma_s_wanted = gamma_star[i]
break
else:
gamma_s_wanted = gamma_star[-1]
First Part
I have a dataframe with finance data (33023 rows, here the link to the
data: https://mab.to/Ssy3TelRs); df.open is the price of the title and
df.close is the closing price.
I have been trying to see how many times in a row the title closed
with a gain and with a lost.
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
I have started with a for:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x]
and then unsuccessful series of if statements...
Thank you for your help.
CronosVirus00
EDITS:
>>> df.head(7)
data ora open max min close Unnamed: 6
0 20160801 0 1.11781 1.11781 1.11772 1.11773 0
1 20160801 100 1.11774 1.11779 1.11773 1.11777 0
2 20160801 200 1.11779 1.11800 1.11779 1.11795 0
3 20160801 300 1.11794 1.11801 1.11771 1.11771 0
4 20160801 400 1.11766 1.11772 1.11763 1.11772 0
5 20160801 500 1.11774 1.11798 1.11774 1.11796 0
6 20160801 600 1.11796 1.11796 1.11783 1.11783 0
Ifs:
for x in range(1,df.close.count()): y = df.close[x]-df.open[x] if y > 0 : green += 1 y = df.close[x+1] - df.close[x+1]
twotimes += 1 if y > 0 : green += 1 y = df.close[x+2] -
df.close[x+2] threetimes += 1 if y > 0 :
green += 1 y = df.close[x+3] - df.close[x+3] fourtimes += 1
FINAL SOLUTION
Thank you all! And the end I did this:
df['test'] = df.close - df.open >0
green = df.test #days that it was positive
def gg(z):
tot =green.count()
giorni = range (1,z+1) # days in a row i wanna check
for x in giorni:
y = (green.rolling(x).sum()>x-1).sum()
print(x," ",y, " ", round((y/tot)*100,1),"%")
gg(5)
1 14850 45.0 %
2 6647 20.1 %
3 2980 9.0 %
4 1346 4.1 %
5 607 1.8 %
If i understood your question correctly you can do it this way:
In [76]: df.groupby((df.close.diff() < 0).cumsum()).cumcount()
Out[76]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 0
dtype: int64
The result that I'm looking for should tell me that the title was
positive 2 days in a row x times, 3 days in a row y times, 4 days in a
row z times and so forth.
In [114]: df.groupby((df.close.diff() < 0).cumsum()).cumcount().value_counts().to_frame('count')
Out[114]:
count
0 4
2 2
1 2
Data set:
In [78]: df
Out[78]:
data ora open max min close
0 20160801 0 1.11781 1.11781 1.11772 1.11773
1 20160801 100 1.11774 1.11779 1.11773 1.11777
2 20160801 200 1.11779 1.11800 1.11779 1.11795
3 20160801 300 1.11794 1.11801 1.11771 1.11771
4 20160801 400 1.11766 1.11772 1.11763 1.11772
5 20160801 500 1.11774 1.11798 1.11774 1.11796
6 20160801 600 1.11796 1.11796 1.11783 1.11783
7 20160801 700 1.11783 1.11799 1.11783 1.11780
In [80]: df.close.diff()
Out[80]:
0 NaN
1 0.00004
2 0.00018
3 -0.00024
4 0.00001
5 0.00024
6 -0.00013
7 -0.00003
Name: close, dtype: float64
It sounds like what you want to do is:
compute the difference of two series (open & close), eg diff = df.open - df.close
apply a condition to the result to get a boolean series diff > 0
pass the resulting boolean series to the DataFrame to get a subset of the DataFrame where the condition is true df[diff > 0]
Find all contiguous subsequences by applying a column wise function to identify and count
I need to board a plane, but I will provide a sample of what the last step looks like when I can.
If I understood you correctly, you want the number of days that have at least n positive days in a row before and itself included.
Similarly to what #Thang suggested, you can use rolling:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 2), columns=["open", "close"])
# This just sets up random test data, for example:
# open close
# 0 0.997986 0.594789
# 1 0.052712 0.401275
# 2 0.895179 0.842259
# 3 0.747268 0.919169
# 4 0.113408 0.253440
# 5 0.199062 0.399003
# 6 0.436424 0.514781
# 7 0.180154 0.235816
# 8 0.750042 0.558278
# 9 0.840404 0.139869
positiveDays = df["close"]-df["open"] > 0
# This will give you a series that is True for positive days:
# 0 False
# 1 True
# 2 False
# 3 True
# 4 True
# 5 True
# 6 True
# 7 True
# 8 False
# 9 False
# dtype: bool
daysToCheck = 3
positiveDays.rolling(daysToCheck).sum()>daysToCheck-1
This will now give you a series, indicating for every day, whether it has been positive for daysToCheck number of days in a row:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 False
9 False
dtype: bool
Now you can use (positiveDays.rolling(daysToCheck).sum()>daysToCheck-1).sum() to get the number of days (in the example 3) that obey this, which is what you want, as far as I understand.
This should work:
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test['cumulative'][i] = test['cumulative'][i-1] + 1
test['cumulative'][i-1] = 0
results = test['cumulative'].value_counts()
Ignore the '0' row in the results. It can be modified without too much trouble if you want to e.g count both days in a run-of-two as runs-of-one as well.
Edit: without the warnings -
import pandas as pd
import numpy as np
test = pd.DataFrame(np.random.randn(100,2), columns = ['open','close'])
test['gain?'] = (test['open']-test['close'] < 0)
test['cumulative'] = 0
for i in test.index[1:]:
if test['gain?'][i]:
test.loc[i,'cumulative'] = test.loc[i-1,'cumulative'] + 1
test.loc[i-1,'cumulative'] = 0
results = test['cumulative'].value_counts()
I have a DataFrame which looks like below. I am trying to count the number of elements less than 2.0 in each column, then I will visualize the result in a bar plot. I did it using lists and loops, but I wonder if there is a "Pandas way" to do this quickly.
x = []
for i in range(6):
x.append(df[df.ix[:,i]<2.0].count()[i])
then I can get a bar plot using list x.
A B C D E F
0 2.142 1.929 1.674 1.547 3.395 2.382
1 2.077 1.871 1.614 1.491 3.110 2.288
2 2.098 1.889 1.610 1.487 3.020 2.262
3 1.990 1.760 1.479 1.366 2.496 2.128
4 1.935 1.765 1.656 1.530 2.786 2.433
In [96]:
df = pd.DataFrame({'a':randn(10), 'b':randn(10), 'c':randn(10)})
df
Out[96]:
a b c
0 -0.849903 0.944912 1.285790
1 -1.038706 1.445381 0.251002
2 0.683135 -0.539052 -0.622439
3 -1.224699 -0.358541 1.361618
4 -0.087021 0.041524 0.151286
5 -0.114031 -0.201018 -0.030050
6 0.001891 1.601687 -0.040442
7 0.024954 -1.839793 0.917328
8 -1.480281 0.079342 -0.405370
9 0.167295 -1.723555 -0.033937
[10 rows x 3 columns]
In [97]:
df[df > 1.0].count()
Out[97]:
a 0
b 2
c 2
dtype: int64
So in your case:
df[df < 2.0 ].count()
should work
EDIT
some timings
In [3]:
%timeit df[df < 1.0 ].count()
%timeit (df < 1.0).sum()
%timeit (df < 1.0).apply(np.count_nonzero)
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 560 us per loop
1000 loops, best of 3: 529 us per loop
So #DSM's suggestions are correct and much faster than my suggestion
Method-chaining is possible (comparison operators have their respective methods, e.g. < = lt(), <= = le()):
df.lt(2).sum()
If you have multiple conditions to consider, e.g. count the number of values between 2 and 10. Then you can use boolean operator on two boolean Serieses:
(df.gt(2) & df.lt(10)).sum()
or you can use pd.eval():
pd.eval("2 < df < 10").sum()
Count the number of values less than 2 or greater than 10:
(df.lt(2) | df.gt(10)).sum()
# or
pd.eval("df < 2 or df > 10").sum()