How to make python not to break dataframe-description into blocks? - python

The bellow code outputs the df-description into two blocks, though the display.max_rows and display.max_columns is set to a high value.
I'd like to print it without the breaks. Is there any way to do it?
import pandas as pd
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
data = {'Age': [20, 21, 19, 18],
'Height': [120, 121, 119, 118],
'Very_very_long_variable_name': [40, 71, 49, 78]
}
df = pd.DataFrame(data)
print(df.describe().transpose())

Try adding this setting as well on the top:
pd.set_option('display.expand_frame_repr', False)
And now:
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
data = {'Age': [20, 21, 19, 18],
'Height': [120, 121, 119, 118],
'Very_very_long_variable_name': [40, 71, 49, 78]
}
df = pd.DataFrame(data)
print(df.describe().transpose())
Output:
count mean std min 25% 50% 75% max
Age 4.0 19.5 1.290994 18.0 18.75 19.5 20.25 21.0
Height 4.0 119.5 1.290994 118.0 118.75 119.5 120.25 121.0
Very_very_long_variable_name 4.0 59.5 17.935068 40.0 46.75 60.0 72.75 78.0

No need to change any option, just use the "to_string" method:
print(df.describe().T.to_string())
output:
count mean std min 25% 50% 75% max
Age 4.0 19.5 1.290994 18.0 18.75 19.5 20.25 21.0
Height 4.0 119.5 1.290994 118.0 118.75 119.5 120.25 121.0
Very_very_long_variable_name 4.0 59.5 17.935068 40.0 46.75 60.0 72.75 78.0

Related

line_profiler could not tell pandas assign details

def test_lprun():
data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
'Age':[20, 21, 19, 18],
'Height' : [6.1, 5.9, 6.0, 6.1]
}
df = pd.DataFrame(data)
df=df.assign(A=123,
B=lambda x:x.Age+x.Height,
C=lambda x:x.Name.str.upper(),
D=lambda x:x.Name.str.lower()
)
return df
In [8]: %lprun -f test_lprun test_lprun()
Timer unit: 1e-07 s
Total time: 0.0044901 s
File: <ipython-input-7-eaf21639fb5f>
Function: test_lprun at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def test_lprun():
2 1 21.0 21.0 0.0 data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
3 1 13.0 13.0 0.0 'Age':[20, 21, 19, 18],
4 1 15.0 15.0 0.0 'Height' : [6.1, 5.9, 6.0, 6.1]
5 }
6 1 8651.0 8651.0 19.3 df = pd.DataFrame(data)
7 1 19.0 19.0 0.0 df=df.assign(A=123,
8 1 11.0 11.0 0.0 B=lambda x:x.Age+x.Height,
9 1 10.0 10.0 0.0 C=lambda x:x.Name.str.upper(),
10 1 36147.0 36147.0 80.5 D=lambda x:x.Name.str.lower()
11 )
12 1 14.0 14.0 0.0 return df
When using pandas assign, it could not tell which rows occupies most time but tell the whole result for assign function.
Goal: line_profile could tell each row result for pandas assign function like Line 6 %Time is 10, Line7 %Time is 30 and so on.

Iterate over pandas dataframe columns containing nested arrays (reformulated request)

I Hope you can help. few weeks ago you did gave an huge help with a similar issue regarding nested arrays.
Today I've similar issue, and I've tried all solution you provided in this link below,
Iterate over pandas dataframe columns containing nested arrays
My data is an ORB vector containing description points. It returns a list. When I convert the List into an array I get this output
data=np.asarray([['Test /file0090',
np.asarray([[ 84, 55, 189],
[248, 100, 18],
[ 68, 0, 88]])],
['aa file6565',
np.asarray([[ 86, 58, 189],
[24, 10, 118],
[ 68, 11, 0]])],
['aa filejjhgjgj',
None],
['Test /file0088',
np.asarray([[ 54, 58, 787],
[ 4, 1, 18 ],
[ 8, 1, 0 ]])]])
This a small sample, real data is a array with 800.000 x 2
Some images do not return any Descriptor Points, and the value shows' None
Below an example, I've just selected 2 rows where values where "None",
array([['/00cbbc837d340fa163d11e169fbdb952.jpg',
None],
['/078b35be31e8ac99b0cbb817dab4c23f.jpg',
None]], dtype=object)
Once again, I need to get this in nx4 (in this case we have 4 variables but my real data there are 33 variables) this kind :
col0, Col1, Col2, col3,
Test /file0090 84, 55, 189
Test /file0090 248, 100, 18
Test /file0090 84, 55, 189
'aa file6565' 86, 58, 189
'aa file6565' 24, 10, 118
'aa file6565' 68, 11, 0
'aa filejjhgjgj' 0 0 0
'Test /file0088 54, 58, 787
'Test /file0088 4, 1, 18
'Test /file0088 8, 1, 0
The issue with the solution provided in the link, is when we have this "None" values in the array it returns
ufunc 'add' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')
If someone can help me to go trought this
You can modify #anky answer to handle null values by using df.fillna(''):
df = pd.DataFrame(data).add_prefix('col')
df = df.fillna('').explode('col1').reset_index(drop=True)
df = df.join(pd.DataFrame(df.pop('col1').tolist()).add_prefix('Col')).fillna(0)
Returns
col0 Col0 Col1 Col2
Test /file0090 84.0 55.0 189.0
Test /file0090 248.0 100.0 18.0
Test /file0090 68.0 0.0 88.0
aa file6565 86.0 58.0 189.0
aa file6565 24.0 10.0 118.0
aa file6565 68.0 11.0 0.0
aa filejjhgjgj 0.0 0.0 0.0
Test /file0088 54.0 58.0 787.0
Test /file0088 4.0 1.0 18.0
Test /file0088 8.0 1.0 0.0

Is there a more elegant way to do conditional cumulative sums in pandas?

I'm trying to build a little portfolio app and calculate what my average entry price is and realised gains off the back of that. Here's what I have so far which works but curious to know if there's a more elegant way to get conditional cumulative sums without creating extra columns. Seems like a lot of steps for effectively a sumifs statement in excel.
Input dataframe:
hist_pos = pd.DataFrame(data=[
[datetime(2020, 5, 1), 'PPT.AX', 30, 20.00, 15.00, 'Buy'],
[datetime(2020, 5, 2), 'RIO.AX', 25, 25.00, 15.00, 'Buy'],
[datetime(2018, 5, 3), 'BHP.AX', 100, 4.00, 15.00, 'Buy'],
[datetime(2019, 5, 3), 'BHP.AX', 50, 4.00, 15.00, 'Sell'],
[datetime(2019, 12, 3), 'PPT.AX', 80, 4.00, 15.00, 'Buy'],
[datetime(2020, 5, 3), 'RIO.AX', 100, 4.00, 15.00, 'Buy'],
[datetime(2020, 5, 5), 'PPT.AX', 50, 40.00, 15.00, 'Sell'],
[datetime(2020, 5, 10), 'PPT.AX', 15, 45.00, 15.00, 'Sell'],
[datetime(2020, 5, 18), 'PPT.AX', 30, 20.00, 15.00, 'Sell']],
columns=['Date', 'Ticker', 'Quantity', 'Price', 'Fees', 'Direction'])
Code base:
hist_pos.sort_values(['Ticker', 'Date'], inplace=True)
hist_pos.Quantity = pd.to_numeric(hist_pos.Quantity) #convert to number
# where direction is sale, make quantity negative
hist_pos['AdjQ'] = np.where(
hist_pos.Direction == 'Buy', 1, -1)*hist_pos.Quantity
#Sum quantity to get closing quantity for each ticker using the AdjQ column
hist_pos['CumQuan'] = hist_pos.groupby('Ticker')['AdjQ'].cumsum()
Expected Output:
Date Ticker Quantity Price Fees Direction AdjQ CumQuan
2 2018-05-03 BHP.AX 100 4.0 15.0 Buy 100 100
3 2019-05-03 BHP.AX 50 4.0 15.0 Sell -50 50
4 2019-12-03 PPT.AX 80 4.0 15.0 Buy 80 80
0 2020-05-01 PPT.AX 30 20.0 15.0 Buy 30 110
6 2020-05-05 PPT.AX 50 40.0 15.0 Sell -50 60
7 2020-05-10 PPT.AX 15 45.0 15.0 Sell -15 45
8 2020-05-18 PPT.AX 30 20.0 15.0 Sell -30 15
1 2020-05-02 RIO.AX 25 25.0 15.0 Buy 25 25
5 2020-05-03 RIO.AX 100 4.0 15.0 Buy 100 125
The code above works fine and produces the expected output for column CumQuan. However, I have broader code (here in Repl) where I need to go through this process a number of times for various columns. So wondering if there was a simpler way to process the data to use group by, cumulative sum with a conditional.
Grouping together is the only thing I can think of.
hist_pos2 = hist_pos.groupby('Ticker').agg(CumQuan=('AdjQ','cumsum'), CumCost=('CFBuy','cumsum'))
CumQuan CumCost
2 100 -415.0
3 50 -415.0
4 80 -335.0
0 110 -950.0
6 60 -950.0
7 45 -950.0
8 15 -950.0
1 25 -640.0
5 125 -1055.0

Calculate source value from deviation of Bollinger Bands with Python and Pandas

I am calculating the standard deviation of the rolling mean (Bollinger Bands, example here is very simplified) in a pandas dataframe like this:
import pandas as pd
import numpy as np
no_of_std = 3
window = 20
df = pd.DataFrame({'A': [34, 34, 34, 33, 32, 34, 35.0, 21, 22, 25, 23, 21, 39, 26, 31, 34, 38, 26, 21, 39, 31]})
rolling_mean = df['A'].rolling(window).mean()
rolling_std = df['A'].rolling(window).std(ddof=0)
df['M'] = rolling_mean
df['BBL'] = rolling_mean - (rolling_std * no_of_std)
df['BBH'] = rolling_mean + (rolling_std * no_of_std)
print (df)
The result looks like this:
A M BBL BBH
0 34.0 NaN NaN NaN
1 34.0 NaN NaN NaN
2 34.0 NaN NaN NaN
3 33.0 NaN NaN NaN
4 32.0 NaN NaN NaN
5 34.0 NaN NaN NaN
6 35.0 NaN NaN NaN
7 21.0 NaN NaN NaN
8 22.0 NaN NaN NaN
9 25.0 NaN NaN NaN
10 23.0 NaN NaN NaN
11 21.0 NaN NaN NaN
12 39.0 NaN NaN NaN
13 26.0 NaN NaN NaN
14 31.0 NaN NaN NaN
15 34.0 NaN NaN NaN
16 38.0 NaN NaN NaN
17 26.0 NaN NaN NaN
18 21.0 NaN NaN NaN
19 39.0 30.10 11.633544 48.566456
20 31.0 29.95 11.665375 48.234625
Now i want to calculate in the other direction which value the last value in the column 'A' needs to have to hit exactly the 3rd standard deviation of the rolling mean.
That means in other words i want to calculate: which value needs A to have in a next row nr.15 that it will be exactly the same as the value in BBH or BBL.
I can do this by recursive approximation but this needs a lot of perfomance and i think there must be a better way. Here is an example for the solution from which i think it is to slow and there must be a better faster way:
import pandas as pd
odf = pd.DataFrame({'A': [34, 34, 34, 33, 32, 34, 35.0, 21, 22, 25, 23, 21, 39, 26, 31, 34, 38, 26, 21, 39, 31]})
def get_last_bbh_bbl(idf):
xdf = idf.copy()
no_of_std = 3
window = 20
rolling_mean = xdf['A'].rolling(window).mean()
rolling_std = xdf['A'].rolling(window).std()
xdf['M'] = rolling_mean
xdf['BBL'] = rolling_mean - (rolling_std * no_of_std)
xdf['BBH'] = rolling_mean + (rolling_std * no_of_std)
bbh = xdf.loc[len(xdf) - 1, 'BBH']
bbl = xdf.loc[len(xdf) - 1, 'BBL']
return bbh, bbl
def search_matching_value(idf, low, high, search_for):
xdf = idf.copy()
if abs(high-low) < 0.000001:
return high
middle = low + ((high-low)/2)
xdf = xdf.append({'A' : middle}, ignore_index=True)
bbh, bbl = get_last_bbh_bbl(xdf)
if search_for == 'bbh':
if bbh < middle:
result=search_matching_value(idf, low, middle, search_for)
elif bbh > middle:
result=search_matching_value(idf, middle, high, search_for)
else:
return middle
elif search_for == 'bbl':
if bbl > middle:
result=search_matching_value(idf, middle, high, search_for)
elif bbl < middle:
result=search_matching_value(idf, low, middle, search_for)
else:
return middle
return result
actual_bbh, actual_bbl = get_last_bbh_bbl(odf)
last_value = odf.loc[len(odf) - 1, 'A']
print('last_value: {}, actual bbh: {}, actual bbl: {}'.format(last_value, actual_bbh, actual_bbl))
low = last_value
high = actual_bbh * 10
next_value_that_hits_bbh = search_matching_value(odf, low, high, 'bbh')
print ('next_value_that_hits_bbh: {}'.format(next_value_that_hits_bbh))
low=0
high=last_value
next_value_that_hits_bbl = search_matching_value(odf, low, high, 'bbl')
print ('next_value_that_hits_bbl: {}'.format(next_value_that_hits_bbl))
the result looks like this:
last_value: 31.0, actual bbh: 48.709629106422284, actual bbl: 11.190370893577711
next_value_that_hits_bbh: 57.298733206475276
next_value_that_hits_bbl: 2.174952656030655
here one solution to calculate next value with fast algorithm: newton opt and newton classic are faster than dichotomy and this solution dont use dataframe to recalculate the different value, i use directly the statistic function from the library of same name
some info for scipy.optimize.newton
from scipy import misc
import pandas as pd
import statistics
from scipy.optimize import newton
#scipy.optimize if you want to test the newton optimized function
def get_last_bbh_bbl(idf):
xdf = idf.copy()
rolling_mean = xdf['A'].rolling(window).mean()
rolling_std = xdf['A'].rolling(window).std()
xdf['M'] = rolling_mean
xdf['BBL'] = rolling_mean - (rolling_std * no_of_std)
xdf['BBH'] = rolling_mean + (rolling_std * no_of_std)
bbh = xdf.loc[len(xdf) - 1, 'BBH']
bbl = xdf.loc[len(xdf) - 1, 'BBL']
lastvalue = xdf.loc[len(xdf) - 1, 'A']
return lastvalue, bbh, bbl
#classic newton
def NewtonsMethod(f, x, tolerance=0.00000001):
while True:
x1 = x - f(x) / misc.derivative(f, x)
t = abs(x1 - x)
if t < tolerance:
break
x = x1
return x
#to calculate the result of function bbl(x) - x (we want 0!)
def low(x):
l = lastlistofvalue[:-1]
l.append(x)
avg = statistics.mean(l)
std = statistics.stdev(l, avg)
return avg - std * no_of_std - x
#to calculate the result of function bbh(x) - x (we want 0!)
def high(x):
l = lastlistofvalue[:-1]
l.append(x)
avg = statistics.mean(l)
std = statistics.stdev(l, avg)
return avg + std * no_of_std - x
odf = pd.DataFrame({'A': [34, 34, 34, 33, 32, 34, 35.0, 21, 22, 25, 23, 21, 39, 26, 31, 34, 38, 26, 21, 39, 31]})
no_of_std = 3
window = 20
lastlistofvalue = odf['A'].shift(0).to_list()[::-1][:window]
"""" Newton classic method """
x = odf.loc[len(odf) - 1, 'A']
x0 = NewtonsMethod(high, x)
print(f'value to hit bbh: {x0}')
odf = pd.DataFrame({'A': [34, 34, 34, 33, 32, 34, 35.0, 21, 22, 25, 23, 21, 39, 26, 31, 34, 38, 26, 21, 39, 31, x0]})
lastvalue, new_bbh, new_bbl = get_last_bbh_bbl(odf)
print(f'value to hit bbh: {lastvalue} -> check new bbh: {new_bbh}')
x0 = NewtonsMethod(low, x)
print(f'value to hit bbl: {x0}')
odf = pd.DataFrame({'A': [34, 34, 34, 33, 32, 34, 35.0, 21, 22, 25, 23, 21, 39, 26, 31, 34, 38, 26, 21, 39, 31, x0]})
lastvalue, new_bbh, new_bbl = get_last_bbh_bbl(odf)
print(f'value to hit bbl: {lastvalue} -> check new bbl: {new_bbl}')
output:
value to hit bbh: 57.298732375228624
value to hit bbh: 57.298732375228624 -> check new bbh: 57.29873237527272
value to hit bbl: 2.1749518354059636
value to hit bbl: 2.1749518354059636 -> check new bbl: 2.1749518353102992
you could compare the newton optimized like:
""" Newton optimized method """
x = odf.loc[len(odf) - 1, 'A']
x0 = newton(high, x, fprime=None, args=(), tol=1.00e-08, maxiter=50, fprime2=None)
print(f'Newton opt value to hit bbh: {x0}')
x0 = newton(low, x, fprime=None, args=(), tol=1.48e-08, maxiter=50, fprime2=None)
print(f'Newton value to hit bbl: {x0}')
output:
Newton opt value to hit bbh: 57.29873237532118
Newton value to hit bbl: 2.1749518352051225
with the newton optimized, you could play with the max iteration
and optimized is faster than classic:
measures for each calculus
0.002 sec for optimized
0.005 sec for classic
*Remarks: *
if you use rolling(window).std() you are using the standard deviation so you have to use
std = statistics.stdev(l, avg) you divide by N-1 items
if you use rolling(window).std(ddof=0) you are using the population deviation so you have to use
std = statistics.pstdev(l, avg) you divide by N items

pandas combining dataframe

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array
javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)
This code comes out with the following:
Town Java Jobs
435 York,NY 3593
212 NewYork,NY 3585
584 Seattle,WA 2080
624 Chicago,IL 1920
301 Boston,MA 1571
...
79 Holland,MI 5
38 Manhattan,KS 5
497 Vernon,IL 5
30 Clayton,MO 5
90 Waukegan,IL 5
[653 rows x 2 columns]
Town Python Jobs
160 NewYork,NY 2949
11 York,NY 2938
349 Seattle,WA 1321
91 Chicago,IL 1312
167 Boston,MA 1117
383 Hanover,NH 5
209 Bulverde,TX 5
203 Salisbury,NC 5
67 Rockford,IL 5
256 Ventura,CA 5
[416 rows x 2 columns]
I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.
import pandas as pd
javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])
result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
# Java Jobs Python Jobs
# Town
# York,NY 3593.0 2938.0
# NewYork,NY 3585.0 2949.0
# Seattle,WA 2080.0 1321.0
# Chicago,IL 1920.0 1312.0
# Boston,MA 1571.0 1117.0
# Holland,MI 5.0 NaN
# Manhattan,KS 5.0 NaN
# Vernon,IL 5.0 NaN
# Clayton,MO 5.0 NaN
# Waukegan,IL 5.0 NaN
# Hanover,NH NaN 5.0
# Bulverde,TX NaN 5.0
# Salisbury,NC NaN 5.0
# Rockford,IL NaN 5.0
# Ventura,CA NaN 5.0
pd.merge will by default join two DataFrames on all columns shared in common. In this case, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.
how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.
Use pd.concat
df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)
Java Jobs Python Jobs
Boston,MA 1571.0 1117.0
Bulverde,TX NaN 5.0
Chicago,IL 1920.0 1312.0
Clayton,MO 5.0 NaN
Hanover,NH NaN 5.0
Holland,MI 5.0 NaN
Manhattan,KS 5.0 NaN
NewYork,NY 3585.0 2949.0
Rockford,IL NaN 5.0
Salisbury,NC NaN 5.0
Seattle,WA 2080.0 1321.0
Ventura,CA NaN 5.0
Vernon,IL 5.0 NaN
Waukegan,IL 5.0 NaN
York,NY 3593.0 2938.0

Categories