Low speed of python using while ,for,if-else condition - python

Hi I have below code in python .
I am comparing multiple columns to 1 single data each time based on certain flag value presence. Here two such sets are shown. ST9 and ST1
I am using while loop,for and if-else condition. But it is taking almost 13 minutes to process. How can I increase the computation speed.
i=0
Reslt_9=[]
while i<len(xx):
for val in xx.ST9:
if val==4:
xz = xx[['SV1','SV2','SV3','SV4','SV5','SV6','SV7','SV8','SV9','SV10','SV11','SV12']].eq(xx['G9'], axis=0).assign(no = True)
i1 = xz.values.argmax(axis=1)
Reslt_9 = np.append(Reslt_9,xx['R9'][i] -xx[['PR1','PR2','PR3','PR4','PR5','PR6','PR7','PR8','PR9','PR10','PR11','PR12']].assign(no = np.nan).values[xx.index, i1][i])
elif val==2:
Reslt_9=np.append(Reslt_9,0)
i=i+1
a2=0
Reslt_1=[]
aa=read_data
while a2<len(aa):
# Debug area
for val1 in aa.ST1:
if val1==4:
a = aa[['SV1','SV2','SV3','SV4','SV5','SV6','SV7','SV8','SV9','SV10','SV11','SV12']].eq(aa['G1'], axis=0).assign(no = True)
a1 = a.values.argmax(axis=1)
Reslt_1 = np.append(Reslt_1,aa['R1'][a2] - aa[['PR1','PR2','PR3','PR4','PR5','PR6','PR7','PR8','PR9','PR10','PR11','PR12']].assign(no = np.nan).values[aa.index, a1][a2])
elif val1==2:
Reslt_1 = np.append(Reslt_1,0)
a2=a2+1
.....................
# input data
#I have 60 columns and 10871 rows.
# For every row I am comparing any 1 item(lets say sv1) of set-1 with set-2 if corresponding flag column is 4 otherwise default zero.
set-1 set-2
3 6 11 31 22 23 14 17 19 1 with 1 14 3 23 6 11 31 17 9 19 22 10 #(Flag->) 4 4 4 4 4 4 4 4 2 4 4 2

Related

Editing Values in the List of Dictionaries in Python

I have a list of dictionaries in python. The Dictionary contains "sequenceId" that I need to update for all Dictionaries making sure that each sequenceId is even and non-repeatable.
To update the sequenceIds, I am using a for loop, but the behavior is not what I expect.
seqId = 0
for index in range(20):
FinalNodes[index]['sequenceId'] = seqId
seqId +=2
print(FinalNodes[7]['sequenceId'])
Expected Output:- 14
Observed Output:- 38
Here is the full code snipet
import json
import time
with open('test.json', "r") as json_file:
data = json.load(json_file)
numberofJobs = 10
NodesList = data['nodes']
nNodes = len(NodesList)
#Divide all node into first, main and last
MainNodes = NodesList[1:nNodes-1]
FirstNode = NodesList[0:1]
LastNode = NodesList[nNodes-1:nNodes]
#prepare final nodes
FinalNodes = FirstNode.copy()
for i in range(numberofJobs):
FinalNodes.extend(MainNodes)
FinalNodes.extend(LastNode)
print(FinalNodes[7]['sequenceId'])
seqId = 0
for index in range(0,20):
FinalNodes[index]['sequenceId'] = seqId
seqId +=2
print(FinalNodes[index]['sequenceId'],index)
print(FinalNodes[7]['sequenceId'])
Output Inside the Loop:-
0 0
2 1
4 2
6 3
8 4
10 5
12 6
14 7
16 8
18 9
20 10
22 11
24 12
26 13
28 14
30 15
32 16
34 17
36 18
38 19
Please check with your FinalNodes dictionary as I ran the same code and it works fine!
FinalNodes= []
for i in range(20):
FinalNodes.append({"sequenceId":0})
seqId = 0
for index in range(8):
FinalNodes[index]['sequenceId'] = seqId
seqId +=2
print(FinalNodes[7]['sequenceId']) #14

Python: How to create weighted quantiles in Pandas?

I understand how to create simple quantiles in Pandas using pd.qcut. But after searching around, I don't see anything to create weighted quantiles. Specifically, I wish to create a variable which bins the values of a variable of interest (from smallest to largest) such that each bin contains an equal weight. So far this is what I have:
def wtdQuantile(dataframe, var, weight = None, n = 10):
if weight == None:
return pd.qcut(dataframe[var], n, labels = False)
else:
dataframe.sort_values(var, ascending = True, inplace = True)
cum_sum = dataframe[weight].cumsum()
cutoff = max(cum_sum)/n
quantile = cum_sum/cutoff
quantile[-1:] -= 1
return quantile.map(int)
Is there an easier way, or something prebuilt from Pandas that I'm missing?
Edit: As requested, I'm providing some sample data. In the following, I'm trying to bin the "Var" variable using "Weight" as the weight. Using pd.qcut, we get an equal number of observations in each bin. Instead, I want an equal weight in each bin, or in this case, as close to equal as possible.
Weight Var pd.qcut(n=5) Desired_Rslt
10 1 0 0
14 2 0 0
18 3 1 0
15 4 1 1
30 5 2 1
12 6 2 2
20 7 3 2
25 8 3 3
29 9 4 3
45 10 4 4
I don't think this is built-in to Pandas, but here is a function that does what you want in a few lines:
import numpy as np
import pandas as pd
from pandas._libs.lib import is_integer
def weighted_qcut(values, weights, q, **kwargs):
'Return weighted quantile cuts from a given series, values.'
if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
order = weights.iloc[values.argsort()].cumsum()
bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
return bins.sort_index()
We can test it on your data this way:
data = pd.DataFrame({
'var': range(1, 11),
'weight': [10, 14, 18, 15, 30, 12, 20, 25, 29, 45]
})
data['qcut'] = pd.qcut(data['var'], 5, labels=False)
data['weighted_qcut'] = weighted_qcut(data['var'], data['weight'], 5, labels=False)
print(data)
The output matches your desired result from above:
var weight qcut weighted_qcut
0 1 10 0 0
1 2 14 0 0
2 3 18 1 0
3 4 15 1 1
4 5 30 2 1
5 6 12 2 2
6 7 20 3 2
7 8 25 3 3
8 9 29 4 3
9 10 45 4 4

pandas display: truncate column display rather than wrapping

With lengthy column names, DataFrames will display in a very messy form seemingly no matter what options are set.
Info: I'm in Jupyter QtConsole, pandas 0.20.1, with the following relevant options specified at startup:
pd.set_option('display.max_colwidth', 20)
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_rows', 25)
Question: how can I truncate the DataFrame if necessary rather than wrapping the columns to the next line, while keeping expand_frame_repr=False?
Here's an example. Again, the issue doesn't depend on the number of columns but length of the columns.
This will not cause an issue:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['col' + str(i) for i in range(1000)])
As the output is perfectly readable and looks like:
The same DataFrame with long column names causes the issue I'm talking about:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['very_long_col_name_'
+ str(i) for i in range(1000)])
Is there any way to conform the second output to be like the first that I'm missing? (Through specifying an option, not through using .iloc every time I want to view.)
Use max_columns
from string import ascii_letters
df = pd.DataFrame(np.random.randint(10, size=(5, 52)), columns=list(ascii_letters))
with pd.option_context(
'display.max_colwidth', 20,
'expand_frame_repr', False,
'display.max_rows', 25,
'display.max_columns', 5,
):
print(df.add_prefix('really_long_column_name_'))
really_long_column_name_a really_long_column_name_b ... really_long_column_name_Y really_long_column_name_Z
0 8 1 ... 1 9
1 8 5 ... 2 1
2 5 0 ... 9 9
3 6 8 ... 0 9
4 1 2 ... 7 1
[5 rows x 52 columns]
Another idea... Obviously not exactly what you want, but maybe you can twist it to your needs.
d1 = df.add_suffix('_really_long_column_name')
with pd.option_context('display.max_colwidth', 4, 'expand_frame_repr', False):
mw = pd.get_option('display.max_colwidth')
print(d1.rename(columns=lambda x: x[:mw-3] + '...' if len(x) > mw else x))
a... b... c... d... e... f... g... h... i... j... ... Q... R... S... T... U... V... W... X... Y... Z...
0 6 5 5 5 8 3 5 0 7 6 ... 9 0 6 9 6 8 4 0 6 7
1 0 5 4 7 2 5 4 3 8 7 ... 8 1 5 3 5 9 4 5 5 3
2 7 2 1 6 5 1 0 1 3 1 ... 6 7 0 9 9 5 2 8 2 2
3 1 8 7 1 4 5 5 8 8 3 ... 3 6 5 7 1 0 8 1 4 0
4 7 5 6 2 4 9 7 9 0 5 ... 6 8 1 6 3 5 4 2 3 2
Looks like it will need an enhancement. The relevant code in the repr function appears to be here:
max_rows = get_option("display.max_rows")
max_cols = get_option("display.max_columns")
show_dimensions = get_option("display.show_dimensions")
if get_option("display.expand_frame_repr"):
width, _ = console.get_console_size()
else:
width = None
self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
line_width=width, show_dimensions=show_dimensions)
So either you pass expand_frame_repr=True and it wraps on the line width, or you pass expand_frame_repr=False and it shouldn't. But it looks like there is a bug in the code (this should be pandas 0.20.3 iirc):
in pd.io.formats.format.DataFrameFormatter:
def _chk_truncate(self):
"""
Checks whether the frame should be truncated. If so, slices
the frame up.
"""
from pandas.core.reshape.concat import concat
# Column of which first element is used to determine width of a dot col
self.tr_size_col = -1
# Cut the data to the information actually printed
max_cols = self.max_cols
max_rows = self.max_rows
if max_cols == 0 or max_rows == 0: # assume we are in the terminal
# (why else = 0)
(w, h) = get_terminal_size()
self.w = w
self.h = h
if self.max_rows == 0:
dot_row = 1
prompt_row = 1
if self.show_dimensions:
show_dimension_rows = 3
n_add_rows = (self.header + dot_row + show_dimension_rows +
prompt_row)
# rows available to fill with actual data
max_rows_adj = self.h - n_add_rows
self.max_rows_adj = max_rows_adj
# Format only rows and columns that could potentially fit the
# screen
if max_cols == 0 and len(self.frame.columns) > w:
max_cols = w
if max_rows == 0 and len(self.frame) > h:
max_rows = h
Looks like it intended to do what you wanted, but was unfinished. It's checking max_cols against the number of columns, not the total width of the columns.
So you could either create a show_df function that would calculate the correct number of columns and show it in an option_context like pi2Squared's answer, or fix it here (and maybe submit a patch if you need it distributed).
As others have pointed out, Pandas itself seems to be bugged or badly designed here, so a workaround is required.
Most of the time this problem occurs with numerical columns, since numbers are relatively short. Pandas will split the column heading onto multiple lines if there are spaces in it, so you can "hack in" the correct behavior by inserting spaces into column headings for numerical columns when you display the dataframe. I have a one-liner to do this:
def colfix(df, L=5): return df.rename(columns=lambda x: ' '.join(x.replace('_', ' ')[i:i+L] for i in range(0,len(x),L)) if df[x].dtype in ['float64','int64'] else x )
do display your dataframe, simply type
colfix(your_df)
note that the renaming is not going to permanently change the dataframe, it will only add spaces to the names for the purposes of displaying it that one time.
Results (in a Jupyter Notebook):
With colfix:
Without:

Pandas track consecutive near numbers via compare-cumsum-groupby pattern

I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41

How to Align Unicode-Type Values of a Column?

I have the following output containing two columns (line# and ID):
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
How can I make the ID values of second column align at the top of each other, like the following:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
The problem I am facing is how to incorporate the solution to my code. I tried to use python tabulate, but found this is not working properly since what I am printing: row[0] is a unicode from the tuple row (See the following code).
count = 0
for row in c:
count += 1
print count, row[0]
Any idea how can I incorporate tabulate or other methods to align the unicode-type values in the column?
Use alignment specifiers:
data = {
1:'Q50331',
2:'P75247',
3:'P75544',
4:'P22446',
5:'P78027',
6:'P75271',
7:'P75176',
8:'P0ABB4',
9:'P63284',
10:'P0A6M8',
11:'P0AES4',
12:'P39452',
13:'P0A8T7',
14:'P0A698',
333:'P00bar'
}
length = len(str(max(data.keys())))+1
for k,v in data.items():
print "{:<{}}{}".format(k, length, v)
Output:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
333 P00bar
I've created length which will contain the length of the max value from data keys, +1. Then I pass that length value to my alignment specifier, which in this case is 4:
{:<4}{}

Categories