line_profiler could not tell pandas assign details - python

def test_lprun():
data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
'Age':[20, 21, 19, 18],
'Height' : [6.1, 5.9, 6.0, 6.1]
}
df = pd.DataFrame(data)
df=df.assign(A=123,
B=lambda x:x.Age+x.Height,
C=lambda x:x.Name.str.upper(),
D=lambda x:x.Name.str.lower()
)
return df
In [8]: %lprun -f test_lprun test_lprun()
Timer unit: 1e-07 s
Total time: 0.0044901 s
File: <ipython-input-7-eaf21639fb5f>
Function: test_lprun at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def test_lprun():
2 1 21.0 21.0 0.0 data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
3 1 13.0 13.0 0.0 'Age':[20, 21, 19, 18],
4 1 15.0 15.0 0.0 'Height' : [6.1, 5.9, 6.0, 6.1]
5 }
6 1 8651.0 8651.0 19.3 df = pd.DataFrame(data)
7 1 19.0 19.0 0.0 df=df.assign(A=123,
8 1 11.0 11.0 0.0 B=lambda x:x.Age+x.Height,
9 1 10.0 10.0 0.0 C=lambda x:x.Name.str.upper(),
10 1 36147.0 36147.0 80.5 D=lambda x:x.Name.str.lower()
11 )
12 1 14.0 14.0 0.0 return df
When using pandas assign, it could not tell which rows occupies most time but tell the whole result for assign function.
Goal: line_profile could tell each row result for pandas assign function like Line 6 %Time is 10, Line7 %Time is 30 and so on.

Related

delete redundant rows in a dataframe with set in columns

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!
Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

How to make python not to break dataframe-description into blocks?

The bellow code outputs the df-description into two blocks, though the display.max_rows and display.max_columns is set to a high value.
I'd like to print it without the breaks. Is there any way to do it?
import pandas as pd
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
data = {'Age': [20, 21, 19, 18],
'Height': [120, 121, 119, 118],
'Very_very_long_variable_name': [40, 71, 49, 78]
}
df = pd.DataFrame(data)
print(df.describe().transpose())
Try adding this setting as well on the top:
pd.set_option('display.expand_frame_repr', False)
And now:
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
data = {'Age': [20, 21, 19, 18],
'Height': [120, 121, 119, 118],
'Very_very_long_variable_name': [40, 71, 49, 78]
}
df = pd.DataFrame(data)
print(df.describe().transpose())
Output:
count mean std min 25% 50% 75% max
Age 4.0 19.5 1.290994 18.0 18.75 19.5 20.25 21.0
Height 4.0 119.5 1.290994 118.0 118.75 119.5 120.25 121.0
Very_very_long_variable_name 4.0 59.5 17.935068 40.0 46.75 60.0 72.75 78.0
No need to change any option, just use the "to_string" method:
print(df.describe().T.to_string())
output:
count mean std min 25% 50% 75% max
Age 4.0 19.5 1.290994 18.0 18.75 19.5 20.25 21.0
Height 4.0 119.5 1.290994 118.0 118.75 119.5 120.25 121.0
Very_very_long_variable_name 4.0 59.5 17.935068 40.0 46.75 60.0 72.75 78.0

Finding peaks in pandas series with non integer index

I have the following series and trying to find the index of the peaks which should be [1,8.5] or the peak value which should be [279,139]. the used threshold is 100. I tried many ways but, it always ignores the series index and returns [1,16].
0.5 0
1.0 279
1.5 256
2.0 84
2.5 23
3.0 11
3.5 3
4.0 2
4.5 7
5.0 5
5.5 4
6.0 4
6.5 10
7.0 30
7.5 88
8.0 133
8.5 139
9.0 84
9.5 55
10.0 26
10.5 10
11.0 8
11.5 4
12.0 4
12.5 1
13.0 0
13.5 0
14.0 1
14.5 0
I tried this code
thresh = 100
peak_idx, _ = find_peaks(out.value_counts(sort=False), height=thresh)
plt.plot(out.value_counts(sort=False).index[peak_idx], out.value_counts(sort=False)[peak_idx], 'r.')
out.value_counts(sort=False).plot.bar()
plt.show()
peak_idx
here is the output
array([ 1, 16], dtype=int64)
You are doing it right the only thing that you misunderstood is that find_peaks finds the indexes of the peaks, not peaks themselves.
Here is the documentation that clearly states that:
Returns
peaksndarray
Indices of peaks in x that satisfy all given conditions.
Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html
Try this code here:
thresh = 100
y = [0,279,256, 84, 23, 11, 3, 2, 7, 5, 4, 4, 10, 30, 88,133,139, 84, 55, 26, 10, 8, 4, 4, 1, 0, 0, 1, 0]
x = [0.5 ,1.0 ,1.5 ,2.0 ,2.5 ,3.0 ,3.5 ,4.0 ,4.5 ,5.0 ,5.5 ,6.0 ,6.5 ,7.0 ,7.5 ,8.0 ,8.5 ,9.0 ,9.5 ,10.0,10.5,11.0,11.5,12.0,12.5,13.0,13.5,14.0,14.5]
peak_idx, _ = find_peaks(x, height=thresh)
out_values = [x[peak] for peak in peak_idx]
Here out_vaules will contain what you want

Convert list to dataframe

I am running a loop that appends three fields. Predictfinal is a list, though it is not necessary that it should be a list.
predictfinal.append(y_hat_orig[0])
predictfinal.append(mape)
predictfinal.append(length)
At the end, predictfinal returns a long list. But I really want to conform the list into a Dataframe, where each row is 3 columns. However the list does not designate between the 3 columns, it's just a long list with commas in between. Somehow I am trying to slice predictfinal into 3 columns and a Dataframe from currnet unstructured list - any help how?
predictfinal
Out[88]:
[1433.0459967608983,
1.6407741379111223,
23,
1433.6389125340916,
1.6474721044455922,
22,
1433.867408791692,
1.6756763089082383,
21,
1433.8484984008207,
1.6457581105556003,
20,
1433.6340460965778,
1.6380908467895527,
19,
1437.0294365907992,
1.6147672264908473,
18,
1439.7485102740507,
1.5010415925555876,
17,
1440.950406295299,
1.433891246672529,
16,
1434.837060644701,
1.5252803314930383,
15,
1434.9716303636983,
1.6125952442799232,
14,
1441.3153523102953,
3.2633984339696185,
13,
1435.6932462859334,
3.2703435261200497,
12,
1419.9057834496082,
1.9100005818319687,
11,
1426.0739741342488,
1.947684057178654,
10]
Based on https://stackoverflow.com/a/48347320/6926444
We can achieve it by using zip() and iter(). The code below iterates three elements each time.
res = pd.DataFrame(list(zip(*([iter(data)] * 3))), columns=['a', 'b', 'c'])
Result:
a b c
0 1433.045997 1.640774 23
1 1433.638913 1.647472 22
2 1433.867409 1.675676 21
3 1433.848498 1.645758 20
4 1433.634046 1.638091 19
5 1437.029437 1.614767 18
6 1439.748510 1.501042 17
7 1440.950406 1.433891 16
8 1434.837061 1.525280 15
9 1434.971630 1.612595 14
10 1441.315352 3.263398 13
11 1435.693246 3.270344 12
12 1419.905783 1.910001 11
13 1426.073974 1.947684 10
You could do:
pd.DataFrame(np.array(predictfinal).reshape(-1,3), columns=['origin', 'mape', 'length'])
Output:
origin mape length
0 1433.045997 1.640774 23.0
1 1433.638913 1.647472 22.0
2 1433.867409 1.675676 21.0
3 1433.848498 1.645758 20.0
4 1433.634046 1.638091 19.0
5 1437.029437 1.614767 18.0
6 1439.748510 1.501042 17.0
7 1440.950406 1.433891 16.0
8 1434.837061 1.525280 15.0
9 1434.971630 1.612595 14.0
10 1441.315352 3.263398 13.0
11 1435.693246 3.270344 12.0
12 1419.905783 1.910001 11.0
13 1426.073974 1.947684 10.0
Or you can also modify your loop:
predictfinal = []
for i in some_list:
predictfinal.append([y_hat_orig[0], mape, length])
# output dataframe
pd.DataFrame(predictfinal, columns=['origin', 'mape', 'length'])
Here is a pandas solution
s=pd.Series(l)
s.index=pd.MultiIndex.from_product([range(len(l)//3),['origin','map','len']])
s=s.unstack()
Out[268]:
len map origin
0 23.0 1.640774 1433.045997
1 22.0 1.647472 1433.638913
2 21.0 1.675676 1433.867409
3 20.0 1.645758 1433.848498
4 19.0 1.638091 1433.634046
5 18.0 1.614767 1437.029437
6 17.0 1.501042 1439.748510
7 16.0 1.433891 1440.950406
8 15.0 1.525280 1434.837061
9 14.0 1.612595 1434.971630
10 13.0 3.263398 1441.315352
11 12.0 3.270344 1435.693246
12 11.0 1.910001 1419.905783
13 10.0 1.947684 1426.073974

pandas dataframe: make new dataframe from lookup and computation of existing dataframes

Im trying to use data from two dateframes to create a new dataframe
lookup_data = [
{ 'item': 'apple',
'attribute_1':3,
'attribute_2':2,
'attribute_3':10,
'attribute_4':0,
},
{ 'item': 'orange',
'attribute_1':0.4,
'attribute_2':20,
'attribute_3':1,
'attribute_4':9,
},
{ 'item': 'pear',
'attribute_1':0,
'attribute_2':0,
'attribute_3':30,
'attribute_4':0,
},
{ 'item': 'peach',
'attribute_1':2,
'attribute_2':2,
'attribute_3':3,
'attribute_4':6,
},]
df_lookup_data = pd.DataFrame(lookup_data,dtype=float)
df_lookup_data.set_index('item', inplace=True, drop=True)
collected_data = [
{ 'item':'apple',
'qnt': 4},
{ 'item':'orange',
'qnt': 2},
{ 'item':'pear',
'qnt': 7},
]
df_collected_data = pd.DataFrame(collected_data,dtype=float)
df_collected_data.set_index('item', inplace=True, drop=True)
df_result = pd.DataFrame(
.... first column is item type
.... second column is qnt*attribute_1
.... second column is qnt*attribute_2
.... second column is qnt*attribute_3
.... second column is qnt*attribute_4
)
df_result.columns = ['item', 'attribute_1', 'attribute_2', 'attribute_3', 'attribute_4']
print(result)
the result should print
item attribute_1 attribute_2 attribute_3 attribute_4
0 apple 14 8 40 0
1 orange 0.8 40 2 18
2 pear 0 0 210 0
but im really not sure how i get date from these two dataframes and make this new one
No need to merge or concat here. Since indexes do match, simply mul across axis=0
>>> df_lookup_data.mul(df_collected_data.qnt, axis=0)
attribute_1 attribute_2 attribute_3 attribute_4
item
apple 12.0 8.0 40.0 0.0
orange 0.8 40.0 2.0 18.0
peach NaN NaN NaN NaN
pear 0.0 0.0 210.0 0.0
Or use:
df_lookup_data = pd.DataFrame(lookup_data,dtype=float)
items = [i['item'] for i in collected_data]
qnts = [i['qnt'] for i in collected_data]
print(df_lookup_data[df_lookup_data['item'].isin(items)].set_index('item').mul(qnts, axis=0))
Output:
attribute_1 attribute_2 attribute_3 attribute_4
item
apple 12.0 8.0 40.0 0.0
orange 0.8 40.0 2.0 18.0
pear 0.0 0.0 210.0 0.0

Categories