I am working making a dynamic table by adding columns in same row, however, range of columns is determined based on two columns difference (high-low):
df = pd.DataFrame({
'10': [1, 10, 20, 30, 40, 50],
'20': [20, 15, 12, 18, 32, 12],
'30': [3, 11, 25, 32, 13, 4],
'40': [32, 11, 9, 82, 2, 1],
'50': [9, 5, 11, 11, 2, 5],
'low': [12, 22, 18, 27, 23, 15],
'high': [45, 41, 33, 54, 35, 45],
})
df
Index 10 20 30 40 50 low high
0 1 20 3 32 9 12 45
1 10 15 11 11 5 22 41
2 20 12 25 9 11 18 33
3 30 18 32 82 11 27 54
4 40 32 13 2 2 23 35
5 50 12 4 1 5 15 45
high and low range is then taken to determine which columns are selected, and finally sums by index. So my initial code starts by determining difference between columns, query the (cols) to be used for the operation
def colrange(first, last):
return (first - last).abs().argsort()[0]
cols = df.columns[:-2]
Then I used iterrows() to start looking in every row between the range:
c = cols.to_series().astype(int)
for idx,row in df.iterrows():
df.loc[idx,'result']= row[cols[ colrange(c,row.low) : colrange(c, row.high) ]].sum()
So my df['result'] should look like:
Index 10 20 30 40 50 low high result
0 1 20 3 32 9 12 45 1+20+3 = 24
1 10 15 11 11 5 22 41 15+11 = 26
2 20 12 25 9 11 18 33 12 = 12
3 30 18 32 82 11 27 54 32+82 = 114
4 40 32 13 2 2 23 35 32 = 32
5 50 12 4 1 5 15 45 50+12+4 = 66
My problem is that this method is too slow, could you advice any other idea how to solve this exercise? I appreciate any thoughts in advance.
This is about 5 times faster on your example.
It should also scale pretty good as the DataFrame size increase
start = np.abs((c.to_frame().to_numpy().T - df['low'].to_frame().to_numpy())).argsort()[:, 0]
stop = np.abs((c.to_frame().to_numpy().T - df['high'].to_frame().to_numpy())).argsort()[:, 0]
df['result'] = [*map(lambda first, last, row: df.iloc[row, first:last].sum(), start, stop, range(len(df)))]
Related
Reproducible data:
import random
data = {'test_1_a':random.sample(range(1, 50), 7),
'test_1_b':random.sample(range(1, 50), 7),
'test_1_c':random.sample(range(1, 50), 7),
'test_2_a':random.sample(range(1, 50), 7),
'test_2_b':random.sample(range(1, 50), 7),
'test_2_c':random.sample(range(1, 50), 7),
'test_3_a':random.sample(range(1, 50), 7),
'test_4_b':random.sample(range(1, 50), 7),
'test_4_c':random.sample(range(1, 50), 7)}
df = pd.DataFrame(data)
Description:
I have a data frame similar to the example I gave above with 1000ish columns. The column name format is as following:
test_number_family so test_1_c would be a number type of 1 and the family of "c"
I want to classify the df by column names of the same "family" type. So my final output needs to be a list of lists of same family values:
Output example:
[ [a_familily values], [b_familily values],...]
it would also look like the values of columns:
[ [test_1_a, test_2_a , test_3_a ] , [test_1_b, test_2_b , test_3_b ] , ...]
What I have:
#### transfers data frame into a sorted dict (by column name) by columns as key
col_names = [ i for (i,j) in df.iteritems() ]
col_vals = [ j for (i,j) in df.iteritems() ]
df_dict = dict(zip(col_names, col_vals))
families = np.unique([ i.split("_")[2] for i in dict1.keys() ])
I have classified each column name with its associated value and extracted the unique number of groups I want to have in the final output as "families". I now am seeking help in categorizing the data frame into a length(families) number of lists identical to the output example I have given above.
I hope my explanation has been clear, thank you for your time!
Let's keep track of the different families in a dictionary, the keys being the letters (the families) and the values being lists holding the columns from a certain family.
Since we know that each column ends with a letter related with its family, we can use that as a key in the dictionary.
from collections import defaultdict
families = defaultdict(list)
for col in df.columns:
families[col[-1]].append(df[col])
Now for example, in families["a"], we have:
[0 26
1 13
2 11
3 35
4 43
5 45
6 46
Name: test_1_a, dtype: int64,
0 10
1 15
2 20
3 43
4 40
5 35
6 22
Name: test_2_a, dtype: int64,
0 35
1 48
2 38
3 13
4 3
5 10
6 25
Name: test_3_a, dtype: int64]
We can easily get a per-family dataframe with concat.
df_a = pd.concat(families["a"], axis=1)
Gets us:
test_1_a test_2_a test_3_a
0 26 10 35
1 13 15 48
2 11 20 38
3 35 43 13
4 43 40 3
5 45 35 10
6 46 22 25
If we were to create a dictionary of dataframes per family,
dfs = {f"df_{fam}" : pd.concat(families[fam], axis=1) for fam in families.keys()}
Now, the dictionary dfs contains:
{'df_a': test_1_a test_2_a test_3_a
0 26 10 35
1 13 15 48
2 11 20 38
3 35 43 13
4 43 40 3
5 45 35 10
6 46 22 25,
'df_b': test_1_b test_2_b test_4_b
0 18 4 44
1 48 43 2
2 30 21 4
3 46 12 16
4 42 14 25
5 22 24 13
6 43 40 43,
'df_c': test_1_c test_2_c test_4_c
0 25 15 5
1 36 39 28
2 6 3 37
3 22 48 16
4 2 34 25
5 39 16 30
6 32 36 2}
What do you think of an approach like that ? Use pd.wide_to_long with the result of a long dataframe with splitted columns, one with the whole classification like 1_a, one only with the number, one with the family and their values.
df = (pd.wide_to_long(
df.reset_index(),stubnames='test_',i='index',j='classification',suffix='\d_\w')
.reset_index()
.drop('index',axis=1)
.rename(columns={'test_':'values'}))
df[['number', 'family']] = df['classification'].str.split('_', expand=True)
df = df.reindex(columns=['classification', 'number', 'family', 'values'])
print(df)
classification number family values
0 1_a 1 a 29
1 1_a 1 a 46
2 1_a 1 a 2
3 1_a 1 a 6
4 1_a 1 a 16
.. ... ... ... ...
58 4_c 4 c 30
59 4_c 4 c 23
60 4_c 4 c 26
61 4_c 4 c 40
62 4_c 4 c 39
Easy to groupby or filter for more analysis.
If you want to get dicts or lists of specific data, here some examples:
filter1 = df.loc[df['classification']=='1_a',:]
filter2 = df.loc[df['number']=='2','values']
filter1.to_dict(orient='list')
Output:
{'classification': ['1_a', '1_a', '1_a', '1_a', '1_a', '1_a', '1_a'],
'number': ['1', '1', '1', '1', '1', '1', '1'],
'family': ['a', 'a', 'a', 'a', 'a', 'a', 'a'],
'values': [29, 46, 2, 6, 16, 12, 38]}
filter2.tolist()
Output:
[8, 2, 43, 9, 5, 30, 28, 26, 25, 49, 3, 1, 47, 44, 16, 9, 8, 15, 24, 36, 1]
Not sure I understand the question completely; is this what you have in mind:
dict(list(df.groupby(df.columns.str[-1], axis = 1)))
{'a': test_1_a test_2_a test_3_a
0 20 36 14
1 4 7 16
2 28 13 28
3 3 40 9
4 38 41 5
5 34 47 18
6 49 25 46,
'b': test_1_b test_2_b test_4_b
0 35 10 44
1 46 14 23
2 26 11 36
3 17 27 4
4 13 16 42
5 20 38 9
6 41 22 18,
'c': test_1_c test_2_c test_4_c
0 22 2 26
1 42 24 3
2 15 16 41
3 7 11 16
4 40 37 47
5 38 7 33
6 39 22 24}
This groups the columns on the last letter in the column name.
If this is not what you have in mind, kindly comment and maybe explain a bit better where I misunderstood your intent.
I have a pandas dataframe df1 that looks like this:
import pandas as pd
d = {'node1': [47, 24, 19, 77, 24, 19, 77, 24, 56, 92, 32, 77], 'node2': [24, 19, 77, 24, 19, 77, 24, 19, 92, 32, 77, 24], 'user': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C']}
df1 = pd.DataFrame(data=d)
df1
node1 node2 user
47 24 A
24 19 A
19 77 A
77 24 A
24 19 A
19 77 B
77 24 B
24 19 B
56 92 C
92 32 C
32 77 C
77 24 C
And a second pandas dataframe df2 that looks like this:
d2 = {'way_id': [4, 3, 1, 8, 5, 2, 7, 9, 6, 10], 'source': [24, 19, 84, 47, 19, 16, 77, 56, 32, 92], 'target': [19, 43, 67, 24, 77, 29, 24, 92, 77, 32]}
df2 = pd.DataFrame(data=d2)
df2
way_id source target
4 24 19
3 19 43
1 84 67
8 47 24
5 19 77
2 16 29
7 77 24
9 56 92
6 32 77
10 92 32
In a new dataframe I would like to count how often the value pairs per row in the columns node1 and node2 in df1 occur in the rows of the source and target columns in df2. The order is relevant, but also the corresponding user should be added to a new column. That's why the desired output should be like this:
way_id source target count user
4 24 19 2 A
3 19 43 0 A
1 84 67 0 A
8 47 24 1 A
5 19 77 1 A
2 16 29 0 A
7 77 24 1 A
9 56 92 0 A
6 32 77 0 A
10 92 32 0 A
4 24 19 1 B
3 19 43 0 B
1 84 67 0 B
8 47 24 0 B
5 19 77 1 B
2 16 29 0 B
7 77 24 1 B
9 56 92 0 B
6 32 77 0 B
10 92 32 0 B
4 24 19 0 C
3 19 43 0 C
1 84 67 0 C
8 47 24 0 C
5 19 77 0 C
2 16 29 0 C
7 77 24 1 C
9 56 92 1 C
6 32 77 1 C
10 92 32 1 C
Since you don't care about the source/target match, you need to duplicate the data then merge :
(pd.concat([df1.rename(columns={'node1':'source','node2':'target'}),
df1.rename(columns={'node2':'source','node1':'target'})]
)
.merge(df2, on=['source','target'], how='outer')
.groupby(['source','target','user'], as_index=False)['way_id'].count()
)
I want to find the max difference between two sequential occurrence of the same integer in an efficient way. I can try a loop but my dataset is >100,000 rows which is incredibly cumbersome. Does anyone have any suggestions?
data = np.random.randint(5,30,size=100000)
df = pd.DataFrame(data, columns=['random_numbers'])
Example:
In my sample, the max difference between this sequential occurrence of 5 is 29 - 5 = 24.
df.loc[79:93].values
array([[ 5],
[17],
[ 7],
[15],
[25],
[23],
[24],
[22],
[21],
[29],
[25],
[28],
[13],
[19],
[ 5]])
You can try this:
g = df['random_numbers'].eq(5).cumsum()
df.groupby(g).max() - 5
Output with smaller data:
data = np.random.randint(5,30,size=30)
# array([28, 19, 29, 22, 10, 18, 13, 14, 25, 24, 21, 24, 10, 20, 20, 5, 23,
# 8, 29, 22, 24, 24, 24, 19, 12, 5, 6, 14, 5, 15])
df = pd.DataFrame(data, columns=['rand_nums'])
g = df['rand_nums'].eq(5).cumsum()
# Look at both df and g
# print(pd.concat([df, g], axis=1) # just for explanation.
rand_nums rand_nums
0 28 0 ⟶ group 1 starts here
1 19 0
2 29 0
3 22 0
4 10 0
5 18 0
6 13 0
7 14 0 # we take max from here i.e. 29.
8 25 0
9 24 0
10 21 0
11 24 0
12 10 0
13 20 0
14 20 0 ⟶ group1 ends here
15 5 1 ⟶ group2 starts here
16 23 1
17 8 1
18 29 1
19 22 1
20 24 1 # take max from here i.e 29
21 24 1
22 24 1
23 19 1
24 12 1 ⟶ group2 ends here.
25 5 2 ⟶ grp 3 starts here.
26 6 2 # take max from here i.e. 14
27 14 2 ⟶ grp 3 ends here.
28 5 3 ⟶ grp4 starts here. # take max from here i.e. 15
29 15 3 ⟶ grp4 ends here.
That gives us:
df.groupby(g).max() - 5
rand_nums
rand_nums
0 24
1 24
2 9
3 10
df.loc[79:93].max() - df.loc[79:93].min()
EDIT:
index_integer = df.index[df['random_numbers'] == 5] # change 5 for your
max_disp = []
for i in index[:-1]:
max_displ.append(df[index[i]:index[i+1].max() - df[index[i]:index[i+1].mmin())
using comprehension list:
index_integer = df.index[df['random_numbers'] == 5] # change 5 for your number
max_displ = [df[l[i]:l[i+1]].max() - df[l[i]:l[i+1]].min() for i in range(0,len(l[:-1]))]
I was trying to make a weighed average and I came across a doubt:
Problem
I wanted to create a new column named answer that calculates the result between each line and a list of weighted values named in this case as month. If I use df.mean() I would get a simple average by month and that is not what I want. The idea is to give more importance to the end of the year and less importance to the demand in the begging of the year. So that's why I would like to use weighted average calculation.
In excel I would use the formula bellow. I'm having troubles to convert this calculation to pandas data frame.
=SUMPRODUCT( demands[#[1]:[12]] ; month )/SUM(month)
I couldn't find a solution to this problem and I really appreciate help with this subject.
Thank you in advance.
Here's a dummy dataframe that serves as an example:
Example Code
demand = pd.DataFrame({'1': [360, 40, 100, 20, 55],
'2': [500, 180, 450, 60, 50],
'3': [64, 30, 60, 10, 0],
'4': [50, 40, 30, 60, 50],
'5': [40, 24, 45, 34, 60],
'6': [30, 34, 65, 80, 78],
'7': [56, 45, 34, 90, 58],
'8': [32, 12, 45, 55, 66],
'9': [32, 56, 89, 67, 56],
'10': [57, 35, 75, 48, 9],
'11': [56, 33, 11, 6, 78],
'12': [23, 65, 34, 8, 67]
})
months = [i for i in range(1,13)]
Visualization of the problem
Just use numpy.average, specifying weights:
demand["result"]=np.average(demand, weights=months, axis=1)
https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.average.html
Outputs:
1 2 3 4 5 6 ... 8 9 10 11 12 result
0 360 500 64 50 40 30 ... 32 32 57 56 23 58.076923
1 40 180 30 40 24 34 ... 12 56 35 33 65 43.358974
2 100 450 60 30 45 65 ... 45 89 75 11 34 58.884615
3 20 60 10 60 34 80 ... 55 67 48 6 8 43.269231
4 55 50 0 50 60 78 ... 66 56 9 78 67 55.294872
This can be done by the following:
demand['result'] = (demand * months).sum(axis=1)/sum(months)
You can try this code:
den = np.sum(a)
demand['average']=demand['1'].mul(1/den).add(demand['2'].mul(2/den)).add(demand['3'].mul(3/den)).add(demand['4'].mul(4/den)).add(demand['5'].mul(5/den)).add(demand['6'].mul(6/den)).add(demand['7'].mul(7/den)).add(demand['8'].mul(8/den)).add(demand['9'].mul(9/den)).add(demand['10'].mul(10/den)).add(demand['11'].mul(11/den)).add(demand['12'].mul(12/den))
The Output:
1 2 3 4 5 6 7 8 9 10 11 12 average
0 360 500 64 50 40 30 56 32 32 57 56 23 58.076923
1 40 180 30 40 24 34 45 12 56 35 33 65 43.358974
2 100 450 60 30 45 65 34 45 89 75 11 34 58.884615
3 20 60 10 60 34 80 90 55 67 48 6 8 43.269231
4 55 50 0 50 60 78 58 66 56 9 78 67 55.294872
I have a dtaframe df as below
df = pd.DataFrame({
'A': [20,30,40,-50,60,-70 ],
'B': [21, -19, 20, 18, 17, -21],
'C': [1,12,-13,14,15,16],
'D': [-88, 92, 9, 70, -6, 78]})
I want every value on column ['C','D'] to be zero where the value is between -10 and 10, rest of the values should remain same.
is there something similar to data.series.between, which can be applied to a data frame
df[(df[['C','D']].between(-10,10,inclusive=True)]=0
output should be :
A B C D
0 20 21 0 -88
1 30 -19 12 92
2 40 20 -13 0
3 -50 18 14 70
4 60 17 15 0
5 -70 -21 16 78
You can use df.mask() here after comparing by df.ge and df.le:
df[['C','D']]=df[['C','D']].mask(df[['C','D']].ge(-10)&df[['C','D']].le(10),0)
Or np.where():
df[['C','D']]=np.where(df[['C','D']].ge(-10)&df[['C','D']].le(10),0,df[['C','D']])
A B C D
0 20 21 0 -88
1 30 -19 12 92
2 40 20 -13 0
3 -50 18 14 70
4 60 17 15 0
5 -70 -21 16 78