I need to run a statistic on some data. See how many times a values "j" is next to a value "i". The code that I put hereafter is a gross simplification of what I need to to, but it contains the problem I have.
Let's say that you have this data frame.
import numpy as np
import pandas as pd
a_df=pd.DataFrame({"a_col":np.random.randint(10, size=1000), "b_col":np.random.randint(10, size=1000)})
I generate a matrix that will contain our statistics:
res_matrix=np.zeros((10, 10))
by looking at res_matrix[i][j] we will know how many times the number "j" was next to the number "i" in our data frame.
I know that "for loops" are bad in pandas, but again, this is a simplification.
I generate a sub-table for the value "i" and on this table I ran "value_counts()"
on the column "b_col".
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
is there an efficient way to populate res_matrix without changing the topmost for loop?
I am thinking something like list comprehension, but I cannot wrap my mind around it.
Please, focus ONLY on these two lines:
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
I can't use groupby because my project requires many more operations on the dataframe.
There's a function crosstab in pandas that does just that:
pd.crosstab(a_df['a_col'], a_df['b_col'])
Output:
b_col 0 1 2 3 4 5 6 7 8 9
a_col
0 10 10 10 12 14 9 10 5 13 16
1 16 9 13 14 14 8 4 11 9 12
2 10 8 12 13 9 12 13 7 10 5
3 11 7 10 17 6 9 6 8 7 14
4 9 8 4 5 7 13 12 8 11 6
5 14 9 8 15 6 10 12 9 7 9
6 11 13 10 9 7 5 8 11 13 21
7 8 9 11 8 8 10 11 15 10 12
8 6 17 11 4 12 9 6 10 10 13
9 12 6 14 3 11 11 7 5 14 14
Update: if the outer loop must remain for other reasons, you can set values in res_matrix inside the loop:
res_matrix = np.zeros((10, 10))
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
# set values in res_matrix
res_matrix[i, table_count.index] = table_count
Don't loop, this is slow. If you think there is a good reason to loop, please explain it and provide an appropriate example.
Here is another method.
You can groupby both columns and get the group size, then unstack to get a 2D shape:
a_df.groupby(['a_col', 'b_col']).size().unstack()
output:
b_col 0 1 2 3 4 5 6 7 8 9
a_col
0 16 2 4 11 9 13 11 11 8 6
1 10 12 7 6 6 11 10 8 2 12
2 9 12 10 22 12 13 8 11 9 8
3 13 11 11 14 7 11 9 7 8 14
4 14 7 17 5 8 6 15 8 11 8
5 10 12 7 14 6 16 11 12 6 8
6 13 10 9 12 11 14 8 10 6 8
7 9 12 12 9 11 9 8 14 5 12
8 7 8 9 8 10 14 9 8 8 18
9 13 6 13 11 13 11 8 7 11 11
Related
I need to duplicate each row 3 times and add two new columns. The new column values are different for each row.
import pandas as pd
df = {'A': [ 8,9,12],
'B': [ 1,11,3],
'C': [ 7,9,13],
'D': [81,92,121]}
df = pd.DataFrame(df)
#####################################################
#input
A B C D
8 1 7 81
9 11 9 92
12 3 13 121
####################################################
#expected output
A B C D E F
8 1 7 81 9 8 E=A+1, F= C+1
8 1 7 81 8 7 E=A, F= C
8 1 7 81 7 6 E=A-1, F= C-1
9 11 9 92 10 10
9 11 9 92 9 9
9 11 9 92 8 8
12 3 13 121 13 14
12 3 13 121 12 13
12 3 13 121 11 12
To repeat the DataFrame you can use np.repeat().
Afterwards you can create a list to add to "A" and "C".
df = pd.DataFrame(np.repeat(df.to_numpy(), 3, axis=0), columns=df.columns)
extra = [1,0, -1]*3
df['E'] = df['A']+extra
df['F'] = df['C']+extra
This gives:
A B C D E F
0 8 1 7 81 9 8
1 8 1 7 81 8 7
2 8 1 7 81 7 6
3 9 11 9 92 10 10
4 9 11 9 92 9 9
5 9 11 9 92 8 8
6 12 3 13 121 13 14
7 12 3 13 121 12 13
8 12 3 13 121 11 12
Use Index.repeat with DataFrame.loc for repeat rows, then repeat integers [1,0,-1] by numpy.tile and create new columns E, F:
df1 = df.loc[df.index.repeat(3)]
g = np.tile([1,0,-1], len(df))
df1[['E','F']] = df1[['A','C']].add(g, axis=0).to_numpy()
df1 = df1.reset_index(drop=True)
print (df1)
A B C D E F
0 8 1 7 81 9 8
1 8 1 7 81 8 7
2 8 1 7 81 7 6
3 9 11 9 92 10 10
4 9 11 9 92 9 9
5 9 11 9 92 8 8
6 12 3 13 121 13 14
7 12 3 13 121 12 13
8 12 3 13 121 11 12
By default, pandas shows you top and bottom 5 rows of a dataframe in jupyter, given that there are too many rows to display:
>>> df.shape
(100, 4)
col0
col1
col2
col3
0
7
17
15
2
1
6
5
5
12
2
10
15
5
15
3
6
19
19
14
4
12
7
4
12
...
...
...
...
...
95
2
14
8
16
96
8
8
5
16
97
6
8
9
1
98
1
5
10
15
99
15
9
1
18
I know that this setting exists:
pd.set_option("display.max_rows", 20)
however, that yields the same result. Using df.head(10) and df.tail(10) in to consecutive cells is an option, but less clean. Same goes for concatenation. Is there another pandas setting like display.max_row for this default view? How can I expand this to let's say the top and bottom 10?
IIUC, use display.min_rows:
pd.set_option("display.min_rows", 20)
print(df)
# Output:
0 1 2 3
0 18 8 12 2
1 2 13 13 14
2 8 7 9 2
3 17 19 9 3
4 14 18 12 3
5 11 5 9 18
6 4 5 12 3
7 12 8 2 7
8 11 2 14 13
9 6 6 3 6
.. .. .. .. ..
90 8 2 1 9
91 7 19 4 6
92 4 3 17 12
93 19 6 5 18
94 3 5 15 5
95 16 3 13 13
96 11 3 18 8
97 1 9 18 4
98 13 10 18 15
99 16 3 5 9
[100 rows x 4 columns]
Hello there I was coding in pandas when I found this problem:
for label,content in data_temp.items():
print(len(label))#Como vemos aqui nos imprime
print(len(data_temp.columns))
Firstly, I was trying to print the label, which is the indicator of the column, right? It outputs these different numbers.
7
9
9
7
10
12
8
24
9
11
11
15
13
17
11
18
5
12
16
12
9
5
8
12
5
12
12
15
11
14
17
10
9
6
9
11
9
7
14
14
15
10
23
12
5
15
12
16
10
15
17
17
8
9
7
7
22
34
And when i print the print(len(data_temp.columns)) it outputs:
58
Why does the data_temp.columns gives me a different number from the label in the for loop data_temp.item()? Aren't the labels of the for loop the indices of the data_temp.columns?
You are printing the length of the labels, not the labels themselves.
Try print(label) and print(data_temp.columns) that should output the labels one by one in the for loop and then the name of the columns as a list
here is my three time series:
t1 t2 t3
3 8 17
1 8 18
0 8 17
0 8 18
2 8 17
3 8 17
0 8 18
0 8 17
2 8 17
3 8 18
1 8 17
0 8 17
0 8 17
1 8 17
2 8 16
2 8 16
3 8 16
0 8 16
2 8 16
2 8 16
3 8 16
1 8 17
1 8 16
2 8 16
3 8 16
1 8 17
2 8 16
4 8 17
0 8 16
1 8 17
3 8 16
0 8 16
3 8 16
2 8 16
2 8 17
0 8 16
2 8 16
2 8 17
3 8 16
3 8 16
3 8 16
2 8 16
4 8 16
1 8 16
0 8 17
0 8 17
2 8 17
1 8 17
2 8 17
2 8 18
0 8 18
1 8 18
0 8 17
0 8 17
2 8 17
1 8 17
2 8 17
0 8 17
0 8 17
0 8 17
as i have seen , DTW can give us output , which can tell us a similarity between time-
series
but i don't know how can we do that?
how can we say that with the output of DTW?
which distance is good ?? high or low?
help me to solve this problem
Thanks
With use of DTW:
import pandas as pd
from io import StringIO
from dtaidistance import dtw
data = StringIO("""
t1 t2 t3
3 8 17
1 8 18
. . .
. . .
0 8 17
0 8 17
""")
# load data into data frame
df = pd.read_csv(data, sep=' ', engine='python', dtype=float)
# transpose data
transposed_matrix = df.values.transpose()
# calculate series cost
results = dtw.distance_matrix_fast(transposed_matrix, compact=True)
Output:
Cost results for comparing 3 time series. Lower cost is better.
[ 51.4392846 118.73078792 67.71262807]
My question is related to pivot table and merging.
I have a main dataframe that I use to create a pivot table. Later, I perform some calculations to that pivot and add a new column. Finally I want to merge this new column back to the main dataframe but not getting result as desired.
I try to explain the steps that i performed as follows:
Step 1.
df:
items cat section weight factor1
0 1 7 abc 3 80
1 1 7 abc 3 80
2 2 7 xyz 5 60
3 2 7 xyz 5 60
4 2 7 xyz 5 60
5 2 7 xyz 5 60
6 3 7 abc 3 80
7 3 7 abc 3 80
8 3 7 abc 3 80
9 1 8 abc 2 80
10 1 8 abc 2 60
11 2 8 xyz 6 60
12 2 8 xyz 6 60
12 2 8 xyz 6 60
13 2 8 xyz 6 60
14 3 8 abc 2 80
15 1 9 abc 4 80
16 2 9 xyz 9 60
17 2 9 xyz 9 60
18 3 9 abc 4 80
Main dataframe (df) having number of items. Each item has given a number.
whereas each item belongs to a dedicated section. Each item has given a weight that varies based on a category (cat) and section. In addition, there is another column named 'factor' whose value is constant for a given section.
Step 2.
I need to create a pivot as follows from the above df.
pivot = df.pivot_table(db, index=['section'],values=['weight','factor', 'items'],columns=['cat'],aggfunc={'weight':np.max,'factor':np.max, 'items':np.sum})
pivot:
weight factor items
cat 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2
xyz 5 6 9 60 60 60 4 4 2
Step 3:
Now I want to perform some calculations on that pivot then add the
result in a new column as follows:
pivot['w_n',7] = pivot['factor', 7]/pivot['items', 7]
pivot['w_n',8] = pivot['factor', 8]/pivot['items', 8]
pivot['w_n',9] = pivot['factor', 9]/pivot['items', 9]
pivot:
weight factor items w_n
cat 7 8 9 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2 16 27 40
xyz 5 6 9 60 60 60 4 4 2 15 15 30
Step 4:
Finally I want to merge that new column back to the main df .
with a desired result of single column 'w_n' but instead I am getting 3 columns one for each cat.
Current result:
df:
items cat section weight factor1 w_n_7 w_n,8 w_n,9
0 1 7 abc 3 80 16 27 40
1 1 7 abc 3 80 16 27 40
2 2 7 xyz 5 60 15 15 30
3 2 7 xyz 5 60 15 15 30
4 2 7 xyz 5 60 15 15 30
5 2 7 xyz 5 60 15 15 30
6 3 7 abc 3 80 16 27 40
7 3 7 abc 3 80 16 27 40
8 3 7 abc 3 80 16 27 40
9 1 8 abc 2 80 16 27 40
10 1 8 abc 2 60 16 27 40
11 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
13 2 8 xyz 6 60 15 15 30
14 3 8 abc 2 80 16 27 40
15 1 9 abc 4 80 16 27 40
16 2 9 xyz 9 60 15 15 30
17 2 9 xyz 9 60 15 15 30
18 3 9 abc 4 80 16 27 40
Desired result:
------------------
df:
items cat section weight factor1 w_n
0 1 7 abc 3 80 16
1 1 7 abc 3 80 16
2 2 7 xyz 5 60 15
3 2 7 xyz 5 60 15
4 2 7 xyz 5 60 15
5 2 7 xyz 5 60 15
6 3 7 abc 3 80 16
7 3 7 abc 3 80 16
8 3 7 abc 3 80 16
9 1 8 abc 2 80 27
10 1 8 abc 2 60 27
11 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
13 2 8 xyz 6 60 15
14 3 8 abc 2 80 27
15 1 9 abc 4 80 40
16 2 9 xyz 9 60 30
17 2 9 xyz 9 60 30
18 3 9 abc 4 80 40
Use DataFrame.join with MultiIndex Series with Series.unstack:
df = df.join(pivot['w_n'].unstack().rename('W_n'), on=['cat','section'])
print (df)
items cat section weight factor W_n
0 1 7 abc 3 80 7.272727
1 1 7 abc 3 80 7.272727
2 2 7 xyz 5 60 7.500000
3 2 7 xyz 5 60 7.500000
4 2 7 xyz 5 60 7.500000
5 2 7 xyz 5 60 7.500000
6 3 7 abc 3 80 7.272727
7 3 7 abc 3 80 7.272727
8 3 7 abc 3 80 7.272727
9 1 8 abc 2 80 16.000000
10 1 8 abc 2 60 16.000000
11 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
13 2 8 xyz 6 60 7.500000
14 3 8 abc 2 80 16.000000
15 1 9 abc 4 80 20.000000
16 2 9 xyz 9 60 15.000000
17 2 9 xyz 9 60 15.000000
18 3 9 abc 4 80 20.000000