Understanding for loop in pandas dataframe

Understanding for loop in pandas dataframe - python

Hello there I was coding in pandas when I found this problem:
for label,content in data_temp.items():
print(len(label))#Como vemos aqui nos imprime
print(len(data_temp.columns))
Firstly, I was trying to print the label, which is the indicator of the column, right? It outputs these different numbers.
7
9
9
7
10
12
8
24
9
11
11
15
13
17
11
18
5
12
16
12
9
5
8
12
5
12
12
15
11
14
17
10
9
6
9
11
9
7
14
14
15
10
23
12
5
15
12
16
10
15
17
17
8
9
7
7
22
34
And when i print the print(len(data_temp.columns)) it outputs:
58
Why does the data_temp.columns gives me a different number from the label in the for loop data_temp.item()? Aren't the labels of the for loop the indices of the data_temp.columns?

You are printing the length of the labels, not the labels themselves.
Try print(label) and print(data_temp.columns) that should output the labels one by one in the for loop and then the name of the columns as a list

Related

in PyGAD how Can I get non duplicate genes eventhough I give parameter allow_duplicate_genes=False

Hi I am trying to solve a Tsp problem by using PyGAD I made a result any how but result make duplicated number
I put in initial_population:
[[ 1 12 26 19 22 20 6 15 17 23 21 7 28 5 13 14 16 2 24 4 3 10 9 8 18 25 27 11]
[ 2 17 23 27 22 12 20 21 24 25 13 5 10 4 9 26 7 1 11 3 15 18 16 14 8 19 28 6]
[ 3 23 12 7 2 11 15 13 19 26 21 14 9 5 24 20 25 1 8 16 22 28 27 10 4 6 18 17]
[ 4 19 2 25 21 13 98 18 28 7 27 20 11 23 22 14 1 10 16 12 5 26 24 17 3 15 6]
[ 5 9 19 7 22 10 11 13 1 25 6 17 8 12 2 24 28 20 26 4 15 14 18 23 21 27 3 16]]
but result was
[ 6 8 13 1 19 10 6 23 18 22 5 3 *21* 11 6 16 28 1 4 10 6 25 7 22 5 3 *21* 11]
You can see in here that some values are duplycated
this is part of my code
import pygad
import numpy as np
import copy
def fitness_func(solution, solution_idx):
distance_treshold=np.load('distance.npy')
function_inputs=distance_simple(distance_treshold)
a1=treshold(function_inputs)
f=0
for i in range(len(solution)):
if i == 0:
f+= distance_treshold[solution[0]][solution[i+1]]
else:
try:
f+= distance_treshold[solution[i]][solution[i+1]]
except:
f+=distance_treshold[solution[i]][solution[0]]
fitness_score=pow((a1)/f,2)#fitness
return fitness_score
def treshold(solution):
distance_treshold=np.load('distance.npy')
f=0
for i in range(len(solution)):
if i == 0:
f+= distance_treshold[solution[0]][solution[i+1]]
else:
try:
f+= distance_treshold[solution[i]][solution[i+1]]
except:
f+=distance_treshold[solution[i]][solution[0]]
return f
function_inputs=distance_simple(distance_treshold)
a1=treshold(function_inputs)
print(a1)
np.load('distance.npy')
#print(initial_pop)
initial_population=np.load('inital_generation.npy')
print(initial_population)
num_parents_mating= 2
num_generations= 30
parent_selection_type='sus'
mutation_type="swap"
keep_parents=0
mutation_num_genes=1
mutation_percent_genes=3
crossover_type="single_point"
allow_duplicate_genes=False
gene_type=int
mutation_probability=0.03
print("GA start")
ga_instance = pygad.GA(num_generations=num_generations,mutation_probability=mutation_probability,
parent_selection_type=parent_selection_type,initial_population=initial_population,
num_parents_mating=num_parents_mating, fitness_func=fitness_func,gene_type=gene_type,
mutation_percent_genes=mutation_num_genes,mutation_num_genes=mutation_percent_genes,
mutation_type=mutation_type,allow_duplicate_genes=False)
ga_instance.run()
ga_instance.plot_fitness()
best_solution,best_solution_fitness,best_match_idx=ga_instance.best_solution()
print(best_solution)
fitness_func(best_solution,0)
print(best_solution_fitness)
I also saw How to solve TSP problem using pyGAD package? so I Tried allow_duplicate_genes=False but it doesn't works. Also I tried input initial_population type as numpy, but it still not works
thank you for your helps It helps me a lot

Thanks for using PyGAD :)
For the allow_duplicate_genes parameter to work and prevent duplicate genes, the number of distinct gene values must be higher than or equal to the number of genes. Let me explain further.
Assume that the gene space is set to [0.4, 7, 9, 2.3] (with only 4 values) and there are 5 genes. In this case, it is impossible to prevent duplicates because at least 2 genes will share the same value. To solve this issue, you have to add other values to the gene space so that the number of values in the space is >= the number of genes (5 in this case).
TO solve your issue, you may use the gene_space parameter and give it enough values to prevent duplicates. This is already used in the question you mentioned.

Formatting output of series in python pandas

Here is my DataFrame. This is a representation of an 8-hour day, and the many different combinations of schedules. The time is in 24hr time.
Input:
solutions = problem.getSolutions()
pd.options.display.max_columns = None
df = pd.DataFrame(solutions)
Output:
WorkHr1 WorkHr2 WorkHr3 WorkHr4 WorkOut Lunch FreeHour Cleaning
0 13 14 15 16 11 10 9 12
1 13 14 15 16 11 10 12 9
2 13 14 15 16 11 12 10 9
3 13 14 15 16 11 12 9 10
4 13 14 15 16 12 11 10 9
.. ... ... ... ... ... ... ... ...
I can create a series using:
series1 = pd.Series(solutions[0])
print(series1)
And I get this output:
WorkHr1 13
WorkHr2 14
WorkHr3 15
WorkHr4 16
WorkOut 11
Lunch 10
FreeHour 9
Cleaning 12
How can I switch the columns of this series so that the time is first?
Also, is there any possible way to display the rows in order of time? Like this:
9 FreeHour
10 Lunch
11 WorkOut
12 Cleaning
13 WorkHr1
14 WorkHr2
15 WorkHr3
16 WorkHr4

You can reverse it by passing its index as data and data as index to a Series constructor:
out = pd.Series(s.index, index=s).sort_index()
Output:
9 FreeHour
10 Lunch
11 WorkOut
12 Cleaning
13 WorkHr1
14 WorkHr2
15 WorkHr3
16 WorkHr4
dtype: object

populate matrix with value_counts

I need to run a statistic on some data. See how many times a values "j" is next to a value "i". The code that I put hereafter is a gross simplification of what I need to to, but it contains the problem I have.
Let's say that you have this data frame.
import numpy as np
import pandas as pd
a_df=pd.DataFrame({"a_col":np.random.randint(10, size=1000), "b_col":np.random.randint(10, size=1000)})
I generate a matrix that will contain our statistics:
res_matrix=np.zeros((10, 10))
by looking at res_matrix[i][j] we will know how many times the number "j" was next to the number "i" in our data frame.
I know that "for loops" are bad in pandas, but again, this is a simplification.
I generate a sub-table for the value "i" and on this table I ran "value_counts()"
on the column "b_col".
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
is there an efficient way to populate res_matrix without changing the topmost for loop?
I am thinking something like list comprehension, but I cannot wrap my mind around it.
Please, focus ONLY on these two lines:
for val,cnt in table_count.iteritems():
res_matrix[i][val]+=int(cnt)
I can't use groupby because my project requires many more operations on the dataframe.

There's a function crosstab in pandas that does just that:
pd.crosstab(a_df['a_col'], a_df['b_col'])
Output:
b_col 0 1 2 3 4 5 6 7 8 9
a_col
0 10 10 10 12 14 9 10 5 13 16
1 16 9 13 14 14 8 4 11 9 12
2 10 8 12 13 9 12 13 7 10 5
3 11 7 10 17 6 9 6 8 7 14
4 9 8 4 5 7 13 12 8 11 6
5 14 9 8 15 6 10 12 9 7 9
6 11 13 10 9 7 5 8 11 13 21
7 8 9 11 8 8 10 11 15 10 12
8 6 17 11 4 12 9 6 10 10 13
9 12 6 14 3 11 11 7 5 14 14
Update: if the outer loop must remain for other reasons, you can set values in res_matrix inside the loop:
res_matrix = np.zeros((10, 10))
for i in a_df["a_col"].unique():
temp_df=a_df[a_df["a_col"]==i]
table_count=temp_df["b_col"].value_counts()
# set values in res_matrix
res_matrix[i, table_count.index] = table_count

Don't loop, this is slow. If you think there is a good reason to loop, please explain it and provide an appropriate example.
Here is another method.
You can groupby both columns and get the group size, then unstack to get a 2D shape:
a_df.groupby(['a_col', 'b_col']).size().unstack()
output:
b_col 0 1 2 3 4 5 6 7 8 9
a_col
0 16 2 4 11 9 13 11 11 8 6
1 10 12 7 6 6 11 10 8 2 12
2 9 12 10 22 12 13 8 11 9 8
3 13 11 11 14 7 11 9 7 8 14
4 14 7 17 5 8 6 15 8 11 8
5 10 12 7 14 6 16 11 12 6 8
6 13 10 9 12 11 14 8 10 6 8
7 9 12 12 9 11 9 8 14 5 12
8 7 8 9 8 10 14 9 8 8 18
9 13 6 13 11 13 11 8 7 11 11

how to create new dataframe by combining some columns together of existing one?

I am having a dataframe df like shown:
1-1 1-2 1-3 2-1 2-2 3-1 3-2 4-1 5-1
10 3 9 1 3 9 33 10 11
21 31 3 22 21 13 11 7 13
33 22 61 31 35 34 8 10 16
6 9 32 5 4 8 9 6 8
where the explanation of the columns as the following:
the first digit is a group number and the second is part of it or subgroup in our example we have groups 1,2,3,4,5 and group 1 consists of 1-1,1-2,1-3.
I would like to create a new dataframe that have only the groups 1,2,3,4,5 without subgroups and choose for each row the max number in the subgroup and be flexible for any new modifications or increasing the groups or subgroups.
The new dataframe I need is like the shown:
1 2 3 4 5
10 3 33 10 11
31 22 13 7 13
61 35 34 10 16
32 5 9 6 8

You can aggregate by columns with axis=1 and lambda function for split and select first values with max and DataFrame.groupby:
This working correct if numbers of groups contains 2 or more digits.
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).max()
Alternative is pass splitted columns names:
df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).max()
print (df1)
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8

You can use .str[] or .str.get here.
df.groupby(df.columns.str[0], axis=1).max())
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8

How to plot multiple lines as histograms per group from a pandas Date Frame

I am trying to look at 'time of day' effects on my users on a week over week basis to get a quick visual take on how consistent time of day trends are. So as a first start I've used this:
df[df['week'] < 10][['realLocalTime', 'week']].hist(by = 'week', bins = 24, figsize = (15, 15))
To produce the following:
This is a nice easy start, but what I would really like is to represent the histogram as a line plot, and overlay all the lines, one for each week on the same plot. Is there a way to do this?
I have a bit more experience with ggplot, where I would just do this by adding a factor level dependency on color and by. Is there a similarly easy way to do this with pandas and or matplotlib?
Here's what my data looks like:
realLocalTime week
1 12 10
2 12 10
3 12 10
4 12 10
5 13 5
6 17 5
7 17 5
8 6 6
9 17 5
10 20 6
11 18 5
12 18 5
13 19 6
14 21 6
15 21 6
16 14 6
17 6 6
18 0 6
19 21 5
20 17 6
21 23 6
22 22 6
23 22 6
24 17 6
25 22 5
26 13 6
27 23 6
28 22 5
29 21 6
30 17 6
... ... ...
70 14 5
71 9 5
72 19 6
73 19 6
74 21 6
75 20 5
76 20 5
77 21 5
78 15 6
79 22 6
80 23 6
81 15 6
82 12 6
83 7 6
84 9 6
85 8 6
86 22 6
87 22 6
88 22 6
89 8 5
90 8 5
91 8 5
92 9 5
93 7 5
94 22 5
95 8 6
96 10 6
97 0 6
98 22 5
99 14 6

Maybe you can simply use crosstab to compute the number of element by week and plot it.
# Test data
d = {'realLocalTime': ['12','14','14','12','13','17','14', '17'],
'week': ['10','10','10','10','5','5','6', '6']}
df = DataFrame(d)
ax = pd.crosstab(df['realLocalTime'], df['week']).plot()

Use groupby and value_counts
df.groupby('week').realLocalTime.value_counts().unstack(0).fillna(0).plot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding for loop in pandas dataframe - python

You are printing the length of the labels, not the labels themselves. Try print(label) and print(data_temp.columns) that should output the labels one by one in the for loop and then the name of the columns as a list

Related

in PyGAD how Can I get non duplicate genes eventhough I give parameter allow_duplicate_genes=False

Formatting output of series in python pandas

populate matrix with value_counts

how to create new dataframe by combining some columns together of existing one?

How to plot multiple lines as histograms per group from a pandas Date Frame

Categories

Resources