Formatting output of series in python pandas - python

Here is my DataFrame. This is a representation of an 8-hour day, and the many different combinations of schedules. The time is in 24hr time.
Input:
solutions = problem.getSolutions()
pd.options.display.max_columns = None
df = pd.DataFrame(solutions)
Output:
WorkHr1 WorkHr2 WorkHr3 WorkHr4 WorkOut Lunch FreeHour Cleaning
0 13 14 15 16 11 10 9 12
1 13 14 15 16 11 10 12 9
2 13 14 15 16 11 12 10 9
3 13 14 15 16 11 12 9 10
4 13 14 15 16 12 11 10 9
.. ... ... ... ... ... ... ... ...
I can create a series using:
series1 = pd.Series(solutions[0])
print(series1)
And I get this output:
WorkHr1 13
WorkHr2 14
WorkHr3 15
WorkHr4 16
WorkOut 11
Lunch 10
FreeHour 9
Cleaning 12
How can I switch the columns of this series so that the time is first?
Also, is there any possible way to display the rows in order of time? Like this:
9 FreeHour
10 Lunch
11 WorkOut
12 Cleaning
13 WorkHr1
14 WorkHr2
15 WorkHr3
16 WorkHr4

You can reverse it by passing its index as data and data as index to a Series constructor:
out = pd.Series(s.index, index=s).sort_index()
Output:
9 FreeHour
10 Lunch
11 WorkOut
12 Cleaning
13 WorkHr1
14 WorkHr2
15 WorkHr3
16 WorkHr4
dtype: object

Related

combining specific row conditionally and add output to existing row in pandas

suppose I have following data frame :
data = {'age' :[10,11,12,11,11,10,11,13,13,13,14,14,15,15,15],
'num1':[10,11,12,13,14,15,16,17,18,19,20,21,22,23,24],
'num2':[20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]}
df = pd.DataFrame(data)
I want to sum rows for age 14 and 15 and keep those new values as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
in the code below, I have tried to group.by age but it does not work for me:
df1 =df.groupby(age[age >=14])['num1', 'num2'].apply(', '.join).reset_index(drop=True).to_frame()
limit_age = 14
new = df.query("age < #limit_age").copy()
new.loc[len(new)] = [limit_age,
*df.query("age >= #limit_age").drop(columns="age").sum()]
first get the "before 14" dataframe
then assign it to a new row where
age is 14
other values are the row-wise sums of "after 14" dataframe
to get
>>> new
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160
(new.index += 1 can be used for a 1-based index at the end.)
I would use a mask and concat:
m = df['age'].isin([14, 15])
out = pd.concat([df[~m],
df[m].agg({'age': 'min', 'num1': 'sum', 'num2': 'sum'})
.to_frame().T
], ignore_index=True)
Output:
age num1 num2
0 10 10 20
1 11 11 21
2 12 12 22
3 11 13 23
4 11 14 24
5 10 15 25
6 11 16 26
7 13 17 27
8 13 18 28
9 13 19 29
10 14 110 160

in PyGAD how Can I get non duplicate genes eventhough I give parameter allow_duplicate_genes=False

Hi I am trying to solve a Tsp problem by using PyGAD I made a result any how but result make duplicated number
I put in initial_population:
[[ 1 12 26 19 22 20 6 15 17 23 21 7 28 5 13 14 16 2 24 4 3 10 9 8 18 25 27 11]
[ 2 17 23 27 22 12 20 21 24 25 13 5 10 4 9 26 7 1 11 3 15 18 16 14 8 19 28 6]
[ 3 23 12 7 2 11 15 13 19 26 21 14 9 5 24 20 25 1 8 16 22 28 27 10 4 6 18 17]
[ 4 19 2 25 21 13 98 18 28 7 27 20 11 23 22 14 1 10 16 12 5 26 24 17 3 15 6]
[ 5 9 19 7 22 10 11 13 1 25 6 17 8 12 2 24 28 20 26 4 15 14 18 23 21 27 3 16]]
but result was
[ 6 8 13 1 19 10 6 23 18 22 5 3 *21* 11 6 16 28 1 4 10 6 25 7 22 5 3 *21* 11]
You can see in here that some values are duplycated
this is part of my code
import pygad
import numpy as np
import copy
def fitness_func(solution, solution_idx):
distance_treshold=np.load('distance.npy')
function_inputs=distance_simple(distance_treshold)
a1=treshold(function_inputs)
f=0
for i in range(len(solution)):
if i == 0:
f+= distance_treshold[solution[0]][solution[i+1]]
else:
try:
f+= distance_treshold[solution[i]][solution[i+1]]
except:
f+=distance_treshold[solution[i]][solution[0]]
fitness_score=pow((a1)/f,2)#fitness
return fitness_score
def treshold(solution):
distance_treshold=np.load('distance.npy')
f=0
for i in range(len(solution)):
if i == 0:
f+= distance_treshold[solution[0]][solution[i+1]]
else:
try:
f+= distance_treshold[solution[i]][solution[i+1]]
except:
f+=distance_treshold[solution[i]][solution[0]]
return f
function_inputs=distance_simple(distance_treshold)
a1=treshold(function_inputs)
print(a1)
np.load('distance.npy')
#print(initial_pop)
initial_population=np.load('inital_generation.npy')
print(initial_population)
num_parents_mating= 2
num_generations= 30
parent_selection_type='sus'
mutation_type="swap"
keep_parents=0
mutation_num_genes=1
mutation_percent_genes=3
crossover_type="single_point"
allow_duplicate_genes=False
gene_type=int
mutation_probability=0.03
print("GA start")
ga_instance = pygad.GA(num_generations=num_generations,mutation_probability=mutation_probability,
parent_selection_type=parent_selection_type,initial_population=initial_population,
num_parents_mating=num_parents_mating, fitness_func=fitness_func,gene_type=gene_type,
mutation_percent_genes=mutation_num_genes,mutation_num_genes=mutation_percent_genes,
mutation_type=mutation_type,allow_duplicate_genes=False)
ga_instance.run()
ga_instance.plot_fitness()
best_solution,best_solution_fitness,best_match_idx=ga_instance.best_solution()
print(best_solution)
fitness_func(best_solution,0)
print(best_solution_fitness)
I also saw How to solve TSP problem using pyGAD package? so I Tried allow_duplicate_genes=False but it doesn't works. Also I tried input initial_population type as numpy, but it still not works
thank you for your helps It helps me a lot
Thanks for using PyGAD :)
For the allow_duplicate_genes parameter to work and prevent duplicate genes, the number of distinct gene values must be higher than or equal to the number of genes. Let me explain further.
Assume that the gene space is set to [0.4, 7, 9, 2.3] (with only 4 values) and there are 5 genes. In this case, it is impossible to prevent duplicates because at least 2 genes will share the same value. To solve this issue, you have to add other values to the gene space so that the number of values in the space is >= the number of genes (5 in this case).
TO solve your issue, you may use the gene_space parameter and give it enough values to prevent duplicates. This is already used in the question you mentioned.

Hash table mapping in Pandas

I have a large dataset with millions of rows of data. One of the data columns is ID.
I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.
What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas?
As an example, lets say that the dataset looks like this:
In [18]:
print(df_test)
Out [19]:
ID
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12 25
13 26
14 27
15 28
16 29
17 30
18 31
19 32
Now the hash table with the range of indices looks like this:
In [20]:
print(df_hash)
Out [21]:
ID_first
0 0
1 2
2 10
where the index specifies the group number that I need.
I tried doing something like this:
for index in range(df_hash.size):
try:
df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
except:
df_test.loc[df_hash.ID_first[index]:, 'Group'] = index
Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows). It produces the following answer (which I want):
In [23]:
print(df_test)
Out [24]:
ID Group
0 13 0
1 14 0
2 15 1
3 16 1
4 17 1
5 18 1
6 19 1
7 20 1
8 21 1
9 22 1
10 23 2
11 24 2
12 25 2
13 26 2
14 27 2
15 28 2
16 29 2
17 30 2
18 31 2
19 32 2
Is there a way to do this more efficiently?
You can map the index of df_test using ID_first to the index of df_hash, and then ffill. Need to construct a Series as the pd.Index class doesn't have a ffill method.
df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))),
index=df_test.index)
.ffill(downcast='infer'))
# ID group
#0 13 0
#1 14 0
#2 15 1
#...
#9 22 1
#10 23 2
#...
#17 30 2
#18 31 2
#19 32 2
you can do series.isin with series.cumsum
df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)
print(df_test)
ID group
0 0 1
1 1 1
2 2 2
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3
14 14 3
15 15 3
16 16 3
17 17 3
18 18 3
19 19 3

Understanding for loop in pandas dataframe

Hello there I was coding in pandas when I found this problem:
for label,content in data_temp.items():
print(len(label))#Como vemos aqui nos imprime
print(len(data_temp.columns))
Firstly, I was trying to print the label, which is the indicator of the column, right? It outputs these different numbers.
7
9
9
7
10
12
8
24
9
11
11
15
13
17
11
18
5
12
16
12
9
5
8
12
5
12
12
15
11
14
17
10
9
6
9
11
9
7
14
14
15
10
23
12
5
15
12
16
10
15
17
17
8
9
7
7
22
34
And when i print the print(len(data_temp.columns)) it outputs:
58
Why does the data_temp.columns gives me a different number from the label in the for loop data_temp.item()? Aren't the labels of the for loop the indices of the data_temp.columns?
You are printing the length of the labels, not the labels themselves.
Try print(label) and print(data_temp.columns) that should output the labels one by one in the for loop and then the name of the columns as a list

Python: Get the total hour from start time column and end time column in CSV file

I have 2 column 'Start' and 'End' in file CSV. I want to write a column name 'Total' with the data as total hours.
My data:
Start End
-----------
16 18
3 15
13 23
22 15
9 1
The data I want:
Start End Total
----------------------
16 18 2
3 15 12
13 23 10
22 15 17
9 1 16
Use np.where
dif=(df['End']-df['Start'])
df['Total']=np.where(dif>0,dif,24+dif)
print(df)
Start End Total
0 16 18 2
1 3 15 12
2 13 23 10
3 22 15 17
4 9 1 16

Categories