List combinations from a compatibility matrix

List combinations from a compatibility matrix - python

I have a table in the shape of a symmetric matrix that tells me which components are compatible. Here is an example;
Components | A | B | C | D | E | F | G |
-----------+---+---+---+---+---+---+---+
A | | | 1 | 1 | 1 | 1 | |
-----------+---+---+---+---+---+---+---+
B | | | | | 1 | | 1 |
-----------+---+---+---+---+---+---+---+
C | 1 | | | | | 1 | |
-----------+---+---+---+---+---+---+---+
D | 1 | | | | | 1 | 1 |
-----------+---+---+---+---+---+---+---+
E | 1 | 1 | | | | | 1 |
-----------+---+---+---+---+---+---+---+
F | 1 | | 1 | 1 | | | 1 |
-----------+---+---+---+---+---+---+---+
G | | 1 | | 1 | 1 | 1 | |
-----------+---+---+---+---+---+---+---+
Where the 1s show what is compatible and the blanks are what is not compatible. In the actual table there are a lot more components. Currently the real table is in an excel spreadsheet but could easily be converted to csv or text for convenience.
What I need to do is create a list of possible combinations. I know there are things like itertools but I need it to only create a list of the compatible ones and ignore the non compatible ones. For this with a dat file I pull when I run pyomo;
set NODES := A B C D E F G;
param: ARCS:=
A
B
C
...
A C
A D
B E
...
A C F
BGE
...
I need everything listed together to be compatible together. So ACF can be together because they are all compatible with each other but not ADG because G is not compatible with A.
Long Term Plan:
Eventually I plan to use Pyomo to find the the best combination of components to minimize the resources needed associated with each component. Therefore in the dat file there will eventually be and additional cost associated with each combination.

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
df = pd.read_excel(r"/path/to/file.xlsx", sheet_name="Sheet4",index_col=0,usecols = "A:H")
df.edge=nx.from_pandas_adjacency(df)
print(list(nx.enumerate_all_cliques(nx.Graph(df.edge))))

Related

Subsetting dataset by using IF statement

I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.

How to efficiently zero pad datasets with different lengths

My aim is to zero pad my data to have an equal length for all the subset datasets. I have data as follows:
|server| users | power | Throughput range | time |
|:----:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 1 | [8, 6,2,7] | -6.4528433 | [6.2343, 7.0974845] | 1 |
| 2 | [9,12,10,11] | -3.5322451 | [4.31240, 4.9073840]| 2 |
| 3 | [14,13,16,17]| -5.9752843 | [5.2243, 5.2974843] | 3 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 4 |
| 1 | [22,23,24,25]| -9.884843 | [8.00843, 8.0974843]| 5 |
| 2 | [27,26,28,29]| -2.3984843 | [7.23843, 8.2094845]| 6 |
| 3 | [30,32,31,33]| -4.5654566 | [3.1233, 4.2474643] | 7 |
| 1 | [36,34,37,35]| -1.2974652 | [3.12843, 4.2474643]| 8 |
| 2 | [40,41,38,39]| -3.5322451 | [4.31240, 4.9073840]| 9 |
| 1 | [42,43,45,44]| -5.9752843 | [6.31240, 6.9073840]| 10 |
The aim is to analyze individual servers by their respective data which was done using the code below:
c0 = grp['server'].values == 0
c0_new = grp[c0]
server0 = pd.DataFrame(c0_new)
c1 = grp['server'].values == 1
c1_new = grp[c1]
server1 = pd.DataFrame(c1_new)
c2 = grp['server'].values == 2
c2_new = grp[c2]
server2 = pd.DataFrame(c2_new)
c3 = grp['server'].values == 3
c3_new = grp[c3]
server3 = pd.DataFrame(c3_new)
The results of this code provide the different servers and their respective data features. For example, the server0 output becomes:
| server | users | power | Throughput range | time |
|:------:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 1 |
The results obtained for individual servers have different lengths so I tried padding using the code below:
from Keras.preprocessing.sequence import pad_sequences
man = [server0, server1, server2, server3]
new = pad_sequences(man)
The results obtained in this case show the padding has been done with all the servers having equal length but the problem is that the output does not contain the column names anymore, I want the final data to contain the columns. Please any suggestions?

The aim is to apply machine learning on the data and would like to have them concatenated. This is what I later did and it worked for the application I wanted it for.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
man = [server0, server1, server2, server3]
for cel in man:
cel.set_index('time', inplace=True)
cel.drop(['users'], axis=1, inplace=True)
scl = MinMaxScaler()
vals = [cel.values.reshape(cel.shape[0], 1) for cel in man]
I then applied the the pad sequence and it worked as follows:
from keras.preprocessing.sequence import pad_sequences
new = pad_sequences(vals)

Pyspark: Reorder only a subset of rows among themselves

my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+

From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/

Where am I going wrong when analyzing this data?

Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date

Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.

I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset

Merge two datasets based on specific column data

I have two pandas datasets
old:
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
new:
| alpha | beta | zeta | id | numb |
| ------ | ------------------ | ---------------| ------| -----|
| 1 | LA | Hwood | Q | Q400 |
| 2 | NY | queens | B | B200 |
| 3 | Chic | lincpark | D | D300 |
(Columns and data don't mean anything in particular, just an example).
I want to merge datasets in a way such that
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.id = new.numb, you only keep the entry from the old table. (in this case the row 2 on the old with queens would be kept as opposed to row 2 on new with queens)
Note that rows 3 and 4 on old are the same, but we still keep both. If there were 2 duplicates of these rows in new we consider them as 1-1 corresponding. If maybe there were 3 duplicates on new of rows 3 and 4 on old, then 2 are considered copies (and we don't add them, but we would add the third when we merge them)
IF old.alpha, old.beta, and old.zeta = their corresponding new columns and If old.numb is contained inside new.numb, you only keep the entry from the old table. (in this case the row 5 on the old with lincpark would be kept as opposed to row 3 on new with lincpark, because 300 is contained in new.numb)
Otherwise add the new data as new data, keeping the new table's id and numb, and having null for any extra columns that the old table has (new's row 1 with hollywood)
I have tried various merging methods along with the drop_duplicates method. The problem with the the latter is that I attempted to drop duplicates having the same alpha beta and zeta, but often they were deleted from the same datasource, because the rows were exactly the same.
This is what ultimately needs to be shown when merging. 2 of the rows in new were duplicates, one was something to be added.
| alpha | beta | zeta | id | rand | numb|
| ------ | ------------------ | ------------| ------ | --- -| ----|
| 1 | LA | bev | A100 | D | 100 |
| 1 | LA | malib | C150 | Z | 150 |
| 2 | NY | queens | B200 | N | 200 |
| 2 | NY | queens | B200 | N | 200 |
| 3 | Chic | lincpark | E300 | T | 300 |
| 3 | NY | Bronx | F300 | M | 300 |
| 1 | LA | Hwood | Q | | Q400|

We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
Assuming df1 is your new and df2 is the old
Follow merge by IF conditions.
import pandas
dfinal = df1.merge(df2, on="alpha", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'idold' as 'idnew'.
dfinal = df1.merge(df2, how='inner', left_on='alpha', right_on='id')
If you want to be even more specific, you may read the documentation of pandas merge operation.
Also specify If conditions and perform merge operations by rows, and then drop the remaining columns in a temporary dataframe. And add values to that dataframe according to conditions.
I understand the answer is a little bit complex, but so is your question. Cheers :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

List combinations from a compatibility matrix - python

import numpy as np import pandas as pd import networkx as nx import matplotlib.pyplot as plt df = pd.read_excel(r"/path/to/file.xlsx", sheet_name="Sheet4",index_col=0,usecols = "A:H") df.edge=nx.from_pandas_adjacency(df) print(list(nx.enumerate_all_cliques(nx.Graph(df.edge))))

Related

Subsetting dataset by using IF statement

How to efficiently zero pad datasets with different lengths

Pyspark: Reorder only a subset of rows among themselves

Where am I going wrong when analyzing this data?

Merge two datasets based on specific column data

Categories

Resources