Link lists that share common elements - python

I have an issue similar to this one with a few differences/complications
I have a list of groups containing members, rather than merging the groups that share members I need to preserve the groupings and create a new set of edges based on which groups have members in common, and do so conditionally based on attributes of the groups
The source data looks like this:
+----------+------------+-----------+
| Group ID | Group Type | Member ID |
+----------+------------+-----------+
| A | Type 1 | 1 |
| A | Type 1 | 2 |
| B | Type 1 | 2 |
| B | Type 1 | 3 |
| C | Type 1 | 3 |
| C | Type 1 | 4 |
| D | Type 2 | 4 |
| D | Type 2 | 5 |
+----------+------------+-----------+
Desired output is this:
+----------+-----------------+
| Group ID | Linked Group ID |
+----------+-----------------+
| A | B |
| B | C |
+----------+-----------------+
A is linked to B because it shares 2 in common
B is linked to C because it shares 3 in common
C is not linked to D, it has a member in common but is of a different type
The number of shared members doesn't matter for my purposes, a single member in common means they're linked
The output is being used as the edges of a graph, so if the output is a graph that fits the rules that's fine
The source dataset is large (hundreds of millions of rows), so performance is a consideration
This poses a similar question, however I'm new to Python and can't figure out how to get the source data to a point where I can use the answer, or work in the additional requirement of the group type matching

Try some thing like this-
df1=df.groupby(['Group Type','Member ID'])['Group ID'].apply(','.join).reset_index()
df2=df1[df1['Group ID'].str.contains(",")]
This might not handle the case of cyclic grouping.

Related

How to create a table from another table with GridDB?

I have a GridDB container where I have stored my database. I want to copy the table but this would exclude a few columns. The function I need should extract all columns matching a given keyword and then create a new table from that. It must always include the first column *id because it is needed on every table.
For example, in the table given below:
'''
-- | employee_id | department_id | employee_first_name | employee_last_name | employee_gender |
-- |-------------|---------------|---------------------|---------------------|-----------------|
-- | 1 | 1 | John | Matthew | M |
-- | 2 | 1 | Alexandra | Philips | F |
-- | 3 | 2 | Hen | Lotte | M |
'''
Suppose I need to get the first column and every other column starting with "employee". How can I do this through a Python function?
I am using GridDB Python client on my Ubuntu machine and I have already stored the database.csv file in the container. Thanks in advance for your help!

Efficient way to write Pandas groupby codes by eliminating repetition

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

SQL/SqlAlchemy: Querying all objects in a dependancy tree

I have a table with a self, asymmetric many-to-many relationship of dependancies between objects. I use that relationship to create a dependably tree between objects.
Having a set of object IDs, I would like to fetch all objects that are somewhere in that dependancy tree.
Here's an example objects table:
+----+------+
| ID | Name |
+----+------+
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
| 5 | E |
+----+------+
And a table of relationships:
+------------+-----------+
| Dependancy | Dependant |
+------------+-----------+
| 2 | 1 |
| 3 | 2 |
| 4 | 1 |
+------------+-----------+
Showing A (ID: 1) depends on both B(2) and D(4), and that B(2) depends on C(3).
Now, I would like to construct a single SQL query that given {1} as a set with a single ID will return the four objects in A's dependancy tree: A, B, D and C.
Alternatively, using one query to fetch all needed object IDs and another to fetch their actual data is also acceptable.
This should be work regardless of the number of levels in the dependency/hierarchy tree.
I'll be happy with either an SQLAlchemy example or plain SQL for the postgresql 10 database (which I'll see how to implement with SQLAlchemy later on).
Thanks!

How do you convert two columns of vectors into one PySpark data frame?

After having run PolynomialExpansion on a Pyspark dataframe, I have a data frame (polyDF) that looks like this:
+--------------------+--------------------+
| features| polyFeatures|
+--------------------+--------------------+
|(81,[2,9,26],[13....|(3402,[5,8,54,57,...|
|(81,[4,16,20,27,3...|(3402,[14,19,152,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
The "features" column includes the features included in the original data. Each row represents a different user. There are 81 total possible features for each user in the original data. The "polyFeatures" column includes the features after the polynomial expansion has been run. There are 3402 possible polyFeatures after running PolynomialExpansion. So what each row of both columns contain are:
An integer representing the number of possible features (each user may or may not have had a value in each of the features).
A list of integers that contains the feature indexes for which that user had a value.
A list of numbers that contains the values of each of the features mentioned in #2 above.
My question is, how can I take these two columns, create two sparse matrices, and subsequently join them together to get one, full, sparse Pyspark matrix? Ideally it would look like this:
+---+----+----+----+------+----+----+----+----+---+---
| 1 | 2 | 3 | 4 | ... |405 |406 |407 |408 |409|...
+---+----+----+----+------+----+----+----+----+---+---
| 0 | 13 | 0 | 0 | ... | 0 | 0 | 0 | 6 | 0 |...
| 0 | 0 | 0 | 9 | ... | 0 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
I have reviewed the Spark documentation for PolynomialExpansion located here but it doesn't cover this particular issue. I have also tried to apply the SparseVector class which is documented here, but this seems to be useful for only one vector rather than a data frame of vectors.
Is there an effective way to accomplish this?

Categories