meanshift clustering using pyspark - python

We're trying to migrate a vanilla python code-base to pyspark. The agenda is to do some filtering on a dataframe (previously pandas, now spark), then group it by user-ids, and finally apply meanshift clustering on top.
I'm using pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) on the grouped-data. But now there's a problem in the way the final output should be represented.
Let's say we have two columns in the input dataframe, user-id and location. For each user we need to get all clusters (on the location), retain only the biggest one, and then return its attributes, which is a 3-dimensional vector. Let's assume the columns of the 3-tuple are col-1, col-2 and col-3. I can only think of creating the original dataframe with 5 columns, with these 3 fields set to None, using something like withColumn('col-i', lit(None).astype(FloatType())). Then, in the first row for each user, I'm planning to populate these three columns with these attributes. But this seems really ugly way of doing it, and it would unnecessarily waste a lot of space, because apart from the first row, all entries in col-1, col-2 and col-3 would be zero. The output dataframe would look something like below in this case:
+---------+----------+-------+-------+-------+
| user-id | location | col-1 | col-2 | col-3 |
+---------+----------+-------+-------+-------+
| 02751a9 | 0.894956 | 21.9 | 31.5 | 54.1 |
| 02751a9 | 0.811956 | null | null | null |
| 02751a9 | 0.954956 | null | null | null |
| ... |
| 02751a9 | 0.811956 | null | null | null |
+--------------------------------------------+
| 0af2204 | 0.938011 | 11.1 | 12.3 | 53.3 |
| 0af2204 | 0.878081 | null | null | null |
| 0af2204 | 0.933054 | null | null | null |
| 0af2204 | 0.921342 | null | null | null |
| ... |
| 0af2204 | 0.978081 | null | null | null |
+--------------------------------------------+
This feels so wrong. Is there an elegant way of doing it?

What I ended up doing, was grouped the df by user-ids, applied functions.collect_list on the columns, so that each cell contains a list. Now each user has only one row. Then I applied meanshift clustering on each row's data.

Related

How to map 2 dataset to check if a value from Dataset_A is present in Dataset_B and create a new column in Dataset_A as 'Present or Not'?

I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError

Update a column data w.r.t values in other columns regex match in dataframes

I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large

How do I reference a sub-column from a pivoted table inside a method such as sort_values?

This table has 2 level of columns after pivotting col2. I want to sort the table with df['col3']['A'], but in .sort_values() you can only use a string or a list of strings to reference column(s).
I know for this specific problem, I can just sort the table before pivotting. But this problem applies to all other methods e.g. df.style.set_properties( subset=...) on any dataframes with multi-level/hierarchical columns.
EDIT:
Table here:
+------+---------------------+---------------------+
| | col3 | col4 |
+------+----------+----------+----------+----------+
| col2 | A | B | A | B |
+------+----------+----------+----------+----------+
| col1 | | | | |
+------+----------+----------+----------+----------+
| 1 | 0.242926 | 0.181175 | 0.189465 | 0.338340 |
| 2 | 0.240864 | 0.494611 | 0.211883 | 0.614739 |
| 3 | 0.052051 | 0.757591 | 0.361446 | 0.389341 |
+------+----------+----------+----------+----------+
Found the answer here: Sort Pandas Pivot Table by the margin ('All') values column
Basically just put the column and sub column(s) in a tuple. i.e. for my case, it is just .sort_values(('col3','A')).

graphlab - sframe : How to remove rows which have same ids and condition on a column?

I have a graphlab sframe dataframe where few rows have similar id value in "uid" column.
| VIM Document Type | Vendor Number & Zone | Value <5000 or >5000 | Today Status |
+-------------------+----------------------+----------------------+--------------+
| PO_VR_GLB | 1613407EMEAi | Less than 5000 | 0 |
| PO_VR_GLB | 249737LATIN AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1822317NORTH AMERICA | Less than 5000 | 1 |
| PO_MN_GLB | 1216902NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 1213709EMEAi | Less than 5000 | 0 |
| PO_MN_GLB | 882843NORTH AMERICA | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
| PO_MN_GLB | 2131503ASIA PACIFIC | More than 5000 | 1 |
+-------------------+----------------------+----------------------+--------------+
+---------------------+
| uid |
+---------------------+
| 63068$#069 |
| 5789$#13 |
| 12933036$#IN6532618 |
| 12933022$#IN6590132 |
| 12932349$#IN6636468 |
| 12952077$#203250 |
| 13012770$#MUML04184 |
| 12945049$#112370 |
| 13582330$#CI160118 |
| 13012770$#MUML04184|
Here, I want to retain all the rows with unique uids and only one of the rows which have same uid, the row to be retained can be any row which has today status=1, (i.e. there can be rows where uid and row status are same, but other fields are different, in that case, we can keep any one of these rows.) I want to do these operations in graphlab sframes, but am unable to figure out how to proceed.
you may use SFrame.unique() that can give you unique rows
sf = sf.unique()
Other way can also be using either groupby() method or join() methods where you can specify column name and further work. You may read their documentation on turi.com click for various ways.
Another way (that I personally prefer) is to convert SFrame to Dataframe of pandas and work on getting data operations and again converting pandas Dataframe to SFrame. It depends on your choice and I hope this helps.

How to efficiently generate a special co-author network in python pandas?

I'm trying to generate a network graph of individual authors given a table of articles. The table I start with is of articles with a single column for the "lead author" and a single column for "co-author". Since each article can have up to 5 authors, article rows may repeat as such:
| paper_ID | project_name | lead_id | co_lead_id | published |
|----------+--------------+---------+------------+-----------|
| 1234 | "fubar" | 999 | 555 | yes |
| 1234 | "fubar" | 999 | 234 | yes |
| 1234 | "fubar" | 999 | 115 | yes |
| 2513 | "fubar2" | 765 | 369 | no |
| 2513 | "fubar2" | 765 | 372 | no |
| 5198 | "fubar3" | 369 | 325 | yes |
My end goal is to have a nodes table, where each row is a unique author, and an edge table, where each row contains source and target author_id columns. The edges table is trivial, as I can merely create a dataframe using the requisite columns of the article table.
For example, for the above table I would have the following node table:
| author_id | is_published |
|-----------+--------------|
| 999 | yes |
| 555 | yes |
| 234 | yes |
| 115 | yes |
| 765 | no |
| 369 | yes |
| 372 | no |
| 325 | yes |
Notice how the "is_published" shows if the author was ever a lead or co-author on at least one published paper. This is where I'm running into trouble creating a nodes table efficiently. Currently I iterate through every row in the article table and run checks on if an author exists yet in the nodes table and whether to turn on the "is_published" flag. See the following code snippet as an example:
articles = pd.read_excel('excel_file_with_articles_table')
nodes = pd.DataFrame(columns=('is_published'))
nodes.index.name = 'author_id'
for row in articles.itertuples():
if not row.lead_id in nodes.index:
author = pd.Series([False], index=["is_published"], name=row.lead_id)
pi_nodes = pi_nodes.append(author)
if not row.co_lead_id in nodes.index:]
investigator = pd.Series([False], index=["is_published"], name=row.co_lead_id)
pi_nodes = pi_nodes.append(investigator)
if row.published == "yes":
pi_nodes.at[row.lead_id,"is_published"]=True
pi_nodes.at[row.co_lead_id,"is_published"]=True
For my data set (with tens of thousands of rows), this is somewhat slow, and I understand that loops should be avoided when possible when using pandas dataframes. I feel like the pandas apply function may be able to do what I need, but I'm at a loss as to how to implement it.
With df as your first DataFrame, you should be able to:
nodes = pd.concat([df.loc[:, ['lead_id', 'is_published']].rename(columns={'lead_id': 'author_id'}, df.loc[:, ['co_lead_id', 'is_published']].rename(columns={'co_lead_id': 'author_id'}]).drop_duplicates()
for a unique list of author_id and co_author_id with their respective is_published information.
To only keep is_published=True if there is also a False entry:
nodes = nodes.sort_values('is_published', ascending=False).drop_duplicates(subset=['author_id'])
.sort_values() will sort True (==1) before False, and .drop_duplicates() by default keeps the first occurrence (see docs). With this addition I guess you don't really need the first .drop_duplicates() anymore.

Categories