PySpark Dataframe - Iterate over window partition by

PySpark Dataframe - Iterate over window partition by - python

Let's assume that we have data like this(sorted by time) and created the dummy column for the classes in Pyspark dataframe:
ID class e_TYPE_B e_TYPE_C e_TYPE_L e_TYPE_A e_TYPE_E e_TYPE_G
1 G 0 0 0 0 0 1
1 B 1 0 0 0 0 0
1 B 1 0 0 0 0 0
2 E 0 0 0 0 1 0
2 E 0 0 0 0 1 0
2 C 0 1 0 0 0 0
2 C 0 1 0 0 0 0
2 E 0 0 0 0 1 0
2 E 0 0 0 0 1 0
3 L 0 0 1 0 0 0
3 L 0 0 1 0 0 0
3 B 1 0 0 0 0 0
3 E 0 0 0 0 1 0
4 A 0 0 0 1 0 0
4 A 0 0 0 1 0 0
5 B 1 0 0 0 0 0
5 B 1 0 0 0 0 0
5 A 0 0 0 1 0 0
5 A 0 0 0 1 0 0
Now, I am trying to findout the count of ID moving from one class to another.It can be consecutive or may have some other classes in between. The relation should be created for each class from top to bottom on ID basis.
For Example,
ID 1 goes from G to B then 1 should be added to G to B counter,
ID 2 goes from E to C then 1 should be added to E to C counter,
ID 2 goes from C to E then 1 should be added to C to E counter,
ID 3 goes from L to B then 1 should be added to L to B counter,
ID 3 goes from B to E then 1 should be added to B to E counter,
Also ID 3 goes from L to E then 1 should be added to L to E counter,
ID 4 have only one class so it should be discarded
I thought of using Window operation which should partition on ID, but how to iterate the partition to calculate the count of above class relation is that i am struggling on.
Please provide the solution/code snippet for this.
Thanks.

Related

Extract all row values and create new columns similar to Count Vectorizer

My apologies SO community, I am a newbie on the platform and in the pursuit of making this question precise and straight to the point, I didn't give relevant info.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376|98']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta a_A a_B a_C a_D b_1 b_20 b_30 b_350 b_376
0 abc A 1|20|30 1 0 0 0 1 1 1 1 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows like the one displayed by #wwwnde. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output

You will have to achieve that in a series of steps.
Sample Data
id ALPHA BETA
0 1 A 1|20|30
1 2 B,C,D,A 350
2 3 B,C,A 395|45|90
Create Lists for values in ALPHA and BETA
df.BETA=df.BETA.apply(lambda x: x.split('|'))#.str.join(',')
df.ALPHA=df.ALPHA.apply(lambda x: x.split(','))#.str.join(',')
Disintegrate the list elements into individuals
df=df.explode('ALPHA')
df=df.explode('BETA')
Extract the variable frequencies using get dummies.
pd.get_dummies(df)
Strip columns of the prefix
df.columns=df.columns.str.replace('ALPHA_|BETA_','')
id A B C D 1 20 30 350 395 45 90
0 1 1 0 0 0 1 0 0 0 0 0 0
0 1 1 0 0 0 0 1 0 0 0 0 0
0 1 1 0 0 0 0 0 1 0 0 0 0
1 2 0 1 0 0 0 0 0 1 0 0 0
1 2 0 0 1 0 0 0 0 1 0 0 0
1 2 0 0 0 1 0 0 0 1 0 0 0
1 2 1 0 0 0 0 0 0 1 0 0 0
2 3 0 1 0 0 0 0 0 0 1 0 0
2 3 0 1 0 0 0 0 0 0 0 1 0
2 3 0 1 0 0 0 0 0 0 0 0 1
2 3 0 0 1 0 0 0 0 0 1 0 0
2 3 0 0 1 0 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0 0 0 0 1
2 3 1 0 0 0 0 0 0 0 1 0 0
2 3 1 0 0 0 0 0 0 0 0 1 0
2 3 1 0 0 0 0 0 0 0 0 0 1

how to pivot a sql table from rows to columns with python/pandas/numpy

I have a dictionary table with bit mask ids
see below:
I'd like to transform it to this structure:
each row's tag will become a column and it's value will be the combination (bitwise).
example:
the value 3 is a combination of 1 and 2 therefore a will be given 1 and b will given 1 and all the rest of the columns are 0
I've implemented it using a SQL Server stored procedure with bitwise operator "&".
I'd like to implement this transform using python (I assume it would be done with pandas),
As you can each tag is 2 to the power of n, so I tried to tackle it using a transformation from decimal to binary - which give me exactly what I need, but I'm missing the stage of how to attach each bit to the correct column
example 3 is represented as 11 in binary so I'd like to assign a with 1 and b with 1 and all the rest should be 0.
the source table may be added additional entries so the output should alter the destination table with the new row (for example n , 4096) as a new column m which will be assign 1 or 0 depending the value.
Any suggestions how to approach this using python/pandas?

Use numpy broadcasting with bit shifting (>>) for convert integers to columns filled by binaries, last for new column with all combinations is used DataFrame.dot with columns names and separators:
df = pd.DataFrame({'mask_id':range(1, 17)})
#list or Series of tags
L = list('abcdefghijklm')
#L = df2['Tags']
a = df.mask_id.to_numpy()
n = len(L)
data = (a[:, None] >> np.arange(n)) & 1
df1 = pd.DataFrame(data, index=df.index, columns=L)
df1['combinations'] = df1.dot(df1.columns + ',').str.rstrip(',')
print (df1)
a b c d e f g h i j k l m combinations
0 1 0 0 0 0 0 0 0 0 0 0 0 0 a
1 0 1 0 0 0 0 0 0 0 0 0 0 0 b
2 1 1 0 0 0 0 0 0 0 0 0 0 0 a,b
3 0 0 1 0 0 0 0 0 0 0 0 0 0 c
4 1 0 1 0 0 0 0 0 0 0 0 0 0 a,c
5 0 1 1 0 0 0 0 0 0 0 0 0 0 b,c
6 1 1 1 0 0 0 0 0 0 0 0 0 0 a,b,c
7 0 0 0 1 0 0 0 0 0 0 0 0 0 d
8 1 0 0 1 0 0 0 0 0 0 0 0 0 a,d
9 0 1 0 1 0 0 0 0 0 0 0 0 0 b,d
10 1 1 0 1 0 0 0 0 0 0 0 0 0 a,b,d
11 0 0 1 1 0 0 0 0 0 0 0 0 0 c,d
12 1 0 1 1 0 0 0 0 0 0 0 0 0 a,c,d
13 0 1 1 1 0 0 0 0 0 0 0 0 0 b,c,d
14 1 1 1 1 0 0 0 0 0 0 0 0 0 a,b,c,d
15 0 0 0 0 1 0 0 0 0 0 0 0 0 e
If need combinations in lists use list comprehension:
cols = df1.columns.to_numpy()
df1['combinations'] = [cols[x].tolist() for x in df1.to_numpy().astype(bool)]
print (df1)
a b c d e f g h i j k l m combinations
0 1 0 0 0 0 0 0 0 0 0 0 0 0 [a]
1 0 1 0 0 0 0 0 0 0 0 0 0 0 [b]
2 1 1 0 0 0 0 0 0 0 0 0 0 0 [a, b]
3 0 0 1 0 0 0 0 0 0 0 0 0 0 [c]
4 1 0 1 0 0 0 0 0 0 0 0 0 0 [a, c]
5 0 1 1 0 0 0 0 0 0 0 0 0 0 [b, c]
6 1 1 1 0 0 0 0 0 0 0 0 0 0 [a, b, c]
7 0 0 0 1 0 0 0 0 0 0 0 0 0 [d]
8 1 0 0 1 0 0 0 0 0 0 0 0 0 [a, d]
9 0 1 0 1 0 0 0 0 0 0 0 0 0 [b, d]
10 1 1 0 1 0 0 0 0 0 0 0 0 0 [a, b, d]
11 0 0 1 1 0 0 0 0 0 0 0 0 0 [c, d]
12 1 0 1 1 0 0 0 0 0 0 0 0 0 [a, c, d]
13 0 1 1 1 0 0 0 0 0 0 0 0 0 [b, c, d]
14 1 1 1 1 0 0 0 0 0 0 0 0 0 [a, b, c, d]
15 0 0 0 0 1 0 0 0 0 0 0 0 0 [e]

Assuming you want binary representations, here is one without previous dataset needed:
cols = ['a','b','c','d','e','f','g','h','i','j','k','l']
df = [list(('0'*(12-1)+"{0:b}".format(1))[::-1])]
for i in range(16):
n = "{0:b}".format(i)
df = df + [list(('0'*(12-len(n))+n)[::-1])]
df = pd.DataFrame(df, columns = cols)
df["combinations"] = df.apply(lambda x: list(x[x == '1'].index) ,axis = 1)
Output:
a b c d e f g h i j k l combinations
0 1 0 0 0 0 0 0 0 0 0 0 0 [a]
1 0 0 0 0 0 0 0 0 0 0 0 0 []
2 1 0 0 0 0 0 0 0 0 0 0 0 [a]
3 0 1 0 0 0 0 0 0 0 0 0 0 [b]
4 1 1 0 0 0 0 0 0 0 0 0 0 [a, b]
5 0 0 1 0 0 0 0 0 0 0 0 0 [c]
6 1 0 1 0 0 0 0 0 0 0 0 0 [a, c]
7 0 1 1 0 0 0 0 0 0 0 0 0 [b, c]
8 1 1 1 0 0 0 0 0 0 0 0 0 [a, b, c]
9 0 0 0 1 0 0 0 0 0 0 0 0 [d]
10 1 0 0 1 0 0 0 0 0 0 0 0 [a, d]
11 0 1 0 1 0 0 0 0 0 0 0 0 [b, d]
12 1 1 0 1 0 0 0 0 0 0 0 0 [a, b, d]
13 0 0 1 1 0 0 0 0 0 0 0 0 [c, d]
14 1 0 1 1 0 0 0 0 0 0 0 0 [a, c, d]
15 0 1 1 1 0 0 0 0 0 0 0 0 [b, c, d]
16 1 1 1 1 0 0 0 0 0 0 0 0 [a, b, c, d]

Convert one-hot encoded data-frame columns into one column

In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.

Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5

argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19

Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

Creating table python

I have a data set in excel. A sample of the data is given below. Each row contains a number of items; one item in each column. The data has no headers either.
a b a d
g z f d a
e
dd gg dd g f r t
want to create a table which should look like below. It should count the items in each row and display the count by the row. I dont know apriori how many items are in the table.
row# a b d g z f e dd gg r t
1 2 1 1 0 0 0 0 0 0 0 0
2 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 2 1 1 1
I am not an expert in python and any assistance is very much appreciated.

Use get_dummies + sum:
df = pd.read_csv(file, names=range(100)).stack() # setup to account for missing values
df.str.get_dummies().sum(level=0)
a b d dd e f g gg r t z
0 2 1 1 0 0 0 0 0 0 0 0
1 1 0 1 0 0 1 1 0 0 0 1
2 0 0 0 0 1 0 0 0 0 0 0
3 0 0 0 2 0 1 1 1 1 1 0

Multiple for loops in Python with variable range and variable number of loops

With this code:
from itertools import product
for a, b, c, d in product(range(low, high), repeat=4):
print (a, b, c, d)
I have an output like this:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 1 0
0 0 1 1
0 0 1 2
0 0 2 0
0 0 2 1
0 0 2 2
but how I can create an algorithm capable of this:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 0 3
0 0 0 4
0 0 1 1
0 0 1 2
0 0 1 3
0 0 1 4
0 0 2 2
0 0 2 3
0 0 2 4
0 0 3 3
0 0 3 4
0 0 4 4
More important: every column of output must have different ranges, for example: first column: 0-4 second column: 0-10 etc.
And the number of columns ( a,b,c,d ) isn't fixed; depending on other parts of the program, can be in a range from 2 to 200.
UPDATE: to be more comprehensible and clear
what I need is something like that:
for a in range (0,10):
for b in range (a,10):
for c in range (b,10):
for d in range (c,10):
print(a,b,c,d)
the question is been partially resolved but still had problems on how to change the range parameters such like the above example.
Excuse me for the mess ! :)

itertools.product can already do exactly what you are looking for, simply by passing it multiple iterables (in this case the ranges you want). It will collect one element from each iterable passed. For example:
for a,b,c in product(range(2), range(3), range(4)):
print (a,b,c)
Outputs:
0 0 0
0 0 1
0 0 2
0 0 3
0 1 0
0 1 1
0 1 2
0 1 3
0 2 0
0 2 1
0 2 2
0 2 3
1 0 0
1 0 1
1 0 2
1 0 3
1 1 0
1 1 1
1 1 2
1 1 3
1 2 0
1 2 1
1 2 2
1 2 3
If your input ranges are variable, just place the loop in a function and call it with different parameters. You can also use something along the lines of
for elements in product(*(range(i) for i in [1,2,3,4])):
print(*elements)
if you have a large number of input iterables.
With your updated request for the variable ranges, a nice short-circuiting approach with itertools.product is not as clear, although you can always just check that each iterable is sorted in ascending order (as this is essentially what your variable ranges ensures). As per your example:
for elements in product(*(range(i) for i in [10,10,10,10])):
if all(elements[i] <= elements[i+1] for i in range(len(elements)-1)):
print(*elements)

You looking for something like this?
# the program would modify these variables below
column1_max = 2
column2_max = 3
column3_max = 4
column4_max = 5
# now generate the list
for a in range(column1_max+1):
for b in range(column2_max+1):
for c in range(column3_max+1):
for d in range(column4_max+1):
if c>d or b>c or a>b:
pass
else:
print a,b,c,d
Output:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 0 3
0 0 0 4
0 0 0 5
0 0 1 1
0 0 1 2
0 0 1 3
0 0 1 4
0 0 1 5
0 0 2 2
0 0 2 3
0 0 2 4
0 0 2 5
0 0 3 3
0 0 3 4
0 0 3 5
0 0 4 4
0 0 4 5
0 1 1 1
0 1 1 2
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark Dataframe - Iterate over window partition by - python

Related

Extract all row values and create new columns similar to Count Vectorizer

how to pivot a sql table from rows to columns with python/pandas/numpy

Convert one-hot encoded data-frame columns into one column

Creating table python

Multiple for loops in Python with variable range and variable number of loops

Categories

Resources