shopping basket analysis in python with pandas - python

I have a dataframe containing transaction data. Each row represents one transaction and the columns indicate whether a product has been bought from a category (categories are A-F) or not (one = yes, zero = no). Now I would like to compute the pairs of transactions within each category. My dataframe looks as follows:
A B C D E F
1 1 0 0 0 0
1 0 1 1 0 0
The output should be a matrix counting each pairs of the categories in the dataframe like so:
A B C D E F
A 4 2 1 0 4 2
B 5 6 7 3 5 1
C 1 6 5 8 7 9
D ...
E ...
F ...
Anyone knows a solution on how to solve this?
Thank you very much!

Use the dot product with its transpose:
df.T.dot(df)
Out:
A B C D E F
A 2 1 1 1 0 0
B 1 1 0 0 0 0
C 1 0 1 1 0 0
D 1 0 1 1 0 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0
Note that looking for pairwise occurrences is not scalable though. You might want to look at apriori algorithm.

Related

Delete row and column from pandas dataframe

I have a CSV file which is contains a symmetric adjacency matrix which means row and column have equivalent labels.
I would like to import this into a pandas dataframe, ideally have some GUI pop up and ask for a list of items to delete....and then take that list in and set the values in the relative row and column as zero's and return a separate altered dataframe.
In short, something that takes the following matrix
a b c d e
a 0 3 5 3 5
b 3 0 2 4 5
c 5 2 0 1 7
d 3 4 1 0 9
e 5 5 7 9 0
Pops up a simple interface asking "which regions should be deleted" and a line to enter those regions
and say c and e are entered
returns
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
with the altered entries as shown in bold
it should be able to do this for as many areas as entered which can be up to 379....ideally seperated by commas
Set columns and rows by index values with DataFrame.loc:
vals = ['c','e']
df.loc[vals, :] = 0
df[vals] = 0
#alternative
#df.loc[:, vals] = 0
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
Another solution is create boolean mask with numpy broadcasting and set values by DataFrame.mask:
mask = df.index.isin(vals) | df.columns.isin(vals)[:, None]
df = df.mask(mask, 0)
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
Start by importing the csv:
import pandas as pd
adj_matrix = pd.read_csv("file/name/to/your.csv", index_col=0)
Then request the input:
regions = input("Please enter the regions that you want deleted (as an array of strings)")
adj_matrix.loc[regions, :] = 0
adj_matrix.loc[:, regions] = 0
Now adj_matrix should be in the form you want.

Creating a data matrix

I am a data scientist and am working with a text file that specifies how many datasets I have for a specific participant by printing the participant's ID on a new line for each dataset. The second column counts the number of different participants, like so
a 1
a 1
a 1
b 2
b 2
c 3
d 4
d 4
d 4
I now need to create a matrix which has a column for each participant and specifies what lines refer to that participant by giving it a value of 1 vs 0. I have over 2000 participants, so I cannot do this by hand or write out all column numbers and what to print where but have to create a rule.
The number of columns in my file will be the number in the last row of column 2 + 2 (in the example that should be 4 + 2 = 6). Basically, for each row, I need to print a 1 in columns that match the (value in column 2 (participants number) + 2). For that row, all other columns get the value of 0. So for row 1, column (1+2=)3 gets a 1, all other columns get a value of 0. For row 2, column (1+2=)3 gets a 1, all other columns get a value of 0, etc.
This should look like this:
a 1 1 0 0 0
a 1 1 0 0 0
a 1 1 0 0 0
b 2 0 1 0 0
b 2 0 1 0 0
c 3 0 0 1 0
d 4 0 0 0 1
d 4 0 0 0 1
d 4 0 0 0 1
I wish I could provide code that I have tried, but I don't know where to start.
Hope anyone can help. Thanks!
awk to the rescue!
$ awk 'NR==FNR{if(max<$2)max=$2; next}
{printf "%s %s", $1,$2;
for(i=1;i<=max;i++) printf " %s", i==$2;
print ""}' file{,}
a 1 1 0 0 0
a 1 1 0 0 0
a 1 1 0 0 0
b 2 0 1 0 0
b 2 0 1 0 0
c 3 0 0 1 0
d 4 0 0 0 1
d 4 0 0 0 1
d 4 0 0 0 1
with this double scan algorithm the consistency and order doesn't matter.

Creating table python

I have a data set in excel. A sample of the data is given below. Each row contains a number of items; one item in each column. The data has no headers either.
a b a d
g z f d a
e
dd gg dd g f r t
want to create a table which should look like below. It should count the items in each row and display the count by the row. I dont know apriori how many items are in the table.
row# a b d g z f e dd gg r t
1 2 1 1 0 0 0 0 0 0 0 0
2 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 2 1 1 1
I am not an expert in python and any assistance is very much appreciated.
Use get_dummies + sum:
df = pd.read_csv(file, names=range(100)).stack() # setup to account for missing values
df.str.get_dummies().sum(level=0)
a b d dd e f g gg r t z
0 2 1 1 0 0 0 0 0 0 0 0
1 1 0 1 0 0 1 1 0 0 0 1
2 0 0 0 0 1 0 0 0 0 0 0
3 0 0 0 2 0 1 1 1 1 1 0

for loop to extract header for a dataframe in pandas

I am a newbie in python. I have a data frame that looks like this:
A B C D E
0 1 0 1 0 1
1 0 1 0 0 1
2 0 1 1 1 0
3 1 0 0 1 0
4 1 0 0 1 1
How can I write a for loop to gather the column names for each row. I expect my result set looks like that:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE
Anyone can help me with that? Thank you!
The dot function is done for that purpose as you want the matrix dot product between your matrix and the vector of column names:
df.dot(df.columns)
Out[5]:
0 ACE
1 BE
2 BCD
3 AD
4 ADE
If your dataframe is numeric, then obtain the boolean matrix first by test your df against 0:
(df!=0).dot(df.columns)
PS: Just assign the result to the new column
df['Result'] = df.dot(df.columns)
df
Out[7]:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE

Multiple for loops in Python with variable range and variable number of loops

With this code:
from itertools import product
for a, b, c, d in product(range(low, high), repeat=4):
print (a, b, c, d)
I have an output like this:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 1 0
0 0 1 1
0 0 1 2
0 0 2 0
0 0 2 1
0 0 2 2
but how I can create an algorithm capable of this:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 0 3
0 0 0 4
0 0 1 1
0 0 1 2
0 0 1 3
0 0 1 4
0 0 2 2
0 0 2 3
0 0 2 4
0 0 3 3
0 0 3 4
0 0 4 4
More important: every column of output must have different ranges, for example: first column: 0-4 second column: 0-10 etc.
And the number of columns ( a,b,c,d ) isn't fixed; depending on other parts of the program, can be in a range from 2 to 200.
UPDATE: to be more comprehensible and clear
what I need is something like that:
for a in range (0,10):
for b in range (a,10):
for c in range (b,10):
for d in range (c,10):
print(a,b,c,d)
the question is been partially resolved but still had problems on how to change the range parameters such like the above example.
Excuse me for the mess ! :)
itertools.product can already do exactly what you are looking for, simply by passing it multiple iterables (in this case the ranges you want). It will collect one element from each iterable passed. For example:
for a,b,c in product(range(2), range(3), range(4)):
print (a,b,c)
Outputs:
0 0 0
0 0 1
0 0 2
0 0 3
0 1 0
0 1 1
0 1 2
0 1 3
0 2 0
0 2 1
0 2 2
0 2 3
1 0 0
1 0 1
1 0 2
1 0 3
1 1 0
1 1 1
1 1 2
1 1 3
1 2 0
1 2 1
1 2 2
1 2 3
If your input ranges are variable, just place the loop in a function and call it with different parameters. You can also use something along the lines of
for elements in product(*(range(i) for i in [1,2,3,4])):
print(*elements)
if you have a large number of input iterables.
With your updated request for the variable ranges, a nice short-circuiting approach with itertools.product is not as clear, although you can always just check that each iterable is sorted in ascending order (as this is essentially what your variable ranges ensures). As per your example:
for elements in product(*(range(i) for i in [10,10,10,10])):
if all(elements[i] <= elements[i+1] for i in range(len(elements)-1)):
print(*elements)
You looking for something like this?
# the program would modify these variables below
column1_max = 2
column2_max = 3
column3_max = 4
column4_max = 5
# now generate the list
for a in range(column1_max+1):
for b in range(column2_max+1):
for c in range(column3_max+1):
for d in range(column4_max+1):
if c>d or b>c or a>b:
pass
else:
print a,b,c,d
Output:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 0 3
0 0 0 4
0 0 0 5
0 0 1 1
0 0 1 2
0 0 1 3
0 0 1 4
0 0 1 5
0 0 2 2
0 0 2 3
0 0 2 4
0 0 2 5
0 0 3 3
0 0 3 4
0 0 3 5
0 0 4 4
0 0 4 5
0 1 1 1
0 1 1 2
...

Categories