How to build sequence of purchases for each ID? - python

I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?

shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0

Related

Multiplying pandas columns based on multiple conditions

I have a df like this
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 0 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 0 |
| total | | 6 | | 0 |
I want a output like below
| count | people | A | B | C |
|---------|--------|-----|-----|-----|
| yes | siya | 4 | 2 | 8 |
| no | aish | 4 | 3 | 0 |
| total | | 4 | | 0 |
| yes | dia | 6 | 4 | 0 |
| no | dia | 6 | 2 | 2 |
| total | | 6 | | 0 |
The goal is calculate column C by mulytiplying A and B only when the count value is "yes" but if the column People values are same that is yes for dia and no for also dia , then we have to calculate for the count value "no"
I tried this much so far
df.C= df.groupby("Host", as_index=False).apply(lambda dfx : df.A *
df.B if (df['count'] == 'no') else df.A *df.B)
But not able to achieve the goal, any idea how can I achieve the output
import numpy as np
#Set Condtions
c1=df.groupby('people')['count'].transform('nunique').eq(1)&df['count'].eq('yes')
c2=df.groupby('people')['count'].transform('nunique').gt(1)&df['count'].eq('no')
#Put conditions in list
c=[c1,c2]
#Mke choices corresponding to condition list
choice=[df['A']*df['B'],len(df[df['count'].eq('no')])]
#Apply np select
df['C']= np.select(c,choice,0)
print(df)
count people A B C
0 yes siya 4 2.0 8.0
1 no aish 4 3.0 0.0
2 total NaN 4 0.0 0.0
3 yes dia 6 4.0 0.0
4 no dia 6 2.0 2.0
5 total NaN 6 NaN 0.0

How to count the occurrence of a value and set that count as a new value for that value's row

Title is probably confusing, but let me make it clearer.
Let's say I have a df like this:
+----+------+---------------+
| Id | Name | reports_to_id |
+----+------+---------------+
| 0 | A | 10 |
| 1 | B | 10 |
| 2 | C | 11 |
| 3 | D | 12 |
| 4 | E | 11 |
| 10 | F | 20 |
| 11 | G | 21 |
| 12 | H | 22 |
+----+------+---------------+
I would want my resulting df to look like this:
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 0 |
| 1 | B | 10 | 0 |
| 2 | C | 11 | 0 |
| 3 | D | 12 | 0 |
| 4 | E | 11 | 0 |
| 10 | F | 20 | 2 |
| 11 | G | 21 | 2 |
| 12 | H | 22 | 1 |
+----+------+---------------+-------+
But this what I currently get as a result of my code (that is wrong):
+----+------+---------------+-------+
| Id | Name | reports_to_id | Count |
+----+------+---------------+-------+
| 0 | A | 10 | 2 |
| 1 | B | 10 | 2 |
| 2 | C | 11 | 2 |
| 3 | D | 12 | 1 |
| 4 | E | 11 | 2 |
| 10 | F | 20 | 0 |
| 11 | G | 21 | 0 |
| 12 | H | 22 | 0 |
+----+------+---------------+-------+
with this code:
df['COUNT'] = df.groupby(['reports_to_id'])['id'].transform('count')
Any suggestions or directions on how to get the result I want? All help is appreciated! and thank you in advance!
Use value_counts to count the reports_to_id by values, then map that to Id:
df['COUNT'] = df['Id'].map(df['reports_to_id'].value_counts()).fillna(0)
Output:
Id Name reports_to_id COUNT
0 0 A 10 0.0
1 1 B 10 0.0
2 2 C 11 0.0
3 3 D 12 0.0
4 4 E 11 0.0
5 10 F 20 2.0
6 11 G 21 2.0
7 12 H 22 1.0
Similar idea with reindex:
df['COUNT'] = df['reports_to_id'].value_counts().reindex(df['Id'], fill_value=0).values
which gives a better looking COUNT:
Id Name reports_to_id COUNT
0 0 A 10 0
1 1 B 10 0
2 2 C 11 0
3 3 D 12 0
4 4 E 11 0
5 10 F 20 2
6 11 G 21 2
7 12 H 22 1
You can try the following:
l=list[df['reports_to_id']
df['Count']=df['Id'].apply(lambda x: l.count(x))

Transposing group of data in pandas dataframe

I have a large dataframe like this:
|type| qt | vol|
|----|---- | -- |
| A | 1 | 10 |
| A | 2 | 12 |
| A | 1 | 12 |
| B | 3 | 11 |
| B | 4 | 20 |
| B | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
| C | 4 | 20 |
How can I transpose to the dataframe with grouping horizontally like that?
|A. |B. |C. |
|--------------|--------------|--------------|
|type| qt | vol|type| qt | vol|type| qt | vol|
|----|----| ---|----|----| ---|----|----| ---|
| A | 1 | 10 | B | 3 | 11 | C | 4 | 20 |
| A | 2 | 12 | B | 4 | 20 | C | 4 | 20 |
| A | 1 | 12 | B | 4 | 20 | C | 4 | 20 |
| C | 4 | 20 |
You can group the dataframe on type then create key-value pairs of groups inside a dict comprehension, finally use concat along axis=1 and pass the optional keys parameter to get the final result:
d = {k:g.reset_index(drop=True) for k, g in df.groupby('type')}
pd.concat(d.values(), keys=d.keys(), axis=1)
Alternatively you can use groupby + cumcount to create a sequential counter per group, then create a multilevel index having two levels where the first level is counter and second level is column type itself, finally use stack followed by unstack to reshape:
c = df.groupby('type').cumcount()
df.set_index([c, df['type'].values]).stack().unstack([1, 2])
A B C
type qt vol type qt vol type qt vol
0 A 1 10 B 3 11 C 4 20
1 A 2 12 B 4 20 C 4 20
2 A 1 12 B 4 20 C 4 20
3 NaN NaN NaN NaN NaN NaN C 4 20
This is pretty much pivot by one column:
(df.assign(idx=df.groupby('type').cumcount())
.pivot(index='idx',columns='type', values=df.columns)
.swaplevel(0,1, axis=1)
.sort_index(axis=1)
)
Output:
type A B C
qt type vol qt type vol qt type vol
idx
0 1 A 10 3 B 11 4 C 20
1 2 A 12 4 B 20 4 C 20
2 1 A 12 4 B 20 4 C 20
3 NaN NaN NaN NaN NaN NaN 4 C 20

creating unique code for nested XML using element tree

I've below nested XML code. Refer image below
Yellow Highlighted codes are 1st Layer
Blue Highlighted codes are 2nd Layer
Red Highlighted codes are 3rd Layer
refer below for the xml data
<trx><invoice>27844173</invoice><total>52</total><item><code>110</code></item><item><code>304</code><items><item><code>54</code><items><item><code>174</code></item><item><code>600</code></item></items></item><item><code>478</code></item><item><code>810</code></item></items></item></trx>
My task is to create unique ids for all 3 layers. and below is my code I wrote.
import pandas as pd
import xml.etree.ElementTree as ET
xml_file_path = 'C:\Desktop\data.xml'
tree = ET.parse(xml_file_path)
root = tree.getroot()
sub_item_id = 0
cols = ['invoice','total','code','item_id','A','B','C']
dict_xml = {}
data = []
for trx in root.iter('trx'):
invoice = trx.find('invoice').text
total = trx.find('total').text
item_id = 0
a = 0
for it in trx.findall('item'):
a += 1
b = -1
for j in it.iter('item'):
b += 1
c = 0
code = j.find('code').text
item_id += 1
data.append({"invoice":invoice,"total":total,"code":code,
"item_id":item_id,"A":a,"B":b,"C":c})
data = pd.DataFrame(data)
data
And I get below output. where Column A is correct. not B and C
+---+----------+-------+------+---------+---+---+---+
| | invoice | total | code | item_id | A | B | C |
+---+----------+-------+------+---------+---+---+---+
| 0 | 27844173 | 52 | 110 | 1 | 1 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 1 | 27844173 | 52 | 304 | 2 | 2 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 2 | 27844173 | 52 | 54 | 3 | 2 | 1 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 3 | 27844173 | 52 | 174 | 4 | 2 | 2 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 4 | 27844173 | 52 | 600 | 5 | 2 | 3 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 5 | 27844173 | 52 | 478 | 6 | 2 | 4 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 6 | 27844173 | 52 | 810 | 7 | 2 | 5 | 0 |
+---+----------+-------+------+---------+---+---+---+
My expected result is as below.
+---+----------+-------+------+---------+---+---+---+
| | invoice | total | code | item_id | A | B | C |
+---+----------+-------+------+---------+---+---+---+
| 0 | 27844173 | 52 | 110 | 1 | 1 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 1 | 27844173 | 52 | 304 | 2 | 2 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 2 | 27844173 | 52 | 54 | 3 | 2 | 1 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 3 | 27844173 | 52 | 174 | 4 | 2 | 1 | 1 |
+---+----------+-------+------+---------+---+---+---+
| 4 | 27844173 | 52 | 600 | 5 | 2 | 1 | 2 |
+---+----------+-------+------+---------+---+---+---+
| 5 | 27844173 | 52 | 478 | 6 | 2 | 2 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 6 | 27844173 | 52 | 810 | 7 | 2 | 3 | 0 |
+---+----------+-------+------+---------+---+---+---+
how and where should I increment B and C variables to get the desired output
A preliminary observation first: while you used xml.etree, I prefer using the lxml library because of it has better support for xpath. Obviously, you can try to convert the code to xml.etree if you feel it's necessary.
There may be shorter ways of doing this, but for the time being let's use the following and I'll explain along the way:
import pandas as pd
from lxml import etree
stuff = """[your xml above]"""
doc = etree.XML(stuff.encode())
tree = etree.ElementTree(doc)
#first off, get the invoice number and total as integers
inv = int(doc.xpath('/trx/invoice/text()')[0])
total = int(doc.xpath('/trx/total/text()')[0])
#initialize a few lists:
levels = [] #we'll need this to determine programmatically how many levels deep the xml is
codes = [] #collect the codes
tiers = [] #create rows for each tier
#next - how many levels deep is the xml? Not easy to find out:
for e in doc.iter('item'):
path = tree.getpath(e)
tier = path.replace('/trx/','').replace('item','').replace('/s/',' ').replace('[','').replace(']','')
tiers.append(tier.split(' '))
codes.append(e.xpath('./code/text()')[0])
levels.append(path.count('[')) #we now have the depth of each tier
#the length of each tier is a function of its level; so we pad the length of that list to the highest level number (3 in this example):
for tier in tiers:
tiers[tiers.index(tier)] = [*tier, *["0"] * (max(levels)-len(tier))]
#so all that work with counting levels was just to use this max(levels) variable once...
#we now insert the other info you require in each row:
for t,c in zip(tiers,codes):
t.insert(0,c)
t.insert(0,inv)
t.insert(0,total)
#With all this prep out of the way, we get to the dataframe at last:
ids = list(range(1, len(tiers)+1)) #this is for the additional column you require
columns = ["total","invoice","code"," A"," B","C"]
df = pd.DataFrame(tiers,columns=columns)
df.insert(2, 'item_id', ids) #insert the extra column
df
Output:
total invoice item_id code A B C
0 52 27844173 1 110 1 0 0
1 52 27844173 2 304 2 0 0
2 52 27844173 3 54 2 1 0
3 52 27844173 4 174 2 1 1
4 52 27844173 5 600 2 1 2
5 52 27844173 6 478 2 2 0
6 52 27844173 7 810 2 3 0

Python - cumulative sum until sum matches an exact number

I have a column of numbers in a Python Pandas df: 1,8,4,3,1,5,1,4,2
If I create a cumulative sum column it returns the cumulative sum. How do I only return the rows that reaches a cumulative sum of 20 skipping numbers that take cumulative sum over 20?
+-----+-------+------+
| Var | total | cumu |
+-----+-------+------+
| a | 1 | 1 |
| b | 8 | 9 |
| c | 4 | 13 |
| d | 3 | 16 |
| e | 1 | 17 |
| f | 5 | 22 |
| g | 1 | 23 |
| h | 4 | 27 |
| i | 2 | 29 |
+-----+-------+------+
Desired output:
+-----+-------+------+
| Var | total | cumu |
+-----+-------+------+
| a | 1 | 1 |
| b | 8 | 9 |
| c | 4 | 13 |
| d | 3 | 16 |
| e | 1 | 17 |
| g | 1 | 18 |
| i | 2 | 20 |
+-----+-------+------+
If I understood your question correctly, you want only skip values that get you over cumulative sum of 20:
def acc(total):
s, rv = 0, []
for v, t in zip(total.index, total):
if s + t <= 20:
s += t
rv.append(v)
return rv
df = df[df.index.isin(acc(df.total))]
df['cumu'] = df.total.cumsum()
print(df)
Prints:
Var total cumu
0 a 1 1
1 b 8 9
2 c 4 13
3 d 3 16
4 e 1 17
6 g 1 18
8 i 2 20

Categories