creating unique code for nested XML using element tree - python

I've below nested XML code. Refer image below
Yellow Highlighted codes are 1st Layer
Blue Highlighted codes are 2nd Layer
Red Highlighted codes are 3rd Layer
refer below for the xml data
<trx><invoice>27844173</invoice><total>52</total><item><code>110</code></item><item><code>304</code><items><item><code>54</code><items><item><code>174</code></item><item><code>600</code></item></items></item><item><code>478</code></item><item><code>810</code></item></items></item></trx>
My task is to create unique ids for all 3 layers. and below is my code I wrote.
import pandas as pd
import xml.etree.ElementTree as ET
xml_file_path = 'C:\Desktop\data.xml'
tree = ET.parse(xml_file_path)
root = tree.getroot()
sub_item_id = 0
cols = ['invoice','total','code','item_id','A','B','C']
dict_xml = {}
data = []
for trx in root.iter('trx'):
invoice = trx.find('invoice').text
total = trx.find('total').text
item_id = 0
a = 0
for it in trx.findall('item'):
a += 1
b = -1
for j in it.iter('item'):
b += 1
c = 0
code = j.find('code').text
item_id += 1
data.append({"invoice":invoice,"total":total,"code":code,
"item_id":item_id,"A":a,"B":b,"C":c})
data = pd.DataFrame(data)
data
And I get below output. where Column A is correct. not B and C
+---+----------+-------+------+---------+---+---+---+
| | invoice | total | code | item_id | A | B | C |
+---+----------+-------+------+---------+---+---+---+
| 0 | 27844173 | 52 | 110 | 1 | 1 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 1 | 27844173 | 52 | 304 | 2 | 2 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 2 | 27844173 | 52 | 54 | 3 | 2 | 1 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 3 | 27844173 | 52 | 174 | 4 | 2 | 2 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 4 | 27844173 | 52 | 600 | 5 | 2 | 3 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 5 | 27844173 | 52 | 478 | 6 | 2 | 4 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 6 | 27844173 | 52 | 810 | 7 | 2 | 5 | 0 |
+---+----------+-------+------+---------+---+---+---+
My expected result is as below.
+---+----------+-------+------+---------+---+---+---+
| | invoice | total | code | item_id | A | B | C |
+---+----------+-------+------+---------+---+---+---+
| 0 | 27844173 | 52 | 110 | 1 | 1 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 1 | 27844173 | 52 | 304 | 2 | 2 | 0 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 2 | 27844173 | 52 | 54 | 3 | 2 | 1 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 3 | 27844173 | 52 | 174 | 4 | 2 | 1 | 1 |
+---+----------+-------+------+---------+---+---+---+
| 4 | 27844173 | 52 | 600 | 5 | 2 | 1 | 2 |
+---+----------+-------+------+---------+---+---+---+
| 5 | 27844173 | 52 | 478 | 6 | 2 | 2 | 0 |
+---+----------+-------+------+---------+---+---+---+
| 6 | 27844173 | 52 | 810 | 7 | 2 | 3 | 0 |
+---+----------+-------+------+---------+---+---+---+
how and where should I increment B and C variables to get the desired output

A preliminary observation first: while you used xml.etree, I prefer using the lxml library because of it has better support for xpath. Obviously, you can try to convert the code to xml.etree if you feel it's necessary.
There may be shorter ways of doing this, but for the time being let's use the following and I'll explain along the way:
import pandas as pd
from lxml import etree
stuff = """[your xml above]"""
doc = etree.XML(stuff.encode())
tree = etree.ElementTree(doc)
#first off, get the invoice number and total as integers
inv = int(doc.xpath('/trx/invoice/text()')[0])
total = int(doc.xpath('/trx/total/text()')[0])
#initialize a few lists:
levels = [] #we'll need this to determine programmatically how many levels deep the xml is
codes = [] #collect the codes
tiers = [] #create rows for each tier
#next - how many levels deep is the xml? Not easy to find out:
for e in doc.iter('item'):
path = tree.getpath(e)
tier = path.replace('/trx/','').replace('item','').replace('/s/',' ').replace('[','').replace(']','')
tiers.append(tier.split(' '))
codes.append(e.xpath('./code/text()')[0])
levels.append(path.count('[')) #we now have the depth of each tier
#the length of each tier is a function of its level; so we pad the length of that list to the highest level number (3 in this example):
for tier in tiers:
tiers[tiers.index(tier)] = [*tier, *["0"] * (max(levels)-len(tier))]
#so all that work with counting levels was just to use this max(levels) variable once...
#we now insert the other info you require in each row:
for t,c in zip(tiers,codes):
t.insert(0,c)
t.insert(0,inv)
t.insert(0,total)
#With all this prep out of the way, we get to the dataframe at last:
ids = list(range(1, len(tiers)+1)) #this is for the additional column you require
columns = ["total","invoice","code"," A"," B","C"]
df = pd.DataFrame(tiers,columns=columns)
df.insert(2, 'item_id', ids) #insert the extra column
df
Output:
total invoice item_id code A B C
0 52 27844173 1 110 1 0 0
1 52 27844173 2 304 2 0 0
2 52 27844173 3 54 2 1 0
3 52 27844173 4 174 2 1 1
4 52 27844173 5 600 2 1 2
5 52 27844173 6 478 2 2 0
6 52 27844173 7 810 2 3 0

Related

How to build sequence of purchases for each ID?

I want to create a dataframe that shows me the sequence of what users purchasing according to the sequence column. For example this is my current df:
user_id | sequence | product | price
1 | 1 | A | 10
1 | 2 | C | 15
1 | 3 | G | 1
2 | 1 | B | 20
2 | 2 | T | 45
2 | 3 | A | 10
...
I want to convert it to the following format:
user_id | source_product | target_product | cum_total_price
1 | A | C | 25
1 | C | G | 16
2 | B | T | 65
2 | T | A | 75
...
How can I achieve this?
shift + cumsum + groupby.apply:
def seq(g):
g['source_product'] = g['product']
g['target_product'] = g['product'].shift(-1)
g['price'] = g.price.cumsum().shift(-1)
return g[['user_id', 'source_product', 'target_product', 'price']].iloc[:-1]
df.sort_values('sequence').groupby('user_id', group_keys=False).apply(seq)
# user_id source_product target_product price
#0 1 A C 25.0
#1 1 C G 26.0
#3 2 B T 65.0
#4 2 T A 75.0

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Find top N values within each group

I have a dataset similar to the sample below:
| id | size | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6 | small | 3 | 0 | 21 | 0 |
| 6 | small | 9 | 0 | 23 | 0 |
| 13 | medium | 3 | 0 | 12 | 0 |
| 13 | medium | 37 | 0 | 20 | 1 |
| 20 | medium | 30 | 0 | 5 | 6 |
| 20 | medium | 12 | 2 | 3 | 0 |
| 12 | small | 7 | 0 | 2 | 0 |
| 10 | small | 8 | 0 | 12 | 0 |
| 15 | small | 19 | 0 | 3 | 0 |
| 15 | small | 54 | 0 | 8 | 0 |
| 87 | medium | 6 | 0 | 9 | 0 |
| 90 | medium | 11 | 1 | 16 | 0 |
| 90 | medium | 25 | 0 | 4 | 0 |
| 90 | medium | 10 | 0 | 5 | 0 |
| 9 | large | 8 | 1 | 23 | 0 |
| 9 | large | 19 | 0 | 2 | 0 |
| 1 | large | 1 | 0 | 0 | 0 |
| 50 | large | 34 | 0 | 7 | 0 |
This is the input for above table:
data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])
I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:
| size | id | new_a |
|--------|----|-------|
| large | 9 | 25 |
| large | 50 | 7 |
| medium | 13 | 32 |
| medium | 90 | 25 |
| small | 6 | 44 |
| small | 10 | 12 |
I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.
nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)
Prints:
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.
df.set_index(["size", "id"]).groupby(level=0).apply(
lambda x: x.sum(level=1).nlargest(2)
).reset_index()
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can chain two groupby methods:
data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()
Output:
id size new_a
0 9 large 25
1 50 large 7
2 13 medium 32
3 90 medium 25
4 6 small 44
5 10 small 12

Add values in two Spark DataFrames, row by row

I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul

Filter all rows from groupby object

I have a dataframe like below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 2 | 77 | 105 | 3 | 12 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 1 | 21 | 145 | 1 | 9 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I want to filter the entire group, if any of the items in the list item_list = [128,129,130] is present in that group, after grouping by 'InvoiceNo' &'CategoryNo'.
My desired out put is as below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I know how to filter a dataframe using isin(). But, not sure how to do it with groupby()
so far i have tried below
import pandas as pd
df = pd.read_csv('data.csv')
item_list = [128,129,130]
df.groupby(['InvoiceNo','CategoryNo'])['Item'].isin(item_list)
but nothing happens. please guide me how to solve this issue.
You can do something like this:
s = (df['Item'].isin(item_list)
.groupby([df['InvoiceNo'], df['CategoryNo']])
.transform('any')
)
df[s]

Categories