Sorting Pandas Dataframe by Category - python

I have a .xlsx file with 9 variables of 898 observations. I read in a .xlsx file and parsed into a pandas dataframe. I tried sorting a pandas dataframe column called product_id by ascending order, but got all the columns in the result.
I followed the advice from another link, but still got an error.
Question: How can I get the top 10 highest occurring values from the product_id category in ascending order?
import pandas as pd
import xlrd
#Import data
trans = pd.ExcelFile('file.xlsx')
#parse xlsx file into dataframe
transdata = trans.parse('Orders')
#view head of dataframe
print transdata.head()
site_id visitor_id transaction_id transaction_date product_id price \
0 3 10001 20001 2014-10-31 48165 150
1 3 10002 20002 2014-10-31 48162 128
2 3 10002 20003 2014-10-30 48165 150
3 3 10003 20004 2014-10-31 48815 98
4 3 10003 20005 2014-10-29 48165 150
units sales_tax total
0 1 12.38 162.38
1 1 10.56 138.56
2 1 12.38 162.38
3 1 8.09 106.09
4 1 12.38 162.38
grouped = transdata.groupby(['product_id']).size()
print grouped
product_id
36959 78
44524 12
45956 33
46814 11
48162 50
48165 100
48412 12
48478 23
48500 13
48528 14
48552 101
48587 106
48593 104
48628 4
48810 25
48814 16
48815 33
48823 20
49418 11
49444 12
49882 102
51184 2
51380 15
dtype: int64
EDIT: I tried sorting the pandas dataframe category product_id but got rankings of all the columns.
grouped = transdata.groupby(['product_id'])
counts = grouped.size().sort()
result = counts.head(10).index
print result
site_id visitor_id transaction_id transaction_date product_id price \
0 3 10001 20001 2014-10-31 48165 150
1 3 10002 20002 2014-10-31 48162 128
2 3 10002 20003 2014-10-30 48165 150
3 3 10003 20004 2014-10-31 48815 98
4 3 10003 20005 2014-10-29 48165 150
units sales_tax total
0 1 12.38 162.38
1 1 10.56 138.56
2 1 12.38 162.38
3 1 8.09 106.09
4 1 12.38 162.38
Traceback (most recent call last):
File "Trending.py", line 14, in <module>
result = counts.head(10).index
AttributeError: 'NoneType' object has no attribute 'head'
Desired output: A vector with the top highest occuring values from the product_id category.
product_id
48587
48593
49882
48552
48165
36959
48162
45956
48815
48478

From this:
grouped = transdata.groupby(by=['product_id'])
You just need
counts = grouped.size()['product_id']
counts.sort()
counts.head(10)

Related

Add/Update/Merge original DataFrame into a grouped DataFrame

How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
I have a DataFrame with 22 rows and 78 columns. An internet-friendly version of the file can be found here. This a sample:
item_no code group gross_weight net_weight value ... ... +70 columns more
1 7417.85.24.25 0 18 17 13018.74
2 1414.19.00.62 1 35 33 0.11
3 7815.80.99.96 0 49 48 1.86
4 1414.19.00.62 1 30 27 2.7
5 5867.21.36.92 1 31 24 94
6 9227.71.84.12 1 24 17 56.4
7 1414.19.00.62 0 42 35 0.56
8 4465.58.84.31 0 50 42 0.94
9 1596.09.32.64 1 20 13 0.75
10 2194.64.27.41 1 38 33 1.13
11 1596.09.32.64 1 53 46 1.9
12 1596.09.32.64 1 18 15 10.44
13 1596.09.32.64 1 35 33 15.36
14 4835.09.81.44 1 55 47 10.44
15 5698.44.72.13 1 51 49 15.36
16 5698.44.72.13 1 49 45 2.15
17 5698.44.72.13 0 41 33 16
18 3815.79.80.69 1 25 21 4
19 3815.79.80.69 1 35 30 2.4
20 4853.40.53.94 1 53 46 3.12
21 4853.40.53.94 1 50 47 3.98
22 4853.40.53.94 1 16 13 6.53
The column group gives me the instruction that I should group all similar values in the code column and add the values in the columns: 'gross_weight', 'net_weight', 'value', and 'item_quantity'. Additionally, I have to modify 2 additional columns as shown below:
#Group DF
grouped_df = df.groupby(['group', 'code'], as_index=False).agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'}).copy()
#Total items should be equal to the length of the DF
grouped_df['total_items'] = len(grouped_df)
#Item No.
grouped_df['item_no'] = [x+1 for x in range(len(grouped_df))]
This is the result:
group code item_quantity gross_weight net_weight value total_items item_no
0 0 1414.19.00.62 75.0 42 35 0.56 14 1
1 0 4465.58.84.31 125.0 50 42 0.94 14 2
2 0 5698.44.72.13 200.0 41 33 16.0 14 3
3 0 7417.85.24.25 1940.2 18 17 13018.74 14 4
4 0 7815.80.99.96 200.0 49 48 1.86 14 5
5 1 1414.19.00.62 275.0 65 60 2.81 14 6
6 1 1596.09.32.64 515.0 126 107 28.45 14 7
7 1 2194.64.27.41 151.0 38 33 1.13 14 8
8 1 3815.79.80.69 400.0 60 51 6.4 18 14 9
9 1 4835.09.81.44 87.0 55 47 10.44 14 10
10 1 4853.40.53.94 406.0 119 106 13.63 14 11
11 1 5698.44.72.13 328.0 100 94 17.51 14 12
12 1 5867.21.36.92 1000.0 31 24 94.0 14 13
13 1 9227.71.84.12 600.0 24 17 56.4 14 14
All of the columns in the grouped DF exist in the original DF but some have different values.
How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
The objective DataFrame is the grouped DF.
The columns in the original DF that already exists in the Grouped DF should be omitted.
I should be able to take the first value of the columns in the original DF that aren't in the Grouped DF.
The column code does not have unique values.
The column part_number in the complete file does not have unique values.
I tried:
pd.Merge(how='left') after creating a unique ID; it duplicates existing columns instead of updating values or overwriting.
join, concat, update: does not yield the expected results.
.agg({lambda x: x.iloc[0]}) adds all the columns but I don't know how to add it to the current .agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'})
I know that .agg({'column_name':'first']) returns the first value, but I don't know how to make it work for over 70 columns automatically.
You can achieve this dynamically creating a dictionary with list comprehension like this:
df.groupby(['group', 'code'], as_index=False).agg({col : 'sum' for col in df.columns[3:]}
If item_no is your index, then change df.columns[3:] to df.columns[2:]

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

Mark every Nth row per group using pandas

I have a Dataframe with customer info with their purchase details. I am trying to add a new columns that indicates every 3rd purchase done by the same customer.
Given below is the Dataframe
customer_name,bill_no,date
Mark,101,2018-10-01
Scott,102,2018-10-01
Pete,103,2018-10-02
Mark,104,2018-10-02
Mark,105,2018-10-04
Scott,106,2018-10-21
Julie,107,2018-10-03
Kevin,108,2018-10-07
Steve,109,2018-10-02
Mark,110,2018-10-06
Mark,111,2018-10-02
Mark,112,2018-10-05
Mark,113,2018-10-05
I am writing to filter every 3rd purchase done by the same customer. So in this case, I would like to add a flag for the below bill_no
Mark,105,2018-10-04
Mark,112,2018-10-05
Basically every multiple of 3 bill generated for the same customer.
Using groupby.cumcount:
n = 3
df['flag'] = df.groupby('customer_name').cumcount() + 1
df['flag'] = ((df['flag'] % n) == 0).astype(int)
print(df)
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
If actually getting the indices are important, you should use groupby + apply with slicing on the index:
n = 3
idx = df.groupby('customer_name', group_keys=False).apply(
lambda x: x.index[n-1::n].to_series())
# So you can query these rows easily.
df.loc[idx]
customer_name bill_no date
4 Mark 105 2018-10-04
11 Mark 112 2018-10-05
Now, mark them using the indices:
df['flag'] = 0
df.loc[idx, 'flag'] = 1
df
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
If performance is important, use Sandeep's solution instead.

Pandas: Adding two dataframes on the common columns

I have 2 tables which have the same columns and I want to add the numbers where the key matches, and if it doesn't then just add it as is in the output df.. I tried combine first, merge, concat and join.. They all create 2 seperate columnd for t1 and t2, but its the same key, so should just be together I know this would be something very basic.. pls could someone help? thanks vm!
df1:
t1 a b
0 USD 2,877 -2,418
1 CNH 600 -593
2 AUD 756 -106
3 JPY 113 -173
4 XAG 8 0
df2:
t2 a b
0 CNH 64 -44
1 USD 756 -774
2 JPY 1,127 -2,574
3 TWO 56 -58
4 TWD 38 -231
Output:
t a b
USD 3,634 -3,192
CNH 664 -637
AUD 756 -106
JPY 1,240 -2,747
XAG 8 0
TWO 56 -58
TWD 38 -231
First set_index in both DataFrames by first columns and then use add with parameter fill_value=0:
print (df1.set_index('t1').add(df2.set_index('t2'), fill_value=0)
.reset_index()
.rename(columns={'index':'t'}))
t a b
0 AUD 756.0 -106.0
1 CNH 664.0 -637.0
2 JPY 1240.0 -2747.0
3 TWD 38.0 -231.0
4 TWO 56.0 -58.0
5 USD 3633.0 -3192.0
6 XAG 8.0 0.0
If need convert output to int:
print (df1.set_index('t1').add(df2.set_index('t2'), fill_value=0)
.astype(int)
.reset_index()
.rename(columns={'index':'t'}))
t a b
0 AUD 756 -106
1 CNH 664 -637
2 JPY 1240 -2747
3 TWD 38 -231
4 TWO 56 -58
5 USD 3633 -3192
6 XAG 8 0

Pandas read_table error

I am trying to read a tab delimited text file into a dataframe.
This is the how the file looks in Excel:
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
Import into a df:
df = p.read_table("E:/FileLoc/ThisIsAFile.txt", encoding = "iso-8859-1")
Now it doesn't see the first 3 columns as part of the column index (df[0] = Transaction Type) and all of the headers shift over to reflect this.
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
I am trying to manipulate the text file and then import it to a mysql database as an end result.
You can use read_csv with separator 2 and more whitespaces:
import pandas as pd
import io
temp=u"""CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1"""
#after testing replace io.StringIO(temp) to filename
df =pd.read_csv(io.StringIO(temp), sep=r'\s{2,}', engine='python', encoding = "iso-8859-1")
print (df)
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE \
0 5/13/2016 0:00 13867666 6892372 S
CUSTOMER_NUMBER CUSTOMER_NAME
0 2026 CUSTOMER 1
If separator is tabulator, use sep='\t'.
EDIT:
I test it with your data and it works:
import pandas as pd
df = pd.read_csv('test/AnonymizedData.txt', sep='\t')
print (df)
CUSTOMER_NUMBER CUSTOMER_NAME CUSTOMER_BRANCH_CODE CUSTOMER_BRANCH_NAME \
0 2026 CUSTOMER 1 83 SALES BRANCH 1
1 2359 CUSTOMER 2 76 SALES BRANCH 2
2 100662 CUSTOMER 3 28 SALES BRANCH 3
3 3245 CUSTOMER 4 84 SALES BRANCH 4
4 3179 CUSTOMER 5 28 SALES BRANCH 5
5 39881 CUSTOMER 6 67 SALES BRANCH 6
6 37020 CUSTOMER 7 58 SALES BRANCH 7
7 1239 CUSTOMER 8 50 SALES BRANCH 8
8 2379 CUSTOMER 9 76 SALES BRANCH 9
CUSTOMER_CITY CUSTOMER_STATE ... PRICING_PRODUCT_TYPE_CODE \
0 TOWN 1 CO ... 11
1 TOWN 2 OH ... 11
2 TOWN 3 ME ... 11
3 TOWN 4 IL ... 11
4 TOWN 5 NH ... 11
5 TOWN 6 TX ... 11
6 TOWN 7 NC ... 11
7 TOWN 8 NY ... 11
8 TOWN 9 OH ... 11
PRICING_PRODUCT_TYPE ORGANIZATION_ID ORGANIZATION_NAME PRODUCT_LINE_CODE \
0 DISPOSABLES 83 ORGANIZATIONNAME 891
1 DISPOSABLES 83 ORGANIZATIONNAME 891
2 DISPOSABLES 83 ORGANIZATIONNAME 891
3 DISPOSABLES 83 ORGANIZATIONNAME 891
4 DISPOSABLES 83 ORGANIZATIONNAME 891
5 DISPOSABLES 83 ORGANIZATIONNAME 891
6 DISPOSABLES 83 ORGANIZATIONNAME 891
7 DISPOSABLES 83 ORGANIZATIONNAME 891
8 DISPOSABLES 83 ORGANIZATIONNAME 891
PRODUCT_LINE ROBOTIC_FLAG Unnamed: 52 Unnamed: 53 Unnamed: 54
0 PRODUCTNAME N N NaN 3
1 PRODUCTNAME N N NaN 3
2 PRODUCTNAME N N NaN 2
3 PRODUCTNAME N N NaN 7
4 PRODUCTNAME N N NaN 1
5 PRODUCTNAME N N NaN 4
6 PRODUCTNAME N N NaN 3
7 PRODUCTNAME N N NaN 5
8 PRODUCTNAME N N NaN 3
[9 rows x 55 columns]

Categories