I have a data set which contains Customer Name, Ship Date, and PO Amount.
I would like to sort the data frame to output a table with the format of
cols:[Customer Name,2016,2017,2018,2019,2020,2021]
rows: 1 row for each customer and the sum of PO's within a given year.
This is what I have tried:
The data is coming in from an excel sheet, but assume ShipToName is a String, Bill Amount is a Float, and Sell data is a datetime.datetime.year().
ShipToName = ['Bob', 'Joe', 'Josh', 'Bob','Joe','Josh']
BillAmount = [30.02,23.2,20,45.32,54.23,65]
SellDate = [2016,2016,2018,2020,2021,2018]
dfSales = {'Customer': ShipToName, 'Total Sales': BillAmmount,
'Year':SellDate}
dfSales = pd.DataFrame(dfSales,columns = ['Customer', 'Year','Total
Sales'])
dfbyyear = dfSales.groupby(['Customer','Year'], as_index =
False).sum().sort_values('Total Sales', ascending = False)
This gives me a new row for each customer/year combo.
I would like the output to look like:
Customer Name
2016
2017
2018
2019
2020
2021
Bob
30.02
45.32
Joe
23.20
54.23
Josh
85.00
Edit v2
Using the data from the original version, we can create a temp dataframe dx that groups the data by Customer Name and Year. Then we can pivot the data to the format you wanted.
dx = df.groupby(['Customer Name','Year'])['PO Amount'].agg(Total_Amt=sum).reset_index()
dp = dx.pivot(index='Customer Name',columns='Year',values='Total_Amt')
print (dp)
The output of this will be:
Year 2020 2021
Customer Name
Boby 6754 6371
Jack 5887 6421
Jane 5161 4411
Jill 5857 5641
Kate 6205 6457
Suzy 5027 4561
Original v1
I am making some assumptions with the data as you haven't provided me with any.
Assumptions:
There are many customers in the dataframe - My example has 6
customers
Each customer has more than one Ship Date - My example has 1
shipment each month for 2 years
The shipment amount is a dollar amount - I used integer range from 100 to 900
The total dataframe size is 144 rows with 3 columns - Customer Name, Ship Date, and PO Amount
You are looking for an output by Customer, by Year, sum of all POs for that year
With these assumptions, here's the dataframe and the output.
import pandas as pd
import random
df = pd.DataFrame({'Customer Name':['Jack'] * 24 + ['Jill'] * 24 + ['Jane'] * 24 +
['Kate'] * 24 + ['Suzy'] * 24 + ['Boby'] * 24,
'Ship Date':pd.date_range('2020-01-01',periods = 24,freq='MS').tolist()*6,
'PO Amount':[random.randint(100,900) for _ in range(144)]})
print (df)
df['Year'] = df['Ship Date'].dt.year
print (df.groupby(['Customer Name','Year'])['PO Amount'].agg(Total_Amt=sum).reset_index())
Customer Name Ship Date PO Amount
0 Jack 2020-01-01 310
1 Jack 2020-02-01 677
2 Jack 2020-03-01 355
3 Jack 2020-04-01 571
4 Jack 2020-05-01 875
.. ... ... ...
139 Boby 2021-08-01 116
140 Boby 2021-09-01 822
141 Boby 2021-10-01 751
142 Boby 2021-11-01 109
143 Boby 2021-12-01 866
Each customer has data from 2020-01-01 through 2021-12-01.
The summary report will be as follows:
Customer Name Year Total_Amt
0 Boby 2020 7176
1 Boby 2021 6049
2 Jack 2020 6187
3 Jack 2021 5240
4 Jane 2020 4919
5 Jane 2021 6105
6 Jill 2020 6556
7 Jill 2021 5963
8 Kate 2020 6300
9 Kate 2021 6360
10 Suzy 2020 5969
11 Suzy 2021 4866
Related
I have the following data frame.
Names Counts Year
0 Jordan 1043 2000
1 Steve 204 2000
2 Brock 3 2000
3 Steve 33 2000
4 Mike 88 2000
... ... ... ...
20001 Bryce 2 2015
20002 Steve 11 2015
20003 Penny 24 2015
20004 Steve 15 2015
20005 Ryan 5 2015
I want to output the information about the name "Steve" over all years. The output should combine the "Counts" for the name "Steve" if the name appears multiple times within the same year.
Example output might look like:
Names Counts Year
0 Steve 237 2000
1 Steve 400 2001
2 Steve 35 2002
... ... ... ...
15 Steve 26 2015
do you want something like this ?
#first
cols=['Counts','Year']
df[cols]=df[cols].astype('int32')
df=df[df['Names']=='Steve']
df=df.groupby('Year')['Counts'].agg({'sum'})
Filter records for Steve then groupby Year, and finally calculate the aggregates i.e. first for Names, and sums for Counts
(df[df['Names'].eq('Steve')]
.groupby('Year')
.agg({'Names': 'first', 'Counts': sum})
.reset_index())
Year Names Counts
0 2000 Steve 237
1 2015 Steve 26
I have pandas DF as below ,
id age gender country sales_year
1 None M India 2016
2 23 F India 2016
1 20 M India 2015
2 25 F India 2015
3 30 M India 2019
4 36 None India 2019
I want to group by on id, take the latest 1 row as per sales_date with all non null element.
output expected,
id age gender country sales_year
1 20 M India 2016
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
In pyspark,
df = df.withColumn('age', f.first('age', True).over(Window.partitionBy("id").orderBy(df.sales_year.desc())))
But i need same solution in pandas .
EDIT ::
This can the case with all the columns. Not just age. I need it to pick up latest non null data(id exist) for all the ids.
Use GroupBy.first:
df1 = df.groupby('id', as_index=False).first()
print (df1)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
If column sales_year is not sorted:
df2 = df.sort_values('sales_year', ascending=False).groupby('id', as_index=False).first()
print (df2)
id age gender country sales_year
0 1 20.0 M India 2016
1 2 23.0 F India 2016
2 3 30.0 M India 2019
3 4 36.0 NaN India 2019
print(df.replace('None',np.NaN).groupby('id').first())
first replace the 'None' with NaN
next use groupby() to group by 'id'
next filter out the first row using first()
Use -
df.dropna(subset=['gender']).sort_values('sales_year', ascending=False).groupby('id')['age'].first()
Output
id
1 20
2 23
3 30
4 36
Name: age, dtype: object
Remove the ['age'] to get full rows -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first()
Output
age gender country sales_year
id
1 20 M India 2015
2 23 F India 2016
3 30 M India 2019
4 36 None India 2019
You can put the id back as a column with reset_index() -
df.dropna().sort_values('sales_year', ascending=False).groupby('id').first().reset_index()
Output
id age gender country sales_year
0 1 20 M India 2015
1 2 23 F India 2016
2 3 30 M India 2019
3 4 36 None India 2019
There exist the following dataframe:
year
pop0
pop1
city0
city1
2019
20
40
Malibu
NYC
2018
8
60
Sydney
Dublin
2018
36
23
NYC
Malibu
2020
17
44
Malibu
NYC
2019
5
55
Sydney
Dublin
I would like to calculate the weighted average for the population of each city pair as a new column. For example, the w_mean for Malibu / NYC = (23+20+17)/(36+40+44) = 0.5.
Following is the desired output:
year
pop0
pop1
city0
city1
w_mean
2018
23
36
Malibu
NYC
0.5
2019
20
40
Malibu
NYC
0.5
2020
17
44
Malibu
NYC
0.5
2018
8
60
Sydney
Dublin
0.113
2019
5
55
Sydney
Dublin
0.113
I already sorted the dataframe by its columns, but I have issues swapping the 3rd row from NYC/Malibu to Malibu/NYC with its populations. Besides that, I can only calculate the w_mean for each row but not for each group. I tried groupby().mean() but didn't get any useful output.
Current code:
import pandas as pd
data = pd.DataFrame({'year': ["2019", "2018", "2018", "2020", "2019"], 'pop0': [20,8,36,17,5], 'pop1': [40,60,23,44,55], 'city0': ['Malibu','Sydney','NYC','Malibu','Sydney'], 'city1': ['NYC','Dublin','Malibu','NYC','Dublin']})
new = data.sort_values(by=['city0', 'city1'])
new['w_mean'] = new.apply(lambda row: row.pop0 / row.pop1, axis=1)
print(new)
What you can do is creating a creating tuples of (city, population), put the two tuples in a row into a list and then sort it. By doing this for all rows, you can extract the new cities and populations (sorted alphabetically by city). This can be done as follows:
cities = [sorted([(e[0], e[1]), (e[2], e[3])]) for e in data[['city0','pop0','city1','pop1']].values]
data[['city0', 'pop0']] = [e[0] for e in cities]
data[['city1', 'pop1']] = [e[1] for e in cities]
Resulting dataframe:
year pop0 pop1 city0 city1
0 2019 20 40 Malibu NYC
1 2018 60 8 Dublin Sydney
2 2018 23 36 Malibu NYC
3 2020 17 44 Malibu NYC
4 2019 55 5 Dublin Sydney
Now, the mean_w column can be created using groupby and transform to create the two sums and then divide as follows:
data[['pop0_sum', 'pop1_sum']] = data.groupby(['city0', 'city1'])[['pop0', 'pop1']].transform('sum')
data['w_mean'] = data['pop0_sum'] / data['pop1_sum']
Result:
year pop0 pop1 city0 city1 pop0_sum pop1_sum w_mean
0 2019 20 40 Malibu NYC 60 120 0.500000
1 2018 60 8 Dublin Sydney 115 13 8.846154
2 2018 23 36 Malibu NYC 60 120 0.500000
3 2020 17 44 Malibu NYC 60 120 0.500000
4 2019 55 5 Dublin Sydney 115 13 8.846154
Any extra columns can now be dropped.
If the resulting w_mean column should always be less than zero, then the last division can be done as follows instead:
data['w_mean'] = np.where(data['pop0_sum'] > data['pop1_sum'], data['pop1_sum'] / data['pop0_sum'], data['pop0_sum'] / data['pop1_sum'])
This will give 0.5 for the Malibu & NYC pair and 0.113043 for Dublin & Sydney.
I have a dataframe that looks like this:
df
Name year week date
0 Adam 2016 16 2016-04-24
1 Mary 2016 17 2016-05-01
2 Jane 2016 20 2016-05-22
3 Joe 2016 17 2016-05-01
4 Arthur 2017 44 2017-11-05
5 Liz 2017 41 2017-10-15
6 Janice 2016 47 2016-11-27
And I want to create column season so df['season'] that attributes a season MAM or OND depending on the value in week.
The result should look like this:
df_final
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
In essence, values of week that are below 40 should be paired with MAM and values above 40 should be OND.
So far I have this:
condition =df.week < 40
df['season'] = df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: 'OND')
But it is clunky and does not produce the final response.
Thank you.
Use numpy.where:
condition = df.week < 40
df['season'] = np.where(condition, 'MAM', 'OND')
print (df)
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
EDIT:
For convert strings to integers use astype:
condition = df.week.astype(int) < 40
Or convert column:
df.week = df.week.astype(int)
condition = df.week < 40
I am trying to read a tab delimited text file into a dataframe.
This is the how the file looks in Excel:
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
Import into a df:
df = p.read_table("E:/FileLoc/ThisIsAFile.txt", encoding = "iso-8859-1")
Now it doesn't see the first 3 columns as part of the column index (df[0] = Transaction Type) and all of the headers shift over to reflect this.
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
I am trying to manipulate the text file and then import it to a mysql database as an end result.
You can use read_csv with separator 2 and more whitespaces:
import pandas as pd
import io
temp=u"""CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1"""
#after testing replace io.StringIO(temp) to filename
df =pd.read_csv(io.StringIO(temp), sep=r'\s{2,}', engine='python', encoding = "iso-8859-1")
print (df)
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE \
0 5/13/2016 0:00 13867666 6892372 S
CUSTOMER_NUMBER CUSTOMER_NAME
0 2026 CUSTOMER 1
If separator is tabulator, use sep='\t'.
EDIT:
I test it with your data and it works:
import pandas as pd
df = pd.read_csv('test/AnonymizedData.txt', sep='\t')
print (df)
CUSTOMER_NUMBER CUSTOMER_NAME CUSTOMER_BRANCH_CODE CUSTOMER_BRANCH_NAME \
0 2026 CUSTOMER 1 83 SALES BRANCH 1
1 2359 CUSTOMER 2 76 SALES BRANCH 2
2 100662 CUSTOMER 3 28 SALES BRANCH 3
3 3245 CUSTOMER 4 84 SALES BRANCH 4
4 3179 CUSTOMER 5 28 SALES BRANCH 5
5 39881 CUSTOMER 6 67 SALES BRANCH 6
6 37020 CUSTOMER 7 58 SALES BRANCH 7
7 1239 CUSTOMER 8 50 SALES BRANCH 8
8 2379 CUSTOMER 9 76 SALES BRANCH 9
CUSTOMER_CITY CUSTOMER_STATE ... PRICING_PRODUCT_TYPE_CODE \
0 TOWN 1 CO ... 11
1 TOWN 2 OH ... 11
2 TOWN 3 ME ... 11
3 TOWN 4 IL ... 11
4 TOWN 5 NH ... 11
5 TOWN 6 TX ... 11
6 TOWN 7 NC ... 11
7 TOWN 8 NY ... 11
8 TOWN 9 OH ... 11
PRICING_PRODUCT_TYPE ORGANIZATION_ID ORGANIZATION_NAME PRODUCT_LINE_CODE \
0 DISPOSABLES 83 ORGANIZATIONNAME 891
1 DISPOSABLES 83 ORGANIZATIONNAME 891
2 DISPOSABLES 83 ORGANIZATIONNAME 891
3 DISPOSABLES 83 ORGANIZATIONNAME 891
4 DISPOSABLES 83 ORGANIZATIONNAME 891
5 DISPOSABLES 83 ORGANIZATIONNAME 891
6 DISPOSABLES 83 ORGANIZATIONNAME 891
7 DISPOSABLES 83 ORGANIZATIONNAME 891
8 DISPOSABLES 83 ORGANIZATIONNAME 891
PRODUCT_LINE ROBOTIC_FLAG Unnamed: 52 Unnamed: 53 Unnamed: 54
0 PRODUCTNAME N N NaN 3
1 PRODUCTNAME N N NaN 3
2 PRODUCTNAME N N NaN 2
3 PRODUCTNAME N N NaN 7
4 PRODUCTNAME N N NaN 1
5 PRODUCTNAME N N NaN 4
6 PRODUCTNAME N N NaN 3
7 PRODUCTNAME N N NaN 5
8 PRODUCTNAME N N NaN 3
[9 rows x 55 columns]