Data frame formation - python

I need to create a data frame for 100 customer_id along with their expenses for each day starting from 1st June 2019 to 31st August 2019. I have customer id already in a list and dates as well in a list. How to make a data frame in the format shown.
CustomerID TrxnDate
1 1-Jun-19
1 2-Jun-19
1 3-Jun-19
1 Upto....
1 31-Aug-19
2 1-Jun-19
2 2-Jun-19
2 3-Jun-19
2 Upto....
2 31-Aug-19
and so on for other 100 customer id
I already have customer_id data frame using pandas function now i need to map each customer_id with the date ie assume we have customer id as 1 then 1 should have all dates from 1st June 2019 to 31 aug 2019 and then customerId 2 should have the same dates. Please see the data frame required.

# import module
import pandas as pd
# list of dates
lst = ['1-Jun-19', '2-Jun-19', ' 3-Jun-19']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
Repeat the operations for Customer ID and store in df2 or something and then
frames = [df, df2]
result = pd.concat(frames)
There are simpler methods , but this will give you a idea how it is carried out.
I see you want specific dataframes, so first creat the dataframes according to customer ID 1. then repeat same for Customer ID 2, and then concat those dataframes.

Related

ValueError: cannot reindex on an axis with duplicate labels (Pandas reindex dataframe)

I'm trying to create a dataframe using pandas that counts the number of engaged, repeaters, and inactive customers for a company based on a JSON file with the transaction data.
For context, the columns of the new dataframe would be each month from Jan to June, while the rows are:
Repeater (customers who purchased in the current and previous month)
Inactive (customers in all transactions over the months including the current month who have purchased in previous months but not the current month)
Engaged (customers in all transactions over the months including the current month who have purchased in every month)
Hence, I've written code that first fetches the month of each transaction based on the provided transaction date for each record in the JSON. Then, it creates another month column ("month_no") which contains the month number of the month which the transaction was made. Next, a function is defined with the metrics to apply to each group and it is applied to a dataframe grouped by name.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_json('data/data.json')
df = (df.astype({'transaction_date': 'datetime64'}).assign(month=lambda x: x['transaction_date'].dt.month_name()))
months = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6}
df['month_no'] = df['month'].map(months)
df = df.set_flags(allows_duplicate_labels=False)
def grpProc(grp):
wrk = pd.DataFrame({'engaged': grp.drop_duplicates().sort_values('month_no').set_index('month_no').reindex(months).name.notna()})
wrk['inactive'] = ~wrk.engaged
wrk['repeaters'] = wrk.engaged & wrk.engaged.shift()
return wrk
act = df.groupby('name').apply(grpProc)
result = act.groupby(level=1).sum().astype(int).T
result.columns = months.keys()
However: this code produces these errors:
FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
wrk = pd.DataFrame({'engaged': grp.drop_duplicates().sort_values('month_no').set_index('month_no').reindex(months.values()).name.notna()})
...
ValueError: cannot reindex on an axis with duplicate labels
It highlights the line:
act = df.groupby('name').apply(grpProc)
For your reference, here are the important columns of the dataframe and some dummy data:
Name
Purchase Month
Mark
March
John
January
Luke
March
John
March
Mark
January
Mark
February
Luke
February
John
January
The goal is to create a pivot table based on the above table by counting the repeaters, inactive, and engaged members:
Status
January
February
March
Repeaters
0
1
2
Inactive
1
1
0
Engaged
2
1
1
How do you do this and fix the error? If you have another completely different solution to this that works, please share also.

Count occurrences in column based on another column (date)

I am trying to count the number of "Type" occurrences by what month they are in.
Daily data is given, so to group by month I tried using .resample() but the problem with using is that combines all the strings together in one LONG string and then I can't count the number of occurrences using str.count() as it returns the wrong value (it finds too many matches because it isn't looking for the EXACT pattern).
I think it has to be done in more than one step...
I have tried SO many things... I even heard there is a pivot table?
Sample data:
Type
Date
Cat
2020-01-01
Cat
2020-01-01
Bird
2020-01-01
Dog
2020-01-01
Cat
2020-02-01
Cat
2020-03-01
Bird
2020-03-01
Cat
2020-05-02
... For all the months over a few years...
Converted to the following format: (titles of header can be in numeric form as well)
January 2020
February 2020
Cat
4
1
Bird
1
0
Dog
1
0
As far as I know, Pandas does not have a standard function or typical approach to obtain your desired result. Below I've included a code snippet that gets your desired result.
If you do not mind using extra packages, there exist some packages which you can use for quicker/easier binary encoding (e.g. category_encoder).
import pandas as pd
# your data in dictionary format
d = {
"Type":["Cat","Cat","Bird","Dog","Cat","Cat","Bird","Cat"],
"Date":["2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-03-01","2020-03-01","2020-05-02"]
}
# creata dataframe with the dates as index
df = pd.DataFrame(data = d['Type'], index=pd.to_datetime(d['Date']))
animals = list(df[0].unique()) # a list contaning all unique animals
ndf = pd.DataFrame(index=animals) # empty new dataframe with all animals as index
for animal in animals:
ndf.loc[animal, df.index.month.unique()] = ( # at row = animal, insert all unique months
(df == animal).groupby(df.index.month) # groupby months, using .month (returns 1 for Jan)
.sum() # sum since we use bool comparison
.transpose() # tranpose due to desired output format
.values # array of values to insert
)
# convert column names back to date time and save as string in desired format
ndf.columns = pd.to_datetime(ndf.columns, format='%m').strftime('%B 2020')
Result
January 2020
February 2020
March 2020
May 2020
Cat
2
1
1
1
Bird
1
0
1
0
Dog
1
0
0
0

find first unique items selected by user and ranking them in order of user selection by date

I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.

How to count the number of dropoffs per month for dataframe column

I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1

Counting total monthly values from a CSV in Python

I am trying to record monthy sales totals over the course of 2.5 years in a csv data set.
I started with a csv file of transaction history for a SKU, which was sorted by date (MM/DD/YYYY), with varying statuses indicating whether the item was sold, archived (quoted, not sold), or open. I managed to figure out how to only display the "sold" rows, but cannot figure out how to display a total amount sold per month.
Here's what I have thus far.
#Import Libraries
from pandas import DataFrame, read_csv
import pandas as pd
#Set Variables
fields = ['Date', 'Qty', 'Status']
file = r'kp4.csv'
df = pd.read_csv(file, usecols=fields)
# Filters Dataset to only display "Sold" items in Status column
data = (df[df['Status'] == "Sold"])
print (data)
Output:
Date Qty Status
4 2/21/2018 5 Sold
4 2/21/2018 5 Sold
11 2/16/2018 34 Sold
14 3/16/2018 1 Sold
My ideal output would look something like this:
Date Qty Status
4 02/2018 39 Sold
5 03/2018 1 Sold
I've tried groupy, manipulating the year format, assigning indexes per other tutorials and have gotten nothing but errors. If anyone can point me in the right direction, it would be greatly appreciated.
Thanks!
IIUC
df.Date=pd.to_datetime(df.Date)
df=df.drop_duplicates()
df.groupby(df.Date.dt.strftime('%m/%Y')).agg({'Qty':'sum','Status':'first'})
Out[157]:
Qty Status
Date
02/2018 39 Sold
03/2018 1 Sold

Categories