I am fairly new to Python and have been using Apriori to analyse my baskets. In saying that, my team has requested that I identify the top 3 products sold with certain ranges and I am unsure how to go about this considering I only have access to Excel and Python.
My data is structured in columns listed below.
DocumentNumber - This is the sales document number
DisplayName - product display name
MasterCategory - First hierarchy of the product
Category - second product hierarchy
SubCategory - third product hierarchy
Range - Collection name
Quantity - Number of units sold on that sales document
ProductCode - Product Internal ID
The task is to identify top 3 Sofas (Category) Ranges and the top 3 Occasional Chairs (Category), Top 3 Coffee Tables (Subcategory) and Top 3 Side Tables (Sub Category) that these are often sold with.
I cannot for the life of me figure out how to do this with apriori, and I have over 68,000 rows of transaction data with 33,059 unique transactions to scan for the data above.
Would one of you kind souls please be able to guide me in the right direction?
I have tried Apriori Algorithm in Python, but I am unsure that is the correct way to approach this problem.
You can try pandas. The code will look like that:
# Load your data into a pandas DataFrame
df = pd.read_csv("data.csv")
# Filter the data to include only the transactions that contain one of the ranges you're interested in
sofas = df[df['Range'] == 'Sofas']
top_3_sofas = (sofas.groupby('DisplayName')['Quantity']
.sum()
.sort_values(ascending=False)
.head(3))
Repeat for the other categories.
Looks like you are studying datascience. You can check Kaggle for some more problems with solutions and explanation.
Related
I have a dataframe new_df which has a list of customer id's, dates, and a customer segment for each day. Customer segment can take multiple values. I am looking to identify a list of customers whose segment has changed more than twice in the past 15 days.
Currently, I am using the following to check how many times each segment appears for each customer id.
segment_count = new_df.groupby(new_df['customer_id'].ne(new_df['customer_id'].shift()).cumsum())['segment'].value_counts()
My thinking is if a customer has more than 2 segments which have a count of >1, then they must have migrated from one segment to another at least twice. 2 sample customers may look like this:
|customer_id|day|segment|
|-----------|---|-------|
|12345|'2021-01-01'|G|
|12345|'2021-01-02'|G|
|12345|'2021-01-03'|M|
|12345|'2021-01-04'|G|
|12345|'2021-01-05'|M|
|12345|'2021-01-06'|M|
|6789|'2021-01-01'|G|
|6789|'2021-01-02'|G|
|6789|'2021-01-03'|G|
|6789|'2021-01-04'|G|
|6789|'2021-01-05'|G|
|6789|'2021-01-06'|M|
As an output, I would want to return the following:
customer_id
migration_count
12345
3
6789
1
Anyone have any advice on best way to tackle this or if there are any built in functions I can use to simplify?Thanks!
I use datasets that are often CSV flats files that are the result of a variety of other source tables.
I import them using pandas read_csv.
For example. I get a table like this :
ID
Amount
Client
Company
Company Long Name
ID2
0
12
ClientA
CompanyA
The Company A
AA123
1
2
ClientA
CompanyA
The Company A
AA2339
2
32
ClientB
CompanyA
The Company A
AA3833
3
1
ClientB
CompanyB
The Company B
BB3933
Now I suppose there is a "Company" table somewhere. And I would like to find a way to find columns that are very likely to be from this company table.
So I want to ask with python if working with the columns Company there are any "good candidates" for a potential Company Table
In my example, Company Long Name is a good candidate because if I group by Company and count how many unique values I have for the column Company Long Name, the answer is 1.
Also, what I wish to find is whether part of a column would be a good match as well. In my example, the first two characters of ID2 are a good fit.
Ideally, I would like a just provide a columns and the code check everyother columns (I have hundreds of them) and suggest good candidates with maybe some idea of a matching score. Like 99% meaning that there may be a occurence where I have more than 1 distinct value for this company ID.
So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation
I want to generate a Dataframe which involves three industries, 100 investors, and 2000 firms in 50000 different deals. So, I have 50000 rows and four columns, including deal's date.
I randomly assigned 100 investors and 2000 firm in this dataset for all deals.
My problem is that I need to determine the industry for each deal by considering a couple of conditions that should hold for this process. 1) If a firm randomly assigned to an industry, this firm should hold in the same industry in the whole dataset. 2) All investors invest at least in one industry, 15 investors invest at least in two industries, and four investors invest in all three industries. I don't know how I can do this.
This seems like 2 separate questions
For 1, one idea is to just hash the firm name and then that will give you a deterministic way of setting which firm goes to which industry.
For 2, one idea is to use a permutation function. For example, you can use numpy.shuffle where the first 4 belong to all 3, then the next 15 belong to a random subset of 2... etc, etc.
You could then generate tables off of all this information and use joins to concatenate your final table, where you initially start off with just trade dates, firms, and deal info.
(This is just a starting idea)
In Python I have created a dataframe that has four columns, related to a list of shares listed on various exchanges around the world. The columns are:
Stock ticker,
Country of listing,
Industry Sector,
PE multiple
I want to create a matrix, that collates the data so that along the X axis I see the various countries, and along the Y axis I see the various sectors. Within each cell I'd like to get the median PE multiple of all the companies that are in each combination of country and sector - e.g. banks in USA and Engineering in UK.
You will have to explain a little more...for eg: where are the 'companies' are listed?
But from what I understand, you're looking for something like
df.groupby(by=['Country of listing','Industry Sector'])['PE multiple'].median().unstack(level=0)