I have a dataset that has been recorded periodical for quantities of items sold. this dataset contains the following columns: Item ID, Date(2015-01-01, 2022-12-01), Quantities of items sold. how do I split the dataset by which item id's have less or equal to 12 data points historically as I am trying to forecast item sales for the next 6 months. In Python
grouped_data = df.groupby('item_id').apply(lambda x: x[x['date'].count() <= 12])
I have been able to this by just applying this code: Good_data = p[p['Create month/year']>48].index.values
Good_data.shape.
This filters our Data to get only the item indexes that has datapoints greater than 4 years, and looking at the shape of the list.
Related
I have a dataset of sales data over the last year.
For each unique item in the product category I'd like to calculate the number of days between the first solDate and the last solDate.
The data contains two columns relevant to this 'solDate' which is a datetime object NS product which is a string. Each row in the data set an individual sale of the item.
My first instinct was to use a for loop to iterate through, check each column and then uncheck it when the next entry is found to figure out the last time sold but I know their must be an easier way + the dataset is 100,000 entries so I need something semi-efficent to complete in a reasonable time.
Im using pandas package for the analysis.
I need the top 10 products sorted by total sales from a pandas dataframe. I can output a list with all the values and product name but cannot find a way of outputting a list of sorted values. I have tried creating lists, tuples and dictionaries but I keep getting an error that these cannot be sorted.
petproducts = df_cleaned['prod_title'].unique()
for x in petproducts:
totaltoys =df_cleaned.loc[df_cleaned['prod_title'] == x, 'total_sales'].sum()
print (x,totaltoys)
This gives me a series on pairs which I need to make into something where I can extract the top 10 products.I am a beginner I have been working on this for days.
We can use groupby + sum to calculate total sales per prod_title, then use nlargest to get the top 10 products with highest sales
top10 = df_cleaned.groupby('prod_title')['total_sales'].sum().nlargest(10)
I have downloaded history prices for 2 stocks from Yahoo Finance, and merged the 2 data frames in order to compute the correlation of their close prices over different period of time (see the attached picture of the merged data frame):
2-day (intraday)
3-day
5-day
One way I am thinking of is to iterate the rows from the bottom of the data frame, and get subset of the 2 columns of Close_x and Close_y in the size of 2/3/5 rows and calculate the correlation, respectively. The calculated correlations will be added as columns to the merged data frame.
I am a novice to Pandas data frame, I think it's against the nature of data frame to iterate each row/column. I was wondering if there's a more efficient way to achieve my goal.
The color-coded boxes are:
red: correlation over 2 days of close prices
blue: correlation over 3 days ...
green: correlation over 5 days ...
df = pd.DataFrame([[23.02000046, 23.13999939, 24.21999931, 26.70000076, 28.03000069],
[445.9200134, 446.9700012, 444.0400085, 439.1799927, 439.8599854]], columns = ['Close_x', 'Close_y'])
For the extracted data in the code above, the expected result would be
The correlation of the last 2 rows is 1:
The correlation of the last 3 rows is -0.8867:
The correlation of the last 5 rows is -0.9510:
The final output will have the correlation coefficients as new columns.
Adding the correlation coefficients as new columns, it will look like this:
Close_x Close_y 2D_Corr 3D_Corr 5D_Corr
23.02000046 445.9200134 ... ... ...
23.13999939 446.9700012 ... ... ...
24.21999931 444.0400085
26.70000076 439.1799927
28.03000069 439.8599854 1 -0.8867 -0.9510
As per TM Bailey's comment, you can use rolling:
Close_x = [23.02000046, 23.13999939, 24.21999931, 26.70000076, 28.03000069]
Close_y = [445.9200134, 446.9700012, 444.0400085, 439.1799927, 439.8599854]
s1 = pd.Series(Close_x)
s2 = pd.Series(Close_y)
s1.rolling(5).corr(s2)
How i can compute in a reaseonable amount of time, the correlation of a two time series of two product prices?
I have a set of products length of 8485. The possible combination is around 36 millions of pairs.
Each product is a pandas series with a TimeStamp index (in days) with the prices values. The data time is about 1 year.
For example, the data of some product is like:
price
2020-01-01 200
2020-01-02 250
... ...
2021-02-01 600
I save the data in tuples with the id of product:
products = tuple((id_products, series_products)) = ((111, series_product_111), (222, series_product_222), ...)
len(products) = 8485
I need de maximum cross-correlation of the prices of each products (Im using pandas shift function for crossing data, and pandas corr function for calculate correlation) in a nested loop. For that, I created a list with the all posible combinations (35M) of indexes called list_products.
list_products= [(i,j) for i in range(len(products)) for j in range(len(products)) if i<j]
On the other hand, the time series have different sizes, so to calculate the correlation at the same length of time, I created a function called -subset_datatime-
correlation = list()
for i,j in list_products:
series_1, series_2 = subset_datatime(products[i][1], products[j][1])
correlation.append([ series1.corr(series2.shift(t)) for t in range(-10,10)],
products[i][0],
products[j][0]))
Unfortunately this can take up to 3 days on my computer. Is there a more efficient way to achieve it?
I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.