New to Python (& StackOverflow), I am struggling to find a solution to take my ['Product_Name', 'Date_of_Sale', 'Quantity'] data and output the relative frequencies of the daily quantity frequencies per product.
As an example, Product 1 sells 8 units (Day1), 6 units (Day2), 6 (Day 3), 5 (Day 4), 8 (Day 5), 7 (Day 6), 6 (Day 7) over 7 days, giving relative frequencies for Product 1 of {5 units : 0.142, 6 : 0.429, 7 : 0.142, 8 : 0.286}.
How can I do this for all products for a period?
Normalize the value counts:
>>> df['Product1'].value_counts(normalize=True)
6 0.428571
8 0.285714
7 0.142857
5 0.142857
Name: Product1, dtype: float64
Doing this "for all products for a period" depends on the structure of your data. You would need to provide a sample and your expected result.
Use value_counts() and to_dict():
import pandas as pd
df = pd.DataFrame({'Day': [1, 2, 3, 4, 5, 6, 7],
'Product1': [8, 6, 6, 5, 8, 7, 6]})
df['Product1'].value_counts().div(df.shape[0]).to_dict()
Yields:
{6: 0.42857142857142855, 8: 0.2857142857142857, 7: 0.14285714285714285, 5: 0.14285714285714285}
Related
I have an array which contains 50 time series. Each time series has 50 values.
The shape of my array is therefore:
print(arr.shape) = (50,50)
I want to extract the 50 time series and I want to assign a year to each of them:
years = list(range(1900,1950))
print(len(years)) = 50
The order should be maintained. years[0] should correspond with arr[0,:] (this is the first time series).
I am glad for any help!
Edit: This is the small example
import random
years = list(range(1900,1904))
values = random.sample(range(10, 30), 16)
arr = np.reshape(values, (4, 4))
Let's say you have the following data:
import numpy as np
data = np.random.randint(low=1, high=9, size=(5, 4))
years = np.arange(1900, 1905)
You can use np.concatenate:
>>> arr = np.concatenate([years[:, None], data], axis=1)
>>> arr
array([[1900, 5, 8, 1, 2],
[1901, 3, 3, 1, 5],
[1902, 7, 4, 7, 5],
[1903, 1, 6, 6, 4],
[1904, 4, 5, 3, 8]])
or maybe use a pandas.DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame(data)
>>> df = df.assign(year=years)
>>> df = df.set_index("year")
>>> df
0 1 2 3
year
1900 3 2 8 1
1901 5 8 5 2
1902 3 5 4 3
1903 6 2 7 6
1904 8 8 4 6
Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe
I'm trying to extract a dataframe which only shows duplicates with e.g 3 or more duplicates in a column. For example:
df = pd.DataFrame({
'one': pd.Series(['Berlin', 'Berlin', 'Tokyo', 'Stockholm','Berlin','Stockholm','Amsterdam']),
'two': pd.Series([1, 2, 3, 4, 5, 6, 7]),
'three': pd.Series([8, 9, 10, 11, 12])
})
Expected output:
one two three
0 Berlin 1 8
The extraction should only show the row of the first duplicate.
You could do it like this:
rows = df.groupby('one').filter(lambda group: group.shape[0] >= 3).groupby('one').first()
Output:
>>> rows
two three
one
Amsterdam 7 1.0
Berlin 1 8.0
It works with multiple groups of 3+ duplicates, too. I tested it.
I have a dataframe below:
import pandas as pd
d = {'id': [1, 2, 3, 4, 4, 6, 1, 8, 9], 'cluster': [7, 2, 3, 3, 3, 6, 7, 8, 8]}
df = pd.DataFrame(data=d)
df = df.sort_values('cluster')
I want to keep ALL the rows
if there is the same cluster but different id AND keep every row from that cluster
even if it is the same id since there was a different id AT LEAST once within that cluster.
The code I have been using to achieve this is the following below, BUT, the only problem
with this is it drops too many rows for what I am looking for.
df = (df.assign(counts=df.count(axis=1))
.sort_values(['id', 'counts'])
.drop_duplicates(['id','cluster'], keep='last')
.drop('counts', axis=1))
The output dataframe I am expecting that the code above does not do
would drop rows at
dataframe index 1, 5, 0, and 6 but leave dataframe indexes 2, 3, 4, 7, and 8. Essentially
resulting in what the code below produces:
df = df.loc[[2, 3, 4, 7, 8]]
I have looked at many deduplication pandas posts on stack overflow but have yet to find this
scenario. Any help would be greatly appreciated.
I think we can do this with a single boolean. using .groupby().nunique()
con1 = df.groupby('cluster')['id'].nunique() > 1
#of these we only want the True indexes.
cluster
2 False
3 True
6 False
7 False
8 True
df.loc[(df['cluster'].isin(con1[con1].index))]
id cluster
2 3 3
3 4 3
4 4 3
7 8 8
8 9 8
I have a dataset of stores with 2D locations at daily timestamps. I am trying to match up each row with weather measurements made at stations at some other locations, also with daily timestamps, such that the Cartesian distance between each store and matched station is minimized. The weather measurements have not been performed daily, and the station positions may vary, so this is a matter of finding the closest station for each specific store at each specific day.
I realize that I can construct nested loops to perform the matching, but I am wondering if anyone here can think of some neat way of using pandas dataframe operations to accomplish this. A toy example dataset is shown below. For simplicity, it has static weather station positions.
store_df = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
weather_station_df = pd.DataFrame({
'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
'weather': [20, 21, 19, 17, 16, 18, 19, 17],
'x': [0, 0, 0, 5, 5, 3, 3, 3],
'y': [2, 2, 2, 1, 1, 3, 3, 3],
'date': [1, 2, 3, 1, 3, 1, 2, 3]})
The data below is the desired outcome. I have included station_id only for clarification.
store_id date station_id weather
0 1 1 1 20
1 1 2 1 21
2 1 3 1 19
3 2 1 2 17
4 2 2 3 19
5 2 3 2 16
6 3 1 3 18
7 3 2 3 19
8 3 3 3 17
The idea of the solution is to build the table of all combinations,
df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))
calculate the distance
df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2
and choose the minimum per group:
df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()
If you have a lot of date the you can do the join per group.
import math
import numpy as np
def distance(x1, x2, y1, y2):
return np.sqrt((x2-x1)**2 + (y2-y1)**2)
#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])
#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])
#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()
#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')
#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])
edited: To use vectorized distance formula