Related
I have a huge dataframe with multiple time series like in the link below.
Dataframe with multiple time series
At the end there should be a dataframe with the same values like before except when there is no value (nan) in column g. Shortly said, if there is a value in column g all the other rows should stay the same. If there is a "nan" in column g, all the other values should be also "nan" in this row.There are up to 200 Columns in this dataframe so is it possible to write it in a way without writing the individual column names like in a lot of examples. I tried it with df.iloc and np.where but honestly said, I cannot define the conditions to make it work.
Solution with rows adapted
I hope someone can help me.
Thanks in advance.
import pandas as pd
import numpy as np
columns = ['Date','a','b','c','d','e','f','g']
data = [
['01.06.2022','0,574','0,2342','0,574','0,2342','0,574','0,2342','0,574'],
['02.06.2022','0,574','0,2342','0,574','0,2342','0,574','0,2342',np.nan],
]
df = pd.DataFrame(data, columns=columns)
check_col = 'g'
rows_to_adapt = [x for x in df.columns if x != check_col]
for col in rows_to_adapt:
df[col] = np.where(df['g'].isnull(), np.nan, df[col])
for index in df.index:
if np.isnan(df.loc[index, "g"]):
df.loc[index, :] = np.nan
This should work if I understand your question correctly.
Thank you r.uzunok! Unfortunately the output is the same.On 2022-01-02 and 2022-01-05 should all values in the two rows be replaced by 'nan'.
That's the code now.What do I overlook?
import pandas as pd
import numpy as np
d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
'volume': [50, 60, 40, 100, 50, 100, 40, 50],
'age': [14, 67, 91, 10, 54, 33, 28, 81],
'g': [50, 'nan', 40, 100, 'nan', 100, 40, 50],
'Date':["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07", "2022-01-08"]}
df = pd.DataFrame(d)
df = df.set_index("Date")
for index in df.index:
if pd.isna(df.loc[index, "g"]):
df.loc[index, :] = np.nan
I have a dataframe like
df = pd.DataFrame({'time': [1, 5, 100, 250, 253, 260, 700], 'qty': [3, 6, 2, 5, 64, 2, 5]})
df['time_delta'] = df.time.diff()
and I would like to groupby time_delta such that all rows where the time_delta is less than 10 are grouped together, time_delta column could be dropped, and qty is summed.
The expected result is
pd.DataFrame({'time': [1, 100, 250, 700], 'qty': [9, 2, 71, 5]})
Basically I am hoping there is something like a df.groupby(time_delta_func(10)).agg({'time': 'min', 'qty': 'sum'}) func. I read up on pd.Grouper but it seems like the grouping based on time is very strict and interval based.
you can do it with gt meaning greater than and cumsum to create a new group each time the time-delta is greater than 10
res = (
df.groupby(df['time_delta'].gt(10).cumsum(), as_index=False)
.agg({'time':'first','qty':sum})
)
print(res)
time qty
0 1 9
1 100 2
2 250 71
3 700 5
i need some help with labeling data inside dataframe, based on dynamic conditions.
I have a dataframe
df3 = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
df3['agemore'] = (df3['age'] > 20)
df3
So i need to take first person with id=1 and group =0 and label him with group=1 (on all of his rows).
This person placed on 3 rows (indexes 0, 1, 8) and has agemore=True, product_type = 1, 1, 2 and quantity = 10, 15, 10.
Condition for looking matched persons are based on product_type,quantity, agemore columns.
the first taken person slice:
df6=df3.loc[lambda df: (df['id'] ==1) &(df['product_type'] ==1), :]
df6
i need to take agemore = True, product_type = 1 (with is on two rows) and quantity of product of this type(10,15) for conditions.
and i will look for persons with has agemore = True, product_type = 2(two, its cross column search) (with is on two rows) and quantity of product_type = 2 (10,15) for conditions. The matched person has id 2. i must put this person in group 1 as well.
Then take next person with lowest id and group=0, take his conditions, look for similar, group them together etc
The output i would like to have
df4 = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [1, 1, 1, 1, 1, 2, 2, 3, 1],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
df4
set2
import pandas as pd
data = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 3],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 1],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
data['agemore'] = (data['age'] > 20)
rm1991, thanks for clarifying your question.
From the information provided, I gathered that you are trying to group customers by their behavior and age group. I can also infer that the IDs are assigned to customers when they first make a transaction with you, which means that the higher the ID value, the newer the customer is to the company.
If this is the case, I would suggest you use an unsupervised learning method to cluster the data points by their similarity regarding the product type, quantity purchased, and age group. Have a look at the SKLearn suite of clustering algorithms for further information.
NB: upon further clarification from rm1991, it seems that product_type is not a "clustering" criteria.
I have replicated your output using only Pandas logic within a loop, as you can see below:
import pandas as pd
data = pd.DataFrame({
'first_name': ['John', 'John', 'Jane', 'Jane', 'Jane','Marry', 'Victoria', 'Gabriel', 'John'],
'id': [1, 1, 2, 2, 2, 3, 4, 5, 1],
'age': [30, 30, 25, 25, 25, 30, 45, 15, 30],
'group': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'product_type': [1, 1, 2, 1, 2, 1, 2, 1, 2],
'quantity': [10, 15, 10, 10, 15, 30, 30, 10, 10]
})
data['agemore'] = (data['age'] > 20)
group_val = 0
for id in data['id'].unique():
age_param = list(set([age_bool for age_bool in data.loc[data['id'] == id, 'agemore']]))
# Product type removed as per latest requirements
# product_type_param = list(set([prod_type for prod_type in data.loc[data['id'] == id, 'product_type']]))
quantity_param = list(set([qty for qty in data.loc[data['id'] == id, 'quantity']]))
if data.loc[(data['id'] == id)
& (data['group']==0), :].shape[0] > 0:
group_val += 1
data.loc[(data['group'] == 0)
& (data['agemore'].isin(age_param))
# Product_type removed as per latest requirements
# & (data['product_type'].isin(product_type_param))
& (data['quantity'].isin(quantity_param)), 'group'] = group_val
Now the output does match what you've posted earlier:
first_name id age group product_type quantity agemore
0 John 1 30 1 1 10 True
1 John 1 30 1 1 15 True
2 Jane 2 25 1 2 10 True
3 Jane 2 25 1 1 10 True
4 Jane 2 25 1 2 15 True
5 Marry 3 30 2 1 30 True
6 Victoria 4 45 2 2 30 True
7 Gabriel 5 15 3 1 10 False
8 John 1 30 1 2 10 True
It remains unclear to me why Victoria, with ID = 4, would be assigned to the same group as Marry (ID = 3), given that they have not purchased the same product_type.
I hope this is helpful.
I am trying to compute a rolling 2 day using trans_date sum against the amount column that is grouped by ID within the table below using python.
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>50</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td> </td></tr></tbody></table>
Ultimately, I am trying to achieve the result below using
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th>2d_Running_Total</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td>700</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>250</td><td>950</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td>350</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td>950</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td>15</td><td> </td></tr></tbody></table>
This hyperlink was very close to solving this, but the issue is for the records that have multiple transactions on the same day, it provides the same value for the same day.
https://python-forum.io/Thread-Rolling-sum-for-a-window-of-2-days-Pandas
This should do it:
import pandas as pd
# create dummy data
df = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount'],
data = [
[1, '03/23/2019', '06:51:03', 100],
[1, '03/24/2019', '12:32:48', 600],
[1, '03/24/2019', '14:15:35', 250],
[1, '06/05/2019', '16:18:21', 75],
[2, '02/01/2019', '18:02:52', 200],
[2, '02/02/2019', '10:03:02', 150],
[2, '02/03/2019', '23:47:51', 800],
[3, '01/18/2019', '11:12:58', 1000],
[3, '01/23/2019', '22:12:41', 15]
]
)
df_out = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount', '2d_Running_Total'],
data = [
[1, '03/23/2019', '06:51:03', 100, 100],
[1, '03/24/2019', '12:32:48', 600, 700],
[1, '03/24/2019', '14:15:35', 250, 950],
[1, '06/05/2019', '16:18:21', 75, 75],
[2, '02/01/2019', '18:02:52', 200, 200],
[2, '02/02/2019', '10:03:02', 150, 350],
[2, '02/03/2019', '23:47:51', 800, 950],
[3, '01/18/2019', '11:12:58', 1000, 1000]
]
)
# convert into datetime object and set as index
df['Trans_DateTime'] = pd.to_datetime(df['Trans_Date'] + ' ' + df['Trans_Time'])
df = df.set_index('Trans_DateTime')
# group by ID and apply rolling window to the amount column
df['2d_Running_Total'] = df.groupby('ID')['Amount'].rolling('2d').sum().values.astype(int)
df.reset_index(drop=True)
I want to rank how many of other cols in df is greater than or equal to a reference col. Given testdf:
testdf = pd.DataFrame({'RefCol': [10, 20, 30, 40],
'Col1': [11, 19, 29, 40],
'Col2': [12, 21, 28, 39],
'Col3': [13, 22, 31, 38]
})
I am using the helper function:
def sorter(row):
sortedrow = row.sort_values()
return sortedrow.index.get_loc('RefCol')
as:
testdf['Score'] = testdf.apply(sorter, axis=1)
With actual data this method is very slow, how to speed it up? Thanks
Looks like you need to compare RefCol and check if there are any column less than the RefCol , use:
testdf.lt(testdf['RefCol'],axis=0).sum(1)
0 0
1 1
2 2
3 2
For greater than equal to use:
testdf.drop('RefCol',1).ge(testdf.RefCol,axis=0).sum(1)