I am doing the following:
# Load date
data = pd.read_csv('C:/Users/user/Desktop/STOCKS.txt', keep_default_na=True, sep='\t', nrows=5)
# Convert dates from object columns to datetime columns
data['DATE'] = pd.to_datetime(data_orders['DATE'])
print(data.columns)
# Index(['COUNTRY_ID', 'STOCK_ID', 'DATE', 'STOCK_VALUE'], dtype='object')
# Count of stock per country per day
data_agg= data.groupby(['COUNTRY_ID'], as_index=False).agg({'DATE': 'count'})
print(data_agg.columns)
# Index(['COUNTRY_ID', 'DATE'], dtype='object')
# Rename count column
data_agg.rename({'DATE': 'Count'}, inplace=True)
print(data_agg.columns)
# Index(['COUNTRY_ID', 'DATE'], dtype='object')
As you see above at the last lines, I try to rename the aggregated column after the groupby but for some reason this does not work (I still get the name DATE for this column instead of Count).
How can I fix this?
Need columns keyword, if omit it, rename try change values of index:
data_agg.rename(columns={'DATE': 'Count'}, inplace=True)
rng = pd.date_range('2017-04-03', periods=10)
data = pd.DataFrame({'DATE': rng, 'COUNTRY_ID': [3]*3+ [4]*5 + [1]*2})
print (data)
DATE COUNTRY_ID
0 2017-04-03 3
1 2017-04-04 3
2 2017-04-05 3
3 2017-04-06 4
4 2017-04-07 4
5 2017-04-08 4
6 2017-04-09 4
7 2017-04-10 4
8 2017-04-11 1
9 2017-04-12 1
data_agg= data.groupby(['COUNTRY_ID'], as_index=False).agg({'DATE': 'count'})
data_agg.rename({'DATE': 'Count', 1:'aaa'}, inplace=True)
print (data_agg)
COUNTRY_ID DATE
0 1 2
aaa 3 3
2 4 5
data_agg.rename(columns={'DATE': 'Count', 1:'aaa'}, inplace=True)
print (data_agg)
COUNTRY_ID Count
0 1 2
1 3 3
2 4 5
Another solution is remove as_index=False and use DataFrameGroupBy.count with Series.reset_index with :
data_agg= data.groupby('COUNTRY_ID')['DATE'].count().reset_index(name='Count')
print (data_agg)
COUNTRY_ID Count
0 1 2
1 3 3
2 4 5
I thinks this solves your problem
data_agg=data_agg.rename(columns={'Date':'Count'})
Related
The problem:
The input table, let's say, is a merged table of calls and bills, having columns: TIME of the call and months of all the bills. The idea is to have a table that has the last 3 bills the person paid starting from the time of the call. That way putting the bills in context of the call.
The Example input and output:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
EXAMPLE INPUT CREATION:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
The code I have got so far:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
OUTPUT of my code:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
What my code seems to lack if a for loop or two, for the first two lines of code, to do waht I want it to do, but I just can't believe that there isn't a better a solution than the one I am concocting right now.
I would suggest the following steps so that you can avoid dynamic column selection altogether.
Convert the wide table (reference date as columns) to a long table (reference date as rows)
Compute the difference in months between time of the call TIME and reference date
Select only those with difference >= 0 and difference < 3
Format the output table (add a running number, pivot it) according to your requirements
# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
Output:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13
Here is my (newbie's) solution, it's gonna work only if the dates in column names are in ascending order:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
Output:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
Anyway I think #Toukenize solution is better since it doesn't require iterating.
I have a dataframe with this format:
ID measurement_1 measurement_2
0 3 NaN
1 NaN 5
2 NaN 7
3 NaN NaN
I want to combine to:
ID measurement measurement_type
0 3 1
1 5 2
2 7 2
For each row there will be a value in either measurement_1 or measurement_2 column, not in both, the other column will be NaN.
In some rows both columns will be NaN.
I want to add a column for the measurement type (depending on which column has the value) and take the actual value out of both columns, and remove the rows that have NaN in both columns.
Is there an easy way of doing this?
Thanks!
Use DataFrame.stack to reshape the dataframe then use reset_index and use DataFrame.assign to assign the column measurement_type by using Series.str.split + Series.str[:1] on level_1:
df1 = (
df.set_index('ID').stack().reset_index(name='measurement')
.assign(mesurement_type=lambda x: x.pop('level_1').str.split('_').str[-1])
)
Result:
print(df1)
ID measurement mesurement_type
0 0 3.0 1
1 1 5.0 2
2 2 7.0 2
Maybe combine_first could help?
import numpy as np
df["measurement"] = df["measurement_1"].combine_first(df["measurement_2"])
df["measurement_type"] = np.where(df["measurement_1"].notnull(), 1, 2)
df.drop(["measurement_1", "measurement_2"], 1)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
Set a threshold and drop any that has more than one NaN. Use df.assign to fillna() measurement_1 and apply np.where on measurement_2
df= df.dropna(thresh=2).assign(measurement=df.measurement_1.fillna\
(df.measurement_2), measurement_type=np.where(df.measurement_2.isna(),1,2)).drop(columns=['measurement_1','measurement_2'])
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
You could use pandas melt :
(
df.melt("ID", var_name="measurement_type", value_name="measurement")
.dropna()
.assign(measurement_type=lambda x: x.measurement_type.str[-1])
.iloc[:, [0, -1, 1]]
.astype("int8")
)
or wide to long :
(
pd.wide_to_long(df, stubnames="measurement", i="ID",
j="measurement_type", sep="_")
.dropna()
.reset_index()
.astype("int8")
.iloc[:, [0, -1, 1]]
)
ID measurement measurement_type
0 0 3 1
1 1 5 2
2 2 7 2
Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4
First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4
I have a dataframe with a bunch of columns labelled in 'YYYY-MM' format, along with several other columns. I need to collapse the date columns into calendar quarters and take the mean; I was able to do it manually, but there are a few hundred date columns in my real data and I'd like to not have to map every single one of them by hand. I'm generating the initial df from a CSV; I didn't see anything in read_csv that seemed like it would help, but if there's anything I can leverage there that would be great. I found dataframe.dt.to_period("Q") that will convert a datetime object to quarter, but I'm not quite sure how to apply that here, if I can at all.
Here's a sample df (code below):
foo bar 2016-04 2016-05 2016-06 2016-07 2016-08
0 6 5 3 3 5 8 1
1 9 3 6 9 9 7 8
2 8 5 8 1 9 9 4
3 5 8 1 2 3 5 6
4 4 5 1 2 7 2 6
This code will do what I'm looking for, but I had to generate mapping by hand:
mapping = {'2016-04':'2016q2', '2016-05':'2016q2', '2016-06':'2016q2', '2016-07':'2016q3', '2016-08':'2016q3'}
df = df.set_index(['foo', 'bar']).groupby(mapping, axis=1).mean().reset_index()
New df:
foo bar 2016q2 2016q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
Code to generate the initial df:
df = pd.DataFrame(np.random.randint(1, 11, size=(5, 7)), columns=('foo', 'bar', '2016-04', '2016-05', '2016-06', '2016-07', '2016-08')) '2016-07', '2016-08'))
Use a callable that gets applied to the index values. Use axis=1 to apply it to the column values instead.
(df.set_index(['foo', 'bar'])
.groupby(lambda x: pd.Period(x, 'Q'), axis=1)
.mean().reset_index())
foo bar 2016Q2 2016Q3
0 6 5 3.666667 4.5
1 9 3 8.000000 7.5
2 8 5 6.000000 6.5
3 5 8 2.000000 5.5
4 4 5 3.333333 4.0
The solution is quite short:
Start from copying "monthly" columns to another DataFrame and converting
column names to PeriodIndex:
df2 = df.iloc[:, 2:]
df2.columns = pd.PeriodIndex(df2.columns, freq='M')
Then, to get the result, resample columns by quarter,
compute the mean (for each quarter) and join with 2 "initial" columns:
df.iloc[:, :2].join(df2.resample('Q', axis=1).agg('mean'))
data = [[2,2,2,3,3,3],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5],[1,2,2,3,4,5]]
df = pd.DataFrame(data, columns = ['A','1996-04','1996-05','2000-07','2000-08','2010-10'])
# separate year columns and other columns
# separate year columns
df3 = df.iloc[:, 1:]
# separate other columns
df2 = df.iloc[:,0]
#apply groupby using period index
df3=df3.groupby(pd.PeriodIndex(df3.columns, freq='Q'), axis=1).mean()
final_df = pd.concat([df3,df2], axis=1)
print(final_df)
output is attached in image:
I have pandas dataframe something like:
my_df =
chr PI
2 5
2 5
2 5
2 6
2 6
2 8
2 8
2 8
2 8
2 8
3 5
3 5
3 5
3 5
3 9
3 9
3 9
3 9
3 9
3 9
3 9
3 7
3 7
3 4
......
......
I want to convert it into new dataframe that contains new information on the dataframe, something like:
chr: unique chromosomes
unq_PI : number of unique PIs within each chromosome
PIs : list of "PI" values in that chromosome
PI_freq: length of each "PI" in the respective chromosome
So, expected output would be:
chr unq_PI PIs PI_freq
2 3 5,6,8 3,2,5
3 4 5,9,7,4 4,7,2,1
I was thinking something like:
new_df = pd.DataFrame({'chr': my_df['chr'].unique(),
'unq_PI': my_df('chr')['unq_PI'].nunique()),
'PIs': .......................,
'PI_freq': ..................})
The only code that works is for `chr` when used alone; any additional code just throws an error. How can I fix this?
Use groupby + value_counts, followed by groupby + agg.
v = (df.groupby('chr')
.PI
.apply(pd.Series.value_counts, sort=False)
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.rename(columns={'level_1' : 'PIs', 'PI' : 'PI_freq'})
)
This doesn't account for the count of unique values, that can be computed using groupby + nunique:
v.insert(0, 'unq_PI', df.groupby('chr').PI.nunique())
v
unq_PI PIs PI_freq
chr
2 3 5,6,8 3,2,5
3 4 4,5,7,9 1,4,2,7
You can using value_counts
yourdf=pd.concat([s.nunique(),s.value_counts().to_frame('n').reset_index().groupby('chr').agg(lambda x : ','.join(x.astype(str)))],1)
yourdf
Out[90]:
PI PI n
chr
2 3 8,5,6 5,3,2
3 4 9,5,7,4 7,4,2,1
yourdf.columns=['unq_PI','PIs','PI_freq']
yourdf
Out[93]:
unq_PI PIs PI_freq
chr
2 3 8,5,6 5,3,2
3 4 9,5,7,4 7,4,2,1
If order is important use custom function:
def f(x):
a = x.value_counts().astype(str).reindex(x.unique())
i = ['unq_PI','PIs','PI_freq']
return pd.Series([x.nunique(), ','.join(a.index), ','.join(a)], index=i)
df = df['PI'].astype(str).groupby(df['chr'], sort=False).apply(f).unstack().reset_index()
Another solution:
df = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.assign(unq_PI=lambda x: x['PIs'].str.count(',') + 1)
.reset_index()
.reindex(columns=['chr','unq_PI','PIs','PI_freq'])
)
print (df)
chr unq_PI PIs PI_freq
0 2 3 5,6,8 3,2,5
1 3 4 5,9,7,4 4,7,2,1
Explanation:
You can groupby by both columns and get size for unique values of PI and their frequencies per group. Then reset_index for second level of MultiIndex to column and cast to string:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
)
print (df1)
PIs PI_freq
chr
2 5 3
2 6 2
2 8 5
3 5 4
3 9 7
3 7 2
3 4 1
Then groupby by index by level=0 and aggreate join:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
)
print (df1)
PIs PI_freq
chr
2 5,6,8 3,2,5
3 5,9,7,4 4,7,2,1
Last get number of unique values by count with assign for new column, reindex for custom order of final columns:
df1 = (df.rename(columns={'PI' : 'PIs'})
.groupby(['chr','PIs'], sort=False)
.size()
.rename('PI_freq')
.reset_index(level=1)
.astype(str)
.groupby(level=0)
.agg(','.join)
.assign(unq_PI=lambda x: x['PIs'].str.count(',') + 1)
.reset_index()
.reindex(columns=['chr','unq_PI','PIs','PI_freq'])
)
print (df1)
chr unq_PI PIs PI_freq
0 2 3 5,6,8 3,2,5
1 3 4 5,9,7,4 4,7,2,1