Pandas Dataframe filter based on time - python

From the datetime object in the dataframe I created two new columns based on month and day.
data["time"] = pd.to_datetime(data["datetime"])
data['month']= data['time'].apply(lambda x: x.month)
data['day']= data['time'].apply(lambda x: x.day)
The resultant data had the correct month and day added to the specific columns.
Then I tried to filter it based on specific day
data = data[data['month']=='9']
data = data[data['day']=='2']
This values were visible in the dataframe before filtering.
This returns an empty dataframe. What did I do wrong?

Compare by 9,2 like integers, without '':
data = data[(data['month']==9) & (data['day']==2)]
Or:
data = data[(data['time'].dt.month == 9) & (data['time'].dt.day == 2)]

Related

How to iterate through data set in pandas based on days?

I have the following code
data
for label, content in data.items():
print('label:', label)
print('content:', content, sep='\n')`
That is all. IGNORE IT
You can subset by index
data2 = data.loc[(data.index.month == 11) & (data.index.day == 10)]
You index is datetime type, and you want it converted to string. First we need to reset_index
data2 = data2.reset_index()
data2["Date"] = data2["Date"].astype(str)
Then get the rows as list of lists (your example output)
data2.values.tolist()
You would need to split the date into three columns :"Year","Month","Day".
Then simply do a selection like
newdf=df[df["Day"]==20 & df["Month"]==11]
Then you can perform your analysis on this new df.
To translate the code it means "copy all the values from df where Month is 11 and Day is 20"

Pandas - How to insert a new column with the count when there are multiple clauses

I have the following excel sheet, which I've imported into pandas using read_csv
df
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>FALSE</td><td>FALSE</td><td>2/1/2019</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr></tbody></table>
I want to add a new column NewOrderForDate which gives me a count of all the orders for that campaign for that date AND 1st Order = TRUE
Here's how the dataframe should look after adding this column
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th><th>NewOrderForDate </th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>2/1/2019</td><td>2</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr></tbody></table>
If I had to do this in Excel, I'd probably use
=COUNTIFS(G$2:G$11,G2,E$2:E$11,"TRUE")
Basically, I want to group by column and date and get a count of all the orders where 1st order = TRUE and write these values to a new column
GroupBy 'Campaign', count the '1st order' and add 'NewOrderForDate' column for each group.
def udf(grp_df):
grp_df['NewOrderForDate'] = len(grp_df[grp_df['1st order']==True])
return grp_df
result = df.groupby('Campaign', as_index=False, group_keys=False).apply(udf)
Use transform to keep the index shape, and sum the bool value of 1st Order:
df['NewOrderForDate'] = df.groupby(['Date', 'Campaign'])['1st order'].transform(lambda x: x.sum())

Retrieve an entire row for a DataFrame

I need to compare a column, called last_login, and if the date match with today's date I want to retrieve the whole row to a list, or something similar:
def joined_today(self, df):
users_joined_today = []
date_joined = pd.DataFrame(df)
today = datetime.date.today()
for i in date_joined['last login']:
i = i.date()
if i == today:
users_joined_today.append(i)
return users_joined_today
I am just wondering what can be an efficient way to retrieve the whole row matching with the values returned by joined_today() function?
With Pandas, you should aim to use vectorised operations:
# convert series to datetime, if not already
df['last_login'] = pd.to_datetime(df['last_login'])
# calculate Boolean series mask
mask = df['last_login'].dt.normalize() == pd.to_datetime('today')
# apply mask
df_filtered = df[mask]
# optionally, convert to list of lists
df_filtered_L = df_filtered.values.tolist()
Normalizing a datetime series flattens the time component to zero, so you can compare it with pd.to_datetime('today'), which is already normalized.
For example, pd.to_datetime('now').normalize() == pd.to_datetime('today') returns True.

How to modify index behavior when pivoting dataframe in Python Pandas?

I'm working with futures timeseries where the trading day starts at 17:00:00 CT and ends at 15:15:00 CT of the next day. To account for this, I make a change in the index, however, when pivoting the dataframe it ignores this change....
Let's look at it with an example:
# Dummy Data
rng = pd.date_range('1/1/2011', periods=5000, freq='min')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
df = pd.DataFrame(ts, columns=['ts'])
df_1 = df.resample('5min').last()
# Change index to account for futures hours
df_1.index = pd.to_datetime(df_1.index.values + np.where((df_1.index.time >= datetime.time(17)), pd.offsets.Day(1).nanos, 0))
# Pivoting df_1 and making some formatting changes
df_2 = pd.pivot_table(df_1, index=df_1.index.date, columns=df_1.index.time, values='ts')
df_2.columns = df_2.columns.map(lambda t: t.strftime('%H%M'))
df_2_cols = df_2.columns.tolist()
for i in range(len(df_2_cols)):
if df_2_cols[i][0] == '0':
df_2_cols[i] = df_2_cols[i][1:4]
After doing all this, the dataframe is in the shape and format I want but the first column, corresponding to the first timestamp of the day is 00:00:00 instead of 17:00:00, as I intended with the index modification.
How can I fix this??
After pivoting the columns get sorted chronologically. But you can reorder them. Say, the columns are already formatted, so we look for '1700':
pos = np.nonzero(df_2.columns == '1700')[0][0]
(np.nonzero returns a tuple of arrays, hence those double [0]). Then
new_cols = df_2.columns[pos:].append(df_2.columns[:pos])
df_2 = df_2.reindex(columns = new_cols)

Divide entire DataFrame based on dates in specific columns into quarterly dataframes

New to pandas.
Have a DataFrame of the order:
A B C Date1 Date2 D with multiple rows with values. I want to divide the entire DataFrame into multiple dataframes based on quarters, i.e (Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec). I am trying to use only the Date1 column values for the same. I tried the following so far:
data_q = data.groupby(pandas.TimeGrouper(freq = '3M'))
The dates are in the form 2009-11-03.
There a few ways to do this.
I would ensure that Date1 column is a datetime type using the .dtype method.
e.g. df['Date1'].dtype
If it's not, cast to datetime object using:
df.Date1 = pd.to_datetime(df.Date1)
Add a quarters column for eventual data frame slicing:
df['quarters'] = df.Date1.dt.quarter
Create your data frames:
q1 = df[df.quarters == 1]
q2 = df[df.quarters == 2]
q3 = df[df.quarters == 3]
q4 = df[df.quarters == 4]
So the approach that appears easiest to me is to convert Date1 to your index, then groupby on the quarter.
df2 = df.set_index('Date1')
quardfs = list(df2.groupby(df2.index.quarter))
This will leave you with quardfs, which a list of DataFrames.
If you don't want to set Date1 to an index, you can also copy it out of the DataFrame and use it:
quars = pd.DatetimeIndex(df['Date1']).quarter
quardfs = list(df2.groupby(quars))

Categories