From the datetime object in the dataframe I created two new columns based on month and day.
data["time"] = pd.to_datetime(data["datetime"])
data['month']= data['time'].apply(lambda x: x.month)
data['day']= data['time'].apply(lambda x: x.day)
The resultant data had the correct month and day added to the specific columns.
Then I tried to filter it based on specific day
data = data[data['month']=='9']
data = data[data['day']=='2']
This values were visible in the dataframe before filtering.
This returns an empty dataframe. What did I do wrong?
Compare by 9,2 like integers, without '':
data = data[(data['month']==9) & (data['day']==2)]
Or:
data = data[(data['time'].dt.month == 9) & (data['time'].dt.day == 2)]
Related
I have the following code
data
for label, content in data.items():
print('label:', label)
print('content:', content, sep='\n')`
That is all. IGNORE IT
You can subset by index
data2 = data.loc[(data.index.month == 11) & (data.index.day == 10)]
You index is datetime type, and you want it converted to string. First we need to reset_index
data2 = data2.reset_index()
data2["Date"] = data2["Date"].astype(str)
Then get the rows as list of lists (your example output)
data2.values.tolist()
You would need to split the date into three columns :"Year","Month","Day".
Then simply do a selection like
newdf=df[df["Day"]==20 & df["Month"]==11]
Then you can perform your analysis on this new df.
To translate the code it means "copy all the values from df where Month is 11 and Day is 20"
I have the following excel sheet, which I've imported into pandas using read_csv
df
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>FALSE</td><td>FALSE</td><td>2/1/2019</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr></tbody></table>
I want to add a new column NewOrderForDate which gives me a count of all the orders for that campaign for that date AND 1st Order = TRUE
Here's how the dataframe should look after adding this column
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th><th>NewOrderForDate </th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>2/1/2019</td><td>2</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr></tbody></table>
If I had to do this in Excel, I'd probably use
=COUNTIFS(G$2:G$11,G2,E$2:E$11,"TRUE")
Basically, I want to group by column and date and get a count of all the orders where 1st order = TRUE and write these values to a new column
GroupBy 'Campaign', count the '1st order' and add 'NewOrderForDate' column for each group.
def udf(grp_df):
grp_df['NewOrderForDate'] = len(grp_df[grp_df['1st order']==True])
return grp_df
result = df.groupby('Campaign', as_index=False, group_keys=False).apply(udf)
Use transform to keep the index shape, and sum the bool value of 1st Order:
df['NewOrderForDate'] = df.groupby(['Date', 'Campaign'])['1st order'].transform(lambda x: x.sum())
I need to compare a column, called last_login, and if the date match with today's date I want to retrieve the whole row to a list, or something similar:
def joined_today(self, df):
users_joined_today = []
date_joined = pd.DataFrame(df)
today = datetime.date.today()
for i in date_joined['last login']:
i = i.date()
if i == today:
users_joined_today.append(i)
return users_joined_today
I am just wondering what can be an efficient way to retrieve the whole row matching with the values returned by joined_today() function?
With Pandas, you should aim to use vectorised operations:
# convert series to datetime, if not already
df['last_login'] = pd.to_datetime(df['last_login'])
# calculate Boolean series mask
mask = df['last_login'].dt.normalize() == pd.to_datetime('today')
# apply mask
df_filtered = df[mask]
# optionally, convert to list of lists
df_filtered_L = df_filtered.values.tolist()
Normalizing a datetime series flattens the time component to zero, so you can compare it with pd.to_datetime('today'), which is already normalized.
For example, pd.to_datetime('now').normalize() == pd.to_datetime('today') returns True.
I'm working with futures timeseries where the trading day starts at 17:00:00 CT and ends at 15:15:00 CT of the next day. To account for this, I make a change in the index, however, when pivoting the dataframe it ignores this change....
Let's look at it with an example:
# Dummy Data
rng = pd.date_range('1/1/2011', periods=5000, freq='min')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
df = pd.DataFrame(ts, columns=['ts'])
df_1 = df.resample('5min').last()
# Change index to account for futures hours
df_1.index = pd.to_datetime(df_1.index.values + np.where((df_1.index.time >= datetime.time(17)), pd.offsets.Day(1).nanos, 0))
# Pivoting df_1 and making some formatting changes
df_2 = pd.pivot_table(df_1, index=df_1.index.date, columns=df_1.index.time, values='ts')
df_2.columns = df_2.columns.map(lambda t: t.strftime('%H%M'))
df_2_cols = df_2.columns.tolist()
for i in range(len(df_2_cols)):
if df_2_cols[i][0] == '0':
df_2_cols[i] = df_2_cols[i][1:4]
After doing all this, the dataframe is in the shape and format I want but the first column, corresponding to the first timestamp of the day is 00:00:00 instead of 17:00:00, as I intended with the index modification.
How can I fix this??
After pivoting the columns get sorted chronologically. But you can reorder them. Say, the columns are already formatted, so we look for '1700':
pos = np.nonzero(df_2.columns == '1700')[0][0]
(np.nonzero returns a tuple of arrays, hence those double [0]). Then
new_cols = df_2.columns[pos:].append(df_2.columns[:pos])
df_2 = df_2.reindex(columns = new_cols)
New to pandas.
Have a DataFrame of the order:
A B C Date1 Date2 D with multiple rows with values. I want to divide the entire DataFrame into multiple dataframes based on quarters, i.e (Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec). I am trying to use only the Date1 column values for the same. I tried the following so far:
data_q = data.groupby(pandas.TimeGrouper(freq = '3M'))
The dates are in the form 2009-11-03.
There a few ways to do this.
I would ensure that Date1 column is a datetime type using the .dtype method.
e.g. df['Date1'].dtype
If it's not, cast to datetime object using:
df.Date1 = pd.to_datetime(df.Date1)
Add a quarters column for eventual data frame slicing:
df['quarters'] = df.Date1.dt.quarter
Create your data frames:
q1 = df[df.quarters == 1]
q2 = df[df.quarters == 2]
q3 = df[df.quarters == 3]
q4 = df[df.quarters == 4]
So the approach that appears easiest to me is to convert Date1 to your index, then groupby on the quarter.
df2 = df.set_index('Date1')
quardfs = list(df2.groupby(df2.index.quarter))
This will leave you with quardfs, which a list of DataFrames.
If you don't want to set Date1 to an index, you can also copy it out of the DataFrame and use it:
quars = pd.DatetimeIndex(df['Date1']).quarter
quardfs = list(df2.groupby(quars))