Pandas Python - Create dummy variables for multiple conditions

Pandas Python - Create dummy variables for multiple conditions - python

I have a pandas dataframe with a column that indicates which hour of the day a particular action was performed. So df['hour'] is many rows each with a value from 0 to 23.
I am trying to create dummy variables for things like 'is_morning', for example:
if df['hour'] >= 5 and < 12 then return 1, else return 0
A for loop doesn't work given the size of the data set, and I've tried some other stuff like
df['is_morning'] = df['hour'] >= 5 and < 12
Any suggestions??

You can just do:
df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12)
i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.

Related

Sum 1 column based from date selection in Pandas

I am trying to figure out a way to do this sum in one line, or without having to create another dataframe in memory.
I have a DF with 3 columns. ['DateCreated', 'InvoiceNumber', 'InvoiceAmount']
I am trying to SUM the invoice amount during certain date ranges.
I have this working, but I want to do it without having to create a DF then sum the column. Any help is appreciated.
yesterday_sales_df = df[(df['DateCreated'] > yesterday_date) & (df['DateCreated'] < tomorrow_date)]
yesterday_sales_total = yesterday_sales_df['InvoiceAmount'].sum()
print(yesterday_sales_total)
Thanks

You can try with loc
yesterday_sales_total = df.loc[(df['DateCreated'] > yesterday_date) & (df['DateCreated'] < tomorrow_date), 'InvoiceAmount'].sum()

You can use this as well
# filter df with query
yesterday_sales_total = df.query("#yesterday_date < DateCreated < #tomorrow_date")['InvoiceAmount'].sum()

try between:
sales_total = df[df['DateCreated'].between(yesterday_date, tomorrow_date)]['InvoiceAmount'].sum()
if it's nessesary set inclusive argument (inclusive='both' by default)

Comparing and dropping columns based on greater or smaller timestamp

I have this df:
id started completed
1 2022-02-20 15:00:10.157 2022-02-20 15:05:10.044
and I have this other one data:
timestamp x y
2022-02-20 14:59:47.329 16 0.0
2022-02-20 15:01:10.347 16 0.2
2022-02-20 15:06:35.362 16 0.3
what I wanna do is filter the rows in data where timestamp > started and timestamp < completed (which will leave me with the middle row only)
I tried to do it like this:
res = data[(data['timestamp'] > '2022-02-20 15:00:10.157')]
res = res[(res['timestamp'] > '2022-02-20 15:05:10.044')]
and it works.
But when I wanted to combine the two like this:
res = data[(data['timestamp'] > df['started']) and (data['timestamp'] < df['completed'])]
I get ValueError: Can only compare identically-labeled Series objects
Can anyone please explain why and where am I doing the mistake? Do I have to convert to string the df['started'] or something?

You have two issues here.
The first is the use of and. If you want to combine multiple masks (boolean array) with a "and" logic element-wise, you want to use & instead of and.
Then, the use of df['started'] and df['completed'] for comparing. If you use a debugger, you can see that
df['started'] is a dataframe with its own indexes, the same for data['timestamp']. The rule for comparing, two dataframes are described here. Essentially, you can compare only two dataframes with the same indexing. But here df has only one row, data multiple. Try convert your element from df as a non dataframe format. Using loc for instance.
For instance :
Using masks
n = 10
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
df2 = pd.DataFrame(
{
"max_val" : [0],
"min_val" : [-0.5]
}
)
df[(df.y < df2.loc[0, 'max_val']) & (df.y > df2.loc[0, 'min_val'])]
Out[95]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771
Using query
df2 = pd.DataFrame(
{
"max_val" : np.repeat(0, n),
"min_val" : np.repeat(-0.5, n)
}
)
df.query("y < #df2.max_val and y > #df2.min_val")
Out[124]:
x y
2 v -0.055035
3 a -0.107310
5 d -0.097696
7 j -0.453056
8 t -0.470771

To make the comparisons, Pandas need to have the same rows count in both the dataframes, that's because a comparison is made between the first row of the data['timestamp'] series and the first row of the df['started'] series, and so on.
The error is due to the second row of the data['timestamp'] series not having anything to compare with.
In order to make the code work, you can add for any row of data, a row in df to match against. In this way, Pandas will return a Boolean result for every row, and you can use the AND logical operator to get the results that are both True.
Pandas doesn't want Python's and operator, so you need to use the & operator, so your code will look like this:
data[(data['timestamp'] > df['started']) & (data['timestamp'] < df['completed'])]

label intervals based on other intervals in pandas [duplicate]

This question already has answers here:
Add/fill pandas column based on range in rows from another dataframe
(3 answers)
Closed 3 years ago.
I have two dataframes, a and b
b has a datetime index, while a has a Start and End datetime columns
I need to 'Label' to True, all the rows of b whose indexes fall within any [Start,End] intervals from a
Right now I doing:
for _,r in a.iterrows():
b.loc[np.logical_and(b.index>=r.Start,
b.index<=r.End),'Label']=True
but this is extremely slow when b is large.
How to optimize the provided code snippet?
MVCE:
b=pd.DataFrame(index=[pd.Timestamp('2017-01-01'),pd.Timestamp('2018-01-01')],columns=['Label'])
a=pd.DataFrame.from_dict([{'Start':pd.Timestamp('2018-01-01'),'End':pd.Timestamp('2020-01-01')}])
EDIT:
the solution at
Add/fill pandas column based on range in rows from another dataframe
does not work for me (they use range to fill the intervals, while we are working on datetime

Here's one solution using apply -
Dummy CSV data
Date,Start,End
01-08-2019,01-02-2019, 01-10-2019
01-08-2019,01-02-2020, 01-10-2020
Code
df = pd.read_csv('dummy.csv').apply(pd.to_datetime)
df.T.apply(lambda x: x[1] < x[0] and x[2] > x[0])
Result
0 True
1 False
dtype: bool

How about doing something like this?
def func(): # b.index
mask = (a['Start'] > date) & (a['End'] <= date)
df = a.loc[mask]
if len(df) > 0:
return True
else:
return False
b['Label'] = b.index().to_series().apply(func)

Combining functions (AND)

I have a question regarding combining functions.
My purpose is to apply two functions at the same time. Basically, I want to cut my dataset for extreme values by looking for the 5% quantile at the lowest part of the dataset and the top % at the other end.
df = df[df.temperature >= df.temperature.quantile(.05)]
gets me values that are above the 5% quantile
df = df[df.temperature <= df.temperature.quantile(.95)]
gets me all the values that are below the 95% quantile.
My current problem is that
df = df[df.temperature >= df.temperature.quantile(.05)]
df = df[df.temperature <= df.temperature.quantile(.95)]
works but it's not precise because the 2nd function builds on top of the previous cut. So how can I cut both at once?
df = df[df.temperature >= df.temperature.quantile(.05) & <= df.temperature.quantile(.95)]
does not work.
Thanks for support!
Solved:
df = df[(df.temperature >= df.temperature.quantile(.05)) & (df.temperature <= (df.temperature.quantile(.95)))]

You need parentheses around the conditions due to operator precedence:
f = df[(df.temperature >= df.temperature.quantile(.05)) & (df.temperature <= df.temperature.quantile(.95))]
The docs show that >= has lower precedence than & so you need the parentheses, besides your code should have raised an ambiguous error.
code style wise it is more readable to have your conditions as variables so I would rewrite it to this:
low_limit = df.temperature >= df.temperature.quantile(.05)
upper_limit = df.temperature >= df.temperature.quantile(.95)
then your filtering becomes:
df[(low_limit) & (upper_limit)]
You could optionally change
low_limit = df.temperature >= df.temperature.quantile(.05)
to
low_limit = (df.temperature >= df.temperature.quantile(.05))
so you don't need the parentheses in the filtering

Grabbing selection between specific dates in a DataFrame

so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?

You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]

You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Python - Create dummy variables for multiple conditions - python

You can just do: df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12) i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.

Related

Sum 1 column based from date selection in Pandas

Comparing and dropping columns based on greater or smaller timestamp

label intervals based on other intervals in pandas [duplicate]

Combining functions (AND)

Grabbing selection between specific dates in a DataFrame

Categories

Resources