so I have a large pandas DataFrame that contains about two months of information with a line of info per second. Way too much information to deal with at once, so I want to grab specific timeframes. The following code will grab everything before February 5th 2012:
sunflower[sunflower['time'] < '2012-02-05']
I want to do the equivalent of this:
sunflower['2012-02-01' < sunflower['time'] < '2012-02-05']
but that is not allowed. Now I could do this with these two lines:
step1 = sunflower[sunflower['time'] < '2012-02-05']
data = step1[step1['time'] > '2012-02-01']
but I have to do this with 20 different DataFrames and a multitude of times and being able to do this easily would be nice. I know pandas is capable of this because if my dates were the index rather than a column, it's easy to do, but they can't be the index because dates are repeated and therefore you receive this error:
Exception: Reindexing only valid with uniquely valued Index objects
So how would I go about doing this?
You could define a mask separately:
df = DataFrame('a': np.random.randn(100), 'b':np.random.randn(100)})
mask = (df.b > -.5) & (df.b < .5)
df_masked = df[mask]
Or in one line:
df_masked = df[(df.b > -.5) & (df.b < .5)]
You can use query for a more concise option:
df.query("'2012-02-01' < time < '2012-02-05'")
Related
I am trying to figure out a way to do this sum in one line, or without having to create another dataframe in memory.
I have a DF with 3 columns. ['DateCreated', 'InvoiceNumber', 'InvoiceAmount']
I am trying to SUM the invoice amount during certain date ranges.
I have this working, but I want to do it without having to create a DF then sum the column. Any help is appreciated.
yesterday_sales_df = df[(df['DateCreated'] > yesterday_date) & (df['DateCreated'] < tomorrow_date)]
yesterday_sales_total = yesterday_sales_df['InvoiceAmount'].sum()
print(yesterday_sales_total)
Thanks
You can try with loc
yesterday_sales_total = df.loc[(df['DateCreated'] > yesterday_date) & (df['DateCreated'] < tomorrow_date), 'InvoiceAmount'].sum()
You can use this as well
# filter df with query
yesterday_sales_total = df.query("#yesterday_date < DateCreated < #tomorrow_date")['InvoiceAmount'].sum()
try between:
sales_total = df[df['DateCreated'].between(yesterday_date, tomorrow_date)]['InvoiceAmount'].sum()
if it's nessesary set inclusive argument (inclusive='both' by default)
I have a dataframe in which I'm trying to create a binary 1/0 column when certain conditions are met. The code I'm using is as follows:
sd_threshold = 5
df1["signal"] = np.where(np.logical_and(df1["high"] >= df1["break"], df1["low"]
<= df1["break"], df1["sd_round"] > sd_threshold), 1, 0)
The code returns TypeError: return arrays must be of ArrayType when the last condition df1["sd_round"] > sd_threshold is included, otherwise it works fine. There isn't any issue with the data in the df1["sd_round"] column.
Any insight would be much appreciated, thank you!
check the documentation -- np.logical_and() compares the first two arguments you give it and writes the output to the third. you could use a nested call but i would just go with & (pandas boolean indexing):
df1["signal"] = np.where((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold),
1, 0)
EDIT: you could actually just skip numpy and cast your boolean Series to int to yield 1s and 0s:
mask = ((df1["high"] >= df1["break"]) &
(df1["low"] <= df1["break"]) &
(df1["sd_round"] > sd_threshold))
df1["signal"] = mask.astype(int)
I'd like to create a new column to a Pandas dataframe populated with True or False based on the other values in each specific row. My approach to solve this task was to apply a function checking boolean conditions across each row in the dataframe and populate the new column with either True or False.
This is the dataframe:
l={'DayTime':['2018-03-01','2018-03-02','2018-03-03'],'Pressure':
[9,10.5,10.5], 'Feed':[9,10.5,11], 'Temp':[9,10.5,11]}
df1=pd.DataFrame(l)
This is the function I wrote:
def ops_on(row):
return row[('Feed' > 10)
& ('Pressure' > 10)
& ('Temp' > 10)
]
The function ops_on is used to create the new column ['ops_on']:
df1['ops_on'] = df1.apply(ops_on, axis='columns')
Unfortunately, I get this error message:
TypeError: ("'>' not supported between instances of 'str' and 'int'", 'occurred at index 0')
Thankful for help.
You should work column-wise (vectorised, efficient) rather than row-wise (inefficient, Python loop):
df1['ops_on'] = (df1['Feed'] > 10) & (df1['Pressure'] > 10) & (df1['Temp'] > 10)
The & ("and") operator is applied to Boolean series element-wise. An arbitrary number of such conditions can be chained.
Alternatively, for the special case where you are performing the same comparison multiple times:
df1['ops_on'] = df1[['Feed', 'Pressure', 'Temp']].gt(10).all(1)
In your current setup, just re-write your function like this:
def ops_on(row):
return (row['Feed'] > 10) & (row['Pressure'] > 10) & (row['Temp'] > 10)
My data frame looks as follows:
df_data = pd.read_csv("SKU12345.csv", index_col=0)
where the CSV I refer to has the following values:
SKU,Tag,Fall,Wert
0,12345,1,WE,1000
1,12345,1,ABV,10
2,12345,1,PRO,0
3,23456,2,WE,10000
I want to make an if-condition which reads as follows:
If 'Fall' == 'WE'
and if 'Wert' of this row > 100:
print('Wert' of row with 'Fall' == 'WE')
The outcome I wish to get is
1000
10000
Thank you so much!
Is this what you're looking for ?
for v in df[(df['Wert'] > 100) & (df['Fall'] == 'WE')]['Wert'].values:
print(v)
Essentially, you first filter the DataFrame by all values bigger than 100 in column Wert, and by column Fall for WE
df[(df['Wert'] > 100) & (df['Fall'] == 'WE')]
The & is required to chain several conditionals - it cannot be replaced by and or && afaik.
and then directly access the values within the column Wert.
df[(df['Wert'] > 100) & (df['Fall'] == 'WE')]['Wert'].values
.values of a column returns the column's data as a numpy.array, so you can just loop over it like a list.
I have a pandas dataframe with a column that indicates which hour of the day a particular action was performed. So df['hour'] is many rows each with a value from 0 to 23.
I am trying to create dummy variables for things like 'is_morning', for example:
if df['hour'] >= 5 and < 12 then return 1, else return 0
A for loop doesn't work given the size of the data set, and I've tried some other stuff like
df['is_morning'] = df['hour'] >= 5 and < 12
Any suggestions??
You can just do:
df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12)
i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.