use method of value for the condition of numpy where - python

let's say you have this data frame:
df = pd.DataFrame( data = [ '2014-04-07 10:55:35.087000+00:00',
'2014-04-07 13:59:37.251500+00:00',
'2014-04-02 13:23:59.629000+00:00',
'2014-04-07 12:17:48.182000+00:00',
'2014-04-06 17:00:23.912000+00:00'],
columns = ['timestamp'],
dtype = np.datetime64
)
and you want to create a new column where the values are 1 if the timestamp is a weekday or 0 if it is not. Then I would run something like this:
df['weekday'] = df['timestamp'].apply(lambda x: 1 if x.weekday() < 5 else 0 )
So far so good. However, in my case I have about 10 million rows of such timestamp values and it just takes forever to run. So, I looked around for vectorization options and I found numpy.where(). But, of course, this does not work: np.where(df['timestamp'].weekday() < 5, 1, 0)
So, is there a way to access the .weekday() method of the timestamps when using numpy.where or is there any other way to produce the weekday column when having 10 million rows? Thanks.

Use Series.dt.dayofweek / Series.dt.weekday with Series.lt and Series.astype:
df['weekday'] = df['timestamp'].dt.dayofweek.lt(5).astype(int)
print(df)
timestamp weekday
0 2014-04-07 10:55:35.087000 1
1 2014-04-07 13:59:37.251500 1
2 2014-04-02 13:23:59.629000 1
3 2014-04-07 12:17:48.182000 1
4 2014-04-06 17:00:23.912000 0
I recommend you see: when should I ever want to use apply in my code
We could also use np.where:
df['weekday'] = np.where(df['timestamp'].dt.dayofweek.lt(5), 1, 0)

Related

How to check the dates in different columns?

I have a dataframe "expeditions" where there are 3 columns ("basecamp_date", "highpoint_date" and "termination_date"). I would like to check that the basecamp date is before the highpoint date and before the termination date because I noticed that there are rows where this is not the case (see picture)
Do you have any idea what I should do (a loop, a new dataframe...?)
Code
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
I would start by transforming the columns into a datetime format to be extra sure:
for x in df:
df[x] = pd.to_datetime(df[x],infer_datetime_format=True)
And then follow it by the comparison using np.where()
df['Check'] = np.where((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date']),True,False)
EDIT: Based on OPs follow up question I propose the following solution:
filtered = df[((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date'])) | (df.isna().sum(axis=1) != 0)]
Example:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaN 2008-05-03 2008-06-03
2 2008-01-04 2008-01-01 2009-01-01
Only row 0 should be kept as row 2 doesn't match that date conditions and row 1 has a null value. Using the proposed code, the output is:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaT 2008-05-03 2008-06-03
You should convert your data to datetime format:
df['date_col'] = pd.to_datetime(df['date_col'])
and then do like this:
df[df['date_col1'] < df['date_col2']]
In your case date_col might be you column names.
All the other answers are working solutions but I find this much easier:
df.query("basecamp_date < highpoint_date and basecamp_date
< termination_date")
Use Series.lt for compare columns and chain masks by & for bitwise AND:
m = (df['basecamp_date'].ge(df['highpoint_date']) |
df['basecamp_date'].ge(df['termination_date']) |
df[['basecamp_date', 'termination_date', 'highpoint_date']].notna().all(1)))
)
If need check matched values:
df1 = df[m]

Pandas: creating values in a column based on the previous value in that column

Quick example:
Before:
In Out
1 5
10 0
2 3
After
In Out Value
1 5 -4
10 0 6
2 3 5
So the formula here is Value(rowx) = Value (rowx - 1) + In(rowx) - Out(rowx).
I started with adding a Value column where each cell is 0. I then have looked a shift() but that uses the value in the previous row from the start of the command/function. So it will always use 0 as the value for Value. Is there a way of doing this without using something like iterrows() or a for loop ?
It seems in your calculations you could first calculate In - Out and later use cumsum()
import pandas as pd
data = {
'In': [1,10,2],
'Out': [5,0,3]
}
df = pd.DataFrame(data)
df['Value'] = df['In'] - df['Out']
df['Value'] = df['Value'].cumsum()
print(df)
or even shorter
df['Value'] = (df['In'] - df['Out']).cumsum()

Lambda Operations in Pandas

I have two columns hours and minutes stored separately in columns: a and b
I want to calculate the sum of both in terms of minutes(DURATION)
To convert the hour into minutes I have used the following:
train['duration']=train.a.apply(lambda x:x*60)
Now I want to add the minutes to the newly created duration column.
So that my final value is duration=(a*60)+b
I am unable to perform this operation using lambda and for loop takes forever to execute In pandas.
You can do it using lambda as follows.
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df["sum"] = df.apply(lambda row: row.x + row.y, axis=1)
print(df)
It will give following output:
x y sum
0 1 4 5
1 2 5 7
2 3 6 9
Hope it helps. You can do it using List Comprehension as that will be fast also or as #Karl Olufsen suggested.
Pandas is a very powerful module with neat features. If you want to simply add values of 1 column to values in another respectively you can simply do + operation on the columns.
Here is my example:
import pandas as pd
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[10,20,30,40]})
df["sum of 2 columns"] = df["col1"] + df["col2"]
print(df)
And this is the output:
col1 col2 sum of 2 columns
0 1 10 11
1 2 20 22
2 3 30 33
3 4 40 44
Best would be to use vectorized code which would be faster than apply
df['duration_in_min'] = df['hour'] * 60 + df['min']
The time complexity of various operations from best(fastest) to worst(slowest) is:
Cython procedures < Vectorized code < apply < itertuples < iterrows

Summing up values from one column based on values in other column

I have a dataframe something like below,
Timestamp count
20180702-06:26:20 50
20180702-06:27:11 10
20180702-07:05:10 20
20180702-07:10:10 30
20180702-08:27:11 40
I want output something like below,
Timestamp Sum_of_count
20180702-06 60
20180702-07 50
20180702-08 40
Basically, I need to find sum of count for every hour.
Any help is really appreciated.
You need separate value some way - one is split and seelct first lists by str[0] and then aggregate sum:
s = df['Timestamp'].str.split(':', n=1).str[0]
df1 = df['count'].groupby(s).sum().reset_index(name='Sum_of_count')
Or convert values to datetimes by to_datetime and get values by strftime:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%Y%m%d-%H:%M:%S')
s = df['Timestamp'].dt.strftime('%Y%m%d-%H')
df1 = df['count'].groupby(s).sum().reset_index(name='Sum_of_count')
print (df1)
Timestamp Sum_of_count
0 20180702-06 60
1 20180702-07 50
2 20180702-08 40
Use
In [252]: df.groupby(df.Timestamp.dt.strftime('%Y-%m-%d-%H'))['count'].sum()
Out[252]:
Timestamp
2018-07-02-06 60
2018-07-02-07 50
2018-07-02-08 40
Name: count, dtype: int64
In [254]: (df.groupby(df.Timestamp.dt.strftime('%Y-%m-%d-%H'))['count'].sum()
.reset_index(name='Sum_of_count'))
Out[254]:
Timestamp Sum_of_count
0 2018-07-02-06 60
1 2018-07-02-07 50
2 2018-07-02-08 40

Drop rows if value in a specific column is not an integer in pandas dataframe

If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]

Categories