I have a dataframe "expeditions" where there are 3 columns ("basecamp_date", "highpoint_date" and "termination_date"). I would like to check that the basecamp date is before the highpoint date and before the termination date because I noticed that there are rows where this is not the case (see picture)
Do you have any idea what I should do (a loop, a new dataframe...?)
Code
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
I would start by transforming the columns into a datetime format to be extra sure:
for x in df:
df[x] = pd.to_datetime(df[x],infer_datetime_format=True)
And then follow it by the comparison using np.where()
df['Check'] = np.where((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date']),True,False)
EDIT: Based on OPs follow up question I propose the following solution:
filtered = df[((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date'])) | (df.isna().sum(axis=1) != 0)]
Example:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaN 2008-05-03 2008-06-03
2 2008-01-04 2008-01-01 2009-01-01
Only row 0 should be kept as row 2 doesn't match that date conditions and row 1 has a null value. Using the proposed code, the output is:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaT 2008-05-03 2008-06-03
You should convert your data to datetime format:
df['date_col'] = pd.to_datetime(df['date_col'])
and then do like this:
df[df['date_col1'] < df['date_col2']]
In your case date_col might be you column names.
All the other answers are working solutions but I find this much easier:
df.query("basecamp_date < highpoint_date and basecamp_date
< termination_date")
Use Series.lt for compare columns and chain masks by & for bitwise AND:
m = (df['basecamp_date'].ge(df['highpoint_date']) |
df['basecamp_date'].ge(df['termination_date']) |
df[['basecamp_date', 'termination_date', 'highpoint_date']].notna().all(1)))
)
If need check matched values:
df1 = df[m]
Related
I have the following data in my DataFrame
I want to create another DataFrame by applying one of the filter ( Get all data which belongs to current month ).
To achieve this I am using the following code :
search = '2021-07'
month_dataframe=all_dataframe["date"].str.findall(str(search))
I think the filter is running properly , because when I print month_dataframe, I can see the data, but it is only returning date column data, I want all of the columns/data from all_dataframe.
Input df
Name date address
0 A 2021-07-03 X
1 B 2021-07-03 Y
2 C 2021-07-03 Z
3 D 2021-08-01 M
4 E 2021-08-01 N
5 F 2021-08-01 O
If date col is object/string type and we want to keep it as is
search = '2021-07'
month_dataframe = df[df["date"].str.contains(str(search))]
month_dataframe
If date col is object/str type and we are okay with converting it to datetime
Benefit in this case is we don't need to define variable search = '2021-07' and this solution will work for every(current) month.
df['date'] = pd.to_datetime(df.date)
month_dataframe = df[df.date.dt.month == pd.Timestamp('today').month]
month_dataframe
If date col is datetime type
(Just an option, not the best way)
search = '2021-07'
month_dataframe = df[df["date"].astype(str).str.contains(str(search))]
month_dataframe
Output
Name date address
0 A 2021-07-03 X
1 B 2021-07-03 Y
2 C 2021-07-03 Z
Try to change the column data type and use filter option.
I've got a dataframe that looks something like this:
user
current_date
prior_date
points_scored
1
2021-01-01
2020-10-01
5
2
2021-01-01
2020-10-01
4
2
2021-01-21
2020-10-21
4
2
2021-05-01
2021-02-01
4
The prior_date column is simply current_date - 3 months and points_scored is the number of points scored on current_date. I'd like to be able to identify which rows had sum(points_scored) >= 8 where for a given user, the rows considered would be where current_date is between current_date and prior_date. It is guaranteed that no single row will have a value of points_scored >= 8.
For example, in the example above, I'd like something like this returned:
user
current_date
prior_date
points_scored
flag
1
2021-01-01
2021-04-01
5
0
2
2021-01-01
2020-10-01
4
0
2
2021-01-21
2020-10-21
4
1
2
2021-05-01
2021-02-01
4
0
The third row shows flag=1 because for row 3's values of current_date=2021-01-21 and prior_date=2020-10-21, the rows to consider would be rows 2 and 3. We consider row 2 because row 2's current_date=2021-01-01 which is between row 3's current_date and prior_date.
Ultimately, I'd like to end up with a data structure where it shows distinct user and flag. It could be a dataframe or a dictionary-- anything easily referencable.
user
flag
1
0
2
1
To do this, I'm doing something like this:
flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
temp_df = df[df['user'] == id]
for idx, row in temp_df.iterrows():
cur_date = row['current_date']
prior_date = row['prior_date']
temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
if temp_total >= 8:
flags[id] = 1
break
The code above works, but just takes way too long to actually execute.
You are right, performing loops on large data can be quite time consuming. This is where the power of numpy comes into full play. I am still not sure of what you want but i can help address the speed
Numpy.select can perform your if else statement efficiently.
import pandas as pd
import numpy as np
condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')
Also, you might want to more succint what you want. meanwhile you can refactor your loops with np.select()
let's say you have this data frame:
df = pd.DataFrame( data = [ '2014-04-07 10:55:35.087000+00:00',
'2014-04-07 13:59:37.251500+00:00',
'2014-04-02 13:23:59.629000+00:00',
'2014-04-07 12:17:48.182000+00:00',
'2014-04-06 17:00:23.912000+00:00'],
columns = ['timestamp'],
dtype = np.datetime64
)
and you want to create a new column where the values are 1 if the timestamp is a weekday or 0 if it is not. Then I would run something like this:
df['weekday'] = df['timestamp'].apply(lambda x: 1 if x.weekday() < 5 else 0 )
So far so good. However, in my case I have about 10 million rows of such timestamp values and it just takes forever to run. So, I looked around for vectorization options and I found numpy.where(). But, of course, this does not work: np.where(df['timestamp'].weekday() < 5, 1, 0)
So, is there a way to access the .weekday() method of the timestamps when using numpy.where or is there any other way to produce the weekday column when having 10 million rows? Thanks.
Use Series.dt.dayofweek / Series.dt.weekday with Series.lt and Series.astype:
df['weekday'] = df['timestamp'].dt.dayofweek.lt(5).astype(int)
print(df)
timestamp weekday
0 2014-04-07 10:55:35.087000 1
1 2014-04-07 13:59:37.251500 1
2 2014-04-02 13:23:59.629000 1
3 2014-04-07 12:17:48.182000 1
4 2014-04-06 17:00:23.912000 0
I recommend you see: when should I ever want to use apply in my code
We could also use np.where:
df['weekday'] = np.where(df['timestamp'].dt.dayofweek.lt(5), 1, 0)
I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?
Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4
If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]