I need to compare a column, called last_login, and if the date match with today's date I want to retrieve the whole row to a list, or something similar:
def joined_today(self, df):
users_joined_today = []
date_joined = pd.DataFrame(df)
today = datetime.date.today()
for i in date_joined['last login']:
i = i.date()
if i == today:
users_joined_today.append(i)
return users_joined_today
I am just wondering what can be an efficient way to retrieve the whole row matching with the values returned by joined_today() function?
With Pandas, you should aim to use vectorised operations:
# convert series to datetime, if not already
df['last_login'] = pd.to_datetime(df['last_login'])
# calculate Boolean series mask
mask = df['last_login'].dt.normalize() == pd.to_datetime('today')
# apply mask
df_filtered = df[mask]
# optionally, convert to list of lists
df_filtered_L = df_filtered.values.tolist()
Normalizing a datetime series flattens the time component to zero, so you can compare it with pd.to_datetime('today'), which is already normalized.
For example, pd.to_datetime('now').normalize() == pd.to_datetime('today') returns True.
Related
I have a dataframe where it was added date and datetime information to a column where it was expected a string. What would be the best way to filter all dates and date values from a pandas dataframe column and replace those values to blank?
Thank you!
In general, if you provided a minimum working example of your problem, one could help more specifically, but assuming you have the following column:
df = pd.DataFrame(np.zeros(shape=(10,1)), columns = ["Mixed"])
df["Mixed"] = "foobar"
df.loc[2,"Mixed"] = pd.to_datetime("2022-08-22")
df.loc[7,"Mixed"] = pd.to_datetime("2022-08-21")
#print("Before Fix", df)
You can use apply(type) on the column to obtain the data-types of each cell and then use list comprehension [x!=str for x in types] to check for each cells datatype if its a string or not. After that, just replace those values that are not the desired datatype with a value of your choosing.
types = df["Mixed"].apply(type).values
mask = [x!=str for x in types]
df.loc[mask,"Mixed"] = "" #Or None, or whatever you want to overwrite it with
#print("After Fix", df)
From the datetime object in the dataframe I created two new columns based on month and day.
data["time"] = pd.to_datetime(data["datetime"])
data['month']= data['time'].apply(lambda x: x.month)
data['day']= data['time'].apply(lambda x: x.day)
The resultant data had the correct month and day added to the specific columns.
Then I tried to filter it based on specific day
data = data[data['month']=='9']
data = data[data['day']=='2']
This values were visible in the dataframe before filtering.
This returns an empty dataframe. What did I do wrong?
Compare by 9,2 like integers, without '':
data = data[(data['month']==9) & (data['day']==2)]
Or:
data = data[(data['time'].dt.month == 9) & (data['time'].dt.day == 2)]
Could I ask how to retrieve an index of a row in a DataFrame?
Specifically, I am able to retrieve the index of rows from a df.loc.
idx = data.loc[data.name == "Smith"].index
I can even retrieve row index from df.loc by using data.index like this:
idx = data.loc[data.index == 5].index
However, I cannot retrieve the index directly from the row itself (i.e., from row.index, instead of df.loc[].index). I tried using these codes:
idx = data.iloc[5].index
The result of this code is the column names.
To provide context, the reason I need to retrieve the index of a specific row (instead of rows from df.loc) is to use df.apply for each row.
I plan to use df.apply to apply a code to each row and copy the data from the row immediately above them.
def retrieve_gender (row):
# This is a panel data, whose only data in 2000 is already keyed in. Time-invariant data in later years are the same as those in 2000.
if row["Year"] == 2000:
pass
elif row["Year"] == 2001: # To avoid complexity, let's use only year 2001 as example.
idx = row.index # This is wrong code.
row["Gender"] = row.iloc[idx-1]["Gender"]
return row["Gender"]
data["Gender"] = data.apply(retrieve_gender, axis=1)
With Pandas you can loop through your dataframe like this :
for index in range(len(df)):
if df.loc[index,'year'] == "2001":
df.loc[index,'Gender'] = df.loc[index-1 ,'Gender']
apply gives series indexed by column labels
The problem with idx = data.iloc[5].index is data.iloc[5] converts a row to a pd.Series object indexed by column labels.
In fact, what you are asking for is impossible via pd.DataFrame.apply because the series that feeds your retrieve_gender function does not include any index identifier.
Use vectorised logic instead
With Pandas row-wise logic is inefficient and not recommended; it involves a Python-level loop. Use columnwise logic instead. Taking a step back, it seems you wish to implement 2 rules:
If Year is not 2001, leave Gender unchanged.
If Year is 2001, use Gender from previous row.
np.where + shift
For the above logic, you can use np.where with pd.Series.shift:
data['Gender'] = np.where(data['Year'] == 2001, data['Gender'].shift(), data['Gender'])
mask + shift
Alternatively, you can use mask + shift:
data['Gender'] = data['Gender'].mask(data['Year'] == 2001, data['Gender'].shift())
I have some data in a data frame like this:
COLUMNS Point,x,y,z,Description,Layer,Date
1,224939.203,1243008.651,1326.774,F,C-GRAD-FILL,09/22/18 07:24:34,
1,225994.242,1243021.426,1301.772,BS,C-GRAD-FILL,09/24/18 08:24:18,
451,225530.332,1243016.186,1316.173,GRD,C-TOE,10/02/18 11:49:13,
452,225522.429,1242996.017,1319.168,GRD,C-TOE KEY,10/02/18 11:49:46,
I would like to try to try to check against a list of strings to see if it matches, and then change another columns value.
myList = ["BS", "C"]
if df['Description'].isin(myList)) == True:
df['Layer']="INLIST"
df['Date']="1/1/2018'
Use a mask with pd.DataFrame.loc:
mask = df['Description'].isin(['BS', 'C'])
df.loc[mask, 'Layer'] = 'INLIST'
df.loc[mask, 'Date'] = '1/1/2018'
Alternatively, use pd.Series.mask:
mask = df['Description'].isin(['BS', 'C'])
df['Layer'].mask(mask, 'INLIST', inplace=True)
df['Date'].mask(mask, '1/1/2018', inplace=True)
Note you should probably store dates as datetime objects, e.g. pd.Timestamp('2018-01-01'), or convert your series via pd.to_datetime. There's usually no need for a Python-level loop.
Borrow the mask from jpp and using list to assign the value
df.loc[mask,['Layer','Date']]=['INLIST','1/1/2018']
Now I have a list of indices label_index. I want to extract the corresponding values from a dataframe label_file based on the indices. The values of label_index will appear in column image_num in the dataframe and the goal is to get a list of corresponding values in Thermal conductivity(W/mK) column.
label_file = pd.read_excel("/Users/yixuansun/Documents/Research/ThermalConductiviy/Anisotropic/anisotropic_porous_media/data.xlsx",
sheet_name = "total")
label = []
for i in label_index:
for j in range(len(label_file)):
if i == label_file.iloc[j]["image_num"]:
label.append(label_file.iloc[j]["Thermal conductivity(W/mK)"])
I used the brute force to find the match (two for loops). It does take a very long time to get through. I am wondering if there is a more efficient way to do so.
Get column "Thermal conductivity(W/mK)" where "image_num" column has one of the values specified in label_index list:
series = label_file.loc[
label_file['image_num'].isin(label_index),
'Thermal conductivity(W/mK)']
EDIT 1:
For sorting by label_index you can use an auxiliary column as following:
df = label_file.loc[
label_file['image_num'].isin(label_index),
['Thermal conductivity(W/mK)', 'image_num']]
# create aux. column to sort by
df['sortbyme'] = df['image_num'].apply(lambda x: label_index.index(x))
# sort by aux. column and get only 'Thermal conductivity(W/mK)' column
series = df.sort_values('sortbyme').reset_index()['Thermal conductivity(W/mK)']
I actually found a fast but cleaner way myself.
ther = []
for i in label_index:
ther.append(label_file.loc[i]["Thermal conductivity(W/mK)"])
This will do the work.