Pandas select rows based on a function of a column - python

I am trying to learn Pandas. I have found several examples on how to construct a pandas dataframe and how to add columns, they work nicely. I would like to learn to select all rows based on a value of a column. I have found multiple examples on how to perform selection if a value of a column should be smaller or greater than a certain number, that also works. My question is how to do a more general selection, where I would like to first compute a function of a column, then select all rows for which the value of a function would be greater or smaller than a certain number
import names
import numpy as np
import pandas as pd
from datetime import date
import random
def randomBirthday(startyear, endyear):
T1 = date.today().replace(day=1, month=1, year=startyear).toordinal()
T2 = date.today().replace(day=1, month=1, year=endyear).toordinal()
return date.fromordinal(random.randint(T1, T2))
def age(birthday):
today = date.today()
return today.year - birthday.year - ((today.month, today.day) < (birthday.month, birthday.day))
N_PEOPLE = 20
dict_people = { }
dict_people['gender'] = np.array(['male','female'])[np.random.randint(0, 2, N_PEOPLE)]
dict_people['names'] = [names.get_full_name(gender=g) for g in dict_people['gender']]
peopleFrame = pd.DataFrame(dict_people)
# Example 1: Add new columns to the data frame
peopleFrame['birthday'] = [randomBirthday(1920, 2020) for i in range(N_PEOPLE)]
# Example 2: Select all people with a certain age
peopleFrame.loc[age(peopleFrame['birthday']) >= 20]
This code works except for the last line. Please suggest what is the correct way to write this line. I have considered adding an extra column with the value of the function age, and then selecting based on its value. That would work. But I am wondering if I have to do it. What if I don't want to store the age of a person, only use it for selection

Use Series.apply:
peopleFrame.loc[peopleFrame['birthday'].apply(age) >= 20]

Related

Parse CSV in 2D Python Object

i am trying to do Analysis on a CSV file which looks like this:
timestamp
value
1594512094.39
51
1594512094.74
76
1594512098.07
50.9
1594512099.59
76.80000305
1594512101.76
50.9
i am using pandas to import each column:
dataFrame = pandas.read_csv('iot_telemetry_data.csv')
graphDataHumidity: object = dataFrame.loc[:, "humidity"]
graphTime: object = dataFrame.loc[:, "ts"]
My Problem is i need to make a tuple of both columns, to filter the values of a specific time range, so for example i have my timestampBeginn of "1594512109.13668" and my "timestampEnd of "1594512129.37415" and i want to have the corresponding values to generate for example the mean value of the value of the specific time range.
I didn't find any solutions to this online and i don't know any libraries which solve this problem.
You can first filter the rows which timestamp values are between the 'start' and 'end.' Then you can calculate the values of the filtered rows, as follows:
(But, in the sample data, it seems that there is no row, which timestamp are between the range from 1594512109.13668 to 1594512129.37415. You can edit the range values as what you want.
import pandas as pd
df = pd.read_csv('iot_telemetry_data.csv')
start = 159451219.13668
end = 1594512129.37415
df = df[(df['timestamp'] >= start) & (df['timestamp'] <= end)]
average = df['value'].mean()
print(average)

How can I add a day to a datetime object using Python, using two conditions from a dataframe?

I am trying to add a day to my column CutoffSLAII if the time in the column is after 1.59am, however, I only want to add a day if the column FILE_START_TIME also has a time prior to 12:00am. If neither of these conditions are met, then the value in CutoffSLAII should be retained.
The code I am using does run, but nothing changes in the dataframe:
from datetime import datetime, time, timedelta
import pandas as pd
def add_a_day(row: pd.Series) -> datetime:
if pd.isna(x['CutoffSLAII', 'FILE_START_TIME']):
return x['CutoffSLAII', 'FILE_START_TIME']
tme = x['CutoffSLAII'].time()
tme1 = x['FILE_START_TIME'].time()
if tme < time(12, 0, 0) and tme1 > time(1, 59, 0):
return x['CutoffSLAII'] + timedelta(days=1)
data: df2['CutoffSLAII'] = df2.apply(add_a_day, axis=1)
Data that I wish to add a day to:
Both FILE_START_TIME and CutoffSLAII are Datetime64[ns] dtypes, however, when I interact with one value in the columns, they are returned as a timestamp.
in: df2.iloc[0]['FILE_START_TIME']
out: Timestamp('2020-11-02 19:23:47')
The data is not embedded as I do not have enough reputation points, sorry for that.
The error message is now:
TypeError: string indices must be integers
Your function add_a_day takes a variable named row but acts on another one named x.
You may want to fix that first !
I'm a little bit confused what's going on. Has X been referenced somewhere else or is that supposed to be referencing row? Also you map row as a series to datetime when you are applying on the whole data frame. Lastly I think you are trying to reference the rows incorrectly.
if pd.isna(x['CutoffSLAII']) or pd.isna(x['FILE_START_TIME'])):

column / list operators for datetime values in python

I have a pandas dataframe where one of the columns consists of datetime values with varying frequences.
I want to create a new column which flags whenever the gap between two datetime values is greater than one day (datetime current row + timedelta(days=1) < datetime next row).
However, I would want to do this with a list operation, rather than a for loop.
Had the values been int values, you could do something like:
df_ship["gap_gt_1"] = (df_ship['datetime']+1).lt(df_ship['datetime'].shift().bfill()).astype(int)
However, lt and similar operators don't work with datetime objects.
I've tried to do the following, but it only returns 'false' values.
df_ship["gap_gt_1"] = ((df_ship['datetime'] + timedelta(days=1)) < (df_ship['datetime'].shift()))
You can try to do:
import numpy as np
# Take the difference in dates
df["timedelta"] = df['date'] - df['date'].shift(1)
# To make the flags
conditions, type_choices = ([df['timedelta'] > pd.Timedelta(days=1)], [1])
df["flag"] = np.select(conditions, type_choices, default=0)

Python equivalent to Spark rangeBetween for window?

I am trying to find if there is a way in python to do the equivalent of a rangeBetween in a rolling aggregation. In Spark, you can use rangeBetween such that the window does not have to be symmetrical on the targeted row, ie for each row, I can look -5h to +3h: All rows that happen between 5 hours before and 3 hours after absed on a datetime column. I know that python has the pd.rolling option, but after reading all the documentation i can find on it it looks like it only takes 1 input as the window. You can change whether that window is centered on each row or not but I can't find a way to explicitly set it so it can look at a range of my choosing.
Does anyone know of another function or functionality that I am not aware of that would work to do this?
I'm not sure if it's the best answer but it's mine and it works so I guess it'll have to do until there is a better option. I made a python function out of it so you can sub in whatever aggregation function you want.
def rolling_stat(pdf, lower_bound, upper_bound, group , statistic = 'mean' )
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
group = pdf[group].drop_duplicates()
for grp in group:
dataframe_grp = dataframe[dataframe['group']==grp]
dataframe_grp.sort_index()
for index, row in dataframe_grp.iterrows():
lower= (index - timedelta(minutes = lower_bound))
upper= (index + timedelta(minutes = upper_bound))
agg = dataframe_grp.loc[lower:upper]['nbr'].agg([statistic])
dataframe_grp.at[index, 'agg'] = agg[0]
data_agg = data_agg.append(dataframe_grp)

Adding up the values with the same index using numpy/python

I am a newbie to python and numpy. I want to find the total rainfall days (ie. sum of column E for each year, attach the image herewith).
I am using numpy.unique to find the unique elements of array year.
following is my attempt;
import numpy as np
data = np.genfromtxt("location/ofthe/file", delimiter = " ")
unique_year = np.unique(data[:,0], return_index=True)
print(unique_year)
j= input('select one of the unique year: >>> ')
#Then I want to give the sum of the rainfall days in that year.
I would appreciate if someone could help me.
Thanks in advance.
For such tasks, Pandas (which builds on NumPy) is more easily adaptable.
Here, you can use GroupBy to create a series mapping. You can then use your input to query your series:
import pandas as pd
# read file into dataframe
df = pd.read_excel('file.xlsx')
# create series mapping from GroupBy object
rain_days_by_year = df.groupby('year')['Rain days(in numbers)'].sum()
# get input as integer
j = int(input('select one of the unique year: >>> '))
# extract data
res = rain_days_by_year[j]

Categories