I have a dataframe with temperature data for a certain period. With this data, I want to calculate the relative frequency of the month of August being warmer than 20° as well as January being colder than 2°. I have already managed to extract these two columns in a separate dataframe, to get the count of each temperature event and used the normalize function to get the frequency for each value in percent (see code).
df_temp1[df_temp1.aug >=20]
df_temp1[df_temp1.jan <= 2]
df_temp1['aug'].value_counts()
df_temp1['jan'].value_counts()
df_temp1['aug'].value_counts(normalize=True)*100
df_temp1['jan'].value_counts(normalize=True)*100
What I haven't managed is to calculate the relative frequency for aug>=20, jan<=2, as well as aug>=20 AND jan<=2 and aug>=20 OR jan<=2.
Maybe someone could help me with this problem. Thanks.
I would try something like this:
proprortion_of_augusts_above_20 = (df_temp1['aug'] >= 20).mean()
proprortion_of_januaries_below_20 = (df_temp1['jan'] <= 2).mean()
This calculates it in two steps. First, df_temp1['aug'] >= 20 creates a boolean array, with True representing months above 20, and False representing months which are not.
Then, mean() reinterprets True and False as 1 and 0. The average of this is the percentage of months which fulfill the criteria, divided by 100.
As an aside, I would recommend posting your data in a question, which allows people answering to check whether their solution works.
Related
I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.
I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()
The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12
Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.
So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.
This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)