Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a Dataframe like this one :
raw Dataframe
and want to get a Dataframe like this manual crafted one.
wanted outcome
i have tried to do it with the follwing commands
SD = station_data.set_index('ELEMENT')
SD.reset_index(inplace=True)
SD = SD.groupby(['ELEMENT'])
result = SD['VALUE'].unique()
result.reset_index(inplace=False)
SD = pd.DataFrame(result)
SD = SD.transpose()
the problem with this is , that .unique() is wrong cause you can get the same value multiple times and the values aren't a single object anymore.so i'm searching a way to get the wanted Dataframe with all values.
the output of the commandblock is:
VALUE
ELEMENT
HUMIDITY [98.0, 97.0, 96.0, 95.0, 94.0, 92.0, 93.0, 91....
PRECIPITATION_FORM [<NA>]
PRECIPITATION_HEIGHT [0.0, 0.1, 0.2, 0.3, 0.5, 0.4, 0.6, 0.8, 1.5,...
...
let us use the following gutted Dataframe as example.
import pandas as pd
test = pd.DataFrame({'STATION_ID': [1207, 1207, 1207,1207, 1207, 1207,1207, 1207, 1207],
'DATE': ['2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0'],
'ELEMENT': ['TEMPERATURE_AIR_200','TEMPERATURE_AIR_200','TEMPERATURE_AIR_200', 'HUMIDITY','HUMIDITY','HUMIDITY', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200'],
'VALUE': [0.1, 0.1, 0.4, 98.0, 96.0, 98.0, -0.3,0.1,0.4]})
You currently have data in "long" format, and you want your data in "wide" format. This can be done in Pandas using either pivot_table() or unstack().
Here's an example of how to do it with unstack().
import pandas as pd
test = pd.DataFrame({'STATION_ID': [1207, 1207, 1207,1207, 1207, 1207,1207, 1207, 1207],
'DATE': ['2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0'],
'ELEMENT': ['TEMPERATURE_AIR_200','TEMPERATURE_AIR_200','TEMPERATURE_AIR_200', 'HUMIDITY','HUMIDITY','HUMIDITY', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200'],
'VALUE': [0.1, 0.1, 0.4, 98.0, 96.0, 98.0, -0.3,0.1,0.4]})
# Set indexes so pandas knows which columns to unstack by
test = test.set_index(['STATION_ID', 'DATE', 'ELEMENT'])
test = test.unstack()
# The column names are now a multiindex. Fix that
test.columns = test.columns.get_level_values(1)
# Put the index back how it was. Optional.
test = test.reset_index()
print(test)
Output:
ELEMENT STATION_ID DATE HUMIDITY TEMPERATURE_AIR_200 TEMPERATURE_DEW_POINT_200
0 1207 2019-01-01 00:00:00 98.0 0.1 -0.3
1 1207 2019-01-01 01:00:00 96.0 0.1 0.1
2 1207 2019-01-01 02:00:0 98.0 0.4 0.4
Related
I am working with a dataset with records ranging from 16-02-2022 00:00 to 01/04/2022 11:30. I want to plot the 4-variable daily and monthly means (H, LE, co2, h2o) and then compare them with the same variables from another dataset. I filtered the interest variables for the quality flags and I removed outliers using interquartile range. My problem is that I can't get the real date when I plot the average values. For example, I should get about 2 months, instead I get really more.
As you can see it is not the correct plot of Sensible Heat Flux monthly mean Cycle because I have more or less to months
I used this script to import the data:
datasetBio=pd.read_csv(io.BytesIO(uploaded["eddypro_Bioesame_full_output_exp2.csv"]),sep=';',header = 1, parse_dates= ['day&time'] ,index_col=['day&time'], na_values= "-9999")
and my DateTimeIndex look like this:
DatetimeIndex(['2022-02-16 00:00:00', '2022-02-16 00:30:00',
'2022-02-16 01:00:00', '2022-02-16 01:30:00',
'2022-02-16 02:00:00', '2022-02-16 02:30:00',
'2022-02-16 03:00:00', '2022-02-16 03:30:00',
'2022-02-16 04:00:00', '2022-02-16 04:30:00',
...
'2022-01-04 07:00:00', '2022-01-04 07:30:00',
'2022-01-04 08:00:00', '2022-01-04 08:30:00',
'2022-01-04 09:00:00', '2022-01-04 09:30:00',
'2022-01-04 10:00:00', '2022-01-04 10:30:00',
'2022-01-04 11:00:00', '2022-01-04 11:30:00'],
dtype='datetime64[ns]', name='day&time', length=2136, freq=None)
In my dataset there is also the DOY column, but I tried to use it without success. I tried with datetime moduls, strftime and strptime, but without success.
I also tried with:
#list comprehension
datasetBio['HM'] = ['%s-%s' %(el.hour,el.minute) for el in datasetBio.index]
list_m = []
list_h = []
for el in datasetBio['HM'].unique():
list_m.append(datasetBio['H'].loc[datasetBio['HM']==el].mean())
list_h.append(el)
#Look at the group
for el in datasetBio['HM'].unique():`
print(datasetBio.loc[datasetBio['HM']==el])
partial output:
2022-02-19 12:30:00 0.0 ... 0.575134 0.424103 0.066102 0.041973
2022-02-20 12:30:00 1.0 ... 0.898857 0.551975 0.069380 0.221436
2022-02-21 12:30:00 0.0 ... 221.180000 234.682000 0.369427 0.161920
2022-02-22 12:30:00 1.0 ... 0.521469 0.673882 0.074374 0.312831
2022-02-23 12:30:00 0.0 ... 0.303948 0.630388 0.069664 0.283314
When I try to plot together the variables coming from the 2 datasets obviously the problem of the days remains.
Instead of plotting the correct time parameters, it runs from January to December
Please someone help me to solve this problem because I don't know what to do anymore.
Thanks in advance.
I want to write piece of code that returns the number of days hours minutes that are differ between values of backtest_results and realtime_test. So I want to perform 2022-01-24 10:05:00 - 2022-01-30 14:09:03, 2022-01-27 01:54:00 - 2022-02-02 09:34:06,
backtest_results[0] - realtime_test[0]
backtest_results[1] - realtime_test[1]
...
How would I be able top code the wanted code below.
import pandas as pd
import numpy
backtest_results = pd.to_datetime(['2022-01-24 10:05:00', '2022-01-27 01:54:00',
'2022-01-30 19:08:00','2022-02-02 14:32:00',
'2022-02-10 02:58:00', '2022-02-10 14:01:00',
'2022-02-11 00:25:00' '2022-02-16 13:49:00'])
realtime_test = pd.to_datetime([
'2022-01-30 14:09:03', '2022-02-02 09:34:06',
'2022-02-08 07:37:03', '2022-02-09 22:07:02',
'2022-02-10 09:02:03', '2022-02-10 19:32:25',
'2022-02-12 16:42:03', '2022-02-15 23:19:03'])
result = backtest_results - realtime_test
You're missing a comma in backtest_results in the last row:
backtest_results = pd.to_datetime(['2022-01-24 10:05:00', '2022-01-27 01:54:00',
'2022-01-30 19:08:00','2022-02-02 14:32:00',
'2022-02-10 02:58:00', '2022-02-10 14:01:00',
'2022-02-11 00:25:00', '2022-02-16 13:49:00'])
^^ here
Then, you can simply subtract one from the other.
If you want the raw difference:
>>> backtest_results - realtime_test
TimedeltaIndex(['-7 days +19:55:57', '-7 days +16:19:54', '-9 days +11:30:57',
'-8 days +16:24:58', '-1 days +17:55:57', '-1 days +18:28:35',
'-2 days +07:42:57', '0 days 14:29:57'],
dtype='timedelta64[ns]', freq=None)
If you want to get the difference is days:
>>> (backtest_results - realtime_test).astype('timedelta64[D]')
Float64Index([-7.0, -7.0, -9.0, -8.0, -1.0, -1.0, -2.0, 0.0], dtype='float64')
If you want the difference in hours:
>>> (backtest_results - realtime_test).astype('timedelta64[h]')
Float64Index([-149.0, -152.0, -205.0, -176.0, -7.0, -6.0, -41.0, 14.0], dtype='float64')
If you want the difference in minutes:
>>> (backtest_results - realtime_test).astype('timedelta64[m]')
Float64Index([-8885.0, -9101.0, -12270.0, -10536.0, -365.0, -332.0, -2418.0, 869.0], dtype='float64')
df['diff'] = df.EndDate - df.StartDtate
df.diff = df.diff / np.timedelta64(1. 'D')
‘D’ for day, ‘W’ for weeks, ‘M’ for month, ‘Y’ for year
I am having the below data frame which is a time-series data and I process this information to input to my prediction models.
df = pd.DataFrame({"timestamp": [pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None)],
"value":[5.4,5.1,100.8,20.12,21.5,80.08,150.09,160.12,20.06]
})
From this, I take the mean of the value for each timestamp and will send the value as the input to the predictor. But currently, I am using just thresholds to filter out the outliers,but those seem to filter out real vales and also not filter some outliers .
For example, I kept
df[(df['value']>3 )& (df['value']<120 )]
and then this does not filter out
2019-01-01 01:00:00 100.8
which is an outlier for that timestamp and does filter out
2019-01-01 03:00:00 150.09
2019-01-01 03:00:00 160.12
which are not outliers for that timestamp.
So how do I filter out outliers for each timestamp based on which does not fit that group?
Any help is appreciated.
Ok, let's assume you are searching for the confidence interval to detect outlier.
Then you have to get the mean and the confidence intervals for each timestamp group. Therefore you can run:
stats = df.groupby(['timestamp'])['value'].agg(['mean', 'count', 'std'])
ci95_hi = []
ci95_lo = []
import math
for i in stats.index:
m, c, s = stats.loc[i]
ci95_hi.append(m + 1.96*s/math.sqrt(c))
ci95_lo.append(m - 1.96*s/math.sqrt(c))
stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
df = pd.merge(df, stats, how='left', on='timestamp')
which leads to the following output:
then you can adjust a filter column:
import numpy as np
df['Outlier'] = np.where(df['value'] >= df['ci95_hi'], 1, np.where(df['value']<= df['ci95_lo'], 1, 0))
then everythign with a 1 in the column outlier is an outlier. You can adjust the values with 1.96 to play a little with it.
The outcome looks like:
I have a pandas dataframe indexed by DateTime from hour "00:00:00" until hour "23:59:00" (increments by minute, seconds not counted).
in: df.index
out: DatetimeIndex(['2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
...
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 05:16:00', '2018-10-08 07:08:00',
'2018-10-08 13:58:00', '2018-10-08 09:30:00'],
dtype='datetime64[ns]', name='DateTime', length=91846, freq=None)
Now I want to choose specific intervals, say every 1 minute, or every 1 hour, starting from "00:00:00" and retrieve all the rows that interval apart consecutively.
I can grab entire intervals, say the first hour interval, with
df.between_time("01:00:00","00:00:00")
But I want to be able to
(a) get only all the times that are a specific intervals apart
(b) get all the 1-hour intervals without having to manually ask for them 24 times. How do I increment the DatetimeIndex inside the between_time command? Is there a better way than that?
I would solve this problem with masking rather than making new dataframes. For example you can add a column df['which_one'] and set different numbers for each subset. Then you can access the subset by calling df[df['which_one']==x] where x is the subset you want to select. You can still do other conditional statements and just about everything else that Pandas had to offer by access the data this way.
P.S. There are other methods to access data that might be faster. I just used what I'm most comfortable with another way would be df[df['which_one'].eq(x)].
If you are deadset on dataframes I would suggest doing so with a dictionary of dataframes such as:
import pandas as pd
dfdict={}
for i in range(0,10):
dfdict[i]=pd.DataFrame()
print(dfdict)
as you will see they are indeed dfs
out[1]
{0: Empty DataFrame
Columns: []
Index: [], 1: Empty DataFrame
Columns: []
Index: [], 2: Empty DataFrame
Columns: []
Index: [], 3: Empty DataFrame
Columns: []
Index: [], 4: Empty DataFrame
Columns: []
Index: [], 5: Empty DataFrame
Columns: []
Index: [], 6: Empty DataFrame
Columns: []
Index: [], 7: Empty DataFrame
Columns: []
Index: [], 8: Empty DataFrame
Columns: []
Index: [], 9: Empty DataFrame
Columns: []
Index: []}
Although as others have suggested there might be a more practical approach to solve your problem (difficult to say without more specifics of the issue)
how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]