Prepare Data Frames to be compared. Index manipulation, datetime and beyond - python

Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.

Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)

Related

Pandas dataframe: Sum up rows by date and keep only one row per day without timestamp

I have such a dataframe:
ds y
2018-07-25 22:00:00 1
2018-07-25 23:00:00 2
2018-07-26 00:00:00 3
2018-07-26 01:00:00 4
2018-07-26 02:00:00 5
What I want to get is a new dataframe which looks like this
ds y
2018-07-25 3
2018-07-26 12
I want to get a new dataframe df1 where all the entries of one day are summed up in y and I only want to keep one column of this day without a timestamp.
What I did so far is this:
df1 = df.groupby(df.index.date).transform(lambda x: x[:24].sum())
24 because I have 24 entries every day (for every hour). I get the correct sum for every day but I also get 24 rows for every day together with the existing timestamps. How can I achieve what I want?
If need sum all values per days then filtering first 24 rows is not necessary:
df1 = df.groupby(df.index.date)['y'].sum().reset_index()
Try out:
df.groupby([df.dt.year, df.dt.month, df.dt.day])['y'].sum()

How to replace by NaN a time delta object in a pandas serie?

I would like to calculate a mean of a time delta serie excluding 00:00:00 values.
Then this is my time serie:
1 00:28:00
3 01:57:00
5 00:00:00
7 01:27:00
9 00:00:00
11 01:30:00
I try to replace 5 and 9 row per NaN and then apply .mean() to the serie. mean() doesn´t include NaN values and I get the desired value.
How can I do that stuff?
I´am trying:
`df["time_column"].replace('0 days 00:00:00', np.NaN).mean()`
but no values are replaced
One idea is use 0 Timedelta object:
out = df["time_column"].replace(pd.Timedelta(0), np.NaN).mean()
print (out)
0 days 01:20:30

Pandas filter data for line graph

I'm trying to use Pandas to filter the dataframe. So in the dataset I have 1982-01 to 2019-11. I want to filter data based on year 2010 onwards ie. 2010-01 to 2019-11.
mydf = pd.read_csv('temperature_mean.csv')
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
I had set the index = month, and I'm able to get the mean temperature for the filtered data. However I need to get the indexes as my x label for line graph. I'm not able to do it. I tried to use reg to get data from 201x onwards, but there's still an error.
How do I get the label for the months i.e. 2010-01, 2010-02......2019-10, 2019-11, for the line graph.
Thanks!
mydf = pd.read_csv('temperature_mean.csv')
month mean_temp
______________________
0 1982-01-01 39
1 1985-04-01 29
2 1999-03-01 19
3 2010-01-01 59
4 2013-05-01 32
5 2015-04-01 34
6 2016-11-01 59
7 2017-08-01 14
8 2017-09-01 7
df1 = mydf.set_index('month')
df= df1.loc['2010-01':'2019-11']
mean_temp
month
______________________
2010-01-01 59
2013-05-01 32
2015-04-01 34
2016-11-01 59
2017-08-01 14
2017-09-01 7
Drawing the line plot (default x & y arguments):
df.plot.line()
If for some reason, you want to manually specify the column name
df.reset_index().plot.line(x='month', y='mean_temp')

Pandas error on unix datetime conversation -- OutOfBoundsDatetime: cannot convert input with unit 's'

I am getting this error
File "pandas/_libs/tslib.pyx", line 356, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 's'
when trying to convert pandas column to datetime format.
I checked this answer Convert unix time to readable date in pandas dataframe
but it did not help me to solve the problem.
There is an issue on github that seems to be closed but in the same time people keep reporting issues:
https://github.com/pandas-dev/pandas/issues/10987
Dataframe column has unix time format, here is print out of top 20 rows
0 1420096800
1 1420096800
2 1420097100
3 1420097100
4 1420097400
5 1420097400
6 1420093800
7 1420097700
8 1420097700
9 1420098000
10 1420098480
11 1420098600
12 1420099200
13 1420099500
14 1420099500
15 1420100100
16 1420100400
17 1420096800
18 1420100700
19 1420100820
20 1420101840
Any ideas about how I might solve it?
I tried changing units from s to ms, but it did not help.
pd.__version__
'0.24.2'
Row
df[key] = pd.to_datetime(df[key], unit='s')
It works if you add the origin='unix' parameter:
pd.to_datetime(df['date'], origin='unix', unit='s')
0 2015-01-01 07:20:00
1 2015-01-01 07:20:00
2 2015-01-01 07:25:00
3 2015-01-01 07:25:00
4 2015-01-01 07:30:00

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Categories