Add missing dates in pandas df, but date range has (valid) duplicates - python

I have a dataset that has multiple values received per second - up to 100 DFS (no more, but not consistently 100). The challenge is that the date field did not capture time more granularly than second, so multiple rows have the same hh:mm:ss timestamp. These are fine, but I also have several seconds missing across the set, i.e., not showing at all.
Therefore my 2 initial columns might look like this, where I am missing the 54 sec step:
2020-08-24 03:36:53, 5
2020-08-24 03:36:53, 8
2020-08-24 03:36:53, 6
2020-08-24 03:36:55, 8
Because of the legit date "duplicates" and the information I need from this, I don't want to aggregate but I do need to create the missing seconds, insert them and fill (NaN, etc) so I can then manage them appropriately for aligning with other datasets.
The only way I can seem to do this is with a nested if loop which looks at the previous timestamp and if it is the same as the current cell (pt == ct) then no action, if it is 1 less (pt = (ct-1)) then no action but it if is more than the current cell by 2 or more, insert the missing (pt <= (ct-2)). This feels a bit cumbersome (though workable). Am I missing an easier way to do this?
I have checked a lot of "fill missing dates" threads on here as well as in various functions on pandas.pydata.org but reindexing and the most common date fills all seem to rely on dates not having duplicates. Any advice would be fantastic.

This can be solved by creating a pandas series containing all timepoints you want to consider and then merge this with the original dataframe.
For example:
start, end = df['date'].min(), df['date'].max()
all_timepoints = pd.date_range(start, end, freq='s').to_series(name='date')
df.merge(all_timepoints , on='date', how='outer', sort=True).fillna(0)
Will give:
date value
0 2020-08-24 03:36:53 5.0
1 2020-08-24 03:36:53 8.0
2 2020-08-24 03:36:53 6.0
3 2020-08-24 03:36:54 0.0
4 2020-08-24 03:36:55 8.0

Related

What is happening when pandas.Series converts int64s into NaNs?

I have a csv with dates and integers (Headers: Date, Number), separated by a tab.
I'm trying to create a calendar heatmap with CalMap (demo on that page). The function that creates the chart takes data that's indexed by DateTime.
df = pd.read_csv("data.csv",delimiter="\t")
df['Date'] = df['Date'].astype('datetime64[ns]')
events = pd.Series(df['Date'],index = df['Number'])
calmap.yearplot(events)
But when I check events.head(5), it gives the date followed by NaN. I check df['Number'].head(5) and they appear as int64.
What am I doing wrong that is causing this conversion?
Edit: Data below
Date Number
7/9/2018 40
7/10/2018 40
7/11/2018 40
7/12/2018 70
7/13/2018 30
Edit: Output of events.head(5)
2018-07-09 NaN
2018-07-10 NaN
2018-07-11 NaN
2018-07-12 NaN
2018-07-13 NaN
dtype: float64
First of all, it is not NaN, it is NaT (Not a Timestamp), which is unique to Pandas, though Pandas makes it compatible with NaN, and uses it similarly to NaN in floating-point columns to mark missing data.
What pd.Series(data, index=index) does apparently depends on the type of data. If data is a list, then index has to be of equal length, and a new Series will be constructed, with data being data, and index being index. However, if data is already a Series (such as df['Date']), it will instead take the rows corresponding to index and construct a new Series out of those rows. For example:
pd.Series(df['Date'], [1, 1, 4])
will give you
1 2018-07-10
1 2018-07-10
4 2018-07-13
Where 2018-07-10 comes from row #1, and 2018-07-11 from row #4 of df['Date']. However, there is no row with index 40, 70 or 30 in your sample input data, so missing data is presumed, and NaT is inserted instead.
In contrast, this is what you get when you use a list instead:
pd.Series(df['Date'].to_list(), index=df['Number'])
# => Number
# 40 2018-07-09
# 40 2018-07-10
# 40 2018-07-11
# 70 2018-07-12
# 30 2018-07-13
# dtype: datetime64[ns]
I was able to fix this by changing the series into lists via df['Date'].tolist() and df['Number'].tolist(). calmap.calendarplot(events) was able to accept these instead of the original parameters as series.

How to search missing values from timestamp which are not in regular interval

I have a dataset like this with data every 10 second interval.
rec NO2_RAW NO2
0 2019-05-31 13:42:15 0.01 9.13
1 2019-05-31 13:42:25 17.0 51.64
2 2019-05-31 13:42:35 48.4 111.69
The timestamp is not consistent throughout the table. There are instances where after a long gap, the timestamp has started from a new time. Like after 2019-05-31 16:00:00, it started from 2019-06-01 00:00:08.
I want to fill up the missing value by calculating the time difference between two consecutive rows (10s) and assign NAN values to the missing time.
I saw this example Search Missing Timestamp and display in python? but it is meant for consistent data. I want to calculate moving average of 15 minutes from this data. So I want a consistent data.
Can someone please help?

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

How to check the type of missing data in python(randomly missing or not)?

I have a big amount of data with me(93 files, ~150mb each). The data is a time series, i.e, information about a given set of coordinates(3.3 million latitude-longitude values) is recorded and stored everyday for 93 days, and the whole data is broken up into 93 files respectively. Example of two such files:
Day 1:
lon lat A B day1
68.4 8.4 NaN 20 20
68.4 8.5 16 20 18
68.6 8.4 NaN NaN NaN
.
.
Day 2:
lon lat C D day2
68.4 8.4 NaN NaN NaN
68.4 8.5 24 25 24.5
68.6 8.4 NaN NaN NaN
.
.
I am interested in understanding the nature of the missing data in the columns 'day1', 'day2', 'day3', etc. For example, if the values missing in the concerned columns are evenly distributed among all the set of coordinates then the data is probably missing at random, but if the missing values are concentrated more in a particular set of coordinates then my data will become biased. Consider the way my data is divided into multiple files of large sizes and isn't in a very standard form to operate on making it harder to use some tools.
I am looking for a diagnostic tool or visualization in python that can check/show how the missing data is distributed over the set of coordinates so I can impute/ignore it appropriately.
Thanks.
P.S: This is the first time I am handling missing data so it would be great to see if there exists a workflow which people who do similar kind of work follow.
Assuming that you read file and name it df. You can count amount of NaNs using:
df.isnull().sum()
It will return you amount of NaNs per column.
You could also use:
df.isnull().sum(axis=1).value_counts()
This on the other hand will sum number of NaNs per row and then calculate number of rows with no NaNs, 1 NaN, 2 NaNs and so on.
Regarding working with files of such size, to speed up loading data and processing it I recommend using Dask and change format of your files preferably to parquet so that you can read and write to it in parallel.
You could easily recreate function above in Dask like this:
from dask import dataframe as dd
dd.read_parquet(file_path).isnull().sum().compute()
Answering the comment question:
Use .loc to slice your dataframe, in code below I choose all rows : and two columns ['col1', 'col2'].
df.loc[:, ['col1', 'col2']].isnull().sum(axis=1).value_counts()

Pandas: Filling nan poor performance - avoid iterating over rows?

I have a performance problem with filling missing values in my dataset. This concerns a 500mb / 5.000.0000 row dataset (Kaggle: Expedia 2013).
It would be easiest to use df.fillna(), but it seems I cannot use this to fill every NaN with a different value.
I created a lookup table:
srch_destination_id | Value
2 0.0110
3 0.0000
5 0.0207
7 NaN
8 NaN
9 NaN
10 0.1500
12 0.0114
This table contains per srch_destination_id the corresponding value to replace NaN with in dataset.
# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
if pd.isnull(dataset.iloc[row]['prop_location_score2']):
cell = dataset.iloc[row]['srch_destination_id']
df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])
This code works when iterating over 1000 rows, but when iterating over all 5 million rows, my computer never finishes (I waited hours).
Is there a better way to do what I'm doing? Did I make a mistake somewhere?
pd.Series.fillna does accept a series or a dictionary, as well as scalar replacement values.
Therefore, you can create a series mapping from lookup:
s = lookup.set_index('srch_destination')['Value']
Then use this to fill in NaN values in dataset:
dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))
Notice that in the fillna input we are mapping an identifier from dataset. In addition, we use pd.Series.map to perform the necessary mapping.

Categories