What is happening when pandas.Series converts int64s into NaNs? - python

I have a csv with dates and integers (Headers: Date, Number), separated by a tab.
I'm trying to create a calendar heatmap with CalMap (demo on that page). The function that creates the chart takes data that's indexed by DateTime.
df = pd.read_csv("data.csv",delimiter="\t")
df['Date'] = df['Date'].astype('datetime64[ns]')
events = pd.Series(df['Date'],index = df['Number'])
calmap.yearplot(events)
But when I check events.head(5), it gives the date followed by NaN. I check df['Number'].head(5) and they appear as int64.
What am I doing wrong that is causing this conversion?
Edit: Data below
Date Number
7/9/2018 40
7/10/2018 40
7/11/2018 40
7/12/2018 70
7/13/2018 30
Edit: Output of events.head(5)
2018-07-09 NaN
2018-07-10 NaN
2018-07-11 NaN
2018-07-12 NaN
2018-07-13 NaN
dtype: float64

First of all, it is not NaN, it is NaT (Not a Timestamp), which is unique to Pandas, though Pandas makes it compatible with NaN, and uses it similarly to NaN in floating-point columns to mark missing data.
What pd.Series(data, index=index) does apparently depends on the type of data. If data is a list, then index has to be of equal length, and a new Series will be constructed, with data being data, and index being index. However, if data is already a Series (such as df['Date']), it will instead take the rows corresponding to index and construct a new Series out of those rows. For example:
pd.Series(df['Date'], [1, 1, 4])
will give you
1 2018-07-10
1 2018-07-10
4 2018-07-13
Where 2018-07-10 comes from row #1, and 2018-07-11 from row #4 of df['Date']. However, there is no row with index 40, 70 or 30 in your sample input data, so missing data is presumed, and NaT is inserted instead.
In contrast, this is what you get when you use a list instead:
pd.Series(df['Date'].to_list(), index=df['Number'])
# => Number
# 40 2018-07-09
# 40 2018-07-10
# 40 2018-07-11
# 70 2018-07-12
# 30 2018-07-13
# dtype: datetime64[ns]

I was able to fix this by changing the series into lists via df['Date'].tolist() and df['Number'].tolist(). calmap.calendarplot(events) was able to accept these instead of the original parameters as series.

Related

Add missing dates in pandas df, but date range has (valid) duplicates

I have a dataset that has multiple values received per second - up to 100 DFS (no more, but not consistently 100). The challenge is that the date field did not capture time more granularly than second, so multiple rows have the same hh:mm:ss timestamp. These are fine, but I also have several seconds missing across the set, i.e., not showing at all.
Therefore my 2 initial columns might look like this, where I am missing the 54 sec step:
2020-08-24 03:36:53, 5
2020-08-24 03:36:53, 8
2020-08-24 03:36:53, 6
2020-08-24 03:36:55, 8
Because of the legit date "duplicates" and the information I need from this, I don't want to aggregate but I do need to create the missing seconds, insert them and fill (NaN, etc) so I can then manage them appropriately for aligning with other datasets.
The only way I can seem to do this is with a nested if loop which looks at the previous timestamp and if it is the same as the current cell (pt == ct) then no action, if it is 1 less (pt = (ct-1)) then no action but it if is more than the current cell by 2 or more, insert the missing (pt <= (ct-2)). This feels a bit cumbersome (though workable). Am I missing an easier way to do this?
I have checked a lot of "fill missing dates" threads on here as well as in various functions on pandas.pydata.org but reindexing and the most common date fills all seem to rely on dates not having duplicates. Any advice would be fantastic.
This can be solved by creating a pandas series containing all timepoints you want to consider and then merge this with the original dataframe.
For example:
start, end = df['date'].min(), df['date'].max()
all_timepoints = pd.date_range(start, end, freq='s').to_series(name='date')
df.merge(all_timepoints , on='date', how='outer', sort=True).fillna(0)
Will give:
date value
0 2020-08-24 03:36:53 5.0
1 2020-08-24 03:36:53 8.0
2 2020-08-24 03:36:53 6.0
3 2020-08-24 03:36:54 0.0
4 2020-08-24 03:36:55 8.0

Split dataframe into many sub-dataframes based on timestamp

I have a large csv with the following format:
timestamp,name,age
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
2020-03-01 00:00:10,nick
2020-03-01 00:00:12,john
2020-03-01 00:00:54,hank
2020-03-01 00:01:03,peter
I load csv into a dataframe with:
df = pd.read_csv("/home/test.csv")
and then I want to create multiple dataframes every 2 seconds. For example:
df1 contains:
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
df2 contains :
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
and so on.
I achieve to split timestamps with command below:
full_idx = pd.date_range(start=df['timestamp'].min(), end = df['timestamp'].max(), freq ='0.2T')
but how I can store these spitted dataframes? How can I split a dataset based on timestamps into multiple dataframes?
Probably that question can help us: Pandas: Timestamp index rounding to the nearest 5th minute
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])
ns2sec=2*1000000000 # 2 seconds in nanoseconds
# next we round our timestamp to every 2nd second with rounding down
timestamp_rounded = df['timestamp'].astype(np.int64) // ns2sec
df['full_idx'] = pd.to_datetime(((timestamp_rounded - timestamp_rounded % 2) * ns2sec))
# store array for each unique value of your idx
store_array = []
for value in df['full_idx'].unique():
store_array.append(df[df['full_idx']==value][['timestamp', 'name', 'age']])
How about .resample()?
#first loading your data
>>> import pandas as pd
>>>
>>> df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True)
>>> df.head()
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
2020-03-01 00:00:02 nick NaN
2020-03-01 00:00:02 john NaN
2020-03-01 00:00:04 peter NaN
#resampling it at a frequency of 2 seconds
>>> resampled = df.resample('2s')
>>> type(resampled)
<class 'pandas.core.resample.DatetimeIndexResampler'>
#iterating over the resampler object and storing the sliced dfs in a dictionary
>>> df_dict = {}
>>> for i, (timestamp,df) in enumerate(resampled):
>>> df_dict[i] = df
>>> df_dict[0]
name age
timestamp
2020-03-01 00:00:01 nick NaN
2020-03-01 00:00:01 john NaN
Now for some explanation...
resample() is great for rebinning DataFrames based on time (I use it often for downsampling time series data), but it can be used simply to cut up the DataFrame, as you want to do. Iterating over the resampler object produced by df.resample() returns a tuple of (name of the bin,df corresponding to that bin): e.g. the first tuple is (timestamp of the first second,data corresponding to the first 2 seconds). So to get the DataFrames out, we can loop over this object and store them somewhere, like a dict.
Note that this will produce every 2-second interval from the start to the end of the data, so many will be empty given your data. But you can add a step to filter those out if needed.
Additionally, you could manually assign each sliced DataFrame to a variable, but this would be cumbersome (you would probably need to write a line for each 2 second bin, rather than a single small loop). Rather with a dictionary, you can still associate each DataFrame with a callable name. You could also use an OrderedDict or list or whatever collection.
A couple points on your script:
setting freq to "0.2T" is 12 seconds (.2 *60); you can rather
do freq="2s"
The example df and df2 are "out of phase," by that I mean one is binned in 2 seconds starting on odd numbers (1-2 seconds), while one is starting on evens (4-5 seconds). So the date_range you mentioned wouldn't create those bins, it would create dfs from either 0-1s, 2-3s, 4-5s... OR 1-2s,3-4s,5-6s,... depending on which timestamp it started on.
For the latter point, you can use the base argument of .resample() to set the "phase" of the resampling. So in the case above, base=0 would start bins on even numbers, and base=1 would start bins on odds.
This is assuming you are okay with that type of binning - if you really want 1-2 seconds and 4-5 seconds to be in different bins, you would have to do something more complicated I believe.

Pandas join.fillna of two data frames replaces all all values with anf not only nan

The following code will update the number of items in stock based on the index. The table dr with the old stock holds >1000 values. The updated data frame grp1 contains the number of sold items. I would like to subtract data frame grp1 from data frame dr and update dr. Everything is fine until I want to join grp1 to dr with Panda's join and fillna. First of all datatypes are changed from int to float and not only the NaN but also the notnull values are replaced by 0. Is this a problem with not matching indices?
I tried to make the dtypes uniform but this has not changed anything. Removing fillna while joining the two dataframes returns NaN for all columns.
dr has the following format (example):
druck_pseudonym lager_nr menge_im_lager
80009359 62808 1
80009360 62809 10
80009095 62810 0
80009364 62811 11
80009365 62812 10
80008572 62814 10
80009072 62816 18
80009064 62817 13
80009061 62818 2
80008725 62819 3
80008940 62820 12
dr.dtypes
lager_nr int64
menge_im_lager int64
dtype: object
and grp1 (example):
LagerArtikelNummer1 ArtMengen1
880211066 1
80211070 1
80211072 2
80211073 2
80211082 2
80211087 4
80211091 1
80211107 2
88889272 1
88889396 1
ArtMengen1 int64
dtype: object
#update list with "nicht_erledigt"
dr_update = dr.join(grp1).fillna(0)
dr_update["menge_im_lager"] = dr_update["menge_im_lager"] - dr_update["ArtMengen1"]
This returns:
lager_nr menge_im_lager ArtMengen1
druck_pseudonym
80009185 44402 26.0 0.0
80009184 44403 2.0 0.0
80009182 44405 16.0 0.0
80008894 44406 32.0 0.0
80008115 44407 3.0 0.0
80008974 44409 16.0 0.0
80008380 44411 4.0 0.0
dr_update.dtypes
lager_nr int64
menge_im_lager float64
ArtMengen1 float64
dtype: object
Editing after comment, indices are object.
Your indices are string objects. You need to convert these to numeric. Use
dr.index = pd.to_numeric(dr.index)
grp1.index = pd.to_numeric(grp1.index)
dr.sort_index()
grp1.sort_index()
Then try the rest...
You can filter the old stock 'dr' dataframe to match the sold stock, then substract, and assing back to the original filtered dataframe.
# Filter the old stock dataframe so that you have matching index to the sold dataframe.
# Restrict just for menge_im_lager. Then subtract the sold stock
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] = (
dr.loc[dr.index.isin(grp1.index), "menge_im_lager"] - grp1["ArtMengen1"]
)
If I understand correctly, firstly you want the non-matching indices to be in your final dataset and you want your final dataset to be integers. You can use 'outer' join and astype int for your dataset.
So, at the join you can do it this way:
dr.join(grp1,how='outer').fillna(0).astype(int)

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

Is this a Pandas bug with notnull() or a fundamental misunderstanding on my part (probably misunderstanding)

I have a pandas dataframe with two columns and default indexing. The first column is a string and the second is a date. The top date is NaN (though it should be NaT really).
index somestr date
0 ON NaN
1 1C 2014-06-11 00:00:00
2 2C 2014-07-09 00:00:00
3 3C 2014-08-13 00:00:00
4 4C 2014-09-10 00:00:00
5 5C 2014-10-08 00:00:00
6 6C 2014-11-12 00:00:00
7 7C 2014-12-10 00:00:00
8 8C 2015-01-14 00:00:00
9 9C 2015-02-11 00:00:00
10 10C 2015-03-11 00:00:00
11 11C 2015-04-08 00:00:00
12 12C 2015-05-13 00:00:00
Call this dataframe df.
When I run:
df[pd.notnull(df['date'])]
I expect the first row to go away. It doesn't.
If I remove the column with string by setting:
df=df[['date']]
Then apply:
df[pd.notnull(df['date'])]
then the first row with the null does go away.
Also, the row with the null always goes away if all columns are number/date types. When a column with a string appears, this problem occurs.
Surely this is a bug, right? I am not sure if others will be able to replicate this.
This was on my Enthought Canopy for Windows (I am not smart enough for UNIX/Linux command line noise)
Per requests below from Jeff and unutbu:
#ubuntu -
df.dtypes
somestr object
date object
dtype: object
Also:
type(df.iloc[0]['date'])
pandas.tslib.NaTType
In the code this column was specifically assigned as pd.NaT
I also do not understand why it says NaN when it should say NaT. The filtering I used worked fine when I used this toy frame:
df=pd.DataFrame({'somestr' : ['aa', 'bb'], 'date' : [pd.NaT, dt.datetime(2014,4,15)]}, columns=['somestr', 'date'])
It should also be noted that although the table above had NaN in the output, the following output NaT:
df['date'][0]
NaT
Also:
pd.notnull(df['date'][0])
False
pd.notnull(df['date'][1])
True
but....when evaluating the array, they all came back True - bizarre...
np.all(pd.notnull(df['date']))
True
#Jeff - this is 0.12. I am stuck with this. The frame was created by concatenating two different frames that were grabbed from database queries using psql. The date and some other float columns were then added by calculations I did. Of course, I filtered to the two relevant columns that made sense here until I pinpointed that the string valued columns were causing problems.
************ How to Replicate **********
import pandas as pd
import datetime as dt
print(pd.__version__)
# 0.12.0
df = pd.DataFrame({'somestr': ['aa', 'bb'], 'date': ['cc', 'dd']},
columns=['somestr', 'date'])
df['date'].iloc[0] = pd.NaT
df['date'].iloc[1] = pd.to_datetime(dt.datetime(2014, 4, 15))
print(df[pd.notnull(df['date'])])
# somestr date
# 0 aa NaN
# 1 bb 2014-04-15 00:00:00
df2 = df[['date']]
print(df2[pd.notnull(df2['date'])])
# date
# 1 2014-04-15 00:00:00
So, this dataframe originally had all string entries - then the date column was converted to dates with an NaT at the top - note that in the table it is NaN, but when using df.iloc[0]['date'] you do see the NaT. Using the snippet above, you can see that the filtering by not null is bizarre with and without the somestr column. Again - this is Enthought Canopy for Windows with Pandas 0.12 and NumPy 1.8.
I encountered this problem also. Here's how I fixed it. "isnull()" is a function that checks if something is NaN or empty. The "~" (tilde) operator negates the following expression. So we are saying give me a dataframe from your original dataframe but only where the 'data' rows are NOT null.
df = df[~df['data'].isnull()]
Hope this helps!

Categories