Using missingno but got incorrect result - python

I have a dataframe for air pollution with several missing gaps like this:
Date AMB_TEMP CO PM10 PM2.5
2010-01-01 0 8 10 ... 15
2010-01-01 1 10 15 ... 20
...
2010-01-02 0 5 ...
2010-01-02 1 ... 20
...
2010-01-03 1 4 13 ... 34
To specify, here's the data link: shorturl.at/blBN1
Take the first column "ambient temperature"(amb_temp) for instance:
There are given missing info below:
Length of time series: 87648
Number of Missing Values:746
Percentage of Missing Values: 0.85 %
Number of Gaps: 136
Average Gap Size: 5.485294
Longest NA gap (series of consecutive NAs): 32
Most frequent gap size (series of consecutive NA series): 1(occurring 50 times)
I want to plot the overview of missing value and what I've done is:
import missingno as msno
missing_plot = msno.matrix(df , freq='Y')
and got a figure like this:
Obviously, in the first column, the AMB_TEMP is not consistent to the real. Only three horizontal lines but actually it should be at least 136.
**Update: Thanks to Patrick, I also tried only one column, and nothing improved.
Is there any error from the code or else..?

Related

Pandas month sequece are not getting

my sql Table
SDATETIME FE014BPV FE011BPV
0 2022-05-28 5.770000 13.735000
1 2022-05-30 16.469999 42.263000
2 2022-05-31 56.480000 133.871994
3 2022-06-01 49.779999 133.561996
4 2022-06-02 45.450001 132.679001
.. ... ... ...
93 2022-09-08 0.000000 0.050000
94 2022-09-09 0.000000 0.058000
95 2022-09-10 0.000000 0.051000
96 2022-09-11 0.000000 0.050000
97 2022-09-12 0.000000 0.038000
My code:
import pandas as pd
import pyodbc
monthSQL = pd.read_sql_query('SELECT SDATETIME,max(FE014BPV) as flow,max(FE011BPV) as steam FROM [SCADA].[dbo].[TOTALIZER] GROUP BY SDATETIME ORDER BY SDATETIME ASC', conn)
monthdata = monthSQL.groupby(monthSQL['SDATETIME'].dt.strftime("%b-%Y"), sort=True).sum()
print(monthdata)
Produces this incorrect output
flow steam
SDATETIME
Aug-2022 1800.970001 2580.276996
Jul-2022 1994.300014 2710.619986
Jun-2022 3682.329998 7633.660018
May-2022 1215.950003 3098.273025
Sep-2022 0.000000 1.705000
I want output some thing like below
SDATETIME flow steam
May-2022 1215.950003 3098.273025
Jun-2022 3682.329998 7633.660018
Jul-2022 1994.300014 2710.619986
Aug-2022 1800.970001 2580.276996
Sep-2022 0.000000 1.705000
Also, need a sum of last 12 month data
The output is correct, just not in the order you expect. Try this:
# This keep SDATETIME as datetime, not string
monthdata = monthSQL.groupby(pd.Grouper(key="SDATETIME", freq="MS")).sum()
# Rolling sum of the last 12 months
monthdata = pd.concat(
[
monthdata,
monthdata.add_suffix("_LAST12").rolling("366D").sum(),
],
axis=1,
)
# Keep SDATETIME as datetime for as long as you need to manipulate the
# dataframe in Python. When you need to export it, convert it to
# string
monthdata.index = monthdata.index.strftime("%b-%Y")
About the rolling(...) operation: it's easy to think that rolling(12) gives you the rolling sum of the last 12 months, given that each row represents a month. In fact, it returns the rolling sum of the last 12 rows. This is important, because if there are gaps in your data, 12 rows may cover more than 12 months. rolling("366D") makes sure that it only count rows within the last 366 days, which is the maximum length of any 12-month period.
We can't use rolling("12M") because months do not have fixed durations. There are between 28 to 31 days in a month.
You are sorting the date names in alphabetical order - you need to specify which column to sort. You can see that because it goes (the starting letters of the dates):
SDATETIME
Aug-2022 # A goes before J, M, S in the alphabet
Jul-2022 # J goes after A, but before M and S in the alphabet
Jun-2022 # J goes after A, but before M and S in the alphabet
May-2022 # M goes after A, J but before S in the alphabet
Sep-2022 # S goes after A, J, M in the alphabet
To sort by months in reality, you have to make a dictionary and then sort by the .apply() method:
month_dict = {'Jan-2022':1,'Feb-2022':2,'Mar-2022':3, 'Apr-2022':4, 'May-2022':5, 'Jun-2022':6, 'Jul-2022':7, 'Aug-2022':8, 'Sep-2022':9, 'Oct-2022':10, 'Nov-2022':11, 'Dec-2022':12}
df = df.sort_values('SDATETIME', key= lambda x : x.apply (lambda x : month_dict[x]))
print(df)
SDATETIME flow steam
May-2022 1215.950003 3098.273025
Jun-2022 3682.329998 7633.660018
Jul-2022 1994.300014 2710.619986
Aug-2022 1800.970001 2580.276996
Sep-2022 0.000000 1.705000

How do you add the value for a certain column from a previous row to your current row in Python Pandas? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Bokeh Pandas plot using SQLite data

I am using an SQLite database with Pandas and want to display the dynamic data using Bokeh (varea_stack)
My dynamic data (df) structure looks like this:
id date site numberOfSessions ... avgSessionDuration uniqueDimensionCombinations events pageViews
0 1 2020-07-29 177777770 3 ... 11.00 2 4 3
1 2 2020-07-29 178888883 1 ... 11.00 1 4 3
2 3 2020-07-29 177777770 1 ... 11.00 1 4 3
3 4 2020-07-29 173333333 2 ... 260.50 2 23 10
4 5 2020-07-29 178888883 2 ... 260.50 2 23 10
5 6 2020-07-29 173333333 2 ... 260.50 2 23 10
6 7 2020-07-29 178888883 12 ... 103.75 12 143 36
7 8 2020-07-30 178376403 12 ... 103.75 12 143 36
8 9 2020-07-30 178376403 12 ... 103.75 12 143 36
9 10 2020-07-28 178376403 12 ... 103.75 12 143 36
I would like to create a varea_stack plot where the:
x-axis -> "date"
y-axis -> "numberOfSessions" stacked according to "site"
(I am thinking maybe using some sort of Pivot Table?)
this is what I have:
from bokeh.plotting import figure, output_file, show
from bokeh.embed import components
from bokeh.models import HoverTool
plot = figure()
plot.varea_stack(df.site.unique().tolist(), x=df.index.values.tolist(), source=df)
script, div = components(plot)
the Error I get:
Keyword argument sequences for broadcasting must be the same length as stackers
I have been searching online (https://docs.bokeh.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure.varea_stack) and through Stackoverflow. I can't seem to find an answer.
I can't really speak to the Pandas operations needed, but this is the general format the data needs to be in for varea_stack:
sites = [<list of sites>]
data = {
'date' : <all the datetime values>,
<site1> : <site1 values for every date>,
<site2> : <site2 values for every date>,
<site3> : <site3 values for every date>,
...
}
plot.varea_stack(sites, x='date', source=data)
Note that to be usable by varea_stack the following must be true:
every item in the sites list has to be a column in the data
every sites column has to be the same length (a value for every date)
Note that the above also assumes the dates are converted to real datetime values. If you are using your dates are categoricals (i.e. not using real datetimes and a continous datetime axis) then you will need to pass the list of date (strings) to the x_range for figure as well (as with any categorical axis).

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

need to fill the NA values with the past three values before na values in python

need to fill the NA values with the past three values mean of that NA
this is my dataset
RECEIPT_MONTH_YEAR NET_SALES
0 2014-01-01 818817.20
1 2014-02-01 362377.20
2 2014-03-01 374644.60
3 2014-04-01 NA
4 2014-05-01 NA
5 2014-06-01 NA
6 2014-07-01 NA
7 2014-08-01 46382.50
8 2014-09-01 55933.70
9 2014-10-01 292303.40
10 2014-10-01 382928.60
is this dataset a .csv file or a dataframe. This NA is a 'NaN' or a string ?
import pandas as pd
import numpy as np
df=pd.read_csv('your dataset',sep=' ')
df.replace('NA',np.nan)
df.fillna(method='ffill',inplace=True)
you mention something about mean of 3 values..the above simply forward fills the last observation before the NaNs begin. This is often a good way for forecasting (better than taking means in certain cases, if persistence is important)
ind = df['NET_SALES'].index[df['NET_SALES'].apply(np.isnan)]
Meanof3 = df.iloc[ind[0]-3:ind[0]].mean(axis=1,skipna=True)
df.replace('NA',Meanof3)
Maybe the answer can be generalised and improved if more info about the dataset is known - like if you always want to take the mean of last 3 measurements before any NA. The above will allow you to check the indices that are NaNs and then take mean of 3 before, while ignoring any NaNs
This is simple but it is working
df_data.fillna(0,inplace=True)
for i in range(0,len(df_data)):
if df_data['NET_SALES'][i]== 0.00:
condtn = df_data['NET_SALES'][i-1]+df_data['NET_SALES'][i-2]+df_data['NET_SALES'][i-3]
df_data['NET_SALES'][i]=condtn/3
You could use fillna (assuming that your NA is already np.nan) and rolling mean:
import pandas as pd
import numpy as np
df = pd.DataFrame([818817.2,362377.2,374644.6,np.nan,np.nan,np.nan,np.nan,46382.5,55933.7,292303.4,382928.6], columns=["NET_SALES"])
df["NET_SALES"] = df["NET_SALES"].fillna(df["NET_SALES"].shift(1).rolling(3, min_periods=1).mean())
Out:
NET_SALES
0 818817.2
1 362377.2
2 374644.6
3 518613.0
4 368510.9
5 374644.6
6 NaN
7 46382.5
8 55933.7
9 292303.4
10 382928.6
If you want to include the imputed values I guess you'll need to use a loop.

Categories