Generalized Data Quality Checks on Datasets - python

I am pulling in a handful of different datasets daily, performing a few simple data quality checks, and then shooting off emails if a dataset fails the checks.
My checks are as plain as checking for duplicates in the dataset, as well as checking if the number of rows and columns in a dataset haven't changed -- See below.
assert df.shape == (1016545, 8)
assert len(df) - len(df.drop_duplicates()) == 0
Since these datasets are updated daily and may change the number of rows, is there a better way to check instead of hardcoding the specific number?
For instance, one dataset might have only 400 rows, and another might have 2 million.
Could I say to check within 'one standard deviation' of the number of rows from yesterday? But in that case, I would need to start collecting previous days counts in a separate table, and that could get ugly.
Right now, for tables that change daily, I'm doing the following rudimentary check:
assert df.shape[0] <= 1016545 + 100
assert df.shape[0] >= 1016545 - 100
But obviously this is not sustainable.
Any suggestions are much appreciated.

Yes, you would need to store some previous information, but since you don't seem to care about perfectly statistically accurate I think you can cheat a little. If you keep the average number of records based on the previous samples, the previous deviation you calculated, and the number of samples you took you can get reasonably close to what you are looking for by finding the weighted average of the previous deviation with the current deviation.
For example:
If the average count has been 1016545 with a deviation of 85 captured over 10 samples, and today's count is 1016612. If you calculate the difference from the mean (1016612 - 1016545 = 67) then the weighted average of the previous deviation and the current deviation ((85*10 + 67)/11 ≈ 83).
This makes it so you are only storing a handful of variables for each data set instead of all the record counts back in time, but this also means it's not actually standard deviation.
As for storage, you could store your data in a database or a json file or any number of other locations -- I won't go into detail for that since it's not clear what environment you are working in or what resources you have available.
Hope that helps!

Related

Is the runtime of df.groupby().transform() linear in the number of groups in the groupby object?

BACKGROUND
I am calculating racial segregation statistics between and within firms using the Theil Index. The data structure is a multi-indexed pandas dataframe. The calculation involves a lot of df.groupby()['foo'].transform(), where the transformation is the entropy function from scipy.stats. I have to calculate entropy on smaller and smaller groups within this structure, which means calling entropy more and more times on the groupby objects. I get the impression that this is O(n), but I wonder whether there is an optimization that I am missing.
EXAMPLE
The key part of this dataframe comprises five variables: county, firm, race, occ, and size. The units of observation are counts: each row tells you the SIZE of the workforce of a given RACE in a given OCCupation in a FIRM in a specific COUNTY. Hence the multiindex:
df = df.set_index(['county', 'firm', 'occ', 'race']).sort_index()
The Theil Index is the size-weighted sum of sub-units' entropy deviations from the unit's entropy. To calculate segregation between counties, for example, you can do this:
from scipy.stats import entropy
from numpy import where
# Helper to calculate the actual components of the Theil statistic
define Hcmp(w_j, w, e_j, e):
return where(e == 0, 0, (w_j / w) * ((e - e_j) / e))
df['size_county'] = df.groupby(['county', 'race'])['size'].transform('sum')
df['size_total'] = df['size'].sum()
# Create a dataframe with observations aggregated over county/race tuples
counties = df.groupby(['county', 'race'])[['size_county', 'size_total']].first()
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4) # <--
# The base for entropy is 4 because there are four recorded racial categories.
# Assume that counties['entropy_total'] has already been calculated.
counties['seg_cmpnt'] = Hcmp(counties['size_county'], counties['size_total'],
counties['entropy_county'], counties['entropy_total'])
county_segregation = counties['seg_cmpnt'].sum()
Focus on this line:
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4)
The starting dataframe has 3,130,416 rows. When grouped by county, though, the resulting groupby object has just 2,267 groups. This runs quickly enough. When I calculate segregation within counties and between firms, the corresponding line is this:
firms['entropy_firm'] = firms.groupby('firm')['size_firm'].transform(entropy, base=4)
Here, the groupby object has 86,956 groups (the count of firms in the data). This takes about 40 times as long as the prior, which looks suspiciously like O(n). And when I try to calculate segregation within firms, between occupations...
# Grouping by firm and occupation because occupations are not nested within firms
occs['entropy_occ'] = occs.groupby(['firm', 'occ'])['size_occ'].transform(entropy, base=4)
...There are 782,604 groups. Eagle-eyed viewers will notice that this is exactly 1/4th the size of the raw dataset, because I have one observation for each firm/race/occupation tuple, and four racial categories. It is also nine times the number of groups in the by-firm groupby object, because the data break employment out into nine occupational categories.
This calculation takes about nine times as long: four or five minutes. When the underlying research project involves 40-50 years of data, this part of the process can take three or four hours.
THE PROBLEM, RESTATED
I think the issue is that, even though scipy.stats.entropy() is being applied in a smart, vectorized way, the necessity of calculating it over a very large number of small groups--and thus calling it many, many times--is swamping the performance benefits of vectorized calculations.
I could pre-calculate the necessary logarithms that entropy requires, for example with numpy.log(). If I did that, though, I'd still have to group the data to first get each firm/occupation/race's share within the firm/occupation. I would also lose any advantage of readable code that looks similar at different levels of analysis.
Thus my question, stated as clearly as I can: is there a more computationally efficient way to call something like this entropy function, when calculating it over a very large number of relatively small groups in a large dataset?

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Python Pandas optimization algorithm: compare datetime and retrieve datas

This post is quiet long and I will be very grateful to everybody who reads it until the end. :)
I am experimenting execution python code issues and would like to know if you have a better way of doing what I want to.
I explain my problem brifely. I have plenty solar panels measurements. Each one of them is done each 3 minutes. Unfortunately, some measurements can fail. The goal is to compare the time in order to keep only the values that have been measured in the same minutes and then retrieve them. A GUI is also included in my software, so each time the user changes the panels to compare, the calculation has to be done again. To do so, I have implemented 2 parts, the first one creates a vector of true or false for each panel for each minute, and the second compare the previous vector and keep only the common measures.
All the datas are contained in the pandas df energiesDatas. The relevant columns are:
name: contains the name of the panel (length 1)
date: contains the day of the measurement (length 1)
list_time: contains a list of all time of measurement of a day (length N)
list_energy_prod : contains the corresponding measures (length N)
The first part loop over all possible minutes from beginning to end of measurements. If a measure has been done, add True, otherwise add False.
self.ListCompare2=pd.DataFrame()
for n in self.NameList:#loop over all my solar panels
m=self.energiesDatas[self.energiesDatas['Name']==n]#all datas
#table_date contains all the possible date from the 1st measure, with interval of 1 min.
table_list=[1 for i in range(len(table_date))]
pointerDate=0 #pointer to the current value of time
#all the measures of a given day are transform into a str of hour-minutes
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
#some test
changeDate=0
count=0
#store the current pointed date
m_date=m['Date'].iloc[pointerDate]
#for all possible time
for curr_date in table_date:
#if considered date is bigger, move pointer to next day
while curr_date.date()>m_date:
pointerDate+=1
changeDate=1
m_date=m['Date'].iloc[pointerDate]
#if the day is changed, recalculate the measures of this new day
if changeDate:
DateString=[b.strftime('%H-%M') for b in m['list_time'].iloc[pointerDate] ]
changeDate=0
#check if a measure has been done at the considered time
table_list[count]=curr_date.strftime('%H-%M') in DateString
count+=1
#add to a dataframe
self.ListCompare2[n]=table_list
l2=self.ListCompare2
The second part is the following: given a "ListOfName" of modules to compare, check if they have been measured in the same time and only keep the values measure in the same minute.
ListToKeep=self.ListCompare2[ListOfName[0]]#take list of True or False done before
for i in ListOfName[1:]#for each other panels, check if True too
ListToKeep=ListToKeep&self.ListCompare2[i]
for i in ListOfName:#for each module, recover values
tmp=self.energiesDatas[self.energiesDatas['Name']==i]
count=0
#loop over value we want to keep (also energy produced and the interval of time)
for j,k,l,m,n in zip(tmp['list_time'],tmp['Date'],tmp['list_energy_prod'],tmp['list_energy_rec'],tmp['list_interval']):
#calculation of the index
delta_day=(k-self.dt.date()).days*(18*60)
#if the value of ListToKeep corresponding to the index is True, we keep the value
tmp['list_energy_prod'].iloc[count]=[ l[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_energy_rec'].iloc[count]=[ m[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
tmp['list_interval'].iloc[count]=[ n[index] for index,a in enumerate(j) if ListToKeep.iloc[delta_day+(a.hour-4)*60+a.minute]==True]
count+=1
self.store_compare=self.store_compare.append(tmp)
Actually, this part is the one that takes a very long time.
My question is: Is there a way to save time, using build-in function or anything.
Thank you very much
Kilian
The answer of chris-sc sloved my problem:
I believe your data structure isn't appropriate for your problem. Especially the list in fields of a DataFrame, they make loops or apply almost unavoidable. Could you in principle re-structure the data? (For example one df per solar panel with columns date, time, energy)

efficient, fast numpy histograms

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.
First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.
In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.
Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

pandas: smallest X for a defined probability

The data is financial data, with OHLC values in column, e.g.
Open High Low Close
Date
2013-10-20 1.36825 1.38315 1.36502 1.38029
2013-10-27 1.38072 1.38167 1.34793 1.34858
2013-11-03 1.34874 1.35466 1.32941 1.33664
2013-11-10 1.33549 1.35045 1.33439 1.34950
....
I am looking for the answer to the following question:
What is the smallest number X for which (at least) N% of the numbers in a large data set are equal or bigger than X
And for our data with N=60 using the High column, the question would be: What is the smallest number X for which (at least) 60% of High column items are equal or bigger than X?
I know how to calculate std dev, mean and the rest with pandas but my statistic understanding is rather poor to allow me to proceed further. Please also point me to tehoretical papers/tutorials if you know so.
Thank you.
For the sake of completeness, even though the question was essentially resolved in the comment by #haki above, suppose your data is in the DataFrame data. If you were looking for the high price for which 25% of observed high prices are lower, you would use
data['High'].quantile(q=0.25)

Categories