Perform data assay with pandas?

Perform data assay with pandas? - python

I have a data file with a fields separated by commas that I received from someone. I have to systematically go through each column to understand things like usual descriptive statistics:
-Min
-Max
-Mean
-25th percentile
-50th percentile
-75th percentile
or if it's text:
-number of distinct values
but also I need to find
-number of null or missing values
-number of zeroes
Sometimes the oddities of a feature mean something, i.e. contains information. And I might need to circle back with the client about oddities I find. Or if I'm going to replace values I have to make sure I'm not steamrolling over something recklessly.
So my question is this: Is there a package in python that will find this for me without my presupposing the data type? And if it did exist, would pandas be a good home for it?
I see that pandas makes it easy peezy to replace values but in the beginning I just want to look.

You can use the describe method:
In [1]: df = pd.DataFrame(randn(10, 3), columns=list('ABC'))
In [2]: df
Out[2]:
A B C
0 1.389738 -0.205485 -0.775810
1 -1.166596 -0.898761 -1.805333
2 -1.016509 -0.816037 0.169265
3 -0.440860 -1.147164 1.558606
4 0.763012 1.068694 -0.711795
5 0.075961 -0.597715 0.699023
6 3.006095 -0.354879 -0.718440
7 -1.249588 -0.372235 1.611717
8 0.518770 -0.742766 1.956372
9 1.304080 -0.803262 -0.609970
In [3]: df.describe()
Out[3]:
A B C
count 10.000000 10.000000 10.000000
mean 0.318410 -0.486961 0.137363
std 1.360633 0.616566 1.266616
min -1.249588 -1.147164 -1.805333
25% -0.872596 -0.812843 -0.716779
50% 0.297366 -0.670240 -0.220352
75% 1.168813 -0.359218 1.343710
max 3.006095 1.068694 1.956372
It has a percentile_width argument, which defaults to 50.

Related

Rolling mean and standard deviation without zeros

I have a data frame that one of its columns represents how many corns produced in this time stamp.
for example
timestamp corns_produced another_column
1 5 4
2 0 1
3 0 3
4 3 4
The dataframe is big.. 100,000+ rows
I want to calculate moving average and std for 1000 time stamps of corn_produced.
Luckily it is pretty easy using rolling :
my_df.rolling(1000).mean()
my_df.rolling(1000).std().
But the problem is I want to ignore the zeros, meaning if in the last 1000 timestamps there are only 5 instances in which corn was produced, I want to do the mean and std on those 5 elements.
How do I ignore the zeros ?
Just to clarify, I don't want to do the following x = my_df[my_df['corns_produced'] != 0], and than do rolling on x, because it ignores the time stamps and doesn't give me the result I need

You can use Rolling.apply:
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].mean()))
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].std()))

A faster solution: first set all zeros to np.nan, then take a rolling mean. If you are dealing with large data, it will be much faster

Standard deviation of time series

I wanted to calculate the mean and standard deviation of a sample. The sample is two columns, first is a time and second column, separated by space is value. I don't know how to calculate mean and standard deviation of the second column of vales using python, maybe scipy? I want to use that method for large sets of data.
I also want to check which number of a set is seven times higher than standard deviation.
Thanks for help.
time value
1 1.17e-5
2 1.27e-5
3 1.35e-5
4 1.53e-5
5 1.77e-5
The mean is 1.418e-5 and the standard deviation is 2.369-6.

To answer your first question, assuming your samplee's dataframe is df, the following should work:
import pandas as pd
df = pd.DataFrame({'time':[1,2,3,4,5], 'value':[1.17e-5,1.27e-5,1.35e-5,1.53e-5,1.77e-5]}
df will be something like this:
>>> df
time value
0 1 0.000012
1 2 0.000013
2 3 0.000013
3 4 0.000015
4 5 0.000018
Then to obtain the standard deviation and mean of the value column respectively, run the following and you will get the outputs:
>>> df['value'].std()
2.368966019173766e-06
>>> df['value'].mean()
1.418e-05
To answer your second question, try the following:
std = df['value'].std()
df = df[(df.value > 7*std)]
I am assuming you want to obtain the rows at which value is greater than 7 times the sample standard deviation. If you actually want greater than or equal to, just change > to >=. You should then be able to obtain the following:
>>> df
time value
4 5 0.000018
Also, following #Mad Physicist's suggestion of adding Delta Degrees of Freedom ddof=0 (if you are unfamiliar with this, checkout Delta Degrees of Freedom Wiki), doing so results in the following:
std = df['value'].std(ddof=0)
df = df[(df.value > 7*std)]
with output:
>>> df
time value
3 4 0.000015
4 5 0.000018
P.S. If I am not wrong, its a convention here to stick to one question a post, not two.

Conditional Rolling Sum using filter on groupby group rows

I've been trying without success to find a way to create an "average_gain_up" in python and have gotten a bit stuck. Being new to groupby there is something of how it is treating functions that i've not managed to grasp so any intuition behind how to think through these types of problems would be helpful.
Problem:
Create a rolling 14 day sum, only summing if the value is >0 .
new=pd.DataFrame([[1,-2,3,-2,4,5],['a','a','a','b','b','b']])
new= new.T #transposing into a friendly groupby format
#Group by a or b, filter to only have positive values and then sum rolling, we
keep NAs to ensure the sum is ran over 14 values.
groupby=new.groupby(1)[0].filter(lambda x: x>0,dropna=False).rolling(14).sum()
Intended Sum Frame:
x.all()/len(x) result:
this throws a type error "the filter must return a boolean result" .
from reading other answers, I understand as i'm asking if a series/frame is superior to 0 .
The above code works with len(x), again makes sense in that context.
i tried with all() as well but it doesn't behave as intended. the .all() functions returns a single boolean per group and the sum is then just a simple rolling sum.
i've tried creating a list of booleans to say which values are positive and which are not but that also yields an error, this time i'm not sure why.
groupby1=new.groupby(1)[0]
groupby2=[y>0 for x in groupby1 for y in x[1] ]
groupby_try=new.groupby(1)[0].filter(lambda x:groupby2,dropna=False).rolling(2).sum()
1) how do i make the above code work and what is wrong in how i am thinking about it ?
2) is this the "best Practice" way to do these types of operations ?
any help appreciated, let me know if i've missed anything or any further clarification is needed.

According to the doc on filter after a groupby, it is not supposed to filter values within a group but groups as a whole if they don't meet some criteria, such as if the sum of all the elements of the group is above 2 then the group is kept in the first example given
One way could be to replace all the negative values by 0 in new[0] first, using np.clip for example, and then groupby, rolling and sum such as
print (np.clip(new[0],0,np.inf).groupby(new[1]).rolling(2).sum())
1
a 0 NaN
1 1.0
2 3.0
b 3 NaN
4 4.0
5 9.0
Name: 0, dtype: float64
This way prevents from modifying the data in new, if you don't mind you can change the column 0 with new[0] = np.clip(new[0],0,np.inf) and then do new.groupby(1)[0].rolling(2).sum() which give the same result.

How to replace NaN values?

I have a feature called smoking_status it has 3 different values :
1) smokes
2) formerly smoked
3) never smoked
The feature column (smoking_status) has above 3 values as well as lot of NaN values how can I treat the NaN values because my data is not numerical, if it was numerical I could have replaced it using median or mean. How can I replace NaN values in my case ?

There might be two better options than replacing NaN with unknown - at least in the context of a data science challenge which I think this is:
replace this with the most common value (mode).
predict the missing value using the data you have
Getting the most common value is easy. For this purpos you can use <column>.value_counts() to get the frequencies followed by a .idxmax() which gives you the index element from value_counts() with the highes frequency. After that you just call fillna():
import pandas as pd
import numpy as np
df = pd.DataFrame(['formerly', 'never', 'never', 'never',
np.nan, 'formerly', 'never', 'never',
np.nan, 'never', 'never'], columns=['smoked'])
print(df)
print('--')
print(df.smoked.fillna(df.smoked.value_counts().idxmax()))
Gives:
smoked
0 formerly
1 never
2 never
3 never
4 NaN
5 formerly
6 never
7 never
8 NaN
9 never
10 never
--
0 formerly
1 never
2 never
3 never
4 never
5 formerly
6 never
7 never
8 never
9 never
10 never

You don't have the data for those rows. You can simply fill it median or mean, most common value in that feature. But in this perticular case that's a bad idea considering the feature.
A better approach would be to fill with a string saying 'unknown'/'na'
df['smoking_status'].fillna('NA')
Then you can label encode it or convert the column to one hot encoding.

Example categorical data:
ser = pd.Categorical(['non', 'non', 'never', 'former', 'never', np.nan])
Fill it:
ser.add_categories(['unknown']).fillna('unknown')
Gives you:
[non, non, never, former, never, unknown]
Categories (4, object): [former, never, non, unknown]

Looks like the question is about methodology, not technical issue.
So you can try
1) The most frequent value among those three;
2) Use some other categorical fields statistics of your dataset (e.g. group most common smoking status);
3) Random values;
4) "UNKNOWN" category
Then you can do one-hot-encoding and definitely check your models on cross-validation to chose the proper way.
Also there is more tricky way: use this status as a target variable and try to predict those NaNs with scikit using all other data.

Definitive way to parse alphanumeric CSVs in Python with scipy/numpy

I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.
I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):
# my file
name favorite_integer favorite_float1 favorite_float2 short_description
johnny 5 60.2 0.52 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')
I'd like to be able to access data in two ways:
As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:
floats_and_ints = data.matrix
floats_and_ints[:, 0] # access the integers
floats_and_ints[:, 1:3] # access some of the floats
transpose(floats_and_ints) # etc..
As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:
data['favorite_float1'] # get all the values of the column with header
"favorite_float1"
data['name'] # get all the names of the rows
I don't want to have to know that favorite_float1 is the second column in the table, since this might change.
It's also important for me to be able to iterate through the rows and access the fields by name. For example:
for row in data:
# print names and favorite integers of all
print "Name: ", row["name"], row["favorite_int"]
The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.
The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.
Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.
In summary, the main features are:
Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
The ability to extract columns and rows by header name (as in the above example)

Have a look at pandas which is build on top of numpy.
Here is a small example:
In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]:
favorite_integer favorite_float1 favorite_float2 short_description
name
johnny 5 60.20 0.520 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
In [9]: df.describe()
Out[9]:
favorite_integer favorite_float1 favorite_float2
count 2.000000 2.000000 2.000000
mean 3.000000 38.860000 0.260500
std 2.828427 30.179317 0.366988
min 1.000000 17.520000 0.001000
25% 2.000000 28.190000 0.130750
50% 3.000000 38.860000 0.260500
75% 4.000000 49.530000 0.390250
max 5.000000 60.200000 0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]:
name
johnny 60.20
bob 17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]:
short_description mean_favorite
name
johnny johnny likes fruitflies 21.906667
bob bob, bobby, robert 6.173667

matplotlib.mlab.csv2rec returns a numpy recarray, so you can do all the great numpy things to this that you would do with any numpy array. The individual rows, being record instances, can be indexed as tuples but also have attributes automatically named for the columns in your data:
rows = matplotlib.mlab.csv2rec('data.csv')
row = rows[0]
print row[0]
print row.name
print row['name']
csv2rec also understands "quoted strings", unlike numpy.genfromtext.
In general, I find that csv2rec combines some of the best features of csv.reader and numpy.genfromtext.

numpy.genfromtxt()

Why not just use the stdlib csv.DictReader?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.