Equivalent of arcpy.Statistics_analysis using NumPy (or other) - python

I am having a problem (I think memory related) when trying to do an arcpy.Statistics_analysis on an approximately 40 million row table. I am trying to count the number of non-null values in various columns of the table per category (e.g. there are x non-null values in column 1 for category A). After this, I need to join the statistics results to the input table.
Is there a way of doing this using numpy (or something else)?
The code I currently have is like this:
arcpy.Statistics_analysis(input_layer, output_layer, "'Column1' COUNT; 'Column2' COUNT; 'Column3' COUNT", "Categories")
I am very much a novice with arcpy/numpy so any help much appreciated!

You can convert a table to a numpy array using the function arcpy.da.TableToNumPyArray. And then convert the array to a pandas.DataFrame object.
Here is an example of code (I assume you are working with Feature Class because you use the term null values, if you work with shapefile you will need to change the code as null values are not supported are replaced with a single space string (' '):
import arcpy
import pandas as pd
# Change these values
gdb_path = 'path/to/your/geodatabase.gdb'
table_name = 'your_table_name'
cat_field = 'Categorie'
fields = ['Column1','column2','Column3','Column4']
# Do not change
null_value = -9999
input_table = gdb_path + '\\' + table_name
# Convert to pandas DataFrame
array = arcpy.da.TableToNumPyArray(input_table,
[cat_field] + fields,
skip_nulls=False,
null_value=null_value)
df = pd.DataFrame(array)
# Count number of non null values
not_null_count = {field: {cat: 0 for cat in df[cat_field].unique()}
for field in fields}
for cat in df[cat_field].unique():
_df = df.loc[df[cat_field] == cat]
len_cat = len(_df)
for field in fields:
try: # If your field contains integrer or float
null_count = _df[field].value_counts()[int(null_value)]
except IndexError: # If it contains text (string)
null_count = _df[field].value_counts()[str(null_value)]
except KeyError: # There is no null value
null_count = 0
not_null_count[field][cat] = len_cat - null_count
Concerning joining the results to the input table without more information, it's complicated to give you an exact answer that will meet your expectations (because there are multiple columns, so it's unsure which value you want to add).
EDIT:
Here is some additional code following your clarifications:
# Create a copy of the table
copy_name = '' # name of the copied table
copy_path = gdb_path + '\\' + copy_name
arcpy.Copy_management(input_table, copy_path)
# Dividing copy data with summary
# This step doesn't need to convert the dict (not_null_value) to a table
with arcpy.da.UpdateCursor(copy_path, [cat_field] + fields) as cur:
for row in cur:
category = row[0]
for i, fld in enumerate(field):
row[i+1] /= not_null_count[fld][category]
cur.updateRow(row)
# Save the summary table as a csv file (if needed)
df_summary = pd.DataFrame(not_null_count)
df_summary.index.name = 'Food Area' # Or any name
df_summary.to_csv('path/to/file.csv') # Change path
# Summary to ArcMap Table (also if needed)
arcpy.TableToTable_conversion('path/to/file.csv',
gdb_path,
'name_of_your_new_table')

Related

Can I accomplish this in an easier more fluid way? (in ArcPy)

For a project I am adding fields and then populating those fields with data already contained in the table. The adding fields is easy.
arcpy.AddField_management("PLSSFirstDivision","TRS","TEXT","","",20)
arcpy.AddField_management("PLSSFirstDivision","TWN","TEXT","","",20)
arcpy.AddField_management("PLSSFirstDivision","SEC","TEXT","","",20)
arcpy.AddField_management("PLSSFirstDivision","RNG","TEXT","","",20)
arcpy.AddField_management("PLSSFirstDivision","TWN_D","TEXT","","",20)
arcpy.AddField_management("PLSSFirstDivision","RNG_D","TEXT","","",20)
Then I need to take specific number from a field (string) and I could only get it to work in ArcMaps Calculator and not the Python window. The data looked like this: (needed bold)
LA180230N0120E0SN100
TWN = MID([FRSTDIVID],6,2)
RNG = MID([FRSTDIVID],11,2)
SEC = MID([FRSTDIVID],18,2)
Then I needed to strip the initial "0" for those 3 fields:
TWN = !TWN!.lstrip('0')
RNG = !RNG!.lstrip('0')
SEC = !SEC!.lstrip('0')
Than adding it all together in a final field:
TRS = "T"+ [TWN]+ [TWN_D]+"R" + [RNG]+ [RNG_D]+"-" + "SEC" + [SEC]
Thanks for any help, just trying to learn more
I haven't actually run these, so my syntax could be off a tad, but you need to come up with a python expression that results in the strings you want, then use arcpy's CalculateField method to update your table. If you test your expressions in the field calculator window, you should be able to just copy/paste your final expression into the statements like below.
arcpy.CalculateField_management("PLSSFirstDivision", "TWN", "!FRSTDIVID![6:2].lstrip('0')", "PYTHON3")
arcpy.CalculateField_management("PLSSFirstDivision", "TRS", "'T' + !TWN!+ !TWN_D!+'R' + !RNG!+ !RNG_D!+'-SEC' + !SEC!", "PYTHON3")
These are the sort of complex attribute manipulations that I like to use an UpdateCursor for. You can manipulate the contents of multiple fields in one iteration, and then write the updates out for each row at once.
with arcpy.da.UpdateCursor("PLSSFirstDivision", ["FRSTDIVID","TRS","TWN","SEC","RNG","TWN_D","RNG_D"]) as cursor:
for row in cursor:
frstdivid = row[0]
# try string slicing for this instead of the `MID` function
# and you can strip leading zeroes in the same line
twn = frstdivid[5:7].lstrip('0')
rng = frstdivid[10:12].lstrip('0')
sec = frstdivid[17:19].lstrip('0')
# was not sure how twn_d and rng_d are calculated based on your provided code, but...
twn_d = foo
rng_d = bar
# use all these to calculate trs
trs = 'T{}{}R{}{}-{}SEC'.format(twn, twn_d, rng, rng_d, sec)
# assign the calculated values back to row positions
row[1] = trs
row[2] = twn
row[3] = sec
row[4] = rng
row[5] = twn_d
row[6] = rng_d
# write the new row with complete values from memory to your table
cursor.updateRow(row)

df.replace() not being converted into the text or csv file

When I use:
df = df.replace(oldvalue, newvalue)
it replaces the file, but when I try to put the new dataframe into either a text file or a csv file, it does not update and continues to be the original output before the replace.
I am getting the data from two files and trying to add them together. Right now I am trying to change the formatting to match the original formatting.
I have tried altering the placement of the replacement, as well as editing my df.replace command numerous times to either include regrex=True, to_replace, value=, and other small things. Below is a small sampling of code.
drdf['adg'] = adgvals #adds adg values into dataframe
for column, valuex in drdf.iteritems():
#value = value.replace('444.000', '444.0')
for indv in valuex:
valuex = valuex.replace('444.000', '444.0')
for difindv in valuex:
fourspace = ' '
if len(difindv) == 2:
indv1 = difindv + fourspace
value1 = valuex.replace(difindv, indv1)
drdf = drdf.replace(to_replace=valuex, value=value1)
#Transfers new dataframe into new text file
np.savetxt(r'/Users/username/test.txt', drdf.values, fmt='%s', delimiter='' )
drdf.to_csv(r'/Users/username/089010219.tot')
It should be replacing the values (for example 40 with 40(four spaces). It does this within the spyder interface, but it does not translate into the files that are being created.
Did you try:
df.replace(old, new, inplace=True)
Inplace essentially puts the new value 'inplace' of the old in some cases. However, I do not claim to know all the inner technical workings of inplace.
This is how I would do it with map:
drdf['adg'] = adgvals #adds adg values into dataframe
for column, valuex in drdf.iteritems():
#value = value.replace('444.000', '444.0')
for indv in valuex:
valuex = valuex.map('444.000':'444.0')
for difindv in valuex:
fourspace = ' '
if len(difindv) == 2:
indv1 = difindv + fourspace
value1 = valuex.map(difindv:indv1)
drdf = drdf.replace(valuex,value1)
#Transfers new dataframe into new text file
np.savetxt(r'/Users/username/test.txt', drdf.values, fmt='%s', delimiter='' )
drdf.to_csv(r'/Users/username/089010219.tot')

Pandas + HDF5 Panel data storage for large data

As part of my research, I am searching a good storing design for my panel data. I am using pandas for all in-memory operations. I've had a look at the following two questions/contributions, Large Data Work flows using Pandas and Query HDF5 Pandas as they come closest to my set-up. However, I have a couple of questions left. First, let me define my data and some requirements:
Size: I have around 800 dates, 9000 IDs and up to 200 variables. Hence, flattening the panel (along dates and IDs) corresponds to 7.2mio rows and 200 columns. This might all fit in memory or not, let's assume it does not. Disk-space is not an issue.
Variables are typically calculated once, but updates/changes probably happen from time to time. Once updates occur, old versions don't matter anymore.
New variables are added from time to time, mostly one at a time only.
New rows are not added.
Querying takes place. For example, often I need to select only a certain date range like date>start_date & date<end_date. But some queries need to consider rank conditions on dates. For example, get all data (i.e. columns) where rank(var1)>500 & rank(var1)<1000, where rank is as of date.
The objective is to achieve fast reading/querying of data. Data writing is not so critical.
I thought of the following HDF5 design:
Follow the groups_map approach (of 1) to store variables in different tables. Limit the number of columns for each group to 10 (to avoid large memory loads when updating single variables, see point 3).
Each group represents one table, where I use the multi-index based on dates & ids for each table stored.
Create an update function, to update variables. The functions loads the table with all (10) columns to memory as a df, deletes the table on the disk, replaces the updated variable in df and saves the table from memory back to disk.
Create an add function, add var1 to a group with less than 10 columns, or create new group if required. Saving similar as in 3. load current group to memory, delete table on disk, add new column and save it back on disk.
Calculate ranks as of date for relevant variables and add them to disk-storage as rank_var1, which should reduce the query as of to simply rank_var1 > 500 & rank_var1<1000.
I have the following questions:
Updating HDFTable, I suppose I have to delete the entire table in order to update a single column?
When to use 'data_columns', or should I simply assign True in HDFStore.append()?
If I want to query based on condition of rank_var1 > 500 & rank_var1<1000, but I need columns from other groups. Can I enter the index received from the rank_var1 condition into the query to get other columns based on this index (the index is a multi-index with date and ID)? Or would I need to loop this index by date and then chunk the IDs similar as proposed in 2 and repeat the procedure for each group where I need. Alternatively, (a) I could add to each groups table rank columns, but it seems extremely inefficient in terms of disk-storage. Note, the number of variables where rank filtering is relevant is limited (say 5). Or (b) I could simply use the df_rank received from the rank_var1 query and use in-memory operations via df_rank.merge(df_tmp, left_index=True, right_index=True, how='left') and loop through groups (df_tmp) where I select the desired columns.
Say I have some data in different frequencies. Having different group_maps (or different storages) for different freq is the way to go I suppose?
Copies of the storage might be used on win/ux systems. I assume it is perfectly compatible, anything to consider here?
I plan to use pd.HDFStore(str(self.path), mode='a', complevel=9, complib='blosc'). Any concerns regarding complevel or complib?
I've started to write up some code, once I have something to show I'll edit and add it if desired. Please, let me know if you need any more information.
EDIT I here a first version of my storage class, please adjust path at bottom accordingly. Sorry for the length of the code, comments welcome
import pandas as pd
import numpy as np
import string
class LargeDFStorage():
# TODO add index features to ensure correct indexes
# index_names = ('date', 'id')
def __init__(self, h5_path, groups_map):
"""
Parameters
----------
h5_path: str
hdf5 storage path
groups_map: dict
where keys are group_names and values are dict, with at least key
'columns' where the value is list of column names.
A special group_name is reserved for group_name/key "query", which
can be used as queering and conditioning table when getting data,
see :meth:`.get`.
"""
self.path = str(h5_path)
self.groups_map = groups_map
self.column_map = self._get_column_map()
# if desired make part of arguments
self.complib = 'blosc'
self.complevel = 9
def _get_column_map(self):
""" Calc the inverse of the groups_map/ensures uniqueness of cols
Returns
-------
dict: with cols as keys and group_names as values
"""
column_map = dict()
for g, value in self.groups_map.items():
if len(set(column_map.keys()) & set(value['columns'])) > 0:
raise ValueError('Columns have to be unique')
for col in value['columns']:
column_map[col] = g
return column_map
#staticmethod
def group_col_names(store, group_name):
""" Returns all column names of specific group
Parameters
----------
store: pd.HDFStore
group_name: str
Returns
-------
list:
of all column names in the group
"""
if group_name not in store:
return []
# hack to get column names, straightforward way!?
return store.select(group_name, start=0, stop=0).columns.tolist()
#staticmethod
def stored_cols(store):
""" Collects all columns stored in HDF5 store
Parameters
----------
store: pd.HDFStore
Returns
-------
list:
a list of all columns currently in the store
"""
stored_cols = list()
for x in store.items():
group_name = x[0][1:]
stored_cols += LargeDFStorage.group_col_names(store, group_name)
return stored_cols
def _find_groups(self, columns):
""" Searches all groups required for covering columns
Parameters
----------
columns: list
list of valid columns
Returns
-------
list:
of unique groups
"""
groups = list()
for column in columns:
groups.append(self.column_map[column])
return list(set(groups))
def add_columns(self, df):
""" Adds columns to storage for the first time. If columns should
be updated use(use :meth:`.update` instead)
Parameters
----------
df: pandas.DataFrame
with new columns (not yet stored in any of the tables)
Returns
-------
"""
store = pd.HDFStore(self.path, mode='a' , complevel=self.complevel,
complib=self.complib)
# check if any column has been stored already
if df.columns.isin(self.stored_cols(store)).any():
store.close()
raise ValueError('Some cols are already in the store')
# find all groups needed to store the data
groups = self._find_groups(df.columns)
for group in groups:
v = self.groups_map[group]
# select columns of current group in df
select_cols = df.columns[df.columns.isin(v['columns'])].tolist()
tmp = df.reindex(columns=select_cols, copy=False)
# set data column to False only in case of query data
dc = None
if group=='query':
dc = True
stored_cols = self.group_col_names(store,group)
# no columns in group (group does not exists yet)
if len(stored_cols)==0:
store.append(group, tmp, data_columns=dc)
else:
# load current disk data to memory
df_grp = store.get(group)
# remove data from disk
store.remove(group)
# add new column(s) to df_disk
df_grp = df_grp.merge(tmp, left_index=True, right_index=True,
how='left')
# save old data with new, additional columns
store.append(group, df_grp, data_columns=dc)
store.close()
def _query_table(self, store, columns, where):
""" Selects data from table 'query' and uses where expression
Parameters
----------
store: pd.HDFStore
columns: list
desired data columns
where: str
a valid select expression
Returns
-------
"""
query_cols = self.group_col_names(store, 'query')
if len(query_cols) == 0:
store.close()
raise ValueError('No data to query table')
get_cols = list(set(query_cols) & set(columns))
if len(get_cols) == 0:
# load only one column to minimize memory usage
df_query = store.select('query', columns=query_cols[0],
where=where)
add_query = False
else:
# load columns which are anyways needed already
df_query = store.select('query', columns=get_cols, where=where)
add_query = True
return df_query, add_query
def get(self, columns, where=None):
""" Retrieve data from storage
Parameters
----------
columns: list/str
list of columns to use, or use 'all' if all columns should be
retrieved
where: str
a valid select statement
Returns
-------
pandas.DataFrame
with all requested columns and considering where
"""
store = pd.HDFStore(str(self.path), mode='r')
# get all columns in stored in HDFStorage
stored_cols = self.stored_cols(store)
if columns == 'all':
columns = stored_cols
# check if all desired columns can be found in storage
if len(set(columns) - set(stored_cols)) > 0:
store.close()
raise ValueError('Column(s): {}. not in storage'.format(
set(columns)- set(stored_cols)))
# get all relevant groups (where columns are taken from)
groups = self._find_groups(columns)
# if where query is defined retrieve data from storage, eventually
# only index of df_query might be used
if where is not None:
df_query, add_df_query = self._query_table(store, columns, where)
else:
df_query, add_df_query = None, False
# dd collector
df = list()
for group in groups:
# skip in case where was used and columns used from
if where is not None and group=='query':
continue
# all columns which are in group but also requested
get_cols = list(
set(self.group_col_names(store, group)) & set(columns))
tmp_df = store.select(group, columns=get_cols)
if df_query is None:
df.append(tmp_df)
else:
# align query index with df index from storage
df_query, tmp_df = df_query.align(tmp_df, join='left', axis=0)
df.append(tmp_df)
store.close()
# if any data of query should be added
if add_df_query:
df.append(df_query)
# combine all columns
df = pd.concat(df, axis=1)
return df
def update(self, df):
""" Updates data in storage, all columns have to be stored already in
order to be accepted for updating (use :meth:`.add_columns` instead)
Parameters
----------
df: pd.DataFrame
with index as in storage, and column as desired
Returns
-------
"""
store = pd.HDFStore(self.path, mode='a' , complevel=self.complevel,
complib=self.complib)
# check if all column have been stored already
if df.columns.isin(self.stored_cols(store)).all() is False:
store.close()
raise ValueError('Some cols have not been stored yet')
# find all groups needed to store the data
groups = self._find_groups(df.columns)
for group in groups:
dc = None
if group=='query':
dc = True
# load current disk data to memory
group_df = store.get(group)
# remove data from disk
store.remove(group)
# update with new data
group_df.update(df)
# save updated df back to disk
store.append(group, group_df, data_columns=dc)
store.close()
class DataGenerator():
np.random.seed(1282)
#staticmethod
def get_df(rows=100, cols=10, freq='M'):
""" Simulate data frame
"""
if cols < 26:
col_name = list(string.ascii_lowercase[:cols])
else:
col_name = range(cols)
if rows > 2000:
freq = 'Min'
index = pd.date_range('19870825', periods=rows, freq=freq)
df = pd.DataFrame(np.random.standard_normal((rows, cols)),
columns=col_name, index=index)
df.index.name = 'date'
df.columns.name = 'ID'
return df
#staticmethod
def get_panel(rows=1000, cols=500, items=10):
""" simulate panel data
"""
if items < 26:
item_names = list(string.ascii_lowercase[:cols])
else:
item_names = range(cols)
panel_ = dict()
for item in item_names:
panel_[item] = DataGenerator.get_df(rows=rows, cols=cols)
return pd.Panel(panel_)
def main():
# Example of with DataFrame
path = 'D:\\fc_storage.h5'
groups_map = dict(
a=dict(columns=['a', 'b', 'c', 'd', 'k']),
query=dict(columns=['e', 'f', 'g', 'rank_a']),
)
storage = LargeDFStorage(path, groups_map=groups_map)
df = DataGenerator.get_df(rows=200000, cols=15)
storage.add_columns(df[['a', 'b', 'c', 'e', 'f']])
storage.update(df[['a']]*3)
storage.add_columns(df[['d', 'g']])
print(storage.get(columns=['a','b', 'f'], where='f<0 & e<0'))
# Example with panel and rank condition
path2 = 'D:\\panel_storage.h5'
storage_pnl = LargeDFStorage(path2, groups_map=groups_map)
panel = DataGenerator.get_panel(rows=800, cols=2000, items=24)
df = panel.to_frame()
df['rank_a'] = df[['a']].groupby(level='date').rank()
storage_pnl.add_columns(df[['a', 'b', 'c', 'e', 'f']])
storage_pnl.update(df[['a']]*3)
storage_pnl.add_columns(df[['d', 'g', 'rank_a']])
print(storage_pnl.get(columns=['a','b','e', 'f', 'rank_a'],
where='f>0 & e>0 & rank_a <100'))
if __name__ == '__main__':
main()
It's bit difficult to answer those questions without particular examples...
Updating HDFTable, I suppose I have to delete the entire table in
order to update a single column?
AFAIK yes unless you are storing single columns separately, but it will be done automatically, you just have to write your DF/Panel back to HDF Store.
When to use 'data_columns', or should I simply assign True in
HDFStore.append()?
data_columns=True - will index all your columns - IMO it's waste of resources unless you are going to use all columns in the where parameter (i.e. if all columns should be indexed).
I would specify there only those columns that will be used often for searching in where= clause. Consider those columns as indexed columns in a database table.
If I want to query based on condition of rank_var1 > 500 &
rank_var1<1000, but I need columns from other groups. Can I enter the
index received from the rank_var1 condition into the query to get
other columns based on this index (the index is a multi-index with
date and ID)?
I think we would need some reproducible sample data and examples of your queries in order to give a reasonable answer...
Copies of the storage might be used on win/ux systems. I assume it is
perferctly compatible, anything to consider here?
Yes, it should be fully compatible
I plan to use pd.HDFStore(str(self.path), mode='a', complevel=9,
complib='blosc'). Any concerns regarding complevel or complib?
Test it with your data - results might depend on dtypes, number of unique values, etc. You may also want to consider lzo complib - it might be faster in some use-cases. Check this. Sometimes a high complevel doesn't give you better copression ratio, but will be slower (see results of my old comparison)

Python : Separating a .txt file into columns and finding the most frequent data item in one of the columns

I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories