pandas SparseDataFrame insertion - python

i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!

Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).

Related

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

How to return the index value of an element in a pandas dataframe

I have a dataframe of corporate actions for a specific equity. it looks something like this:
0 Declared Date Ex-Date Record Date
BAR_DATE
2018-01-17 2017-02-21 2017-08-09 2017-08-11
2018-01-16 2017-02-21 2017-05-10 2017-06-05
except that it has hundreds of rows, but that is unimportant. I created the index "BAR_DATE" from one of the columns which is where the 0 comes from above BAR_DATE.
What I want to do is to be able to reference a specific element of the dataframe and return the index value, or BAR_DATE, I think it would go something like this:
index_value = cacs.iloc[5, :].index.get_values()
except index_value becomes the column names, not the index. Now, this may stem from a poor understanding of indexing in pandas dataframes, so this may or may not be really easy to solve for someone else.
I have looked at a number of other questions including this one, but it returns column values as well.
Your code is really close, but you took it just one step further than you needed to.
# creates a slice of the dataframe where the row is at iloc 5 (row number 5) and where the slice includes all columns
slice_of_df = cacs.iloc[5, :]
# returns the index of the slice
# this will be an Index() object
index_of_slice = slice_of_df.index
From here we can use the documentation on the Index object: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
# turns the index into a list of values in the index
index_list = index_of_slice.to_list()
# gets the first index value
first_value = index_list[0]
The most important thing to remember about the Index is that it is an object of its own, and thus we need to change it to the type we expect to work with if we want something other than an index. This is where documentation can be a huge help.
EDIT: It turns out that the iloc in this case is returning a Series object which is why the solution is returning the wrong value. Knowing this, the new solution would be:
# creates a Series object from row 5 (technically the 6th row)
row_as_series = cacs.iloc[5, :]
# the name of a series relates to it's index
index_of_series = row_as_series.name
This would be the approach for single-row indexing. You would use the former approach with multi-row indexing where the return value is a DataFrame and not a Series.
Unfortunately, I don't know how to coerce the Series into a DataFrame for single-row slicingbeyond explicit conversion:
row_as_df = DataFrame(cacs.iloc[5, :])
While this will work, and the first approach will happily take this and return the index, there is likely a reason why Pandas doesn't return a DataFrame for single-row slicing so I am hesitant to offer this as a solution.

Efficient way of applying specific function to structured column in Spark data frame?

I have data in a Spark data frame, with a column col that contains structured data of the form:
------ col ------- # Column whose elements are structures
field0 field1 … # StructType with StructFields (variable names and count)
[1,2,3] [4,5] [6] # Each field is of type ArrayType
[1,2] [3] []
…
where the number and the names of the fields are not fixed.
What is the most efficient way of calculating the total number of elements in each row? In the above example, the expected resulting data frame would thus be:
num_elements
6
3
…
There is always the solution of a user defined function:
from pyspark.sql.types import IntegerType
def num_elements(all_arrays_in_row):
return sum(map(len, all_arrays_in_row))
num_elements = pyspark.sql.functions.udf(num_elements, IntegerType())
data_frame.select(num_elements(data_frame.col)).show() # Number of elements in each row
Now, I am not sure whether this is generally efficient, because:
Function num_elements() is in Python.
If the fields happen to not be stored together for some reason, the map() forces a fetch of each array before calculating their length.
More generally, a "pure" Spark approach would be more efficient, but it is eluding me. What I tried so far is the following, but this is way more cumbersome than the approach above, and is also not complete:
Get the field names field0, etc. with [field.name for field in data_frame.select("col").schema.fields[0].dataType.fields] (cumbersome).
For each field name, efficiently calculate the size of its array:
sizes_one_field = data_frame.select(pyspark.sql.functions.size(
data_frame.col.getField(field_name))
Now, I am stuck at this point because I am not sure how to sum together the 1-column data frames sizes_one_field (there is one for each field name). Plus, maybe there is a way of directly applying the size() function to each field of column col in Spark (through some kind of map?)? Or some completely different approach to getting the total number of elements in each row?
You can try something like the following:
from pyspark.sql import functions as f
result = df.select(sum((f.size(df[col_name]) for col_name in df.columns), f.lit(0)))
This solution uses the pyspark.sql built-in functions and will be executed in an optimized way. For more information about these functions, you can check its pyspark documentation.

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

How to self-reference column in pandas Data Frame?

In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)

Categories