I have a DataFrame which contains the results of multiple aggregation functions applied to multiple columns, for example:
bar = pd.DataFrame([
{'a': 1, 'b': 2, 'grp': 0}, {'a': 3, 'b': 8, 'grp': 0},
{'a': 2, 'b': 2, 'grp': 1}, {'a': 4, 'b': 5, 'grp': 1}
])
bar.groupby('grp').agg([np.mean, np.std])
a b
mean std mean std
grp
0 2 1.414214 5.0 4.242641
1 3 1.414214 3.5 2.121320
I want to combine the aggregation results to lists (or tuples):
grp a b
0 [2, 1.414214] [5.0, 4.242641]
1 [3, 1.414214] [3.5, 2.121320]
What would be the proper way to do this?
Thanks in advance!
If you've to use lists in columns. You can
In [60]: bar.groupby('grp').agg(lambda x: [x.mean(), x.std()])
Out[60]:
a b
grp
0 [2.0, 1.4142135623730951] [5.0, 4.242640687119285]
1 [3.0, 1.4142135623730951] [3.5, 2.1213203435596424]
Not recommended to store data like this for pandas.
What would be the proper way to do this?
There is no proper way. Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.
The main reason holding lists in series is not recommended is you lose all vectorised functionally attached to having numeric series with NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers. You will lose benefits in terms of memory and performance.
See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.
Related
Here is my data:
df:
id sub_id
A 1
A 2
B 3
B 4
and I have the following array:
[[1,2],
[2,5],
[1,4],
[7,8]]
Here is my code:
from collections import defaultdict
sub_id_array_dict = defaultdict(dict)
for i, s, a in zip(df['id'].to_list(), df['sub_id'].to_list(), arrays):
sub_id_array_dict[i][s] = a
Now, my actual dataframe includes a total of 100M rows (unique sub_id) with 500K unique ids. Ideally, I'd like to avoid a for loop.
Any help would be much appreciated.
Assuming the arrays variable has same number of rows as in the Dataframe,
df['value'] = arrays
Convert into dictionary by grouping
df.groupby('id').apply(lambda x: dict(zip(x.sub_id, x.value))).to_dict()
Output
{'A': {1: [1, 2], 2: [2, 5]}, 'B': {3: [1, 4], 4: [7, 8]}}
Is there an easy and straightforward way to load the output from sp.stats.describe() into a DataFrame, including the value names? It doesn't seem to be a dictionary format or something related. Ofcourse I can manually attach the relevant column names (see below), but was wondering whether it might be possible to directly load into a DataFrame with named columns.
import pandas as pd
import scipy as sp
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})
sp.stats.describe(data['a'])
pd.DataFrame(a)
pd.DataFrame(a).transpose().rename(columns={0: 'N', 1: 'Min,Max',
2: 'Mean', 3: 'Var',
4: 'Skewness',
5: 'Kurtosis'})
You can use _fields for columns names from named tuple:
a = sp.stats.describe(data['a'])
df = pd.DataFrame([a], columns=a._fields)
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Also is possible create dictionary from named tuples by _asdict:
d = sp.stats.describe(data['a'])._asdict()
df = pd.DataFrame([d], columns=d.keys())
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Guys,
I have an dict like this :
dic = {}
dic['A'] = 1
dic['B'] = np.array([1,2,3])
dic['C'] = np.array([1,2,3,4])
dic['D'] = np.array([6,7])
Then I tried to put them into a DataFrame (also may insert more lines later, but the array length for each element may be variable), for some reasons, I want to keep them as a entire object for each columns, when print, it looks like:
A B C D
1 [1,2,3] [1,2,3,4] [6,7]
......
[2,3] [7,8] [5,6,7,2] 4
When I am trying to do this by :
pd.DataFrame.from_dict(dic)
I always get the error : ValueError: arrays must all be same length
Do I have anyway to keep the entire array as single element, however, some times I do have some single value as well ?
I am not sure why you required input as the dictionary. but if you pass elements as numpy array it converts missing values with NaN.
pd.DataFrame([np.array([1,2,3]),np.array([1,2,3,4]),np.array([6,7])],columns=['A','B','C','D'])
Output:-
A B C D
0 1 2 3.0 NaN
1 1 2 3.0 4.0
2 6 7 NaN NaN
IIUC this should work
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[1, np.array([2,3])],
"B":[np.array([1,2,3]), np.array([7,8])],
"C":[np.array([1,2,3,4]), np.array([5,6,7,2])],
"D":[np.array([6,7]), 4]})
So df.to_dict() returns
{'A': {0: 1, 1: array([2, 3])},
'B': {0: array([1, 2, 3]), 1: array([7, 8])},
'C': {0: array([1, 2, 3, 4]), 1: array([5, 6, 7, 2])},
'D': {0: array([6, 7]), 1: 4}}
UPDATE
If you want to save to file you should consider to use lists instead of numpy arrays and use delimiter=';'
convert arrays to strings by if you want to maintain this shape.
I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. Example code:
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
Looks like this:
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. Ex:
df.loc[4]
should return (trivially)
E var1
id
1 1 0.1
2 0 0.5
The problem is I keep getting a TypeError about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for.
TypeError: only integer scalar arrays can be converted to a scalar index
I've tried many things, nothing seems to work normally. I could include the id column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling set_index('id').
I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex.
Since we are speaking intervals there is a method called get_loc to find the rows that has the value in between the interval. To say what I mean :
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
df.iloc[(df.index.get_level_values(0).get_loc(11))]
E var1
ntv id
(0, 12] 2 0 0.5
This also works if you have multiple rows of data for one inteval i.e
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
3 1 0.1
(0, 12] 2 0 0.5
If you time this up with a list comprehension, this approach is way faster for large dataframes i.e
ndf = pd.concat([df]*10000)
%%timeit
ndf.iloc[ndf.index.get_level_values(0).get_loc(4)]
10 loops, best of 3: 32.8 ms per loop
%%timeit
intervals = ndf.index.get_level_values(0)
mask = [4 in i for i in intervals]
ndf.loc[mask]
1 loop, best of 3: 193 ms per loop
So I did a bit of digging to try and understand the problem. If I try to run your code the following happens.
You try to index into the index label with
"slice(array([0, 1], dtype=int64), array([1, 2], dtype=int64), None)"
(when I say index_type I mean the Pandas datatype)
An index_type's label is a list of indices that map to the index_type's levels array. Here is an example from the documentation.
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex(levels=[[1, 2], ['blue', 'red']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
names=['number', 'color'])
Notice how the second list in labels connects to the order of levels. level[1][1] is equal to red, and level[1][0] is equal to blue.
Anyhow, this is all to say that I don't believe intervalindex is meant to be used in an overlapping fashion. If you look at the orginal proposal for it
https://github.com/pandas-dev/pandas/issues/7640
"A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals."
My suggestion is to move the interval into a column. You could probably write up a simple function with numba to test if a number is in each interval. Do you mind explaining the way you're benefiting from the interval?
Piggybacking off of #Dark's solution, Index.get_loc just calls Index.get_indexer under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape.
idx = df.index.get_level_values(0)
df.iloc[idx.get_indexer([4])]
My originally proposed solution:
intervals = df.index.get_level_values(0)
mask = [4 in i for i in intervals]
df.loc[mask]
Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two:
df.reset_index(level=1, drop=True).loc[4] # good
df.loc[4] # TypeError
This is not really a solution and I don't fully understand but think it may have to do with your interval index not being monotonic (in that you have overlapping intervals). I guess that could in a sense be considered monotonic so perhaps alternately you could say the overlap means the index is not unique?
Anyway, check out this github issue:
ENH: Implement MultiIndex.is_monotonic_decreasing #17455
And here's an example with your data, but changing the intervals to be non-overlapping (0,6) & (7,12):
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0, 6), 'E': 1},
{'id': 2, 'var1': 0.5, 'ntv': ntv(7,12), 'E': 0}
], index=('ntv', 'id'))
Now, loc works OK:
df.loc[4]
E var1
id
1 1 0.1
def check_value(num):
return df[[num in i for i in map(lambda x: x[0], df.index)]]
a = check_value(4)
a
>>
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
if you want to drop the index level, you can add
a.index = a.droplevel(0)
I have two dictionaries that I use as sparse vectors:
dict1 = {'a': 1, 'b': 4}
dict2 = {'a': 2, 'c': 2}
I wrote my own __add__ function to get this desired result:
dict1 = {'a': 3, 'b': 4, 'c': 2}
It is important that I know the strings 'a', 'b' and 'c' for each corresponding value. Just making sure that I add up the correct dimensions is not enough. I will also get many more, previously unknown strings with some values that I just add to my dictionary at the moment.
Now my question: Is there a more efficient data structure out there? I looked at Numpy's arrays and Scipy's sparse matrixes but as far as I understand they are not really of any help here or am I just not seeing the solution?
I could keep keys and values in separate arrays but I don't think I can just use any already existing function to get the desired result.
dict1_keys = np.array([a, b])
dict1_values = np.array([1, 4])
dict2_keys = np.array([a, c])
dict2_values = np.array([2, 2])
# is there anything that will efficiently produce the following?
dict1_keys = np.array([a, b, c])
dict1_values = np.array([3, 4, 2])
Perhaps pandas is what you're looking for:
d1 = pandas.DataFrame(numpy.array([1, 4]), index=['a', 'b'], dtype="int32")
d2 = pandas.DataFrame(numpy.array([2, 2]), index=['a', 'c'], dtype="int32")
d1.add(d2, fill_value=0)
result:
0
a 3
b 4
c 2
#sirfz's Pandas approach could be a one-liner using pandas Series:
>>> pd.Series(dict1).add(pd.Series(dict2), fill_value=0)
a 3.0
b 4.0
c 2.0
Or if your API required dicts
>>> dict(pd.Series(dict1).add(pd.Series(dict2), fill_value=0))
{'a': 3.0, 'b': 4.0, 'c': 2.0}
Plus, this should handle mixed inputs of dicts or Seriess or even scipy sparse matrix rows and sklearn Vectorizer output (sparse vectors/mappings)