pandas, convert DataFrame to MultiIndex'ed DataFrame - python

I have a pandas.DataFrame that I want to convert to a MultiIndexed pandas.DataFrame.
import numpy
import pandas
import itertools
xs = numpy.linspace(0, 10, 100)
ys = numpy.linspace(0, 0.1, 20)
zs = numpy.linspace(0, 5, 200)
def func(x, y, z):
return x * y / z
vals = list(itertools.product(xs, ys, zs))
result = [func(x, y, z) for x, y, z in vals]
# Original DataFrame.
df = pandas.DataFrame(vals, columns=['x', 'y', 'z'])
df = pandas.concat((pandas.DataFrame(result, columns=['result']), df), axis=1)
# I want to turn `df` into this `df2`.
index = pandas.MultiIndex.from_tuples(vals, names=['x', 'y', 'z'])
df2 = pandas.DataFrame(result, columns=['result'], index=index)
Note that in this example I create what I want and what I have.
So, IRL I would start with df and want to turn it into df2 (and don't have access to vals and result), how do I do this?

You need set_index:
print (df2.head())
result
x y z
0.0 0.0 0.000000 NaN
0.025126 0.0
0.050251 0.0
0.075377 0.0
0.100503 0.0
print (df.set_index(['x','y','z']).head())
result
x y z
0.0 0.0 0.000000 NaN
0.025126 0.0
0.050251 0.0
0.075377 0.0
0.100503 0.0
If need compare both DataFrames, need replace NaN to same values, else get False:
print (df.set_index(['x','y','z']).eq(df2).all())
result False
dtype: bool
print (np.nan == np.nan)
False
print (df.fillna(1).set_index(['x','y','z']).eq(df2.fillna(1)).all())
result True
dtype: bool

Related

Pandas - aggregate multiple columns with pivot_table

I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ind0": list("QQQWWWW"), "ind1": list("RRRRSSS"), "vals": range(7), "cols": list("XXYXXYY")})
print(df)
Output:
ind0 ind1 vals cols
0 Q R 0 X
1 Q R 1 X
2 Q R 2 Y
3 W R 3 X
4 W S 4 X
5 W S 5 Y
6 W S 6 Y
I want to aggregate the values while creating columns from col, so I thought of using pivot_table:
df_res = df.pivot_table(index=["ind0", "ind1"], columns="cols", values="vals", aggfunc=np.sum).fillna(0)
print(df_res)
This gives me:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R 3.0 0.0
S 4.0 11.0
However, I would rather get the sum independent of ind1 categories while keeping the information in this column. So, the desired output would be:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R,S 7.0 11.0
Is there a way to use pivot_table or pivot to this end or do I have to aggregate for ind1 in a second step? If the latter, how?
You could reset_index of df_res and groupby "ind0" and using agg, use different functions on columns: joining unique values of "ind1" and summing "X" and "Y".
out = df_res.reset_index().groupby('ind0').agg({'ind1': lambda x: ', '.join(x.unique()), 'X':'sum', 'Y':'sum'})
Or if you have multiple columns that you need to do the same function on, you could also use a dict comprehension:
funcs = {'ind1': lambda x: ', '.join(x.unique()), **{k:'sum' for k in ('X','Y')}}
out = df_res.reset_index().groupby('ind0').agg(funcs)
Output:
cols ind1 X Y
ind0
Q R 1.0 2.0
W R, S 7.0 11.0

nested list comprehension to populate dataframe

Objective: Compute some bivariate polynomial e.g. f(x,y) = sin(x^2 + y^2) for x ∈ [-1,1] and y ∈ [-1,1] and stick values in a dataframe.
What I have...
def sunbrero(x,y):
return np.sin(x**2 + y**2)
lower=-1
upper=1
length=1000
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
Z = pd.DataFrame(index=X,columns=Y)
# [[sunbrero(x,y) for x in X] for y in Y]
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
What I'm hoping to do is something that replaces...
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
...with something like...
[[Z[y] = sunbrero(x,y) for x in X] for y in Y]
But obviously the above doesn't work.
I know that this works...
Z = [[sunbrero(x,y) for x in X] for y in Y]
...but it creates a list of lists rather than a dataframe.
Note 1: if others think a 2D vector is more sensible c.f dataframe, I'm open to that as well.
Note 2: I don't think lambda functions work as it only allows one variable to be defined. Happy to be corrected.
I think the more Panda-esque way of doing this would be to calculate the values first and put them into a dataframe afterwards, not vice versa. Performing the calculations in a list comprehension does not put the internal vector optimizations of Numpy and Pandas to good use.
Instead, you can make use of Numpy's broadcasting to get the matrix first:
length = 5
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
result = sunbrero(X[:, None], Y)
array([[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.84147098, 0.24740396, 0. , 0.24740396, 0.84147098],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743]])
and put that in a dataframe like so:
df = pd.DataFrame(result, index=X, columns=Y)
-1.0 -0.5 0.0 0.5 1.0
-1.0 0.909297 0.948985 0.841471 0.948985 0.909297
-0.5 0.948985 0.479426 0.247404 0.479426 0.948985
0.0 0.841471 0.247404 0.000000 0.247404 0.841471
0.5 0.948985 0.479426 0.247404 0.479426 0.948985
1.0 0.909297 0.948985 0.841471 0.948985 0.909297
You're almost there:
df = pd.DataFrame([[sunbrero(x,y) for x in X] for y in Y])
You can do your list comprehension, then have pandas create a dataframe from a list of lists, for example:
list_of_lists = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(list_of_lists)
to get
0 1 2
0 1 2 3
1 4 5 6

Generate large amount of sentences based on frequency of the input

My goal is to generate sentences based on the frequency of the input. For example I have input like this:
>>> df = pd.DataFrame({"s":["a", "a", "b", "b", "c", "c"], "m":[["x", "y"], ["x", "z"], ["y", "w", "z"], ["y"], ["z"], ["z"]]})
>>> df.set_index("s")
>>> df
m
s
a [x, y]
a [x, z]
b [y, w, z]
b [y]
c [z]
c [z]
I want to have a function gen_sentence(s) that takes an s and generates random non-empty sentence based on the frequency of the letters in column m. So gen_sentence("a") should generate sentences where all of them would contain x, 50% of them should contain y and 50% z.
My intuition is to transform the DataFrame into a DataFrame of frequency, so for the example something like this:
w x y z
s
a 0.0 1 0.5 0.5
b 0.5 0 1.0 0.5
c 0.0 0 0.0 1.0
And then roll a random number for each column given an s:
def gen_sentence(fdf, s):
return fdf.columns[np.random.random(len(fdf.columns)) < fdf.loc[s]]
However, I have no clue how to transform the DataFrame in the frequency DataFrame.
The solution will probably be to use df.agg["s"] but what function do I apply on the aggregate?
In reality the dataset is pretty big with over 1 million rows, about 500 different words in m en about 100 different values for s and the frequency table will be sparse: most s's have a frequency of zero for most words in m. Furthermore, I need to generate at least a couple of hundred thousand sentences so I'm trying to find an implementation can generate a sentence as fast as possible. Also, the solution doesn't have to use Panda's, I was just thinking that the vectorized implementation of most of its functions is the fastest solution.
So in short, first, how do I transform the DataFrame into the frequency DataFrame and second, is there a faster method of generating sentences?
I've tested my implementation to see if it's fast enough and it's decent: a frequency DataFrame with 500 rows and 100 columns can generate 5000 sentences in about 1.2 seconds on my machine.
If you want to test your own method against mine, here's my test:
import timeit
setup = '''
import pandas as pd
import numpy as np
def val():
v = np.random.normal(0, 0.2)
return v if 0 <= v <= 1 else 0
def gen_sentence(fdf, s):
return fdf.columns[np.random.random(len(fdf.columns)) < fdf.loc[s]]
n = 500
m = 100
fdf = pd.DataFrame([[val() for _ in range(n)] for _ in range(m)])
fdf = fdf.join(pd.DataFrame({"s": [i for i in range(m)]}))
fdf = fdf.set_index("s")
fdf.columns = ["w%d" % i for i in range(n)]
'''
test = "x = np.random.randint(0, m); gen_sentence(fdf, x)"
print(timeit.timeit(test, setup=setup, number=5000))
To transform to frequency dataframe try this (not the best solution, but it works):
for letter in ['x', 'y', 'w', 'z']:
df.loc[:, letter] = df.m.apply(lambda x: x.count(letter))
df = df.drop(['m'], axis=1)
df_1 = df.groupby('s').agg(lambda x: sum(x)).reset_index()
print(df_1)
Output:
s x y w z
0 a 2 1 0 1
1 b 0 2 1 1
2 c 0 0 0 2
Another alternative (without for loop, using stack and pivot_table):
import numpy as np
df_1 = (df.m.apply(pd.Series).stack().to_frame('m')).reset_index().set_index('level_0')['m']
df_1 = pd.concat([df['s'], df_1], axis=1).reset_index()[['s', 'm']]
df_1.insert(1, 'freq', 1)
df_1 = pd.pivot_table(df_1, values='freq', index='s', columns='m', aggfunc=np.sum).fillna(0)
df_1 = df_1.div(df_1.max(axis=1), axis=0)
df_1.columns.name=None
print(df_1)
Output:
w x y z
s
a 0.0 1.0 0.5 0.5
b 0.5 0.0 1.0 0.5
c 0.0 0.0 0.0 1.0
With the help of Alla Tarighati I now have this solution for the first part of my question:
letters = set(x for l in df["m"] for x in l)
for letter in letters:
df.loc[:, letter] = df.m.apply(lambda x: letter in x)
df = df.drop(["m"], axis=1)
gdf = df.groupby("s")
fdf = gdf.agg(lambda x: sum(x))
fdf = fdf.divide(gdf.size(), axis="index")
print(fdf)
output:
y x z w
s
a 0.5 1.0 0.5 0.0
b 1.0 0.0 0.5 0.5
c 0.0 0.0 1.0 0.0
Note that in line three I changed the lambda function to letter in x so that duplicate letters in a sentence aren't counted multiple times.
And like Alla Tarighati, this isn't a very fast solution, so improvements are welcome!

Pandas DataFrame from MultiIndex and NumPy structured array (recarray)

First I create a two-level MultiIndex:
import numpy as np
import pandas as pd
ind = pd.MultiIndex.from_product([('X','Y'), ('a','b')])
I can use it like this:
pd.DataFrame(np.zeros((3,4)), columns=ind)
Which gives:
X Y
a b a b
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
But now I'm trying to do this:
dtype = [('Xa','f8'), ('Xb','i4'), ('Ya','f8'), ('Yb','i4')]
pd.DataFrame(np.zeros(3, dtype), columns=ind)
But that gives:
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
I expected something like the previous result, with three rows.
Perhaps more generally, what I want to do is to generate a Pandas DataFrame with MultiIndex columns where the columns have distinct types (as in the example, a is float but b is int).
This looks like a bug, and worth reporting as an issue github.
A workaround is to set the columns manually after construction:
In [11]: df1 = pd.DataFrame(np.zeros(3, dtype))
In [12]: df1.columns = ind
In [13]: df1
Out[13]:
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0
pd.DataFrame(np.zeros(3, dtype), columns=ind)
Empty DataFrame
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
Index: []
is just showing the textual representation of the dataframe output.
Columns: [(X, a), (X, b), (Y, a), (Y, b)]
is then just the text representation of the index.
if you instead:
df = pd.DataFrame(np.zeros(3, dtype), columns=ind)
print type(df.columns)
<class 'pandas.indexes.multi.MultiIndex'>
You see it is indeed a pd.MultiIndex
That said and out of the way. What I don't understand is why specifying the index in the dataframe constructor removes the values.
A work around is this.
df = pd.DataFrame(np.zeros(3, dtype))
df.columns = ind
print df
X Y
a b a b
0 0.0 0 0.0 0
1 0.0 0 0.0 0
2 0.0 0 0.0 0

Return list of indices/index where a min/max value occurs in a pandas dataframe

I'd like to search a pandas DataFrame for minimum values. I need the min in the entire dataframe (across all values) analogous to df.min().min(). However I also need the know the index of the location(s) where this value occurs.
I've tried a number of different approaches:
df.where(df == (df.min().min())),
df.where(df == df.min().min()).notnull()(source) and
val_mask = df == df.min().min(); df[val_mask] (source).
These return a dataframe of NaNs on non-min/boolean values but I can't figure out a way to get the (row, col) of these locations.
Is there a more elegant way of searching a dataframe for a min/max and returning a list containing all of the locations of the occurrence(s)?
import pandas as pd
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
list_of_lowest = []
for column_name, column in df.iteritems():
if len(df[column == df.min().min()]) > 0:
print(column_name, column.where(column ==df.min().min()).dropna())
list_of_lowest.append([column_name, column.where(column ==df.min().min()).dropna()])
list_of_lowest
output: [['x', 2 -1.0
Name: x, dtype: float64]]
Based on your revised update:
In [209]:
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,-1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
df
Out[209]:
x y z
0 1 3 4
1 2 5 2
2 -1 -1 3
Then the following would work:
In [211]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna()
Out[211]:
x y
2 -1.0 -1.0
So this uses the boolean mask on the df:
In [212]:
df[df==df.min().min()]
Out[212]:
x y z
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.0 -1.0 NaN
and we call dropna with param thresh=1 this drops columns that don't have at least 1 non-NaN value:
In [213]:
df[df==df.min().min()].dropna(axis=1, thresh=1)
Out[213]:
x y
0 NaN NaN
1 NaN NaN
2 -1.0 -1.0
Probably safer to call again with thresh=1:
In [214]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna(thresh=1)
Out[214]:
x y
2 -1.0 -1.0

Categories