nested list comprehension to populate dataframe - python

Objective: Compute some bivariate polynomial e.g. f(x,y) = sin(x^2 + y^2) for x ∈ [-1,1] and y ∈ [-1,1] and stick values in a dataframe.
What I have...
def sunbrero(x,y):
return np.sin(x**2 + y**2)
lower=-1
upper=1
length=1000
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
Z = pd.DataFrame(index=X,columns=Y)
# [[sunbrero(x,y) for x in X] for y in Y]
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
What I'm hoping to do is something that replaces...
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
...with something like...
[[Z[y] = sunbrero(x,y) for x in X] for y in Y]
But obviously the above doesn't work.
I know that this works...
Z = [[sunbrero(x,y) for x in X] for y in Y]
...but it creates a list of lists rather than a dataframe.
Note 1: if others think a 2D vector is more sensible c.f dataframe, I'm open to that as well.
Note 2: I don't think lambda functions work as it only allows one variable to be defined. Happy to be corrected.

I think the more Panda-esque way of doing this would be to calculate the values first and put them into a dataframe afterwards, not vice versa. Performing the calculations in a list comprehension does not put the internal vector optimizations of Numpy and Pandas to good use.
Instead, you can make use of Numpy's broadcasting to get the matrix first:
length = 5
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
result = sunbrero(X[:, None], Y)
array([[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.84147098, 0.24740396, 0. , 0.24740396, 0.84147098],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743]])
and put that in a dataframe like so:
df = pd.DataFrame(result, index=X, columns=Y)
-1.0 -0.5 0.0 0.5 1.0
-1.0 0.909297 0.948985 0.841471 0.948985 0.909297
-0.5 0.948985 0.479426 0.247404 0.479426 0.948985
0.0 0.841471 0.247404 0.000000 0.247404 0.841471
0.5 0.948985 0.479426 0.247404 0.479426 0.948985
1.0 0.909297 0.948985 0.841471 0.948985 0.909297

You're almost there:
df = pd.DataFrame([[sunbrero(x,y) for x in X] for y in Y])

You can do your list comprehension, then have pandas create a dataframe from a list of lists, for example:
list_of_lists = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(list_of_lists)
to get
0 1 2
0 1 2 3
1 4 5 6

Related

How can I append the values which I got from the def function return value to DataFrame?

How can I append the values which I got from the def function return value to DataFrame? I would like to return a list from a def function which give me scalar for each loop, but the programm shows me
TypeError: 'int' object is not subscriptable.
def para(x, y):
z = x + y
return z
x = [1,2,3]
y = [4,5,6]
list_z = []
for i in range(2):
z = para(x[i], y[i])
list_z.append(z[0])
print(list_z)
First error is that para method return an int and you try to access something on it, you can't do 4[0] that not possible
Then you use range(2) that generates values [0,1] so you're missing last values of arrays, use range(3) or direcrly yhe length of the arrays
def para(x, y):
z = x + y
return z
x = [1, 2, 3]
y = [4, 5, 6]
list_z = []
for i in range(len(x)):
z = para(x[i], y[i])
list_z.append(z)
print(list_z) # [5, 7, 9]
Improvement, use zip to iterate on both arrays at same time, on values
for x_val, y_val in zip(x, y):
list_z.append(para(x_val, y_val))
If you actually meant a pandas DataFrame rather than manually building a list-of-lists, then simply declare it:
df = pd.DataFrame({'x': x, 'y': y})
>>> df
x y
0 1 4
1 2 5
2 3 6
pandas DataFrames have builtin methods for operations like +,-,/,*, so you can simply do:
df['z'] = df['x'] + df['y']
without needing your function para:
>>> df
x y z
0 1 4 5
1 2 5 7
2 3 6 9

Generate large amount of sentences based on frequency of the input

My goal is to generate sentences based on the frequency of the input. For example I have input like this:
>>> df = pd.DataFrame({"s":["a", "a", "b", "b", "c", "c"], "m":[["x", "y"], ["x", "z"], ["y", "w", "z"], ["y"], ["z"], ["z"]]})
>>> df.set_index("s")
>>> df
m
s
a [x, y]
a [x, z]
b [y, w, z]
b [y]
c [z]
c [z]
I want to have a function gen_sentence(s) that takes an s and generates random non-empty sentence based on the frequency of the letters in column m. So gen_sentence("a") should generate sentences where all of them would contain x, 50% of them should contain y and 50% z.
My intuition is to transform the DataFrame into a DataFrame of frequency, so for the example something like this:
w x y z
s
a 0.0 1 0.5 0.5
b 0.5 0 1.0 0.5
c 0.0 0 0.0 1.0
And then roll a random number for each column given an s:
def gen_sentence(fdf, s):
return fdf.columns[np.random.random(len(fdf.columns)) < fdf.loc[s]]
However, I have no clue how to transform the DataFrame in the frequency DataFrame.
The solution will probably be to use df.agg["s"] but what function do I apply on the aggregate?
In reality the dataset is pretty big with over 1 million rows, about 500 different words in m en about 100 different values for s and the frequency table will be sparse: most s's have a frequency of zero for most words in m. Furthermore, I need to generate at least a couple of hundred thousand sentences so I'm trying to find an implementation can generate a sentence as fast as possible. Also, the solution doesn't have to use Panda's, I was just thinking that the vectorized implementation of most of its functions is the fastest solution.
So in short, first, how do I transform the DataFrame into the frequency DataFrame and second, is there a faster method of generating sentences?
I've tested my implementation to see if it's fast enough and it's decent: a frequency DataFrame with 500 rows and 100 columns can generate 5000 sentences in about 1.2 seconds on my machine.
If you want to test your own method against mine, here's my test:
import timeit
setup = '''
import pandas as pd
import numpy as np
def val():
v = np.random.normal(0, 0.2)
return v if 0 <= v <= 1 else 0
def gen_sentence(fdf, s):
return fdf.columns[np.random.random(len(fdf.columns)) < fdf.loc[s]]
n = 500
m = 100
fdf = pd.DataFrame([[val() for _ in range(n)] for _ in range(m)])
fdf = fdf.join(pd.DataFrame({"s": [i for i in range(m)]}))
fdf = fdf.set_index("s")
fdf.columns = ["w%d" % i for i in range(n)]
'''
test = "x = np.random.randint(0, m); gen_sentence(fdf, x)"
print(timeit.timeit(test, setup=setup, number=5000))
To transform to frequency dataframe try this (not the best solution, but it works):
for letter in ['x', 'y', 'w', 'z']:
df.loc[:, letter] = df.m.apply(lambda x: x.count(letter))
df = df.drop(['m'], axis=1)
df_1 = df.groupby('s').agg(lambda x: sum(x)).reset_index()
print(df_1)
Output:
s x y w z
0 a 2 1 0 1
1 b 0 2 1 1
2 c 0 0 0 2
Another alternative (without for loop, using stack and pivot_table):
import numpy as np
df_1 = (df.m.apply(pd.Series).stack().to_frame('m')).reset_index().set_index('level_0')['m']
df_1 = pd.concat([df['s'], df_1], axis=1).reset_index()[['s', 'm']]
df_1.insert(1, 'freq', 1)
df_1 = pd.pivot_table(df_1, values='freq', index='s', columns='m', aggfunc=np.sum).fillna(0)
df_1 = df_1.div(df_1.max(axis=1), axis=0)
df_1.columns.name=None
print(df_1)
Output:
w x y z
s
a 0.0 1.0 0.5 0.5
b 0.5 0.0 1.0 0.5
c 0.0 0.0 0.0 1.0
With the help of Alla Tarighati I now have this solution for the first part of my question:
letters = set(x for l in df["m"] for x in l)
for letter in letters:
df.loc[:, letter] = df.m.apply(lambda x: letter in x)
df = df.drop(["m"], axis=1)
gdf = df.groupby("s")
fdf = gdf.agg(lambda x: sum(x))
fdf = fdf.divide(gdf.size(), axis="index")
print(fdf)
output:
y x z w
s
a 0.5 1.0 0.5 0.0
b 1.0 0.0 0.5 0.5
c 0.0 0.0 1.0 0.0
Note that in line three I changed the lambda function to letter in x so that duplicate letters in a sentence aren't counted multiple times.
And like Alla Tarighati, this isn't a very fast solution, so improvements are welcome!

How to create a copy of an existing DataFrame(panda)?

I have just started exploring pandas. I tried applying logarithmic scaling to a Dataframe column without affecting the source Dataframe. I passed the existing DataFrame(data_source) to the DataFrame constructor thinking that it would create a copy.
data_source = pd.read_csv("abc.csv")
log_data = pd.DataFrame(data = data_source).apply(lambda x: np.log(x + 1))
I think it works properly but is it a recommended/correct way of applying scaling on a copied DataFrame ? How is it different from the 'DataFrame.copy' function?
pd.DataFrame(data = data_source) does not make a copy. This is documented in the docs for the copy argument to the constructor:
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
This is also easily observed by trying to mutate the result:
>>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
>>> y = pandas.DataFrame(x)
>>> x
x y
0 1 1.0
1 2 2.0
2 3 3.0
>>> y
x y
0 1 1.0
1 2 2.0
2 3 3.0
>>> y.iloc[0, 0] = 2
>>> x
x y
0 2 1.0
1 2 2.0
2 3 3.0
If you want a copy, call the copy method. You don't need a copy, though. apply already returns a new dataframe, and better yet, you can call numpy.log or numpy.log1p on dataframes directly:
>>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
>>> numpy.log1p(x)
x y
0 0.693147 0.693147
1 1.098612 1.098612
2 1.386294 1.386294
DataFrame.apply, .applymap and np.log do not change the original data, so it is not necessary to copy()
also, np.log accepts arrays, so in this particular case it would be better to write:
log_data = pd.DataFrame(np.log(data_source.values + 1),
columns=data_source.columns,
index=data_source.index)

Conditionally select dataframes from a list of dfs

I'm trying to subset a list of dataframes with a function. This function would need to return only the df's which for example have a Z-column-total of > 14 and X-column-values (rows 0-4) which are 30% below or above the average of those 5 values. So, in the example below df1 would be returned and df2 not.
Can this be done, evaluating every dataframe with these kinds of conditions? Could anyone point me in the right direction?
N = 5
np.random.seed(0)
df1 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df2 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df1.loc['total'] = df1.sum()
df2.loc['total'] = df2.sum()
df_list = (df1, df2)
X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486
--------------------------------------
X Y Z
0 0.435646 4.893092 3.199605
1 0.101092 3.995793 0.716766
2 4.163099 2.307397 4.723345
3 3.890784 3.902646 2.609242
4 4.350061 0.591372 2.073310
total 12.940682 15.690299 13.322267
List comprehension can be used, with the 2 stated conditions.
The Z condition is pretty straightforward and easy to implement. Regarding the X condition, I created a little function that returns True if the dataframe matches the condition, else False.
In [156]: def check_X(df):
...: avg = df.drop('total')['X'].mean()
...: for val in df.drop('total')['X']:
...: if val/avg < 0.7 or val/avg > 1.3: #30% more or less
...: return False
...: return True
...:
Therefore, we can get the expected result by doing:
In [157]: [df for df in df_list if df.drop('total')['Z'].sum() > 14 and check_X(df)]
Out[157]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176522 16.611794 14.426486]
Edit: a better, one-liner solution that doesn't use any user-defined function:
In [205]: [df for df in df_list if df['Z'].sum() > 14 and ((df['X'] > df['X'].mean()*0.7) & (df['X'] < df['X'].mean()*1.3)).all()]
Out[205]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180]
For simplicity, I dropped the 'total' row from both df before processing:
In [204]: df_list = [df.drop('total') for df in df_list]
If you have a list of dataframes then conditionally select the dataframe using list comprehension and you can use slicing (iloc[0:-1] for excluding last row).
new_list= [x for x in df_list if (x.loc['total','Z']>14) and
((x.iloc[0:-1]['X'] > x.iloc[0:-1]['X'].mean()*0.7) & (x.iloc[0:-1]['X'] < x.iloc[0:-1]['X'].mean()*1.3)).all()]
Output:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486]

pandas, convert DataFrame to MultiIndex'ed DataFrame

I have a pandas.DataFrame that I want to convert to a MultiIndexed pandas.DataFrame.
import numpy
import pandas
import itertools
xs = numpy.linspace(0, 10, 100)
ys = numpy.linspace(0, 0.1, 20)
zs = numpy.linspace(0, 5, 200)
def func(x, y, z):
return x * y / z
vals = list(itertools.product(xs, ys, zs))
result = [func(x, y, z) for x, y, z in vals]
# Original DataFrame.
df = pandas.DataFrame(vals, columns=['x', 'y', 'z'])
df = pandas.concat((pandas.DataFrame(result, columns=['result']), df), axis=1)
# I want to turn `df` into this `df2`.
index = pandas.MultiIndex.from_tuples(vals, names=['x', 'y', 'z'])
df2 = pandas.DataFrame(result, columns=['result'], index=index)
Note that in this example I create what I want and what I have.
So, IRL I would start with df and want to turn it into df2 (and don't have access to vals and result), how do I do this?
You need set_index:
print (df2.head())
result
x y z
0.0 0.0 0.000000 NaN
0.025126 0.0
0.050251 0.0
0.075377 0.0
0.100503 0.0
print (df.set_index(['x','y','z']).head())
result
x y z
0.0 0.0 0.000000 NaN
0.025126 0.0
0.050251 0.0
0.075377 0.0
0.100503 0.0
If need compare both DataFrames, need replace NaN to same values, else get False:
print (df.set_index(['x','y','z']).eq(df2).all())
result False
dtype: bool
print (np.nan == np.nan)
False
print (df.fillna(1).set_index(['x','y','z']).eq(df2.fillna(1)).all())
result True
dtype: bool

Categories