My goal is to generate sentences based on the frequency of the input. For example I have input like this:
>>> df = pd.DataFrame({"s":["a", "a", "b", "b", "c", "c"], "m":[["x", "y"], ["x", "z"], ["y", "w", "z"], ["y"], ["z"], ["z"]]})
>>> df.set_index("s")
>>> df
m
s
a [x, y]
a [x, z]
b [y, w, z]
b [y]
c [z]
c [z]
I want to have a function gen_sentence(s) that takes an s and generates random non-empty sentence based on the frequency of the letters in column m. So gen_sentence("a") should generate sentences where all of them would contain x, 50% of them should contain y and 50% z.
My intuition is to transform the DataFrame into a DataFrame of frequency, so for the example something like this:
w x y z
s
a 0.0 1 0.5 0.5
b 0.5 0 1.0 0.5
c 0.0 0 0.0 1.0
And then roll a random number for each column given an s:
def gen_sentence(fdf, s):
return fdf.columns[np.random.random(len(fdf.columns)) < fdf.loc[s]]
However, I have no clue how to transform the DataFrame in the frequency DataFrame.
The solution will probably be to use df.agg["s"] but what function do I apply on the aggregate?
In reality the dataset is pretty big with over 1 million rows, about 500 different words in m en about 100 different values for s and the frequency table will be sparse: most s's have a frequency of zero for most words in m. Furthermore, I need to generate at least a couple of hundred thousand sentences so I'm trying to find an implementation can generate a sentence as fast as possible. Also, the solution doesn't have to use Panda's, I was just thinking that the vectorized implementation of most of its functions is the fastest solution.
So in short, first, how do I transform the DataFrame into the frequency DataFrame and second, is there a faster method of generating sentences?
I've tested my implementation to see if it's fast enough and it's decent: a frequency DataFrame with 500 rows and 100 columns can generate 5000 sentences in about 1.2 seconds on my machine.
If you want to test your own method against mine, here's my test:
import timeit
setup = '''
import pandas as pd
import numpy as np
def val():
v = np.random.normal(0, 0.2)
return v if 0 <= v <= 1 else 0
def gen_sentence(fdf, s):
return fdf.columns[np.random.random(len(fdf.columns)) < fdf.loc[s]]
n = 500
m = 100
fdf = pd.DataFrame([[val() for _ in range(n)] for _ in range(m)])
fdf = fdf.join(pd.DataFrame({"s": [i for i in range(m)]}))
fdf = fdf.set_index("s")
fdf.columns = ["w%d" % i for i in range(n)]
'''
test = "x = np.random.randint(0, m); gen_sentence(fdf, x)"
print(timeit.timeit(test, setup=setup, number=5000))
To transform to frequency dataframe try this (not the best solution, but it works):
for letter in ['x', 'y', 'w', 'z']:
df.loc[:, letter] = df.m.apply(lambda x: x.count(letter))
df = df.drop(['m'], axis=1)
df_1 = df.groupby('s').agg(lambda x: sum(x)).reset_index()
print(df_1)
Output:
s x y w z
0 a 2 1 0 1
1 b 0 2 1 1
2 c 0 0 0 2
Another alternative (without for loop, using stack and pivot_table):
import numpy as np
df_1 = (df.m.apply(pd.Series).stack().to_frame('m')).reset_index().set_index('level_0')['m']
df_1 = pd.concat([df['s'], df_1], axis=1).reset_index()[['s', 'm']]
df_1.insert(1, 'freq', 1)
df_1 = pd.pivot_table(df_1, values='freq', index='s', columns='m', aggfunc=np.sum).fillna(0)
df_1 = df_1.div(df_1.max(axis=1), axis=0)
df_1.columns.name=None
print(df_1)
Output:
w x y z
s
a 0.0 1.0 0.5 0.5
b 0.5 0.0 1.0 0.5
c 0.0 0.0 0.0 1.0
With the help of Alla Tarighati I now have this solution for the first part of my question:
letters = set(x for l in df["m"] for x in l)
for letter in letters:
df.loc[:, letter] = df.m.apply(lambda x: letter in x)
df = df.drop(["m"], axis=1)
gdf = df.groupby("s")
fdf = gdf.agg(lambda x: sum(x))
fdf = fdf.divide(gdf.size(), axis="index")
print(fdf)
output:
y x z w
s
a 0.5 1.0 0.5 0.0
b 1.0 0.0 0.5 0.5
c 0.0 0.0 1.0 0.0
Note that in line three I changed the lambda function to letter in x so that duplicate letters in a sentence aren't counted multiple times.
And like Alla Tarighati, this isn't a very fast solution, so improvements are welcome!
Related
I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ind0": list("QQQWWWW"), "ind1": list("RRRRSSS"), "vals": range(7), "cols": list("XXYXXYY")})
print(df)
Output:
ind0 ind1 vals cols
0 Q R 0 X
1 Q R 1 X
2 Q R 2 Y
3 W R 3 X
4 W S 4 X
5 W S 5 Y
6 W S 6 Y
I want to aggregate the values while creating columns from col, so I thought of using pivot_table:
df_res = df.pivot_table(index=["ind0", "ind1"], columns="cols", values="vals", aggfunc=np.sum).fillna(0)
print(df_res)
This gives me:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R 3.0 0.0
S 4.0 11.0
However, I would rather get the sum independent of ind1 categories while keeping the information in this column. So, the desired output would be:
cols X Y
ind0 ind1
Q R 1.0 2.0
W R,S 7.0 11.0
Is there a way to use pivot_table or pivot to this end or do I have to aggregate for ind1 in a second step? If the latter, how?
You could reset_index of df_res and groupby "ind0" and using agg, use different functions on columns: joining unique values of "ind1" and summing "X" and "Y".
out = df_res.reset_index().groupby('ind0').agg({'ind1': lambda x: ', '.join(x.unique()), 'X':'sum', 'Y':'sum'})
Or if you have multiple columns that you need to do the same function on, you could also use a dict comprehension:
funcs = {'ind1': lambda x: ', '.join(x.unique()), **{k:'sum' for k in ('X','Y')}}
out = df_res.reset_index().groupby('ind0').agg(funcs)
Output:
cols ind1 X Y
ind0
Q R 1.0 2.0
W R, S 7.0 11.0
Objective: Compute some bivariate polynomial e.g. f(x,y) = sin(x^2 + y^2) for x ∈ [-1,1] and y ∈ [-1,1] and stick values in a dataframe.
What I have...
def sunbrero(x,y):
return np.sin(x**2 + y**2)
lower=-1
upper=1
length=1000
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
Z = pd.DataFrame(index=X,columns=Y)
# [[sunbrero(x,y) for x in X] for y in Y]
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
What I'm hoping to do is something that replaces...
for y in Y:
Z[y] = [sunbrero(x,y) for x in X]
...with something like...
[[Z[y] = sunbrero(x,y) for x in X] for y in Y]
But obviously the above doesn't work.
I know that this works...
Z = [[sunbrero(x,y) for x in X] for y in Y]
...but it creates a list of lists rather than a dataframe.
Note 1: if others think a 2D vector is more sensible c.f dataframe, I'm open to that as well.
Note 2: I don't think lambda functions work as it only allows one variable to be defined. Happy to be corrected.
I think the more Panda-esque way of doing this would be to calculate the values first and put them into a dataframe afterwards, not vice versa. Performing the calculations in a list comprehension does not put the internal vector optimizations of Numpy and Pandas to good use.
Instead, you can make use of Numpy's broadcasting to get the matrix first:
length = 5
X = np.linspace(lower, upper, num=length)
Y = np.linspace(lower, upper, num=length)
result = sunbrero(X[:, None], Y)
array([[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.84147098, 0.24740396, 0. , 0.24740396, 0.84147098],
[0.94898462, 0.47942554, 0.24740396, 0.47942554, 0.94898462],
[0.90929743, 0.94898462, 0.84147098, 0.94898462, 0.90929743]])
and put that in a dataframe like so:
df = pd.DataFrame(result, index=X, columns=Y)
-1.0 -0.5 0.0 0.5 1.0
-1.0 0.909297 0.948985 0.841471 0.948985 0.909297
-0.5 0.948985 0.479426 0.247404 0.479426 0.948985
0.0 0.841471 0.247404 0.000000 0.247404 0.841471
0.5 0.948985 0.479426 0.247404 0.479426 0.948985
1.0 0.909297 0.948985 0.841471 0.948985 0.909297
You're almost there:
df = pd.DataFrame([[sunbrero(x,y) for x in X] for y in Y])
You can do your list comprehension, then have pandas create a dataframe from a list of lists, for example:
list_of_lists = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(list_of_lists)
to get
0 1 2
0 1 2 3
1 4 5 6
Hi so I have created a function to check the correlation between 2 variables, anyone knows how can I create a new data frame from this?
In [1]:from scipy.stats import pearsonr
for colY in Y.columns:
for colX in X.columns:
#print('Pearson Correlation')
corr, _ = pearsonr(numerical_cols_target[colX], numerical_cols_target[colY])
alpha = 0.05
print('Pearson Correlation', (alpha, corr))
if corr <= alpha:
print(colX +' and ' +colY+ ' two ariables are not correlated ')
else:
print(colX +' and ' +colY+ ' two variables are highly correlated ')
print('\n')
print('\n')
here's a sample output from the correlation function:
Out [1]:
Pearson Correlation (0.05, -0.1620045985125294)
banana and orange are not correlated
Pearson Correlation (0.05, 0.2267582070839226)
apple and orange are highly correlated
```
I would avoid using two for loops. Depending on the size of your dataset this will be very slow.
Pandas provides a correlation function with might come in hand here:
import pandas as pd
df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})
using corr() will give you the pairwise correlations then and returns a new dataframe as well:
df.corr()
For more infos you can check the manual: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
You can just do the following.
df = pd.DataFrame(index=X.columns, columns=Y.columns)
#In your loop
df[colY][colX] = corr
Your loop would then be
for colY in Y.columns:
for colX in X.columns:
#print('Pearson Correlation')
corr, _ = pearsonr(numerical_cols_target[colX], numerical_cols_target[colY])
alpha = 0.05
print('Pearson Correlation', (alpha, corr))
df[colY][colX] = corr
if corr <= alpha:
print(colX +' and ' +colY+ ' two ariables are not correlated ')
else:
print(colX +' and ' +colY+ ' two variables are highly correlated ')
print('\n')
print('\n')
I think you are looking for this:
This will get a column-wise correlation of every two pairs of columns between X and Y dataframes and create another dataframe that keeps all the correlations and whether they pass a threshold alpha:
This assumes Y has less or equal number of columns as X. If not simply switch X and Y places:
import collections
corr_df = pd.DataFrame(columns=['col_X', 'col_Y', 'corr', 'is_correlated'])
d = collections.deque(X.columns)
Y_cols = Y.columns
alpha = 0.05
for i in range(len(d)):
d.rotate(i)
X = X[d]
corr = Y.corrwith(X, axis=0)
corr_df = corr_df.append(pd.DataFrame({'col_X':list(d)[:len(Y_cols)], 'col_Y':Y.columns, 'corr':corr[:len(Y_cols)], 'is_correlated':corr[:len(Y_cols)]>alpha}))
print(corr_df.reset_index())
sample input and output:
X:
A B C
0 2 2 10
1 4 0 2
2 8 0 1
3 0 0 8
Y:
B C
0 2 10
1 0 2
2 0 1
3 0 8
correlation(X, Y):
col_X col_Y corr is_correlated
0 A B 1.0 True
1 B C 1.0 True
2 C B 1.0 True
3 A C 1.0 True
4 A B 1.0 True
5 B C 1.0 True
I have a df like this:
frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']})
And a list of items:
letters = ['a','c']
My goal is to get all the rows from frame that contain at least the 2 elements in letters
I came up with this solution:
for i in letters:
subframe = frame[frame['a'].str.contains(i)]
This gives me what I want, but it might not be the best solution in terms of scalability.
Is there any 'vectorised' solution?
Thanks
I would build a list of Series, and then apply a vectorized np.all:
contains = [frame['a'].str.contains(i) for i in letters]
resul = frame[np.all(contains, axis=0)]
It gives as expected:
a
0 a,b,c
1 a,c,f
3 a,z,c
One way is to split the column values into lists using str.split, and check if set(letters) is a subset of the obtained lists:
letters_s = set(letters)
frame[frame.a.str.split(',').map(letters_s.issubset)]
a
0 a,b,c
1 a,c,f
3 a,z,c
Benchmark:
def serge(frame):
contains = [frame['a'].str.contains(i) for i in letters]
return frame[np.all(contains, axis=0)]
def yatu(frame):
letters_s = set(letters)
return frame[frame.a.str.split(',').map(letters_s.issubset)]
def austin(frame):
mask = frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0)
return frame[mask]
def datanovice(frame):
s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum()
return frame.loc[s[s.ge(2)].index.unique()]
perfplot.show(
setup=lambda n: pd.concat([frame]*n, axis=0).reset_index(drop=True),
kernels=[
lambda df: serge(df),
lambda df: yatu(df),
lambda df: df[df['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))],
lambda df: austin(df),
lambda df: datanovice(df),
],
labels=['serge', 'yatu', 'bruno','austin', 'datanovice'],
n_range=[2**k for k in range(0, 18)],
equality_check=lambda x, y: x.equals(y),
xlabel='N'
)
This also solves it:
frame[frame['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))]
You can use np.intersect1d:
import pandas as pd
import numpy as np
frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']})
letters = ['a','c']
mask = frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0)
print(frame[mask])
a
0 a,b,c
1 a,c,f
3 a,z,c
Use set.issubset:
frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c','x,y']})
letters = ['a','c']
frame[frame['a'].apply(lambda x: set(letters).issubset(x))]
Out:
a
0 a,b,c
1 a,c,f
3 a,z,c
IIUC, explode and a boolean filter
the idea is to create a single series then we can groupby the index the count the true occurrences of your list using a cumulative sum
s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum()
print(s)
0 1.0
0 1.0
0 2.0
1 1.0
1 2.0
1 2.0
2 0.0
2 0.0
2 0.0
3 1.0
3 1.0
3 2.0
frame.loc[s[s.ge(2)].index.unique()]
out:
a
0 a,b,c
1 a,c,f
3 a,z,c
frame.iloc[[x for x in range(len(frame)) if set(letters).issubset(frame.iloc[x,0])]]
output:
a
0 a,b,c
1 a,c,f
3 a,z,c
timeit
%%timeit
#hermes
frame.iloc[[x for x in range(len(frame)) if set(letters).issubset(frame.iloc[x,0])]]
output
300 µs ± 32.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm trying to subset a list of dataframes with a function. This function would need to return only the df's which for example have a Z-column-total of > 14 and X-column-values (rows 0-4) which are 30% below or above the average of those 5 values. So, in the example below df1 would be returned and df2 not.
Can this be done, evaluating every dataframe with these kinds of conditions? Could anyone point me in the right direction?
N = 5
np.random.seed(0)
df1 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df2 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df1.loc['total'] = df1.sum()
df2.loc['total'] = df2.sum()
df_list = (df1, df2)
X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486
--------------------------------------
X Y Z
0 0.435646 4.893092 3.199605
1 0.101092 3.995793 0.716766
2 4.163099 2.307397 4.723345
3 3.890784 3.902646 2.609242
4 4.350061 0.591372 2.073310
total 12.940682 15.690299 13.322267
List comprehension can be used, with the 2 stated conditions.
The Z condition is pretty straightforward and easy to implement. Regarding the X condition, I created a little function that returns True if the dataframe matches the condition, else False.
In [156]: def check_X(df):
...: avg = df.drop('total')['X'].mean()
...: for val in df.drop('total')['X']:
...: if val/avg < 0.7 or val/avg > 1.3: #30% more or less
...: return False
...: return True
...:
Therefore, we can get the expected result by doing:
In [157]: [df for df in df_list if df.drop('total')['Z'].sum() > 14 and check_X(df)]
Out[157]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176522 16.611794 14.426486]
Edit: a better, one-liner solution that doesn't use any user-defined function:
In [205]: [df for df in df_list if df['Z'].sum() > 14 and ((df['X'] > df['X'].mean()*0.7) & (df['X'] < df['X'].mean()*1.3)).all()]
Out[205]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180]
For simplicity, I dropped the 'total' row from both df before processing:
In [204]: df_list = [df.drop('total') for df in df_list]
If you have a list of dataframes then conditionally select the dataframe using list comprehension and you can use slicing (iloc[0:-1] for excluding last row).
new_list= [x for x in df_list if (x.loc['total','Z']>14) and
((x.iloc[0:-1]['X'] > x.iloc[0:-1]['X'].mean()*0.7) & (x.iloc[0:-1]['X'] < x.iloc[0:-1]['X'].mean()*1.3)).all()]
Output:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486]