Splitting columns containing comma separated string to new row values - python

I have a data frame of the below format
variable val
0 'a','x','y' 10
I would like to unnlist(explode) the data in the below format.
variable1 variable2 value
0 a x 10
1 a y 10
2 x y 10
I have tried using df.explode which does not give me the relation between x and y. My code is as below. Can anyone guide me as to how can I proceed further to get the x and y data. Thanks in advance.
import pandas as pd
from ast import literal_eval
data = {'name':["'a','x','y'"], 'val' : [10]}
df = pd.DataFrame(data)
df2 = (df['name'].str.split(',',expand = True, n = 1)
.rename(columns = {0 : 'variable 1', 1 : 'variable 2'})
.join(df.drop(columns = 'name')))
df2['variable 2']=df2['variable 2'].map(literal_eval)
df2=df2.explode('variable 2',ignore_index=True)
print(df2)
OUTPUT:
variable 1 variable 2 val
0 'a' x 10
1 'a' y 10

If need each combinations per splitted values by , use:
print (df)
variable val
0 'a','x','y' 10
1 'a','x','y','f' 80
2 's' 4
from itertools import combinations
df['variable'] = df['variable'].str.replace("'", "", regex=True)
s = [x.split(',') if ',' in x else (x,x) for x in df['variable']]
L = [(*y, z) for x, z in zip(s, df['val']) for y in combinations(x, 2)]
df = pd.DataFrame(L, columns=['variable 1','variable 2','val'])
print (df)
variable 1 variable 2 val
0 a x 10
1 a y 10
2 x y 10
3 a x 80
4 a y 80
5 a f 80
6 x y 80
7 x f 80
8 y f 80
9 s s 4

Related

Randomly select cells in df pandas

From this pandas df
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
samples_indices = df.sample(frac=0.5, replace=False).index
df.loc[samples_indices] = 'X'
will assign 'X' to all columns in randomly selected rows corresponding to 50% of df, like so:
X X X X
1 1 1 1
X X X X
1 1 1 1
But how do I assign 'X' to 50% randomly selected cells in the df?
For example like this:
X X X 1
1 X 1 1
X X X 1
1 1 1 X
Use numpy and boolean indexing, for an efficient solution:
import numpy as np
df[np.random.choice([True, False], size=df.shape)] = 'X'
# with a custom probability:
N = 0.5
df[np.random.choice([True, False], size=df.shape, p=[N, 1-N])] = 'X'
Example output:
0 1 2 3
0 X 1 X X
1 X X 1 X
2 X X X 1
3 X X 1 X
If you need an exact proportion, you can use:
frac = 0.5
df[np.random.permutation(df.size).reshape(df.shape)>=df.size*frac] = 'X'
Example:
0 1 2 3
0 X 1 X 1
1 X 1 X 1
2 1 1 X 1
3 X X 1 X
In #mozway's answer you can set to 'X' cells with a certain probability. But let's say you want to have exactly 50% of your data being marked as 'X'. This is how you can do it:
import numpy as np
df[np.random.permutation(np.hstack([np.ones(df.size // 2), np.zeros(df.size // 2)])).astype(bool).reshape(df.shape)] = 'X'
Example output:
X X X 1
1 X 1 1
X X X 1
1 1 1 X
Create MultiIndex Series by DataFrame.stack, then use Series.sample and last replace removed values by X in Series.unstack:
N = 0.5
df = (df.stack().sample(frac=1-N).unstack(fill_value='X')
.reindex(index=df.index, columns=df.columns, fill_value='X'))
print (df)
0 1 2 3
0 X X 1 1
1 X 1 X 1
2 1 X X X
3 1 1 1 X

How to fit a column as a function of another column in Python dataframe

For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)
That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18

limiting data set to be used xlim

I have lots of files that contain x, y, yerr columns. I read them and save and apply a change on the x values, then I would like to set a limit on the x values I will use afterwards which are the newxval:
for key, value in files_data.items():
file_short_name = key
D_value_sale = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
data.columns = ["x", "y"]
D = D_value_sale
b = 111
c = 222
data["newx"] = -c*(((data.x*(1/(1+D)))-b)/b)
data["newy"] = (data.y-data.y.min())/(data.y.max()-data.y.min())
w = data[(data.newx < 20000) & (data.newx > 8000)]
dfx = w["newx"]
dfy = w["newy"]
peak = GaussianModel()
pars = offset.make_params(c=np.median(dfy))
pars += peak.guess(dfy, x= dfy, amplitude=-0.5)
result = model.fit(dfy, pars, dfx)
If I'm understanding correctly what you are asking this is what you could do:
for key, value in files_data.items():
file_short_name = key
# main = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
# Here you should define what happens in case
# the data isn't what you expected it to be
data["newx"] = data.x + 1 # Perform whatever transformation you need
# data["newy"] = data.y * (1.01234) # Etc.
# Then you can limit the newx column by doing:
data[(data.newx < upper_limit) & (data.newx > lower_limit)]
What you're doing won't work if you want to preserve the relationship between columns. When you assign the data columns to their own variables xval, yval and error you are implicitely "losing" their relationship.
I'll open with the same caveat of "if I'm understanding you correctly" then the crux of what you are looking for is the boolean array that you have created to apply your limits:
data = data[(data[0] >= xlim[0]) & (data[0] <= xlim[1])]
This boolean array can be saved and applied to any array of the same shape.
idx = (data[0] >= xlim[0]) & (data[0] <= xlim[1])
filtered_data = data[0][idx]
filtered_newxval = newxval[idx]
By way of a more complete and independent example, see the code below where this concept can be applied to multidimensional arrays and pandas dataframes
import numpy as np
import pandas as pd
np.random.seed(42)
x = np.random.randint(0, 20, 10)
y = np.random.randint(0, 20, 10)
print("x", x)
# >>> x [ 6 19 14 10 7 6 18 10 10 3]
print("y", y)
# >>> y [ 7 2 1 11 5 1 0 11 11 16]
xmin = 3
xmax = 17
idx = (x >= xmin) & (x <= xmax)
data = np.vstack((x, y))
print("filtered_data:\n", data[:, idx])
# >>> filtered_data:
# [[ 6 14 10 7 6 10 10 3]
# [ 7 1 11 5 1 11 11 16]]
df = pd.DataFrame({"x": x, "y": y})
df["xnew"] = df["x"] * 2
print(df[idx])
# >>> x y xnew
# >>> 0 6 7 12
# >>> 2 14 1 28
# >>> 3 10 11 20
# >>> 4 7 5 14
# >>> 5 6 1 12
# >>> 7 10 11 20
# >>> 8 10 11 20
# >>> 9 3 16 6

pandas convert text feature to numeric value

I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers
#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
dataset['{0}_category'.format(col)] = dataset[col].astype('category')
I can convert the text to integers using this hack:
#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])
new_cols = {}
for col in object_types:
data_set = set(dataset[col].tolist())
data_indexed = {}
for i, item in enumerate(data_set):
data_indexed[item] = i
new_list = []
for item in dataset[col].tolist():
new_list.append(data_indexed[item])
new_cols[col]=new_list
for key, val in new_cols.items():
dataset['{0}_int_value'.format(key)] = val
But is there a better (or existing) way to do the same?
I would use factorize method, which is designed for this particular task:
In [90]: x
Out[90]:
A B
9 c z
10 c z
4 b x
5 b y
1 a w
7 b z
In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
A B
9 2 3
10 2 3
4 1 1
5 1 2
1 0 0
7 1 3
or:
In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
A B
9 0 0
10 0 0
4 1 1
5 1 2
1 2 3
7 1 0
consider df
df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
B=list('wwxxxyyzzzzz')))
df
you can convert to integers like this
def intify(s):
u = np.unique(s)
i = np.arange(len(u))
return s.map(dict(zip(u, i)))
or shorter version
def intify(s):
u = np.unique(s)
return s.map({k: i for i, k in enumerate(u)})
df.apply(intify)
Or in a single line
df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))

Keep certain columns in a pandas DataFrame, deleting everything else

Say I have a data table
1 2 3 4 5 6 .. n
A x x x x x x .. x
B x x x x x x .. x
C x x x x x x .. x
And I want to slim it down so that I only have, say, columns 3 and 5 deleting all other and maintaining the structure. How could I do this with pandas? I think I understand how to delete a single column, but I don't know how to save a select few and delete all others.
If you have a list of columns you can just select those:
In [11]: df
Out[11]:
1 2 3 4 5 6
A x x x x x x
B x x x x x x
C x x x x x x
In [12]: col_list = [3, 5]
In [13]: df = df[col_list]
In [14]: df
Out[14]:
3 5
A x x
B x x
C x x
How do I keep certain columns in a pandas DataFrame, deleting everything else?
The answer to this question is the same as the answer to "How do I delete certain columns in a pandas DataFrame?" Here are some additional options to those mentioned so far, along with timings.
DataFrame.loc
One simple option is selection, as mentioned by in other answers,
# Setup.
df
1 2 3 4 5 6
A x x x x x x
B x x x x x x
C x x x x x x
cols_to_keep = [3,5]
df[cols_to_keep]
3 5
A x x
B x x
C x x
Or,
df.loc[:, cols_to_keep]
3 5
A x x
B x x
C x x
DataFrame.reindex with axis=1 or 'columns' (0.21+)
However, we also have reindex, in recent versions you specify axis=1 to drop:
df.reindex(cols_to_keep, axis=1)
# df.reindex(cols_to_keep, axis='columns')
# for versions < 0.21, use
# df.reindex(columns=cols_to_keep)
3 5
A x x
B x x
C x x
On older versions, you can also use reindex_axis: df.reindex_axis(cols_to_keep, axis=1).
DataFrame.drop
Another alternative is to use drop to select columns by pd.Index.difference:
# df.drop(cols_to_drop, axis=1)
df.drop(df.columns.difference(cols_to_keep), axis=1)
3 5
A x x
B x x
C x x
Performance
The methods are roughly the same in terms of performance; reindex is faster for smaller N, while drop is faster for larger N. The performance is relative as the Y-axis is logarithmic.
Setup and Code
import pandas as pd
import perfplot
def make_sample(n):
np.random.seed(0)
df = pd.DataFrame(np.full((n, n), 'x'))
cols_to_keep = np.random.choice(df.columns, max(2, n // 4), replace=False)
return df, cols_to_keep
perfplot.show(
setup=lambda n: make_sample(n),
kernels=[
lambda inp: inp[0][inp[1]],
lambda inp: inp[0].loc[:, inp[1]],
lambda inp: inp[0].reindex(inp[1], axis=1),
lambda inp: inp[0].drop(inp[0].columns.difference(inp[1]), axis=1)
],
labels=['__getitem__', 'loc', 'reindex', 'drop'],
n_range=[2**k for k in range(2, 13)],
xlabel='N',
logy=True,
equality_check=lambda x, y: (x.reindex_like(y) == y).values.all()
)
You could reassign a new value to your DataFrame, df:
df = df.loc[:,[3, 5]]
As long as there are no other references to the original DataFrame, the old DataFrame will get garbage collected.
Note that when using df.loc, the index is specified by labels. Thus above 3 and 5 are not ordinals, they represent the label names of the columns. If you wish to specify the columns by ordinal index, use df.iloc.
For those who are searching an method to do this inplace:
from pandas import DataFrame
from typing import Set, Any
def remove_others(df: DataFrame, columns: Set[Any]):
cols_total: Set[Any] = set(df.columns)
diff: Set[Any] = cols_total - columns
df.drop(diff, axis=1, inplace=True)
This will create the complement of all the columns in the dataframe and the columns which should be removed. Those can safely be removed. Drop works even on an empty set.
>>> df = DataFrame({"a":[1,2,3],"b":[2,3,4],"c":[3,4,5]})
>>> df
a b c
0 1 2 3
1 2 3 4
2 3 4 5
>>> remove_others(df, {"a","b","c"})
>>> df
a b c
0 1 2 3
1 2 3 4
2 3 4 5
>>> remove_others(df, {"a"})
>>> df
a
0 1
1 2
2 3
>>> remove_others(df, {"a","not","existent"})
>>> df
a
0 1
1 2
2 3
Another approach is to use filter:
In [5]: df.filter([3, 5])
Out[5]:
3 5
A x x
B x x
C x x

Categories