Duplicating a Pandas DF N times - python

So right now, if I multiple a list i.e. x = [1,2,3]* 2 I get x as [1,2,3,1,2,3] But this doesn't work with Pandas.
So if I want to duplicate a PANDAS DF I have to make a column a list and multiple:
col_x_duplicates = list(df['col_x'])*N
new_df = DataFrame(col_x_duplicates, columns=['col_x'])
Then do a join on the original data:
pd.merge(new_df, df, on='col_x', how='left')
This now duplicates the pandas DF N times, Is there an easier way? Or even a quicker way?

Actually, since you want to duplicate the entire dataframe (and not each element), numpy.tile() may be better:
In [69]: import pandas as pd
In [70]: arr = pd.np.array([[1, 2, 3], [4, 5, 6]])
In [71]: arr
Out[71]:
array([[1, 2, 3],
[4, 5, 6]])
In [72]: df = pd.DataFrame(pd.np.tile(arr, (5, 1)))
In [73]: df
Out[73]:
0 1 2
0 1 2 3
1 4 5 6
2 1 2 3
3 4 5 6
4 1 2 3
5 4 5 6
6 1 2 3
7 4 5 6
8 1 2 3
9 4 5 6
[10 rows x 3 columns]
In [75]: df = pd.DataFrame(pd.np.tile(arr, (1, 3)))
In [76]: df
Out[76]:
0 1 2 3 4 5 6 7 8
0 1 2 3 1 2 3 1 2 3
1 4 5 6 4 5 6 4 5 6
[2 rows x 9 columns]

Here is a one-liner to make a DataFrame with n copies of DataFrame df
n_df = pd.concat([df] * n)
Example:
df = pd.DataFrame(
data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']],
columns=['id', 'temp', 'name'],
index=pd.Index([1, 2, 3], name='row')
)
n = 4
n_df = pd.concat([df] * n)
Then n_df is the following DataFrame:
id temp name
row
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark

Related

Difference many columns from a baseline column in pandas

I have a baseline column (base) in a pandas data frame and I want to difference all other columns x* from this column while preserving two groups group1, group2:
The easiest way is to simply difference by doing:
df = pd.DataFrame({'group1': [0, 0, 1, 1], 'group2': [2, 2, 3, 4],
'base': [0, 1, 2, 3], 'x1': [3, 4, 5, 6], 'x2': [5, 6, 7, 8]})
df['diff_x1'] = df['x1'] - df['base']
df['diff_x2'] = df['x2'] - df['base']
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5
But I have hundreds of columns I need to do this for, so I'm looking for a more efficient way.
You can subtract a Series from a dataframe column wise using the sub method with axis=0, which can save you from doing the subtraction for each column individually:
to_sub = df.filter(regex='x.*') # filter based on your actual logic
pd.concat([
df,
to_sub.sub(df.base, axis=0).add_prefix('diff_')
], axis=1)
# group1 group2 base x1 x2 diff_x1 diff_x2
#0 0 2 0 3 5 3 5
#1 0 2 1 4 6 3 5
#2 1 3 2 5 7 3 5
Another way is using df.drop(..., axis=1). Then pass each remaining column of that dataframe into sub(..., axis=0). Guarantees you catch all columns, and preserve their order, don't even need a regex.
df_diff = df.drop(['group1','group2','base'], axis=1).sub(df['base'], axis=0).add_prefix('diff_')
diff_x1 diff_x2
0 3 5
1 3 5
2 3 5
3 3 5
Hence your full solution is:
pd.concat([df, df_diff], axis=1)
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5

How to merge multiple column of same data frame

How to merge multiple column values into one column of same data frame and get new column with unique values.
Column1 Column2 Column3 Column4 Column5
0 a 1 2 3 4
1 a 3 4 5
2 b 6 7 8
3 c 7 7
Output:
Column A
a
a
b
c
1
3
6
7
2
4
5
8
Use unstack or melt for reshape, remove missinf values by dropna and duplicates by drop_duplicates:
df1 = df.unstack().dropna().drop_duplicates().reset_index(drop=True).to_frame('A')
df1 = df.melt(value_name='A')[['A']].dropna().drop_duplicates().reset_index(drop=True)
print (df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8
Here is another way to do it if you are ok using numpy. This will handle either nans or empty strings in the original dataframe and is a bit faster than unstack
or melt.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Column1': ['a', 'a', 'b', 'c'],
'Column2': [1, 3, 6, 7],
'Column3': [2, 4, 7, 7],
'Column4': [3, 5, 8, np.nan],
'Column5': [4, '', '', np.nan]})
u = pd.unique(df.values.flatten(order='F'))
u = u[np.where(~np.isin(u, ['']) & ~pd.isnull(u))[0]]
df1 = pd.DataFrame(u, columns=['A'])
print(df1)
A
0 a
1 b
2 c
3 1
4 3
5 6
6 7
7 2
8 4
9 5
10 8

Pandas - Using a list of values to create a smaller frame

I have a list of values that are found in a large pandas dataframe:
value_list = [1, 4, 5, 6, 54]
Example DataFrame df is below:
column x
0 1 3
1 4 6
2 5 8
3 6 19
4 8 21
5 12 97
6 54 102
I would like to create a subset of the data frame using only these values:
df_new = df[df['column'] is in value_list] # pseudo code
Is this possible?
You might be looking for isin operation.
In [60]: df[df['column'].isin(value_list)]
Out[60]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
Also, you can use query like
In [63]: df.query('column in #value_list')
Out[63]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
You missed a for loop :
df_new = [df[elem]['column'] for elem in df if df[elem]['column'] in value_list]

Pandas: How to filter dataframe for duplicate items that occur at least n times in a dataframe

I have a Pandas DataFrame that contains duplicate entries; some items are listed twice or three times. I would like to filter it so that it only shows items that are listed at least n times:
the DataFrame contains 3 columns: ['colA', 'colB', 'colC']. It should only consider 'colB' in determining whether the item is listed multiple times.
Note: this is not drop_duplicates(). It's the opposite, I would like to drop items that are in the dataframe less than n times.
The end result should list each item only once.
You can use value_counts to get the item count and then construct a boolean mask from this and reference the index and test membership using isin:
In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df
Out[3]:
a
0 0
1 0
2 0
3 1
4 2
5 2
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]
Out[8]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
So breaking the above down:
In [9]:
df['a'].value_counts() > 2
Out[9]:
3 True
4 True
0 True
2 False
1 False
Name: a, dtype: bool
In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]
Out[10]:
3 6
4 3
0 3
Name: a, dtype: int64
In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index
Out[11]:
Int64Index([3, 4, 0], dtype='int64')
EDIT
As user #JonClements suggested a simpler and faster method would be to groupby on the col of interest and filter it:
In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)
Out[4]:
a
0 0
1 0
2 0
6 3
7 3
8 3
9 3
10 3
11 3
12 4
13 4
14 4
EDIT 2
To get just a single entry for each repeat call drop_duplicates and pass param subset='a':
In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')
Out[2]:
a
0 0
6 3
12 4

Pandas lookup from one of multiple columns, based on value

I have the following DataFrame:
Date best a b c d
1990 a 5 4 7 2
1991 c 10 1 2 0
1992 d 2 1 4 12
1993 a 5 8 11 6
I would like to make a dataframe as follows:
Date best value
1990 a 5
1991 c 2
1992 d 12
1993 a 5
So I am looking to find a value based on another row value by using column names. For instance, the value for 1990 in the second df should lookup "a" from the first df and the second row should lookup "c" (=2) from the first df.
Any ideas?
There is a built in lookup function that can handle this type of situation (looks up by row/column). I don't know how optimized it is, but may be faster than the apply solution.
In [9]: df['value'] = df.lookup(df.index, df['best'])
In [10]: df
Out[10]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You create a lookup function and call apply on your dataframe row-wise, this isn't very efficient for large dfs though
In [245]:
def lookup(x):
return x[x.best]
df['value'] = df.apply(lambda row: lookup(row), axis=1)
df
Out[245]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You can do this using np.where like below. I think it will be more efficient
import numpy as np
import pandas as pd
df = pd.DataFrame([['1990', 'a', 5, 4, 7, 2], ['1991', 'c', 10, 1, 2, 0], ['1992', 'd', 2, 1, 4, 12], ['1993', 'a', 5, 8, 11, 6]], columns=('Date', 'best', 'a', 'b', 'c', 'd'))
arr = df.best.values
cols = df.columns[2:]
for col in cols:
arr2 = df[col].values
arr = np.where(arr==col, arr2, arr)
df.drop(columns=cols, inplace=True)
df["values"] = arr
df
Result
Date best values
0 1990 a 5
1 1991 c 2
2 1992 d 12
3 1993 a 5
lookup is deprecated since version 1.2.0. With melt you can 'unpivot' columns to the row axis, where the column names are stored per default in column variable and their values in value. query returns only such rows where the columns best and variable are equal. drop and sort_values are used to match your requested format.
df_new = (
df.melt(id_vars=['Date', 'best'], value_vars=['a', 'b', 'c', 'd'])
.query('best == variable')
.drop('variable', axis=1)
.sort_values('Date')
)
Output:
Date best value
0 1990 a 5
9 1991 c 2
14 1992 d 12
3 1993 a 5
A simple solution that uses a mapper dictionary:
vals = df[['a','b','c','d']].to_dict('list')
mapper = {k: vals[v][k] for k,v in zip(df.index, df['best'])}
df['value'] = df.index.map(mapper).to_numpy()
Output:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
Use looking up values by index column labels because DataFrame.lookup is deprecated since version 1.2.0:
idx, cols = pd.factorize(df['best'])
df['value'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5

Categories