How to repeat a pandas data frame column x amount of times? - python

If I have a Pandas Dataframe like this:
A
1 8
2 9
3 7
4 2
How do I repeat it x number of times? For example, if I wanted to repeat it 3 times I would get something like this:
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2

Use concat:
n = 3
pd.concat([df] * (n+1), axis=1, ignore_index=True)
0 1 2 3
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
If you want the columns renamed, use rename:
(pd.concat([df] * (n+1), axis=1, ignore_index=True)
.rename(lambda x: chr(ord('A')+x), axis=1))
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2

You can use Numpy to repeat the values and reconstruct the dataframe.
n = 3
pd.DataFrame(np.tile(df.values, n + 1), columns = df.columns.tolist()+list('BCD'))
A B C D
0 8 8 8 8
1 9 9 9 9
2 7 7 7 7
3 2 2 2 2

You can use concat like #coldspeed did.
Or you can set them manually.
df['B'] = df.A
df['C'] = df.A
df['D'] = df.A
print(df)
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2

Related

Update values in dataframe based on dictionary and condition

I have a dataframe and a dictionary that contains some of the columns of the dataframe and some values. I want to update the dataframe based on the dictionary values, and pick the higher value.
>>> df1
a b c d e f
0 4 2 6 2 8 1
1 3 6 7 7 8 5
2 2 1 1 6 8 7
3 1 2 7 3 3 1
4 1 7 2 6 7 6
5 4 8 8 2 2 1
and the dictionary is
compare = {'a':4, 'c':7, 'e':3}
So I want to check the values in columns ['a','c','e'] and replace with the value in the dictionary, if it is higher.
What I have tried is this:
comp = pd.DataFrame(pd.Series(compare).reindex(df1.columns).fillna(0)).T
df1[df1.columns] = df1.apply(lambda x: np.where(x>comp, x, comp)[0] ,axis=1)
Excepted Output:
>>>df1
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Another possible solution, based on numpy:
cols = list(compare.keys())
df[cols] = np.maximum(df[cols].values, np.array(list(compare.values())))
Output:
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
limits = df.columns.map(compare).to_series(index=df.columns)
new = df.mask(df < limits, limits, axis=1)
obtain a Series whose index is columns of df and values from the dictionary
check if the frame's values are less then the "limits"; if so, put what limits have; else, as is
to get
>>> new
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1

Slicing Pandas data frame into two parts

Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4

Python - Filter DataFrame by Sub DataFrame

I got a DataFrame like this:
A B C
1 1 2 3
2 4 5 6
3 7 8 9
And I want to filter it by a sub set of DataFrame:
A C
1 4 6
2 7 9
Finally, I can get this output:
A B C
2 4 5 6
3 7 8 9
How can I do to solve this?
In [92]: d1.merge(d2)
Out[92]:
A B C
0 4 5 6
1 7 8 9

Shorter notation for columns in pandas DataFrame

Take a random DataFrame:
df = pd.DataFrame(np.random.rand(3, 2), columns=['a', 'b'])
Pandas allows defining new columns in two ways:
df['c'] = df.a + df.b
df['c'] = df['a'] + df['b']
As the DataFrame name gets longer, this notation becomes less readable.
And then there's the query function:
df.query('a > b')
It returns the slices of the df that match the condition.
Is there a way to run something like DataFrame.query() but for operations on the frame?
Function DataFrame.eval() does exactly this:
df.eval('c = a + b')
And warning-free assignment:
df.eval('c = a + b', inplace=True)
More generally, pandas.eval():
The following arithmetic operations are supported: +, -, *, /, **, %,
// (python engine only) along with the following boolean operations: |
(or), & (and), and ~ (not). Additionally, the 'pandas' parser allows
the use of and, or, and not with the same semantics as the
corresponding bitwise operators.
Pandas docs say that eval supports only Python expression statements (e.g., a == b), but pandas silently supports abs(a - b) and maybe other statements. The rest throw an error. For example:
df.eval('del(a)')
returns NotImplementedError: 'Delete' nodes are not implemented.
Here's a way using assign and add:
df.assign(c=df.a.add(df.b))
a b c
0 0.086468 0.978044 1.064512
1 0.270727 0.789762 1.060489
2 0.150097 0.662430 0.812527
Note: The assign creates a copy of your dataframe, therefore you aren't distorting the original data. You'll need to reassign to a different variable or back to df.
Consider the dataframe named my_obnoxiously_long_dataframe_name
np.random.seed([3,1415])
my_obnoxiously_long_dataframe_name = pd.DataFrame(
np.random.randint(10, size=(10, 10)),
columns=list('ABCDEFGHIJ')
)
my_obnoxiously_long_dataframe_name
A B C D E F G H I J
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
5 2 8 2 4 7 6 9 4 2 4
6 6 3 8 3 9 8 0 4 3 0
7 4 1 5 8 6 0 8 7 4 6
8 3 5 8 5 1 5 1 4 3 9
9 5 5 7 0 3 2 5 8 8 9
If you want cleaner code, create a temp variable name that's smaller
d_ = my_obnoxiously_long_dataframe_name
d_['K'] = abs(d_.J - d_.D)
d_['L'] = d_.A + d_.B
del d_
my_obnoxiously_long_dataframe_name
A B C D E F G H I J K L
0 0 2 7 3 8 7 0 6 8 6 3 2
1 0 2 0 4 9 7 3 2 4 3 1 2
2 3 6 7 7 4 5 3 7 5 9 2 9
3 8 7 6 4 7 6 2 6 6 5 1 15
4 2 8 7 5 8 4 7 6 1 5 0 10
5 2 8 2 4 7 6 9 4 2 4 0 10
6 6 3 8 3 9 8 0 4 3 0 3 9
7 4 1 5 8 6 0 8 7 4 6 2 5
8 3 5 8 5 1 5 1 4 3 9 4 8
9 5 5 7 0 3 2 5 8 8 9 9 10

How do I put a series (such as) the result of a pandas groupby.apply(f) into a new column of the dataframe?

I have a dataframe, that I want to calculate statitics on (value_count, mode, mean, etc.) and then put the result in a new column. My current solution is O(n**2) or so, and I'm sure there is likely a faster, obvious method that I'm overlooking.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(100, 10)),
columns = list('abcdefghij'))
df['result'] = 0
groups = df.groupby([df.i, df.j])
for g in groups:
icol_eq = df.i == g[0][0]
jcol_eq = df.j == g[0][1]
i_and_j = icol_eq & jcol_eq
df['result'][i_and_j] = len(g[1])
The above works, but is extremely slow for large dataframes.
I tried
df['result'] = df.groupby([df.i, df.j]).apply(len)
but it doesn't seem to work.
Nor does
def f(g):
g['result'] = len(g)
return g
df.groupby([df.i, df.j]).apply(f)
Nor can I merge the resulting series of a df.groupby.apply(lambda x: len(x))
You want to use transform:
In [98]:
df['result'] = df.groupby([df.i, df.j]).transform(len)
df
Out[98]:
a b c d e f g h i j result
0 6 1 3 0 1 1 4 2 8 6 6
1 1 3 9 7 5 5 3 5 4 4 1
2 1 5 0 1 8 1 4 7 3 9 1
3 6 8 6 4 6 0 8 0 6 5 6
4 7 9 7 2 8 9 9 6 0 6 7
5 3 5 5 7 2 7 7 3 2 8 3
6 5 0 4 7 5 7 5 7 9 1 5
7 3 2 5 4 3 6 8 4 2 0 3
8 2 3 0 4 8 5 7 9 7 2 2
9 1 1 3 2 3 5 6 6 5 6 1
10 3 0 2 7 1 8 1 3 5 4 3
....
transform returns a Series with an index aligned to your original df so you can then add it as a column

Categories