How to repeat a Pandas DataFrame?

How to repeat a Pandas DataFrame? - python

This is my DataFrame that should be repeated for 5 times:
>>> x = pd.DataFrame({'a':1,'b':2}, index = range(1))
>>> x
a b
0 1 2
I want to have the result like this:
>>> x.append(x).append(x).append(x)
a b
0 1 2
0 1 2
0 1 2
0 1 2
But there must be a smarter way than appending 4 times. Actually the DataFrame I’m working on should be repeated 50 times.
I haven't found anything practical, including those like np.repeat ---- it just doesn't work on a DataFrame.
Could anyone help?

You can use the concat function:
In [13]: pd.concat([x]*5)
Out[13]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
If you only want to repeat the values and not the index, you can do:
In [14]: pd.concat([x]*5, ignore_index=True)
Out[14]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2

I think it's cleaner/faster to use iloc nowadays:
In [11]: np.full(3, 0)
Out[11]: array([0, 0, 0])
In [12]: x.iloc[np.full(3, 0)]
Out[12]:
a b
0 1 2
0 1 2
0 1 2
More generally, you can use tile or repeat with arange:
In [21]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 4
In [23]: np.tile(np.arange(len(df)), 3)
Out[23]: array([0, 1, 0, 1, 0, 1])
In [24]: np.repeat(np.arange(len(df)), 3)
Out[24]: array([0, 0, 0, 1, 1, 1])
In [25]: df.iloc[np.tile(np.arange(len(df)), 3)]
Out[25]:
A B
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [26]: df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[26]:
A B
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Note: This will work with non-integer indexed DataFrames (and Series).

Try using numpy.repeat:
>>> import numpy as np
>>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns)
>>> df
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2

I would generally not repeat and/or append, unless your problem really makes it necessary - it is highly inefficiently and typically comes from not understanding the proper way to attack a problem.
I don't know your exact use case, but if you have your values stored as
values = array(1, 2)
df2 = pd.DataFrame(index=arange(0,50), columns=['a', 'b'])
df2[['a', 'b']] = values
will do the job. Perhaps you want to better explain what you're trying to achieve?

Append should work too:
In [589]: x = pd.DataFrame({'a':1,'b':2},index = range(1))
In [590]: x
Out[590]:
a b
0 1 2
In [591]: x.append([x]*5, ignore_index=True) #Ignores the index as per your need
Out[591]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
In [592]: x.append([x]*5)
Out[592]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2

Without numpy, we could also use Index.repeat and loc (or reindex):
x.loc[x.index.repeat(5)].reset_index(drop=True)
or
x.reindex(x.index.repeat(5)).reset_index(drop=True)
Output:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2

Apply by row-lambda is a universal approach in my opinion:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df.apply(lambda row: row.repeat(2), axis=0) #.reset_index()
Out[1]:
A B
0 1 2
0 1 2
1 3 4
1 3 4

Related

How to change several values of pandas DataFrame at once?

Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.

Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.

You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

dataframe groupby aggregation count function with condition for binning purpose

So I have a dataframe like this
df = pd.DataFrame({
'A': [1,1,2,2,3,3,3],
'B': [1,3,1,3,1,2,1],
'C': [1,3,5,3,7,7,1]})
A B C
0 1 1 1
1 1 3 3
2 2 1 5
3 2 3 3
4 3 1 7
5 3 2 7
6 3 1 1
I want to create a binning of column B (count) with groupby of column A
for example B_bin1 where B < 3 and B_bin2 is the rest (>=3), C_bin1 for C < 5 and C_bin2 for the rest
From that example the output I want is like this
A B_bin1 B_bin2 C_bin1 C_bin2
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2
I found similar question Pandas groupby with bin counts
, it is working for 1 bin
bins = [0,2,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins)])
temp_df.size().unstack()
B (0, 2] (2, 10]
A
1 1 1
2 1 1
3 3 0
but when I tried using more than 1 bin, it is not working (my real data has a lot of binning groups)
bins = [0,2,10]
bins2 = [0,4,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins), pd.cut(df['C'], bins2)])
temp_df.size().unstack()
C (0, 4] (4, 10]
A B
1 (0, 2] 1 0
(2, 10] 1 0
2 (0, 2] 0 1
(2, 10] 1 0
3 (0, 2] 1 2
(2, 10] 0 0
My workaround is by create small temporary df and then binning them using 1 group 1 by 1 and then merge them in the end
I also still trying using aggregation (probably using pd.NamedAgg too) similar to this, but I wonder if that can works
df.groupby('A').agg(
b_count = ('B', 'count'),
b_sum = ('B', 'sum')
c_count = ('C', 'count'),
c_sum = ('C', 'sum')
)
Is anyone has another idea for this?

Because you need processing each bin separately instead groupby+size+unstack is used crosstab with join DataFrames by concat:
bins = [0,2,10]
bins2 = [0,4,10]
temp_df1=pd.crosstab(df['A'], pd.cut(df['B'], bins, labels=False)).add_prefix('B_')
temp_df2=pd.crosstab(df['A'], pd.cut(df['C'], bins2, labels=False)).add_prefix('C_')
df = pd.concat([temp_df1, temp_df2], axis=1).reset_index()
print (df)
A B_0 B_1 C_0 C_1
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2

One option, is with get_dummies, before the aggregation; this works since you have a limited bin (I'm skipping the bin and using comparison):
temp = (df
.assign(B = df.B.lt(3), C = df.C.lt(5))
.replace({True:1, False:2})
)
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2
You could use the bins, along with pd.factorize and get_dummies:
temp = df.copy()
temp['B'] = pd.cut(df.B, bins)
temp['B'] = pd.factorize(temp.B)[0] + 1
temp['C'] = pd.cut(df.C, bins2)
temp['C'] = pd.factorize(temp.C)[0] + 1
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?

Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))

You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2

Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!

Count type of values of each column

I was trying to get some type function in Pandas that would help me find how many strings and ints are there per column.
Example:
A B C D
1 H 3 20
3 5 2 1
2 Y M
Should give me something like
A B C D
int 3 1 2 2
str 0 2 0 1
NA 0 0 1 0
Any function that does so? I was thinking of making up something like: if (A==int'A') then ..., and use a counter, but I guess it would go element by element and would be super inefficient.

Setup
df = pd.DataFrame(dict(
A=[1, 3, 2],
B=['H', 5, 'Y'],
C=[3, 2, None],
D=[20, 1, 'M']
), dtype=object)
df
A B C D
0 1 H 3 20
1 3 5 2 1
2 2 Y None M
Solution
Use pd.DataFrame.applymap with type and pd.value_counts
df.applymap(type).apply(pd.value_counts).fillna(0, downcast='infer')
A B C D
<class 'int'> 3 1 2 2
<class 'str'> 0 2 0 1
<class 'NoneType'> 0 0 1 0

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?

This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0

How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1

I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to repeat a Pandas DataFrame? - python

You can use the concat function: In [13]: pd.concat([x]5) Out[13]: a b 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 If you only want to repeat the values and not the index, you can do: In [14]: pd.concat([x]5, ignore_index=True) Out[14]: a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2

Try using numpy.repeat: >>> import numpy as np >>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns) >>> df a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2

Without numpy, we could also use Index.repeat and loc (or reindex): x.loc[x.index.repeat(5)].reset_index(drop=True) or x.reindex(x.index.repeat(5)).reset_index(drop=True) Output: a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2

Apply by row-lambda is a universal approach in my opinion: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) df.apply(lambda row: row.repeat(2), axis=0) #.reset_index() Out[1]: A B 0 1 2 0 1 2 1 3 4 1 3 4

Related

How to change several values of pandas DataFrame at once?

dataframe groupby aggregation count function with condition for binning purpose

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

Count type of values of each column

Pandas indexing by both boolean `loc` and subsequent `iloc`

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to repeat a Pandas DataFrame? - python

You can use the concat function: In [13]: pd.concat([x]*5) Out[13]: a b 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 If you only want to repeat the values and not the index, you can do: In [14]: pd.concat([x]*5, ignore_index=True) Out[14]: a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2

Try using numpy.repeat: >>> import numpy as np >>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns) >>> df a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2

Without numpy, we could also use Index.repeat and loc (or reindex): x.loc[x.index.repeat(5)].reset_index(drop=True) or x.reindex(x.index.repeat(5)).reset_index(drop=True) Output: a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2

Apply by row-lambda is a universal approach in my opinion: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) df.apply(lambda row: row.repeat(2), axis=0) #.reset_index() Out[1]: A B 0 1 2 0 1 2 1 3 4 1 3 4

Related

How to change several values of pandas DataFrame at once?

dataframe groupby aggregation count function with condition for binning purpose

How to extract period and variable name from dataframe column strings for multiindex panel data preparation

Count type of values of each column

Pandas indexing by both boolean `loc` and subsequent `iloc`

Categories

Resources

You can use the concat function: In [13]: pd.concat([x]5) Out[13]: a b 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 If you only want to repeat the values and not the index, you can do: In [14]: pd.concat([x]5, ignore_index=True) Out[14]: a b 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2