This is my DataFrame that should be repeated for 5 times:
>>> x = pd.DataFrame({'a':1,'b':2}, index = range(1))
>>> x
a b
0 1 2
I want to have the result like this:
>>> x.append(x).append(x).append(x)
a b
0 1 2
0 1 2
0 1 2
0 1 2
But there must be a smarter way than appending 4 times. Actually the DataFrame I’m working on should be repeated 50 times.
I haven't found anything practical, including those like np.repeat ---- it just doesn't work on a DataFrame.
Could anyone help?
You can use the concat function:
In [13]: pd.concat([x]*5)
Out[13]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
If you only want to repeat the values and not the index, you can do:
In [14]: pd.concat([x]*5, ignore_index=True)
Out[14]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
I think it's cleaner/faster to use iloc nowadays:
In [11]: np.full(3, 0)
Out[11]: array([0, 0, 0])
In [12]: x.iloc[np.full(3, 0)]
Out[12]:
a b
0 1 2
0 1 2
0 1 2
More generally, you can use tile or repeat with arange:
In [21]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 4
In [23]: np.tile(np.arange(len(df)), 3)
Out[23]: array([0, 1, 0, 1, 0, 1])
In [24]: np.repeat(np.arange(len(df)), 3)
Out[24]: array([0, 0, 0, 1, 1, 1])
In [25]: df.iloc[np.tile(np.arange(len(df)), 3)]
Out[25]:
A B
0 1 2
1 3 4
0 1 2
1 3 4
0 1 2
1 3 4
In [26]: df.iloc[np.repeat(np.arange(len(df)), 3)]
Out[26]:
A B
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
Note: This will work with non-integer indexed DataFrames (and Series).
Try using numpy.repeat:
>>> import numpy as np
>>> df = pd.DataFrame(np.repeat(x.to_numpy(), 5, axis=0), columns=x.columns)
>>> df
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
I would generally not repeat and/or append, unless your problem really makes it necessary - it is highly inefficiently and typically comes from not understanding the proper way to attack a problem.
I don't know your exact use case, but if you have your values stored as
values = array(1, 2)
df2 = pd.DataFrame(index=arange(0,50), columns=['a', 'b'])
df2[['a', 'b']] = values
will do the job. Perhaps you want to better explain what you're trying to achieve?
Append should work too:
In [589]: x = pd.DataFrame({'a':1,'b':2},index = range(1))
In [590]: x
Out[590]:
a b
0 1 2
In [591]: x.append([x]*5, ignore_index=True) #Ignores the index as per your need
Out[591]:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
In [592]: x.append([x]*5)
Out[592]:
a b
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
Without numpy, we could also use Index.repeat and loc (or reindex):
x.loc[x.index.repeat(5)].reset_index(drop=True)
or
x.reindex(x.index.repeat(5)).reset_index(drop=True)
Output:
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
Apply by row-lambda is a universal approach in my opinion:
df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
df.apply(lambda row: row.repeat(2), axis=0) #.reset_index()
Out[1]:
A B
0 1 2
0 1 2
1 3 4
1 3 4
Related
Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.
Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.
You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
So I have a dataframe like this
df = pd.DataFrame({
'A': [1,1,2,2,3,3,3],
'B': [1,3,1,3,1,2,1],
'C': [1,3,5,3,7,7,1]})
A B C
0 1 1 1
1 1 3 3
2 2 1 5
3 2 3 3
4 3 1 7
5 3 2 7
6 3 1 1
I want to create a binning of column B (count) with groupby of column A
for example B_bin1 where B < 3 and B_bin2 is the rest (>=3), C_bin1 for C < 5 and C_bin2 for the rest
From that example the output I want is like this
A B_bin1 B_bin2 C_bin1 C_bin2
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2
I found similar question Pandas groupby with bin counts
, it is working for 1 bin
bins = [0,2,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins)])
temp_df.size().unstack()
B (0, 2] (2, 10]
A
1 1 1
2 1 1
3 3 0
but when I tried using more than 1 bin, it is not working (my real data has a lot of binning groups)
bins = [0,2,10]
bins2 = [0,4,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins), pd.cut(df['C'], bins2)])
temp_df.size().unstack()
C (0, 4] (4, 10]
A B
1 (0, 2] 1 0
(2, 10] 1 0
2 (0, 2] 0 1
(2, 10] 1 0
3 (0, 2] 1 2
(2, 10] 0 0
My workaround is by create small temporary df and then binning them using 1 group 1 by 1 and then merge them in the end
I also still trying using aggregation (probably using pd.NamedAgg too) similar to this, but I wonder if that can works
df.groupby('A').agg(
b_count = ('B', 'count'),
b_sum = ('B', 'sum')
c_count = ('C', 'count'),
c_sum = ('C', 'sum')
)
Is anyone has another idea for this?
Because you need processing each bin separately instead groupby+size+unstack is used crosstab with join DataFrames by concat:
bins = [0,2,10]
bins2 = [0,4,10]
temp_df1=pd.crosstab(df['A'], pd.cut(df['B'], bins, labels=False)).add_prefix('B_')
temp_df2=pd.crosstab(df['A'], pd.cut(df['C'], bins2, labels=False)).add_prefix('C_')
df = pd.concat([temp_df1, temp_df2], axis=1).reset_index()
print (df)
A B_0 B_1 C_0 C_1
0 1 1 1 2 0
1 2 1 1 1 1
2 3 3 0 1 2
One option, is with get_dummies, before the aggregation; this works since you have a limited bin (I'm skipping the bin and using comparison):
temp = (df
.assign(B = df.B.lt(3), C = df.C.lt(5))
.replace({True:1, False:2})
)
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2
You could use the bins, along with pd.factorize and get_dummies:
temp = df.copy()
temp['B'] = pd.cut(df.B, bins)
temp['B'] = pd.factorize(temp.B)[0] + 1
temp['C'] = pd.cut(df.C, bins2)
temp['C'] = pd.factorize(temp.C)[0] + 1
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
B_bin1 B_bin2 C_bin1 C_bin2
A
1 1 1 2 0
2 1 1 1 1
3 3 0 1 2
I'm new to Python and could not find the answer I'm looking for anywhere.
I have a DataFrame that has the following structure:
df = pd.DataFrame(index=list('abc'), data={'A1': range(3), 'A2': range(3),'B1': range(3), 'B2': range(3), 'C1': range(3), 'C2': range(3)})
df
Out[1]:
A1 A2 B1 B2 C1 C2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Where the numbers are periods and he letters are variables. I'd like to transform the columns in a way, that I split the periods and variables into a multiindex. The desired output would look like that
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
I've tried the following:
periods = list(range(1, 3))
df.columns = df.columns.str.replace('\d+', '')
df.columns = pd.MultiIndex.from_product([df.columns, periods])
That seams to be multiplying the columns and raising an ValueError: Length mismatch
in my dataframe I have 72 periods and 12 variables.
Thanks in advance for your help!
Edit: I realized that I haven't been precise enough. I have several columns names something like Impressions1, Impressions2...Impressions72 and hhi1, hhi2...hhi72. So df.columns.str[0],df.columns.str[1] does not work for me, as all column names have a different length. I think the solution might contain regex but I can't figure out how to do it. Any ideas?
Use pd.MultiIndex.from_tuples:
df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns.str[0],df.columns.str[1])))
print(df)
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Alternative:
pd.MultiIndex.from_tuples([tuple(name) for name in df.columns])
or
pd.MultiIndex.from_tuples(map(tuple, df.columns))
You can also use, .str.extract and from_frame:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)'), names=[None, None])
Output:
A B C
1 2 1 2 1 2
a 0 0 0 0 0 0
b 1 1 1 1 1 1
c 2 2 2 2 2 2
Here is what actually solved my issue:
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(r'([a-zA-Z]+)([0-9]+)'), names=[None, None])
Thanks #Scott Boston for your inspiration to the solution!
I was trying to get some type function in Pandas that would help me find how many strings and ints are there per column.
Example:
A B C D
1 H 3 20
3 5 2 1
2 Y M
Should give me something like
A B C D
int 3 1 2 2
str 0 2 0 1
NA 0 0 1 0
Any function that does so? I was thinking of making up something like: if (A==int'A') then ..., and use a counter, but I guess it would go element by element and would be super inefficient.
Setup
df = pd.DataFrame(dict(
A=[1, 3, 2],
B=['H', 5, 'Y'],
C=[3, 2, None],
D=[20, 1, 'M']
), dtype=object)
df
A B C D
0 1 H 3 20
1 3 5 2 1
2 2 Y None M
Solution
Use pd.DataFrame.applymap with type and pd.value_counts
df.applymap(type).apply(pd.value_counts).fillna(0, downcast='infer')
A B C D
<class 'int'> 3 1 2 2
<class 'str'> 0 2 0 1
<class 'NoneType'> 0 0 1 0
I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?
This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0
How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1
I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0