I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1
Related
I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124
I have a time series DataFrame df1 with prices in a ticker column, from which a new DataFrame df2 is created by concatenating df1 with 3 other columns sharing the same DateTimeIndex, as shown:
Now I need to set up the ticker name "Equity(42950 [FB])" to become the new header and to nest the 3 other columns under it, and to have the ticker's prices replaced by the values in the "closePrice" column.
How to achieve this in Python?
pd.MultiIndex:
d = pd.DataFrame(np.arange(20).reshape(5,4), columns=['Equity', 'closePrice', 'mMb', 'mMv'])
arrays = [['Equity','Equity','Equity'],['closePrice', 'mMb','mMv']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(d.values[:, 1:], columns=index)
df
Equity
closePrice mMb mMv
0 1 2 3
1 5 6 7
2 9 10 11
3 13 14 15
4 17 18 19
i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')
you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()
Let's say I have a data frame with 4 rows, 3 columns. I'd like to stack the rows horizontally so that I get one row with 12 columns. How to do it and how to handle colliding column names?
You can achieve this by stacking the frame to produce a series of all the values, we then want to convert this back to a df using to_frame and then reset_index to drop the index levels and then transpose using .T:
In [2]:
df = pd.DataFrame(np.random.randn(4,3), columns=list('abc'))
df
Out[2]:
a b c
0 -1.744219 -2.475923 1.794151
1 0.952148 -0.783606 0.784224
2 0.386506 -0.242355 -0.799157
3 -0.547648 -0.139976 -0.717316
In [3]:
df.stack().to_frame().reset_index(drop=True).T
Out[3]:
0 1 2 3 4 5 6 \
0 -1.744219 -2.475923 1.794151 0.952148 -0.783606 0.784224 0.386506
7 8 9 10 11
0 -0.242355 -0.799157 -0.547648 -0.139976 -0.717316
I have the following DataFrame:
index PUBLICO CLASSIFICACAO_PUBLICO
0 19 143643 1
1 34 111879 2
2 31 50382 3
3 9 49204 4
4 32 37541 5
5 4 36095 6
I need convert the index name column to index column.
For example:
index PUBLICO CLASSIFICACAO_PUBLICO
19 143643 1
34 111879 2
31 50382 3
9 49204 4
32 37541 5
4 36095 6
I try use df.set_index('index'), but it didn't work.
The column with the name index previously was the index column the DataFrame, but I used reset_index(); now I need to do the reverse.
The method set_index doesn't work inplace. So that you have to reassign your dataframe, or to pass the option inplace = True:
df = df.set_index('index')
or
df.set_index('index',inplace = True)
see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html
You can try it this way:
df.set_index(df['index'], inplace=True)
This will set your index column as the index in your dataframe and your index column will still remain in your dataframe as well. Then, you can just drop that column.
df.drop('index', axis=1, inplace=True)