Reshaping Pandas Dataframe slicing columns and adding them as rows - python

I've a pandas dataframe named 'trdf' with the shape [1 row X 420 columns].
0 1 2 \
0 B0742F7GT8 Stone & Beam Modern Tripod Floor Lamp, 61"H, W... 2018-04-22
3 4 5 6 7 8 9 ... \
0 24-Apr-2018 100.00% 17.06% 0.00% 5 66.67% 8 ...
410 411 412 413 414 415 416 417 418 419
0 56 161 -8 -166.67% 0 1 0.00% 100.00% 8 Planned Replenishment
I want to slice every 20 columns from last and add the column values as new row values. here is my code :
for i in range(420,20,-20):
trdf.append(trdf.loc[:,i:i-20])
print(trdf)
However, the dataframe is still same with respect to shape and values. Where's the error ?

I believe first create MultiIndex in columns and then unstack:
df.columns = [df.columns % 20, df.columns // 20]
df = df.stack().reset_index(level=0, drop=True)
Or use numpy solution with reshape, but finally all data are strings:
df = pd.DataFrame(df.values.reshape(20, 21))
If want use your solution create list of one row DataFrames and concat together:
L = []
for i in range(420,20,-20):
#change order for selecting
df2 = df.loc[:,i-20:i]
#for same columns
df2.columns = range(20)
L.append(df2)
df1 = pd.concat(L)
Also if need expected output join from last columns to first:
df.columns = [df.columns % 20, 20-df.columns // 20]
df = df.stack().reset_index(level=0, drop=True)
And:
df1 = pd.DataFrame(df.values.reshape(20, 21)[::-1])

Related

Create DataFrame from df1 and df2 and take empty value from df2 for column value if not exist in df1 column value

df1 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','124','','656','']})
df2 = pd.DataFrame({'call_sign': ['ASD','CDSF','BSD','GFDFD','FHHH'],'frn':['1234','','124','','765']})
need to get a new df like
df2 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','','124','656','765']})
I need to take frn from df2 if it's missing in df1 and create a new df
Replace empty strings to missing values and use DataFrame.set_index with DataFrame.fillna, because need ordering like df2.call_sign add DataFrame.reindex:
df = (df1.set_index('call_sign').replace('', np.nan)
.fillna(df2.set_index('call_sign').replace('', np.nan))
.reindex(df2['call_sign']).reset_index())
print(df)
call_sign frn
0 ASD 123
1 CDSF NaN
2 BSD 124
3 GFDFD 656
4 FHHH 765
If you want to update df2 you can use boolean indexing:
# is frn empty string?
m = df2['frn'].eq('')
# update those rows from the value in df1
df2.loc[m, 'frn'] = df2.loc[m, 'call_sign'].map(df1.set_index('call_sign')['frn'])
Updated df2:
call_sign frn
0 ASD 1234
1 CDSF
2 BSD 124
3 GFDFD 656
4 FHHH 765
temp = df1.merge(df2,how='left',on='call_sign')
df1['frn']=temp.frn_x.where(temp.frn_x!='',temp.frn_y)
call_sign frn
0 ASD 123
1 BSD
2 CDSF 124
3 GFDFD 656
4 FHHH 765

How to combine two datasets vertically in pandas?

I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124

Stacking columns one below other when the column names are same

I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1

How to remove data from DataFrame permanently

After reading CSV data file with:
import pandas as pd
df = pd.read_csv('data.csv')
print df.shape
I get DataFrame 99 rows (indexes) long:
(99, 2)
To cleanup DataFrame I go ahead and apply dropna() method which reduces it to 33 rows:
df = df.dropna()
print df.shape
which prints:
(33, 2)
Now when I iterate the columns it prints out all 99 rows like they weren't dropped:
for index, value in df['column1'].iteritems():
print index
which gives me this:
0
1
2
.
.
.
97
98
99
It appears the dropna() simply made the data "hidden". That hidden data returns back when I iterate DataFrame. How to assure the dropped data is removed from DataFrame instead just getting hidden?
You're being confused by the fact that the row labels have been preserved so the last row label is still 99.
Example:
In [2]:
df = pd.DataFrame({'a':[0,1,np.NaN, np.NaN, 4]})
df
Out[2]:
a
0 0
1 1
2 NaN
3 NaN
4 4
After calling dropna the index row labels are preserved:
In [3]:
df = df.dropna()
df
Out[3]:
a
0 0
1 1
4 4
If you want to reset so that they are contiguous then call reset_index(drop=True) to assign a new index:
In [4]:
df = df.reset_index(drop=True)
df
Out[4]:
a
0 0
1 1
2 4
Or you can just adjust parameters for example:
Df = df.dropna(inplace=True)

Breaking out column by groups in Pandas

If I have a DataFrame like this:
type value group
a 10 one
b 45 one
a 224 two
b 119 two
a 33 three
b 44 three
how do I make it into this:
type one two three
a 10 224 33
b 45 119 44
I thought it'd be pivot_table, but that just gives me a re-grouped list.
I think you need pivot with rename_axis (new in pandas 0.18.0) and reset_index:
print df.pivot(index='type', columns='group', values='value')
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 10 33 224
1 b 45 44 119
If ordering of columns is important:
df = df.pivot(index='type', columns='group', values='value').rename_axis(None, axis=1)
print df[['one','two','three']].reset_index()
type one two three
0 a 10 224 33
1 b 45 119 44
EDIT:
In your real data you can get error:
print df.pivot(index='type', columns='group', values='value')
.rename_axis(None, axis=1)
.reset_index()
ValueError: Index contains duplicate entries, cannot reshape
print df
type value group
0 a 10 one
1 a 20 one
2 b 45 one
3 a 224 two
4 b 119 two
5 a 33 three
6 b 44 three
Problem is in second row - you get for index value a and column one two values - 10 and 20. Function pivot_table aggregate data in this case. Dafault aggregating function is np.mean, but you can change it by parameter aggfunc:
print df.pivot_table(index='type', columns='group', values='value', aggfunc=np.mean)
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 15 33 224
1 b 45 44 119
print df.pivot_table(index='type', columns='group', values='value', aggfunc='first')
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 10 33 224
1 b 45 44 119
print df.pivot_table(index='type', columns='group', values='value', aggfunc=sum)
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 30 33 224
1 b 45 44 119

Categories