How to stack two columns of a pandas dataframe in python - python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.

You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)

Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5

You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5

Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))

A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5

I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

Related

Retrieving Unknown Column Names from DataFrame.apply

How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)

Pandas : While adding new rows, its replacing my existing dataframe values? [duplicate]

This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.
You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70
A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70
That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.

How do I reorder by column totals?

For example, how do I reorder each column sum and row sum in the following data with summed rows and columns?
import pandas as pd
data=[['fileA',47,15,3,5,7],['fileB',33,13,4,7,2],['fileC',25,17,9,3,5],
['fileD',25,7,1,4,2],['fileE',19,15,3,8,4], ['fileF',11,17,8,4,5]]
df = pd.DataFrame(data, columns=['filename','rows_cnt','cols_cnt','col_A','col_B','col_C'])
print(df)
filename rows_cnt cols_cnt col_A col_B col_C
0 fileA 47 15 3 5 7
1 fileB 33 13 4 7 2
2 fileC 25 17 9 3 5
3 fileD 25 7 1 4 2
4 fileE 19 15 3 8 4
5 fileF 11 17 8 4 5
df.loc[6]= df.sum(0)
filename rows_cnt cols_cnt col_A col_B col_C
0 fileA 47 15 3 5 7
1 fileB 33 13 4 7 2
2 fileC 25 17 9 3 5
3 fileD 25 7 1 4 2
4 fileE 19 15 3 8 4
5 fileF 11 17 8 4 5
6 fileA... 160 84 28 31 25
I made an image of the question.
How do I reorder the red frame in this image by the standard?
df.reindex([2,5,0,4,1,3,6], axis='index')
Is the only way to create the index manually like this?
data=[['fileA',47,15,3,5,7],['fileB',33,13,4,7,2],['fileC',25,17,9,3,5],
['fileD',25,7,1,4,2],['fileE',19,15,3,8,4], ['fileF',11,17,8,4,5]]
df = pd.DataFrame(data, columns=['filename','rows_cnt','cols_cnt','col_A','col_B','col_C'])
df = df.sort_values(by='cols_cnt', axis=0, ascending=False)
df.loc[6]= df.sum(0)
# to keep number original of index
df = df.reset_index(drop=False)
# need to remove this filename column, since need to sort by column (axis=1)
# unable sort with str and integer data type
df = df.set_index('filename', drop=True)
df = df.sort_values(by=df.index[-1], axis=1, ascending=False)
# set back the index of dataframe into original
df = df.reset_index(drop=False)
df = df.set_index('index', drop=True)
# try to set the fixed columns
fixed_cols = ['filename', 'rows_cnt','cols_cnt']
# try get the new order of columns by fixed the first three columns
# and then add with the remaining columns
new_cols = fixed_cols + (df.columns.drop(fixed_cols).tolist())
df[new_cols]

Combining columns in pandas

I have a script with an output of multiple columns which are put beneath each other. I would like the columns to be merged together and to drop the duplicates. I've tried merge, combine, concatenate and joining, but I can't seem to figure it out. I also tried to merge as a list, but that doesn't seem to help as well. Below is my code:
import pandas as pd
data = pd.ExcelFile('path')
newlist = [x for x in data.sheet_names if x.startswith("ZZZ")]
for x in newlist:
sheets = pd.read_excel(data, sheetname = x)
column = sheets.loc[:,'YYY']
Any help is really appreciated!
Edit
Some more info about the code: data is where an excelfile is loaded. Then at newlist, the sheetnames that start with ZZZ are shown. Then in the for-loop, these sheets are called. At column, the columns named YYY are called. These columns are put beneath each other, but aren't merged yet. For example:
Here is the output of the columns now and I would like them to be one list from 1 to 17.
I hope it is more clear now!
Edit 2.0
Here I tried the concat method that is mentioned below. However, I still get the output as the picture above shows instead of a list from 1 to 17.
my_concat_series = pd.Series()
for x in newlist:
sheets = pd.read_excel(data, sheetname = x)
column = sheets.loc[:,'YYY']
my_concat_series = pd.concat([my_concat_series,column]).drop_duplicates()
print(my_concat_series)
I don't see how pandas.concat doesn't work, let's try an example corresponding to the data picture you posted:
import pandas as pd
col1 = pd.Series(np.arange(1,12))
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
dtype: int64
col2 = pd.Series(np.arange(7,18))
0 7
1 8
2 9
3 10
4 11
5 12
6 13
7 14
8 15
9 16
10 17
dtype: int64
And then use pd.concat and drop_duplicates
pd.concat([col1,col2]).drop_duplicates()
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
5 12
6 13
7 14
8 15
9 16
10 17
dtype: int64
You can then reshape your data the way you want them, for instance if you don't want a duplicate index:
pd.concat([col1,col2]).drop_duplicates().reset_index(drop = True),
or if you want the values as a numpy array instead of a pandas series:
pd.concat([col1,col2]).drop_duplicates()
Note that in the last case you can also use numpy arrays from the begginning, which is faster:
import numpy as np
np.unique(np.concatenate((col1.values,col2.values)))
If you want them as a list:
list(pd.concat([col1,col2]).drop_duplicates())

Python pandas constructing dataframe by looping over columns

I am trying to develop a new panda dataframe based on data I got from an existing dataframe and then taking into account the previously calculated value in the new dataframe.
As an example, here are two dataframes with the same size.
df1 = pd.DataFrame(np.random.randint(0,10, size = (5, 4)), columns=['1', '2', '3', '4'])
df2 = pd.DataFrame(np.zeros(df1.shape), index=df1.index, columns=df1.columns)
Then I created a list which starts as a starting basis for my second dataframe df2
L = [2,5,6,7]
df2.loc[0] = L
Then for the remaining rows of df2 I want to take the value from the previous time step (df2) and add the value of df1.
for i in df2.loc[1:]:
df2.ix[i] = df2.ix[i-1] + df1
As an example my dataframes should look like this:
>>> df1
1 2 3 4
0 4 6 0 6
1 7 0 7 9
2 9 1 9 9
3 5 2 3 6
4 0 3 2 9
>>> df2
1 2 3 4
0 2 5 6 7
1 9 5 13 16
2 18 6 22 25
3 23 8 25 31
4 23 11 27 40
I know there is something wrong with the indication of indexes in the for loop but I cannot figure out how the argument must be formulated. I would be very thankful for any help on this.
this is a simple cumsum.
df2 = df1.copy()
df2.loc[0] = [2,5,6,7]
desired_df = df2.cumsum()

Categories