I have a script with an output of multiple columns which are put beneath each other. I would like the columns to be merged together and to drop the duplicates. I've tried merge, combine, concatenate and joining, but I can't seem to figure it out. I also tried to merge as a list, but that doesn't seem to help as well. Below is my code:
import pandas as pd
data = pd.ExcelFile('path')
newlist = [x for x in data.sheet_names if x.startswith("ZZZ")]
for x in newlist:
sheets = pd.read_excel(data, sheetname = x)
column = sheets.loc[:,'YYY']
Any help is really appreciated!
Edit
Some more info about the code: data is where an excelfile is loaded. Then at newlist, the sheetnames that start with ZZZ are shown. Then in the for-loop, these sheets are called. At column, the columns named YYY are called. These columns are put beneath each other, but aren't merged yet. For example:
Here is the output of the columns now and I would like them to be one list from 1 to 17.
I hope it is more clear now!
Edit 2.0
Here I tried the concat method that is mentioned below. However, I still get the output as the picture above shows instead of a list from 1 to 17.
my_concat_series = pd.Series()
for x in newlist:
sheets = pd.read_excel(data, sheetname = x)
column = sheets.loc[:,'YYY']
my_concat_series = pd.concat([my_concat_series,column]).drop_duplicates()
print(my_concat_series)
I don't see how pandas.concat doesn't work, let's try an example corresponding to the data picture you posted:
import pandas as pd
col1 = pd.Series(np.arange(1,12))
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
dtype: int64
col2 = pd.Series(np.arange(7,18))
0 7
1 8
2 9
3 10
4 11
5 12
6 13
7 14
8 15
9 16
10 17
dtype: int64
And then use pd.concat and drop_duplicates
pd.concat([col1,col2]).drop_duplicates()
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
5 12
6 13
7 14
8 15
9 16
10 17
dtype: int64
You can then reshape your data the way you want them, for instance if you don't want a duplicate index:
pd.concat([col1,col2]).drop_duplicates().reset_index(drop = True),
or if you want the values as a numpy array instead of a pandas series:
pd.concat([col1,col2]).drop_duplicates()
Note that in the last case you can also use numpy arrays from the begginning, which is faster:
import numpy as np
np.unique(np.concatenate((col1.values,col2.values)))
If you want them as a list:
list(pd.concat([col1,col2]).drop_duplicates())
Related
My current data frame is comprised of 10 rows and thousands of columns. The setup currently looks similar to this:
A B A B
1 2 3 4
5 6 7 8
But I desire something more like below, where essentially I would transpose the columns into rows once the headers start repeating themselves.
A B
1 2
5 6
3 4
7 8
I've been trying df.reshape but perhaps can't get the syntax right. Any suggestions on how best to transpose the data like this?
I'd probably go for stacking, grouping and then building a new DataFrame from scratch, eg:
pd.DataFrame({col: vals for col, vals in df.stack().groupby(level=1).agg(list).items()})
That'll also give you:
A B
0 1 2
1 3 4
2 5 6
3 7 8
Try with stack, groupby and pivot:
stacked = df.T.stack().to_frame().assign(idx=df.T.stack().groupby(level=0).cumcount()).reset_index()
output = stacked.pivot("idx", "level_0", 0).rename_axis(None, axis=1).rename_axis(None, axis=0)
>>> output
A B
0 1 2
1 5 6
2 3 4
3 7 8
I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)
How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)
I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)
For a DataFrame in Pandas, how can I select both the first 5 values and last 5 values?
For example
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
How to show the first two and the last two rows?
You can use iloc with numpy.r_:
print (np.r_[0:2, -2:0])
[ 0 1 -2 -1]
df = df.iloc[np.r_[0:2, -2:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-07 8 8 8
2012-12-08 9 9 9
df = df.iloc[np.r_[0:4, -4:0]]
print (df)
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
You can use df.head(5) and df.tail(5) to get first five and last five.
Optionally you can create new data frame and append() head and tail:
new_df = df.tail(5)
new_df = new_df.append(df.head(5))
Not quite the same question but if you just want to show the top / bottom 5 rows (eg with display in jupyter or regular print, there's potentially a simpler way than this if you use the pd.option_context context.
#make 100 3d random numbers
df = pd.DataFrame(np.random.randn(100,3))
# sort them by their axis sum
df = df.loc[df.sum(axis=1).index]
with pd.option_context('display.max_rows',10):
print(df)
Outputs:
0 1 2
0 -0.649105 -0.413335 0.374872
1 3.390490 0.552708 -1.723864
2 -0.781308 -0.277342 -0.903127
3 0.433665 -1.125215 -0.290228
4 -2.028750 -0.083870 -0.094274
.. ... ... ...
95 0.443618 -1.473138 1.132161
96 -1.370215 -0.196425 -0.528401
97 1.062717 -0.997204 -1.666953
98 1.303512 0.699318 -0.863577
99 -0.109340 -1.330882 -1.455040
[100 rows x 3 columns]
Small simple function:
def ends(df, x=5):
return df.head(x).append(df.tail(x))
And use like so:
df = pd.DataFrame(np.random.rand(15,6))
ends(df,2)
I actually use this so much, I think it would be a great feature to add to pandas. (No features are to be added to pandas.DataFrame core API) I add it after import like so:
import pandas as pd
def ends(df, x=5):
return df.head(x).append(df.tail(x))
setattr(pd.DataFrame,'ends',ends)
Use like so:
import numpy as np
df = pd.DataFrame(np.random.rand(15,6))
df.ends(2)
You should use both head() and tail() for this purpose. I think the easiest way to do this is:
df.head(5).append(df.tail(5))
In Jupyter, expanding on #bolster's answer, we'll create a reusable convenience function:
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
Then
display_n(df,2)
Returns
0 1 2
0 0.167961 -0.732745 0.952637
1 -0.050742 -0.421239 0.444715
... ... ... ...
98 0.085264 0.982093 -0.509356
99 -0.758963 -0.578267 -0.115865
(except as a nicely formatted HTML table)
when df is df = pd.DataFrame(np.random.randn(100,3))
Notes:
Of course you could make the same thing print as text by modifying display to print above.
On unix-like systems, you can the autoload the above function in all notebooks by placing it in a py or ipy file in ~/.ipython/profile_default/startup as described here.
If you want to keep it to just Pandas, you can use apply() to concatenate the head and tail:
import pandas as pd
from string import ascii_lowercase, ascii_uppercase
df = pd.DataFrame(
{"upper": list(ascii_uppercase), "lower": list(ascii_lowercase)}, index=range(1, 27)
)
df.apply(lambda x: pd.concat([x.head(2), x.tail(2)]))
upper lower
1 A a
2 B b
25 Y y
26 Z z
Associated with Linas Fx.
Defining below
pd.DataFrame.less = lambda df, n=10: df.head(n//2).append(df.tail(n//2))
then you can type only df.less()
It's same as type df.head().append(df.tail())
If you type df.less(2), the result is same as df.head(1).append(df.tail(1))
Combining #ic_fl2 and #watsonic to give the below in Jupyter:
def ends_attr():
def display_n(df,n):
with pd.option_context('display.max_rows',n*2):
display(df)
# set pd.DataFrame attribute where .ends runs display_n() function
setattr(pd.DataFrame,'ends',display_n)
ends_attr()
View first and last 3 rows of your df:
your_df.ends(3)
I like this because I can copy a single function and know I have everything I need to use the ends attribute.