Using itertools.combinations with columns - python

I have a Dataframe df with 3 columns. A,B and C
A B C
2 4 4
5 2 5
6 9 5
My goal is to use itertools.combinations to find all non-repeating column pairs and to put the first column pair in one DataFrame and the second in the other. So all pairs of this would give A:B,A:C,B:C.
So the first dataframe df1 would have the first of of those column pairs:
df=A A B
2 4 4
5 5 2
6 5 9
and the second df2:
B C C
4 4 4
3 5 5
9 5 5
I'm trying to do something with itertools like:
for cola, colb in itertools.combinations(df, 2):
df1[cola]=cola
df2[colb]=colb
I know that makes no sense but i can change each column to a list and itertool a list of lists and then append each to a list A and B and then turn that list back into a Dataframe but then Im missing the headers. And I tried adding the headers to the list but when i try and remake it back to a DataFrame the indexing seems off and I cant seem to fix it. So I'm just trying to see if there is a way to just itertool entire columns with the headers.

Utilize the zip function to group the columns to be used in each DataFrame separately, and then use pandas.concat to construct your new DataFrames:
from itertools import combinations
df1_cols, df2_cols = zip(*combinations(df.columns,2))
df1 = pd.concat([df[col] for col in df1_cols],axis=1)
df2 = pd.concat([df[col] for col in df2_cols],axis=1)

Related

How to sort dataframe columns baed on 2 indexes?

I have a data frame like this
df:
Index C-1 C-2 C-3 C-4 ........
Ind-1 3 9 5 4
Ind-2 5 2 8 3
Ind-3 0 1 1 0
.
.
The data frame has more than a hundred columns and rows with whole numbers(0-60) as values.
The first two rows(indexes) have values in the range 2-12/
I want to sort the columns based on values in the first and second rows(indexes) in ascending order. I need not care about sorting in the remaining rows.
Can anyone help me with this
pandas.DataFrame.sort_values
In the first argument you pass rows that you need sorting on, and than axis to sort through columns.
Mind that by placement has a priority. If you want to sort first by second row and than by the first, you should pass ['Ind-2','Ind-1'], this will have a different result.
df.sort_values(by=['Ind-1','Ind-2'],axis=1)
Output
C-1 C-4 C-3 C-2
Index
Ind-1 3 4 5 9
Ind-2 5 3 8 2
Ind-3 0 0 1 1

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

I want to pick out one column of the DataFrame but the result is automatically ordered by values

I just need one column of my dateframe, but in the original order. When I take it off, it is sorted by the values, and I can't understand why. I tried different ways to pick out one column but all the time it was sorted by the values.
this is my code:
import pandas
data = pandas.read_csv('/data.csv', sep=';')
longti = data.iloc[:,4]
To return the first Column your function should work.
import pandas as pd
df = pd.DataFrame(dict(A=[1,2,3,4,5,6], B=['A','B','C','D','E','F']))
df = df.iloc[:,0]
Out:
0 1
1 2
2 3
3 4
4 5
5 6
If you want to return the second Column you can use the following:
df = df.iloc[:,1]
Out:
0 A
1 B
2 C
3 D
4 E
5 F

Drop a column which is a subset of any other column in a dataframe

I have a pandas dataframe as below. How can I drop any column which is a subset of any of the remaining columns? I would like to do this without using fillna.
df = pd.DataFrame([ [1,1,3,3], [np.NaN,2,np.NaN,4]], columns=['A','B','C','D'] )
df
A B C D
0 1.0 1 3.0 3
1 NaN 2 NaN 4
I can identify here that column A is subset of B and column C is a subset of D with something like this:
if all(df[A][df[A].notnull()].isin(df[B]))
I could run a loop over all columns and drop the subset columns. But is there a more efficient way to accomplish this, so that I have the following result:
df
B D
0 1 3
1 2 4
Thanks.
It still requires iteration, but you can use this list comprehension (with an if statement similar to the one you provided) to get columns to keep:
keep_cols = [x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))]
# ['B', 'D']
And then use the result with filter:
df.filter(items=keep_cols)
# B D
# 0 1 3
# 1 2 4
This should be fast enough, since it still uses apply at its core, and seems to be safer/more efficient than dropping columns within a loop.
If you're keen on a one-line solution, of course assigning the list to a variable is an optional step:
df.filter(items=[x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))])

Sorting DateTimeSeries duplicates in DataFrame

Alright so I hope we don't need an example for this, but let's say we have a DataFrame with 100k rows and 50+ instances of the Index being the exact same DateTime.
What would be the fastest way to sort my DataFrame by Time, but then if there is a tie choose a second column to sort by.
So:
Sort By Time
If Duplicate Time, sort by 'Cost'
If you pass a list of the columns in the order you want them sorted by it will sort by the first column and then the second column
df = DataFrame({'a':[1,2,1,1,3], 'b':[1,2,2,2,1]})
df
Out[11]:
a b
0 1 1
1 2 2
2 1 2
3 1 2
4 3 1
In [13]:
df.sort(columns=['a','b'], inplace=True)
df
Out[13]:
a b
0 1 1
2 1 2
3 1 2
1 2 2
4 3 1
So for your example
df.sort(columns=['Time', 'Cost'],inplace=True)
would work
EDIT
It has pointed out (by #AndyHayden) that there is bug if you have nested NaN in supplementary columns, see this SO and there is a GitHub issue, this may not be an issue in your case but it is something to be aware of.

Categories