Pandas Multiindex get values from first entry of index - python

I have the following multiindex dataframe:
from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
A,0, 0,12,34
A,0, 1,13,34
A,0, 2,23,34
A,1, 6,23,38
A,1, 7,22,38
B,0,17,15,35
B,0,18,17,35
C,0,34,23,32
C,0,35,21,32
""")
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)
>> df
runtime value1 value2
File no
A 0 0 0 12 34
1 1 13 34
2 2 23 34
1 3 6 23 38
4 7 22 38
B 0 5 17 15 35
6 18 17 35
C 0 7 34 23 32
8 35 21 32
What I would like to get is just the first values of every entry with a new file and a different number
A 0 34
A 1 38
B 0 35
C 0 32
The most similar questions I could find where these
Resample pandas dataframe only knowing result measurement count
MultiIndex-based indexing in pandas
Select rows in pandas MultiIndex DataFrame
but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is
idx = pd.IndexSlice
df.loc[idx[:,0],:]
could, for example, filter for the 0 value but would still return the entire rest of the dataframe.
Is a multiindex even the right tool for the task at hand? How to solve this?

Use GroupBy.first by first and second level of MultiIndex:
s = df.groupby(level=[0,1])['value2'].first()
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
If need one column DataFrame use one element list:
df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
value2
File no
A 0 34
1 38
B 0 35
C 0 32
Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:
df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64

For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).
s = df.groupby(level=[0,1])['value2'].nth(0)
This can be generalized to finding any, not merely the first entry
t = df.groupby(level=[0,1])['value1'].nth(1)
Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.
Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html

Related

Compare two columns in the same dataframe and find witch row of the first column match what row from the 2nd row

I've been trying to figure out how to compare two columns that share some values between them, but at different rows.
For example
col_index
col_1
col_2
1
12
34
2
16
42
3
58
35
4
99
60
5
2
12
12
35
99
In the above example, col_1 and col_2 match on several occasions: e.g. values '12' and '99'.
I need to be able to find which rows these match at so that I can get the result of col_index.
What would be the best way to do that?
IIUC only row 2 should be removed from col_index.
You can use np.intersect1d to find the common values between the two columns and then check if these values are in your columns using isin:
import numpy as np
common_values = np.intersect1d(df.col_1,df.col_2)
res = df[(df.col_1.isin(common_values))|(df.col_2.isin(common_values))]
res
col_index col_1 col_2
0 1 12 34 # 12
2 3 58 35 # 35
3 4 99 60 # 99
4 5 2 12 # 12
5 12 35 99 # 99
res[['col_index']]
col_index
0 1
2 3
3 4
4 5
5 12
You could use isin method to get a mask, and then use it to filter the matches. Finally, you get the col_idex column and that's all. So, using your dataframe:
mask = df.col_1.isin(df.col_2)
print(df[mask].col_index.to_list()) #to_list is only to get a python list from a Serie.
Result: [1, 4, 12]
Simply loop over the values that are present in both columns, using the Series.isin method
# test data:
a = 12,16,58,99
b = 34,99,35,12
c = 1,2,3,5
d = pd.DataFrame({"col_1":a, "col_2":b, 'col_idx':c})
# col_1 col_2 col_idx
#0 12 34 1
#1 16 99 2
#2 58 35 3
#3 99 12 5
for _,row in d.loc[d.col_1.isin(d.col_2)].iterrows():
val = row.col_1
idx1 = row.col_idx
print(val, idx1, d.query("col_2==%d" % val).col_idx.values)
#12 1 [5]
#99 5 [2]
If your values are strings (instead of integers as in this example), change the query argument accordingly: query("col_2=='%s'" % val) .

How to stack two columns of a pandas dataframe in python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

Selecting all column names where value is greater than 0

I have an undefined number of columns that have some values. for the example lets say there are 4 columns: [a,b,c,d] , and there is value associated with each column name, like this:
a b c d
0 23 11 0
11 43 33 22
12 0 12 0
I want to write another column right next to d, which has the max value of the column whose value is greater than 0, for example:
Like this:
a b c d e
0 23 11 0 b,c
11 43 33 22 a,b,c,d
12 0 12 0 a,c
my attempt:
dic2 = {'a':[12,0,23],'b':[21,23,0],'c':[0,22,33],'d':[0,22,0]}
df = pd.DataFrame(dic2)
df[df>0]
This will return the NaN value wherever there is zero but I don't know how do I get the column who has these NaN value.
You can filter values greater like 0 to boolean DataFrame and then use DataFrame.dot for matrix multiplication with columns names, last remove separator by indexing with str:
df['e'] = df.gt(0).dot(df.columns + ',').str[:-1]
print (df)
a b c d e
0 12 21 0 0 a,b
1 0 23 22 22 b,c,d
2 23 0 33 0 a,c
You can create a new column and use max function on all other columns
df['D'] = df.max(axis=1)
Code will check all column. If You want to specify from which columns You want to have the max value specify them like that
df['D'] = df[column].max(axis=1)
or with list of columns
df['D'] = df[[column1, column2]].max(axis=1)

DataFrame transform column values to new columns

I have following series:
project id type
First 130403725 PRODUCT 68
EMPTY 2
Six 130405706 PRODUCT 24
132517244 PRODUCT 33
132607436 PRODUCT 87
How I can transform type values to new columns:
PRODUCT EMPTY
project id
First 130403725 68 2
Six 130405706 24 0
132517244 33 0
132607436 87 0
This is a classic pivot table:
df_pivoted = df.pivot(index=["project", "id"], columns=["type"], values=[3])
I've used 3 as the index of the value column but it would be more clear if you would have named it.
Use unstack, because MultiIndex Series:
s1 = s.unstack(fill_value=0)
print (s1)
type EMPTY PRODUCT
project id
First 130403725 2 68
Six 130405706 0 24
132517244 0 33
132607436 0 87
For DataFrame:
df = s.unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
print (df)
project id EMPTY PRODUCT
0 First 130403725 2 68
1 Six 130405706 0 24
2 Six 132517244 0 33
3 Six 132607436 0 87

Pandas individual item using index and column

I have a csv file test.csv. I am trying to use pandas to select items dependent on whether the second value is above a certain value. Eg
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
So what i would like is if B is larger than 50 then give me the values in A as an integer which I could assign a variable to
edit 1:
Sorry for the poor explanation. The final purpose of this is that I want to look in table 1:
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
for any values above 50 in column B and get the column A value and then look in table 2:
index A B
5 44 12
6 45 13
7 46 14
8 47 15
9 48 16
so in the end i want to end up with the value in column B of table two which i can print out as an integer and not as a series. If this is not possible using panda then ok but is there a way to do it in any case?
You can use dataframa slicing, to get the values you want:
import pandas as pd
f = pd.read_csv('yourfile.csv')
f[f['B'] > 50].A
in this code
f['B'] > 50
is the condition, returning a booleans array of True/False for all values meeting the condition or not, and then the corresponding A values are selected
This would be the output:
2 46
3 47
Name: A, dtype: int64
Is this what you wanted?

Categories