I have the following multiindex dataframe:
from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
A,0, 0,12,34
A,0, 1,13,34
A,0, 2,23,34
A,1, 6,23,38
A,1, 7,22,38
B,0,17,15,35
B,0,18,17,35
C,0,34,23,32
C,0,35,21,32
""")
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)
>> df
runtime value1 value2
File no
A 0 0 0 12 34
1 1 13 34
2 2 23 34
1 3 6 23 38
4 7 22 38
B 0 5 17 15 35
6 18 17 35
C 0 7 34 23 32
8 35 21 32
What I would like to get is just the first values of every entry with a new file and a different number
A 0 34
A 1 38
B 0 35
C 0 32
The most similar questions I could find where these
Resample pandas dataframe only knowing result measurement count
MultiIndex-based indexing in pandas
Select rows in pandas MultiIndex DataFrame
but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is
idx = pd.IndexSlice
df.loc[idx[:,0],:]
could, for example, filter for the 0 value but would still return the entire rest of the dataframe.
Is a multiindex even the right tool for the task at hand? How to solve this?
Use GroupBy.first by first and second level of MultiIndex:
s = df.groupby(level=[0,1])['value2'].first()
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
If need one column DataFrame use one element list:
df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
value2
File no
A 0 34
1 38
B 0 35
C 0 32
Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:
df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).
s = df.groupby(level=[0,1])['value2'].nth(0)
This can be generalized to finding any, not merely the first entry
t = df.groupby(level=[0,1])['value1'].nth(1)
Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.
Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html
Related
I've been trying to figure out how to compare two columns that share some values between them, but at different rows.
For example
col_index
col_1
col_2
1
12
34
2
16
42
3
58
35
4
99
60
5
2
12
12
35
99
In the above example, col_1 and col_2 match on several occasions: e.g. values '12' and '99'.
I need to be able to find which rows these match at so that I can get the result of col_index.
What would be the best way to do that?
IIUC only row 2 should be removed from col_index.
You can use np.intersect1d to find the common values between the two columns and then check if these values are in your columns using isin:
import numpy as np
common_values = np.intersect1d(df.col_1,df.col_2)
res = df[(df.col_1.isin(common_values))|(df.col_2.isin(common_values))]
res
col_index col_1 col_2
0 1 12 34 # 12
2 3 58 35 # 35
3 4 99 60 # 99
4 5 2 12 # 12
5 12 35 99 # 99
res[['col_index']]
col_index
0 1
2 3
3 4
4 5
5 12
You could use isin method to get a mask, and then use it to filter the matches. Finally, you get the col_idex column and that's all. So, using your dataframe:
mask = df.col_1.isin(df.col_2)
print(df[mask].col_index.to_list()) #to_list is only to get a python list from a Serie.
Result: [1, 4, 12]
Simply loop over the values that are present in both columns, using the Series.isin method
# test data:
a = 12,16,58,99
b = 34,99,35,12
c = 1,2,3,5
d = pd.DataFrame({"col_1":a, "col_2":b, 'col_idx':c})
# col_1 col_2 col_idx
#0 12 34 1
#1 16 99 2
#2 58 35 3
#3 99 12 5
for _,row in d.loc[d.col_1.isin(d.col_2)].iterrows():
val = row.col_1
idx1 = row.col_idx
print(val, idx1, d.query("col_2==%d" % val).col_idx.values)
#12 1 [5]
#99 5 [2]
If your values are strings (instead of integers as in this example), change the query argument accordingly: query("col_2=='%s'" % val) .
I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)
I have an undefined number of columns that have some values. for the example lets say there are 4 columns: [a,b,c,d] , and there is value associated with each column name, like this:
a b c d
0 23 11 0
11 43 33 22
12 0 12 0
I want to write another column right next to d, which has the max value of the column whose value is greater than 0, for example:
Like this:
a b c d e
0 23 11 0 b,c
11 43 33 22 a,b,c,d
12 0 12 0 a,c
my attempt:
dic2 = {'a':[12,0,23],'b':[21,23,0],'c':[0,22,33],'d':[0,22,0]}
df = pd.DataFrame(dic2)
df[df>0]
This will return the NaN value wherever there is zero but I don't know how do I get the column who has these NaN value.
You can filter values greater like 0 to boolean DataFrame and then use DataFrame.dot for matrix multiplication with columns names, last remove separator by indexing with str:
df['e'] = df.gt(0).dot(df.columns + ',').str[:-1]
print (df)
a b c d e
0 12 21 0 0 a,b
1 0 23 22 22 b,c,d
2 23 0 33 0 a,c
You can create a new column and use max function on all other columns
df['D'] = df.max(axis=1)
Code will check all column. If You want to specify from which columns You want to have the max value specify them like that
df['D'] = df[column].max(axis=1)
or with list of columns
df['D'] = df[[column1, column2]].max(axis=1)
I have following series:
project id type
First 130403725 PRODUCT 68
EMPTY 2
Six 130405706 PRODUCT 24
132517244 PRODUCT 33
132607436 PRODUCT 87
How I can transform type values to new columns:
PRODUCT EMPTY
project id
First 130403725 68 2
Six 130405706 24 0
132517244 33 0
132607436 87 0
This is a classic pivot table:
df_pivoted = df.pivot(index=["project", "id"], columns=["type"], values=[3])
I've used 3 as the index of the value column but it would be more clear if you would have named it.
Use unstack, because MultiIndex Series:
s1 = s.unstack(fill_value=0)
print (s1)
type EMPTY PRODUCT
project id
First 130403725 2 68
Six 130405706 0 24
132517244 0 33
132607436 0 87
For DataFrame:
df = s.unstack(fill_value=0).reset_index().rename_axis(None, axis=1)
print (df)
project id EMPTY PRODUCT
0 First 130403725 2 68
1 Six 130405706 0 24
2 Six 132517244 0 33
3 Six 132607436 0 87
I have a csv file test.csv. I am trying to use pandas to select items dependent on whether the second value is above a certain value. Eg
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
So what i would like is if B is larger than 50 then give me the values in A as an integer which I could assign a variable to
edit 1:
Sorry for the poor explanation. The final purpose of this is that I want to look in table 1:
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
for any values above 50 in column B and get the column A value and then look in table 2:
index A B
5 44 12
6 45 13
7 46 14
8 47 15
9 48 16
so in the end i want to end up with the value in column B of table two which i can print out as an integer and not as a series. If this is not possible using panda then ok but is there a way to do it in any case?
You can use dataframa slicing, to get the values you want:
import pandas as pd
f = pd.read_csv('yourfile.csv')
f[f['B'] > 50].A
in this code
f['B'] > 50
is the condition, returning a booleans array of True/False for all values meeting the condition or not, and then the corresponding A values are selected
This would be the output:
2 46
3 47
Name: A, dtype: int64
Is this what you wanted?