I'm having problems with MultiIndex and stack(). The following example is based on a solution from Calvin Cheung on StackOvervlow.
=== multi.csv ===
h1,main,h3,sub,h5
a,A,1,A1,1
b,B,2,B1,2
c,B,3,A1,3
d,A,4,B2,4
e,A,5,B3,5
f,B,6,A2,6
=== multi.py ===
#!/usr/bin/env python
import pandas as pd
df1 = pd.read_csv('multi.csv')
df2 = df1.pivot('main', 'sub').stack()
print(df2)
=== output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
B3 e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
This works as long as the entries in the sub column are unique with respect to the corresponding entry in the main column. But if we change the sub column entry in row e to B2, then B2 is no longer unique in the group of A rows and we get an error message: "pandas.core.reshape.ReshapeError: Index contains duplicate entries, cannot reshape".
I was expected the shape of the sub index to behave like the shape of the primary index where duplicates are indicated with blank entries under the first row entry.
=== expected output ===
h1 h3 h5
main sub
A A1 a 1 1
B2 d 4 4
e 5 5
B A1 c 3 3
A2 f 6 6
B1 b 2 2
So my question is, how can I structure a MultiIndex in a way that allows duplicates in sub-levels?
Rather than do a pivot*, just set_index directly (this works for both examples):
In [11]: df
Out[11]:
h1 main h3 sub h5
0 a A 1 A1 1
1 b B 2 B1 2
2 c B 3 A1 3
3 d A 4 B2 4
4 e A 5 B2 5
5 f B 6 A2 6
In [12]: df.set_index(['main', 'sub'])
Out[12]:
h1 h3 h5
main sub
A A1 a 1 1
B B1 b 2 2
A1 c 3 3
A B2 d 4 4
B2 e 5 5
B A2 f 6 6
*You're not really doing a pivot here anyway, it just happens to work in the above case.
Related
data frame issue
ID
C1
C2
M1
1
A
B
X
2
A
Y
3
C
W
4
G
H
Z
result wanted
ID
C
1
A
1
B
2
B
3
C
4
C
4
G
The main problem is the first dataset today has C1 and C2
tomorrow we could have C1 , C2 , C3 ...Cn
the filename will be provided and my task is read it and get the result regardless of how many C related columns the file may have. column: M1 is not needed.
-----what I tried:
df = pd.read_csv (r"C:\Users\JIRAdata_TEST.csv")
df = df.filter(regex='ID|C')
print(df2)
will return all ID and C related columns, and remove the M1 column as part of data clean up--dont know if that helps.
then...am stuck!
Use df.melt with df.dropna:
In [1295]: x = df.filter(regex='ID|C').melt('ID', value_name='C').sort_values('ID').dropna().drop('variable', 1)
In [1296]: x
Out[1296]:
ID C
0 1 A
4 1 B
5 2 A
2 3 C
3 4 G
7 4 H
This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 3 years ago.
There is a pandas dataframe:
df = pd.DataFrame({'c1':['a','b','c','d'],'c2':[1,2,3,4]})
c1 c2
0 a 1
1 b 2
2 c 3
3 d 4
And a pandas Series:
list1 = pd.Series(['b','c','e','f'])
Out[6]:
0 a
1 b
2 c
3 e
How to create a new data frame that contains rows where c1 is in list1.
output:
c1 c2
0 b 2
1 c 3
You can use df.isin:
In [582]: df[df.c1.isin(list1)]
Out[582]:
c1 c2
1 b 2
2 c 3
Or, using df.loc, if you want to modify your slice:
In [584]: df.loc[df.c1.isin(list1), :]
Out[584]:
c1 c2
1 b 2
2 c 3
Using query
In [1133]: df.query('c1 in #list1')
Out[1133]:
c1 c2
1 b 2
2 c 3
Or, using isin
In [1134]: df[df.c1.isin(list1)]
Out[1134]:
c1 c2
1 b 2
2 c 3
Both #JohnGalt's and #COLDSPEED's answers are more idiomatic pandas. Please don't use these answers. They are intended to be fun and illustrative of other parts of the pandas and numpy api.
Alt 1
This is utilizing numpy.in1d which acts as a proxy for pd.Series.isin
df[np.in1d(df.c1.values, list1.values)]
c1 c2
1 b 2
2 c 3
Alt 2
Use set logic
df[df.c1.apply(set) & set(list1)]
c1 c2
1 b 2
2 c 3
Alt 3
Use pd.Series.str.match
df[df.c1.str.match('|'.join(list1))]
c1 c2
1 b 2
2 c 3
For the sake of completenes
yet another way (definitely not the best one) to achieve that:
In [4]: df.merge(list1.to_frame(name='c1'))
Out[4]:
c1 c2
0 b 2
1 c 3
a1 = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
b2 = pd.DataFrame({'A': [1,4], 'B': [3,6]})
and I wanna get
c = pd.DataFrame({'A': [1,2,3,4], 'B': [3,3,4,6]})
a1 and b2 merge on the key='A'
but when 'A' equal but B different, get b2 value
how can I get this work? have no idea.
First concatenate both dataframes under each other to get one big dataframe:
c = pd.concat([a1, b2], 0)
A B
0 1 2
1 2 3
2 3 4
0 1 3
1 4 6
Then group on column A to only get the unique values of A, by using last you make sure than when there is a duplicate the value of b2 is used. This gives:
c = c.groupby('A').last()
B
A
1 3
2 3
3 4
4 6
Then set reset index to get a nice numerical index.
c = c.reset_index()
which returns:
A B
0 1 3
1 2 3
2 3 4
3 4 6
To do it all in one go just enter the following lines of code:
c = pd.concat([a1, b2], 0)
c = c.groupby('A').last().reset_index()
I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:
After I have executed a df.size() function as seen below (df = Dataframe) in the pandas module, I've obtained a new column beside the one labeled No.. However, I'm not sure how to manipulate this new column. This is because I don't know the label/key for this column.
For example, I want to express the values generated (in the new column) as a fraction of the sum of all these values in a new column. How can I do so?
JuncNo = pd.read_csv(filename)
JuncNo_group = JuncNo.groupby('No.')
JuncSize = JuncNo_group.size()
JuncSize.head(n=6)
No.
1 122
2 2136
3 561
4 91
5 10
6 3
dtype: int64
You have to set name of new Series and reset index:
JuncSize = JuncSize.groupby('No').size()
JuncSize.name = 'size'
JuncSize = JuncSize.reset_index()
print JuncSize
But if you need add new column with same no of rows as original dataframe, you can use:
JuncSize['size'] = JuncSize.groupby('No').transform(np.size)
Example:
print JuncSize
No Code
0 D B2
1 B B2
2 B B3
3 B B3
4 G B3
5 B B3
JuncSize['size'] = JuncSize.groupby('No').transform(np.size)
print JuncSize
No Code size
0 D B2 1
1 B B2 4
2 B B3 4
3 B B3 4
4 G B3 1
5 B B3 4
JuncSize = JuncSize.groupby('No').size()
print JuncSize
No
B 4
D 1
G 1
JuncSize.name = 'size'
print JuncSize
No
B 4
D 1
G 1
Name: size, dtype: int64
JuncSize = JuncSize.reset_index()
print JuncSize
No size
0 B 4
1 D 1
2 G 1