Selecting rows with similar index names in Pandas

Selecting rows with similar index names in Pandas - python

Lets say I have a Pandas DataFrame of the following form:
a b c
a_1 1 4 2
a_2 3 3 5
a_3 4 7 2
b_1 2 9 8
b_2 7 2 6
b_3 5 4 1
c_1 3 1 3
c_2 8 6 6
c_3 9 3 7
Is there a way I could select only rows that have similar names? In the case of the DataFrame above that would mean selecting only the rows that start with a, or the rows that start with b, etc.

Using #Akavall setup code
df = pd.DataFrame(data = my_data, index=['a_1', 'a_2', 'b_1', 'b_2'], columns=['a', 'b'])
In [1]: my_data = np.arange(8).reshape(4,2)
In [2]: my_data[0,0] = 4
In [3]: df = pd.DataFrame(data = my_data, index=['a_1', 'a_2', 'b_1', 'b_2'], columns=['a', 'b'])
In [5]: df.filter(regex='a',axis=0)
Out[5]:
a b
a_1 4 1
a_2 2 3
[2 rows x 2 columns]
Note that in general this is better posed as a multi-index
In [6]: df.index = MultiIndex.from_product([['a','b'],[1,2]])
In [7]: df
Out[7]:
a b
a 1 4 1
2 2 3
b 1 4 5
2 6 7
[4 rows x 2 columns]
In [8]: df.loc['a']
Out[8]:
a b
1 4 1
2 2 3
[2 rows x 2 columns]
In [9]: df.loc[['a']]
Out[9]:
a b
a 1 4 1
2 2 3
[2 rows x 2 columns]

I don't think that there is a build-in pandas way to do it, but here is one way:
import numpy as np
import pandas as pd
my_data = np.arange(8).reshape(4,2)
my_data[0,0] = 4
df = pd.DataFrame(data = my_data, index=['a_1', 'a_2', 'b_1', 'b_2'], columns=['a', 'b'])
Result:
>>> df
a b
a_1 4 1
a_2 2 3
b_1 4 5
b_2 6 7
>>> start_with_a = [ind for ind, ele in enumerate(df.index) if ele[0] == 'a']
>>> start_with_a
[0, 1]
>>> df.loc[start_with_a]
a b
a_1 4 1
a_2 2 3

in general you can access the row index and the columns with the .index and .columns attributes.
so you can easily get the rows that start with a programmatically
needed_rows = [row for row in df.index if row.startswith('a')]
then you can use these rows like this
df.loc[needed_rows]

Related

How to concat rows(axis=1) with stride?

example:
import pandas as pd
test = {
't':[0,1,2,3,4,5],
'A':[1,1,1,2,2,2],
'B':[9,9,9,9,8,8],
'C':[1,2,3,4,5,6]
}
df = pd.DataFrame(test)
df
Tried use window and concat:
window_size = 2
for row_idx in range(df.shape[0] - window_size):
print(
pd.concat(
[df.iloc[[row_idx]],
df.loc[:, df.columns!='t'].iloc[[row_idx+window_size-1]],
df.loc[:, df.columns!='t'].iloc[[row_idx+window_size]]],
axis=1
)
)
But get wrong dataframe like this:
Is it possible to use a sliding window to concat data?

pd.concat is alingning indices, so you have to make sure that they fit. You could try the following:
window_size = 2
dfs = []
for n in range(window_size + 1):
sdf = df.iloc[n:df.shape[0] - window_size + n]
if n > 0:
sdf = (
sdf.drop(columns="t").rename(columns=lambda c: f"{c}_{n}")
.reset_index(drop=True)
)
dfs.append(sdf)
res = pd.concat(dfs, axis=1)
Result for the sample:
t A B C A_1 B_1 C_1 A_2 B_2 C_2
0 0 1 9 1 1 9 2 1 9 3
1 1 1 9 2 1 9 3 2 9 4
2 2 1 9 3 2 9 4 2 8 5
3 3 2 9 4 2 8 5 2 8 6

Have a look at this example below:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df4 = pd.DataFrame([['bird', 'polly'], ['monkey','george']],
columns=['animal', 'name'])
pd.concat([df1, df4], axis=1)
# Returns the following output
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
It was taken from the following pandas doc.

Pandas : Sum multiple columns and get results in multiple columns

I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )

When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()

Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15

Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7

What is the most efficient way to create a DataFrame from two unrelated series?

I'm looking at creating a Dataframe that is the combination of two unrelated series.
If we take two dataframes:
A = ['a','b','c']
B = [1,2,3,4]
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
I'm looking for this output:
A B
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
One way could be to have loops on the lists direclty and create the DataFrame but there must be a better way. I'm sure I'm missing something from the pandas documentation.
result = []
for i in A:
for j in B:
result.append([i,j])
result_DF = pd.DataFrame(result,columns=['A','B'])
Ultimately I'm looking at combining months and UUID, I have something working but it takes ages to compute and relies too much on the index. A generic solution would clearly be better:
from datetime import datetime
start = datetime(year=2016,month=1,day=1)
end = datetime(year=2016,month=4,day=1)
months = pd.DatetimeIndex(start=start,end=end,freq="MS")
benefit = pd.DataFrame(index=months)
A = [UUID('d48259a6-80b5-43ca-906c-8405ab40f9a8'),
UUID('873a65d7-582c-470e-88b6-0d02df078c04'),
UUID('624c32a6-9998-49f4-92b6-70e712355073'),
UUID('7207ab0c-3c7f-477e-b5bc-fbb8059c1dec')]
dfA = pd.DataFrame(A)
result = pd.DataFrame(columns=['A','month'])
for i in dfA.index:
newdf = pd.DataFrame(index=benefit.index)
newdf['A'] = dfA.iloc[i,0]
newdf['month'] = newdf.index
result = pd.concat([result,newdf])
result

You can use np.meshgrid:
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4
to get a roughly ~2000x speedup on DataFrame objects of length 300 and 400, respectively:
A = ['a', 'b', 'c'] * 100
B = [1, 2, 3, 4] * 100
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
np.meshgrid:
%%timeit
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
100 loops, best of 3: 8.45 ms per loop
vs cross:
%timeit cross(dfA, dfB)
1 loop, best of 3: 16.3 s per loop
So if I understand your example correctly, you could:
A = ['a', 'b', 'c']
dfA = pd.DataFrame(A)
start = datetime(year=2016, month=1, day=1)
end = datetime(year=2016, month=4, day=1)
months = pd.DatetimeIndex(start=start, end=end, freq="MS")
dfB = pd.DataFrame(months.month)
pd.DataFrame(np.array(np.meshgrid(dfA, dfB, )).T.reshape(-1, 2))
to also get:
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4

Using itertools.product:
from itertools import product
result = pd.DataFrame(list(product(dfA.iloc[:,0], dfB.iloc[:,0])))
Not quite as efficient as np.meshgrid, but it's more efficient than the other solutions.

Alternatively
a = [1,2,3]
b = ['a','b','c']
x,y = zip(*[i for i in zip(np.tile(a,len(a)),np.tile(b,len(a)))])
pd.DataFrame({'x':x,'y':y})
Outputs:
x y
0 1 a
1 2 b
2 3 c
3 1 a
4 2 b
5 3 c
6 1 a
7 2 b
8 3 c
%%timeit
1000 loops, best of 3: 559 µs per loop
EDIT: You don't actually need np.tile. A simple comprehension will do
x,y = zip(*[(i,j) for i in a for j in b])

One liner approach
pd.DataFrame(0, A, B).stack().index.to_series().apply(pd.Series).reset_index(drop=True)
Or:
pd.MultiIndex.from_product([A, B]).to_series().apply(pd.Series).reset_index(drop=True)
From dataframes, assuming the information is in the first column.
pd.MultiIndex.from_product([dfA.iloc[:, 0], dfB.iloc[:, 0]]).to_series().apply(pd.Series).reset_index(drop=True)
Functionalized:
def cross(df1, df2):
s1 = df1.iloc[:, 0]
s2 = df2.iloc[:, 0]
midx = pd.MultiIndex.from_product([s1, s2])
df = midx.to_series().apply(pd.Series).reset_index(drop=True)
df.columns = [s1.name, s2.name if s1.name != s2.name else 1]
return df
print cross(dfA, dfB)
0 1
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
10 c 3
11 c 4

Slice one Pandas DataFrame based on another

I have created following pandas DataFrame based on lists of ids.
In [8]: df = pd.DataFrame({'groups' : [1,2,3,4],
'id' : ["[1,3]","[2]","[5]","[4,6,7]"]})
Out[9]:
groups id
0 1 [1,3]
1 2 [2]
2 3 [5]
3 4 [4,6,7]
There is another DataFrame like following.
In [12]: df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p2,p3,p4"]})
I need to get path values for each group.
E.g
groups path
1 p1,p2,p3,p4
p1,p5,p5,p7
2 p1,p2,p1
3 p1,p2
4 p1,p2,p3,p3
p1
p2,p3,p4

I'm not sure this is quite the best way to do it, but it worked for me. Incidentally this only works if you create the id variable in df 1 without the "" marks, i.e. as lists, not strings...
import itertools
df = pd.DataFrame({'groups' : [1,2,3,4],
'id' : [[1,3],[2],[5],[4,6,7]]})
df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p2,p3,p4"]})
paths = [[] for group in df.groups.unique()]
for x in df.index:
paths[x].extend(itertools.chain(*[list(df2[df2.id == int(y)]['path']) for y in df.id[x]]))
df['paths'] = pd.Series(paths)
df
There is probably a much neater way of doing this, but its an odd data structure in a way. Gives the following output
groups id paths
0 1 [1, 3] [p1,p2,p3,p4, p1,p5,p5,p7]
1 2 [2] [p1,p2,p1]
2 3 [5] [p1,p2]
3 4 [4, 6, 7] [p1,p2,p3,p3, p1, p2,p3,p4]

You shouldn't construct your DataFrame to have embedded list objects. Instead, repeat the groups according to the length of the ids and then use pandas.merge, like so:
In [143]: groups = list(range(1, 5))
In [144]: ids = [[1, 3], [2], [5], [4, 6, 7]]
In [145]: df = DataFrame({'groups': np.repeat(groups, list(map(len, ids))), 'id': reduce(lambda
x, y: x + y, ids)})
In [146]: df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p
2,p3,p4"]})
In [147]: df
Out[147]:
groups id
0 1 1
1 1 3
2 2 2
3 3 5
4 4 4
5 4 6
6 4 7
[7 rows x 2 columns]
In [148]: df2
Out[148]:
id path
0 1 p1,p2,p3,p4
1 2 p1,p2,p1
2 3 p1,p5,p5,p7
3 4 p1,p2,p3,p3
4 5 p1,p2
5 6 p1
6 7 p2,p3,p4
[7 rows x 2 columns]
In [149]: pd.merge(df, df2, on='id', how='outer')
Out[149]:
groups id path
0 1 1 p1,p2,p3,p4
1 1 3 p1,p5,p5,p7
2 2 2 p1,p2,p1
3 3 5 p1,p2
4 4 4 p1,p2,p3,p3
5 4 6 p1
6 4 7 p2,p3,p4
[7 rows x 3 columns]

rename index of a pandas dataframe

I have a pandas dataframe whose indices look like:
df.index
['a_1', 'b_2', 'c_3', ... ]
I want to rename these indices to:
['a', 'b', 'c', ... ]
How do I do this without specifying a dictionary with explicit keys for each index value?
I tried:
df.rename( index = lambda x: x.split( '_' )[0] )
but this throws up an error:
AssertionError: New axis must be unique to rename

Perhaps you could get the best of both worlds by using a MultiIndex:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(8).reshape(4,2), index=['a_1', 'b_2', 'c_3', 'c_4'])
print(df)
# 0 1
# a_1 0 1
# b_2 2 3
# c_3 4 5
# c_4 6 7
index = pd.MultiIndex.from_tuples([item.split('_') for item in df.index])
df.index = index
print(df)
# 0 1
# a 1 0 1
# b 2 2 3
# c 3 4 5
# 4 6 7
This way, you can access things according to first level of the index:
In [30]: df.ix['c']
Out[30]:
0 1
3 4 5
4 6 7
or according to both levels of the index:
In [31]: df.ix[('c','3')]
Out[31]:
0 4
1 5
Name: (c, 3)
Moreover, all the DataFrame methods are built to work with DataFrames with MultiIndices, so you lose nothing.
However, if you really want to drop the second level of the index, you could do this:
df.reset_index(level=1, drop=True, inplace=True)
print(df)
# 0 1
# a 0 1
# b 2 3
# c 4 5
# c 6 7

That's the error you'd get if your function produced duplicate index values:
>>> df = pd.DataFrame(np.random.random((4,3)),index="a_1 b_2 c_3 c_4".split())
>>> df
0 1 2
a_1 0.854839 0.830317 0.046283
b_2 0.433805 0.629118 0.702179
c_3 0.390390 0.374232 0.040998
c_4 0.667013 0.368870 0.637276
>>> df.rename(index=lambda x: x.split("_")[0])
[...]
AssertionError: New axis must be unique to rename
If you really want that, I'd use a list comp:
>>> df.index = [x.split("_")[0] for x in df.index]
>>> df
0 1 2
a 0.854839 0.830317 0.046283
b 0.433805 0.629118 0.702179
c 0.390390 0.374232 0.040998
c 0.667013 0.368870 0.637276
but I'd think about whether that's really the right direction.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting rows with similar index names in Pandas - python

in general you can access the row index and the columns with the .index and .columns attributes. so you can easily get the rows that start with a programmatically needed_rows = [row for row in df.index if row.startswith('a')] then you can use these rows like this df.loc[needed_rows]

Related

How to concat rows(axis=1) with stride?

Pandas : Sum multiple columns and get results in multiple columns

What is the most efficient way to create a DataFrame from two unrelated series?

Slice one Pandas DataFrame based on another

rename index of a pandas dataframe

Categories

Resources