Using pandas to order results that are outputted every two rows - python

A program I am working with outputed a tab delimited file that looks like this:
marker A B C
Bin_1 1 2 1
marker C G H B T
Bin_2 3 1 1 1 2
marker B H T Z Y A C
Bin_3 1 1 2 1 3 4 5
I want to fix it so that it looks like this:
marker A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 1 0 0
Bin_3 4 1 5 0 1 2 3 1
This is what I have so far
import pandas as pd
from collections import OrderedDict
df = pd.read_csv('markers.txt',header=None,sep='\t')
x = map(list,df.values)
list_of_dicts = []
s = 0
e =1
g = len(x)+1
while e < g:
new_dict = OrderedDict(zip(x[s],x[e]))
list_of_dicts.append(new_dict)
s += 2
e += 2
Initially I was converting these to dictionaries and then was going to do some kind of count and recreate a dataframe but that seems to be taking a lot of time and memory for what seems like an easy task. Any suggestions on a better way to approach this?

lines = [str.strip(l).split() for l in open('markers.txt').readlines()]
dicts = {b[0]: pd.Series(dict(zip(m[1:], b[1:])))
for m, b in zip(lines[::2], lines[1::2])}
pd.concat(dicts).unstack(fill_value=0)
A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 2 0 0
Bin_3 4 1 5 0 1 2 3 1

The insight is that when you "append" DataFrames, the result is a DataFrame with columns that are the union of the columns, with NaNs or whatever in the holes. So:
$ cat test.py
import pandas as pd
frame = pd.DataFrame()
with open('/tmp/foo.tsv') as markers:
while True:
line = markers.readline()
if not line:
break
columns = line.strip().split('\t')
data = markers.readline().strip().split('\t')
new = pd.DataFrame(data=[data], columns=columns)
frame = frame.append(new)
frame = frame.fillna(0)
print(frame)
$ python test.py < /tmp/foo.tsv
A B C G H T Y Z marker
0 1 2 1 0 0 0 0 0 Bin_1
0 0 1 3 1 1 2 0 0 Bin_2
0 4 1 5 0 1 2 3 1 Bin_3
If you aren't using pandas anywhere else, then this might (or might not) be overkill. But if you are already using it anyway, then I think this is totally reasonable.

Not the most elegant thing in the world, but...
headers = df.iloc[::2][0].apply(lambda x: x.split()[1:])
data = df.iloc[1::2][0].apply(lambda x: x.split()[1:])
result = []
for h, d in zip(headers.values, data.values):
result.append(pd.Series(d, index=h))
pd.concat(result, axis=1).fillna(0).T
A B C G H T Y Z
0 1 2 1 0 0 0 0 0
1 0 1 3 1 1 2 0 0
2 4 1 5 0 1 2 3 1

Why not manipulate the data into a dict on input and then construct the DataFrame:
>>> with open(...) as f:
... d = {}
... for marker, bins in zip(f, f):
... z = zip(h.split(), v.split())
... _, bin = next(z)
... d[bin] = dict(z)
>>> pd.DataFrame(d).fillna(0).T
A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 2 0 0
Bin_3 4 1 5 0 1 2 3 1
If you really need the column axis name:
>>> pd.DataFrame(d).fillna(0).rename_axis('marker').T
marker A B C G H T Y Z
Bin_1 1 2 1 0 0 0 0 0
Bin_2 0 1 3 1 1 2 0 0
Bin_3 4 1 5 0 1 2 3 1

Related

How to count cumulatively with conditions on a groupby?

Say I have a data-frame, filled as below, with the column 'Key' having one of five possible values A, B, C, D, X. I would like to add a new column 'Res' that counts the number of these letters cumulatively and resets each time it hits and X.
For example:
Key Res
0 D 1
1 X 0
2 B 1
3 C 2
4 D 3
5 X 0
6 A 1
7 C 2
8 X 0
9 X 0
May anyone assist in how I can achieve this?
A possible solution:
a = df.Key.ne('X')
df['new'] = ((a.cumsum()-a.cumsum().where(~a).ffill().fillna(0)).astype(int))
Another possible solution, which is more basic than the previous one, but much faster (several orders of magnitude):
s = np.zeros(len(df), dtype=int)
for i in range(len(df)):
if df.Key[i] != 'X':
s[i] = s[i-1] + 1
df['new'] = s
Output:
Key Res new
0 D 1 1
1 X 0 0
2 B 1 1
3 C 2 2
4 D 3 3
5 X 0 0
6 A 1 1
7 C 2 2
8 X 0 0
9 X 0 0
Example
df = pd.DataFrame(list('DXBCDXACXX'), columns=['Key'])
df
Key
0 D
1 X
2 B
3 C
4 D
5 X
6 A
7 C
8 X
9 X
Code
df1 = pd.concat([df.iloc[[0]], df])
grouper = df1['Key'].eq('X').cumsum()
df1.assign(Res=df1.groupby(grouper).cumcount()).iloc[1:]
result:
Key Res
0 D 1
1 X 0
2 B 1
3 C 2
4 D 3
5 X 0
6 A 1
7 C 2
8 X 0
9 X 0

Pandas: Remove rows except the first new occurrence of a value

I have a dataframe
x y
a 1
b 1
c 1
d 0
e 0
f 0
g 1
h 1
i 0
j 0
I want to remove the rows with 0 except every first new occurence of 0 after 1, so the resultant dataframe should be
x y
a 1
b 1
c 1
d 0
g 1
h 1
i 0
Is it possible to do it without creating groups or row by row iteration to make it faster since I have a big dataframe.
Let us try diff with cumsum create the continue value group , then try duplicated
out = df[~df.y.diff().ne(0).cumsum().duplicated() | df.y].copy()
Out[352]:
x y
0 a 1
1 b 1
2 c 1
3 d 0
6 g 1
7 h 1
8 i 0
Check consecutive similarity using shift()
df[df.y.ne(0)|(df.y.eq(0)&df.y.shift(1).ne(0))]
x y
0 a 1
1 b 1
2 c 1
3 d 0
6 g 1
7 h 1
8 i 0

Pandas crosstab on very large matrix?

I have a dataframe of dimensions (42 million rows, 6 columns) that I need to do a crosstab on to get counts of specific events for each person in the dataset that will result in a very large sparse matrix of size ~1.5 million rows by 36,000 columns. When I try this with pandas crosstab (pd.crosstab) function I run out of memory on my system. Is there some way to do this crosstab in chunks and join the resulting dataframes? To be clear, each row of the crosstab will count the number of times an event occurred for each person in the dataset (i.e. each row is a person, each column entry is the count of the times that person participated in a specific event). The ultimate goal is to factor the resulting person-event matrix using PCA/SVD.
Setup
source_0 = [*'ABCDEFGHIJ']
source_1 = [*'abcdefghij']
np.random.seed([3, 1415])
df = pd.DataFrame({
'source_0': np.random.choice(source_0, 100),
'source_1': np.random.choice(source_1, 100),
})
df
source_0 source_1
0 A b
1 C b
2 H f
3 D a
4 I h
.. ... ...
95 C f
96 F a
97 I j
98 I d
99 J b
Use pd.factorize to get an integer factorization... and unique values
ij, tups = pd.factorize(list(zip(*map(df.get, df))))
result = dict(zip(tups, np.bincount(ij)))
This is already a compact form. But you can convert it to a pandas.Series and unstack to verify it is what we want.
pd.Series(result).unstack(fill_value=0)
a b c d e f g h i j
A 2 1 0 0 0 1 0 2 1 1
B 0 1 0 0 0 1 0 1 0 1
C 0 3 1 3 0 2 0 0 0 0
D 3 0 0 2 0 0 1 3 0 2
E 3 0 0 1 0 1 2 5 0 0
F 4 0 2 1 1 1 1 1 1 0
G 0 2 1 0 0 2 3 0 3 1
H 1 3 2 0 2 1 1 1 0 2
I 2 2 1 1 2 0 1 2 0 2
J 0 1 1 0 1 1 0 1 0 1
Using sparse
from scipy.sparse import csr_matrix
i, r = pd.factorize(df['source_0'])
j, c = pd.factorize(df['source_1'])
ij, tups = pd.factorize(list(zip(i, j)))
a = csr_matrix((np.bincount(ij), tuple(zip(*tups))))
b = pd.DataFrame.sparse.from_spmatrix(a, r, c).sort_index().sort_index(axis=1)
b
a b c d e f g h i j
A 2 1 0 0 0 1 0 2 1 1
B 0 1 0 0 0 1 0 1 0 1
C 0 3 1 3 0 2 0 0 0 0
D 3 0 0 2 0 0 1 3 0 2
E 3 0 0 1 0 1 2 5 0 0
F 4 0 2 1 1 1 1 1 1 0
G 0 2 1 0 0 2 3 0 3 1
H 1 3 2 0 2 1 1 1 0 2
I 2 2 1 1 2 0 1 2 0 2
J 0 1 1 0 1 1 0 1 0 1

Delete row and column from pandas dataframe

I have a CSV file which is contains a symmetric adjacency matrix which means row and column have equivalent labels.
I would like to import this into a pandas dataframe, ideally have some GUI pop up and ask for a list of items to delete....and then take that list in and set the values in the relative row and column as zero's and return a separate altered dataframe.
In short, something that takes the following matrix
a b c d e
a 0 3 5 3 5
b 3 0 2 4 5
c 5 2 0 1 7
d 3 4 1 0 9
e 5 5 7 9 0
Pops up a simple interface asking "which regions should be deleted" and a line to enter those regions
and say c and e are entered
returns
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
with the altered entries as shown in bold
it should be able to do this for as many areas as entered which can be up to 379....ideally seperated by commas
Set columns and rows by index values with DataFrame.loc:
vals = ['c','e']
df.loc[vals, :] = 0
df[vals] = 0
#alternative
#df.loc[:, vals] = 0
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
Another solution is create boolean mask with numpy broadcasting and set values by DataFrame.mask:
mask = df.index.isin(vals) | df.columns.isin(vals)[:, None]
df = df.mask(mask, 0)
print (df)
a b c d e
a 0 3 0 3 0
b 3 0 0 4 0
c 0 0 0 0 0
d 3 4 0 0 0
e 0 0 0 0 0
Start by importing the csv:
import pandas as pd
adj_matrix = pd.read_csv("file/name/to/your.csv", index_col=0)
Then request the input:
regions = input("Please enter the regions that you want deleted (as an array of strings)")
adj_matrix.loc[regions, :] = 0
adj_matrix.loc[:, regions] = 0
Now adj_matrix should be in the form you want.

Creating table python

I have a data set in excel. A sample of the data is given below. Each row contains a number of items; one item in each column. The data has no headers either.
a b a d
g z f d a
e
dd gg dd g f r t
want to create a table which should look like below. It should count the items in each row and display the count by the row. I dont know apriori how many items are in the table.
row# a b d g z f e dd gg r t
1 2 1 1 0 0 0 0 0 0 0 0
2 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 2 1 1 1
I am not an expert in python and any assistance is very much appreciated.
Use get_dummies + sum:
df = pd.read_csv(file, names=range(100)).stack() # setup to account for missing values
df.str.get_dummies().sum(level=0)
a b d dd e f g gg r t z
0 2 1 1 0 0 0 0 0 0 0 0
1 1 0 1 0 0 1 1 0 0 0 1
2 0 0 0 0 1 0 0 0 0 0 0
3 0 0 0 2 0 1 1 1 1 1 0

Categories