munging inconsistent data in a pandas dataframe

munging inconsistent data in a pandas dataframe - python

I have several sets of data, comprising about 3,000,000 measurements each. In many cases, 'identical' tests yielded different results. I am trying to create a scheme such that these inconsistencies are apparent in the processed data. I am using pandas for my analyses.
Though not my real case, here is a similar example: Suppose I have a set of chemicals, A, B, C, ..., and I mix them together pairwise and note whether a reaction takes place. Let's denote no reaction by '0' and a measurable reaction by '1'. Our data might look like this:
chemical 1
chemical 2
Do they react?
comments
A
B
1
comment 1
D
E
0
comment 2
A
B
1
comment 3
F
G
1
comment 4
A
B
0
comment 5
I am thinking that a workable aim would be to use a sentinel (say, '2') to indicate this inconsistency and summarize the comments for later examination:
chemical 1
chemical 2
Do they react?
comments
A
B
2
comment 1; comment 3; comment 5
D
E
0
comment 2
F
G
1
comment 4
I have developed code to identify the tuples (chemical 1, chemical 2) that lead to these inconsistent results. There are about 30,000 of them in my first dataset. This calculation runs fairly quickly.
I have also created some pandas code to give me the desired dataframe. Here is an adaptation for the type of data above:
def uniquify_rows(df, replicates, sentinel=2):
for chemical_1, chemical_2 in replicates:
indices = (df["chemical 1"] == chemical_1) & (df["chemical 2"] == chemical_2)
rows = df[indices]
comments = ";".join(c for c in rows["comments"].values)
field_vals = rows.iloc[0].values # chem1, chem2, react, comments
field_vals[2], field_vals[3] = sentinel, comments
df.iloc[indices] = field_vals
df.drop_duplicates(inplace=True)
return df
This code seems to work on my data, but is extremely slow. Being relatively new to pandas, I suspect that I am doing something very inefficient.
Any ideas to speed up this task?
Thank you.
Kind regards,
gyro

Let's try with groupby aggregate:
sentinel = 2
df = df.groupby(['chemical 1', 'chemical 2'], as_index=False).aggregate({
'Do they react?': lambda s: s.iat[0] if (s.iat[0] == s).all() else sentinel,
'comments': ';'.join
})
df:
chemical 1 chemical 2 Do they react? comments
0 A B 2 comment 1;comment 3;comment 5
1 D E 0 comment 2
2 F G 1 comment 4

Related

How to switch column elements for only one specific row in pandas?

So, I am working w/data from my research lab and am trying to sort it and move it around etc. And most of the stuff isn't important to my issue and I don't want to go into detail because confidentiality stuff, but I have a big table w/columns and rows and I want to specifically switch the elements of two columns ONLY in one row.
The extremely bad attempt at code I have for it is this (I rewrote the variables to be more vague though so they make sense):
for x in df.columna.values:
*some if statements*
df.loc[df.index([df.loc[df['columna'] == x]]), ['columnb', 'columna']] = df[df.index([df.loc[df['columna'] == x]]), ['columna', 'columnb']].numpy()
I am aware that the code I have is trash (and also the method - w/the for loops and if statements. I know I can abstract it a TON but I just want to actually figure out a way to make it work and them I will clean it up and make it prettier and more efficient. I learned pandas existed on tuesday so I am not an expert), but I think my issue lies in the way I'm getting the row.
One error I was recently getting for a while is the method I was using to get the row was giving me 1 row x 22 columns and I think I needed the name/index of the row instead. Which is why the index function is now there. However, I am now getting the error:
TypeError: 'RangeIndex' object is not callable
And I am just so confused all around. Sorry I've written a ton of text, basically: is there any simpler way to just switch the elements of two columns for one specific row (in terms of x, an element in that row)?
I think my biggest issue is trying to like- get the rows "name" in the format it wants. Although I may have a ton of other problems because honestly I am just really lost.

You're sooooo close! The error you're getting stems from trying to slice df.index([df.loc[df['columna'] == x]]). The parentheses are unneeded here and this should read as: df.index[df.loc[df['columna'] == x]].
However, here's an example on how to swap values between columns when provided a value (or multiple values) to swap at.
Sample Data
df = pd.DataFrame({
"A": list("abcdefg"),
"B": [1,2,3,4,5,6,7]
})
print(df)
A B
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
Let's say we're going to swap the values where A is either "c" or "f". To do this we need to first create a mask that just selects those rows. To accomplish this, we can use .isin. Then to perform our swap, we actually take the same exact approach you had! Including the .to_numpy() is very important, because without it Pandas will actually realign your columns for you and cause the values to not be swapped. Putting it all together:
swap_at = ["c", "f"]
swap_at_mask = df["A"].isin(swap_at) # mask where columns "A" is either equal to "c" or "f"
# Without the `.to_numpy()` at the end, pandas will realign the Dataframe
# and no values will be swapped
df.loc[swap_at_mask, ["A", "B"]] = df.loc[swap_at_mask, ["B", "A"]].to_numpy()
print(df)
A B
0 a 1
1 b 2
2 3 c
3 d 4
4 e 5
5 6 f
6 g 7

I think it was probably a syntax problem. I am assuming you are using tensorflow with the numpy() function? Try this it switches the columns based on the code you provided:
for x in df.columna.values:
# *some if statements*
df.loc[
(df["columna"] == x),
['columna', 'columnb']
] = df.loc[(df["columna"] == x), ['columnb', 'columna']].values.numpy()
I am also a beginner and would recommend you aim to make it pretty from the get go. It will save you a lot of extra time in the long run. Trial and error!

Python Pandas adding two dataframes SettingWithCopyWarning

I don't understand why I am getting the dreaded warning when I am doing exactly as instructed by the official documentation.
We have a dataframe called a
a = pd.DataFrame(data = [['Tom',1],
['Tom',1],
['Dick',1],
['Dick',1],
['Harry',1],
['Harry',1]], columns = ['Col1', 'Col2'])
a
Out[377]:
Col1 Col2
0 Tom 1
1 Tom 1
2 Dick 1
3 Dick 1
4 Harry 1
5 Harry 1
First we create a "holder" dataframe:
holder = a
Then we create a subset of a:
c = a.loc[a['Col1'] == 'Tom',:]
c
Out[379]:
Col1 Col2
0 Tom 1
1 Tom 1
We create another subset d which will be added to (a slice of) the previous subset c but once we try to add d to c, we get the warning:
d = a.loc[a['Col1'] == 'Tom','Col2']
d
Out[389]:
0 1
1 1
c.loc[:,'Col2'] += d
C:\Users\~\anaconda3\lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item] = s
I would like to understand what I am doing wrong because I use this logic very often (coming from R where everything is not a darn object)

After noticing a different issue, I found a solution.
Whenever you say
dataframe_A = dataframe_B
you need to proceed with caution because Python, it seems, joins these two dataframes by the hip, so to speak. If you make changes to dataframe_B your dataframe_A will also change!
I understand just enough to fix the problem by using .copy(deep=True) where python will create a full and independent copy so that you can make changes to one without affecting the other one.
On further investigation, and for those interested, it apparently has to do with "pointers" which is a slightly complicated coding concept with a scope beyond this specific question.

Looping and filtering through a numpy array to cancel out subsequent related "events"

I have a numpy array that looks like the example below.
The different columns represent different events.
The 1s and 0s represent whether each event occurs. To be explicit, a 1 in column A means that event A occurs.
The columns moving from left to right show progression through time, so A occurs first, then B, then C, etc.
There are 20 columns and about 1M rows.
A B C D ...
1 1 1 1
1 1 1 0
1 1 0 0
1 0 0 0
...
The events are, however, related. Let's say that events A and C are related. If A does not occur, then C cannot. As such, what I am trying to do is, as I understand it:
For each column:
Select rows where A = 0 and C = 1
Set the elements in C in this subset of rows to 0
To be clear, there are situations where there are more than one related event. For example, A, C, F and J could be related. Also, the columns in the numpy array are not named -- I included the A, B, C, &c headings for ease of explanation.
I have been doing this manually via:
df[(df[:,0] == 0) & (df[:,2] == 1), 2]
How can I store the events to make looping through the related events, and use this to update the numpy array?

alternative for python pivot_table for frequency table of two variables

For a data like this
import pandas as pd
df=pd.DataFrame({'group1': list('AABBCCAABBCC'),'group2':list('ZYYXYXXYZXYZ')})
I figured out with some difficulty that to make a frequency table of rows and columns, the most common way is as follows
print df.pivot_table(index='group1',columns='group2',aggfunc=len,fill_value=0)
by which I get
group2 X Y Z
group1
A 1 2 1
B 2 1 1
C 1 2 1
I am just wondering if there are any 'faster' way to generate the same table. Not that there is anything wrong with it but what I mean is something which involve less typing (without me having to write a custom function)
I am just comparing this with R where same result could have been achieved by
table(df$group1,df$group2)
Compared to this, entering non default parameters like aggfunc and fill_value and typing out argument names, index and columns seems lot of additional effort.
In general my experience (very limited) is that R equivalent functions in python are very similar in conciseness.
Any suggestions on alternative methods would be great. I will need to make several of these tables with my data.

Here is an alternative method.
>>> df.groupby(['group1', 'group2']).group2.count().unstack().fillna(0)
group2 X Y Z
group1
A 1 2 1
B 2 1 1
C 1 2 1

pd.crosstab(df['group1'],df['group2'])
This was exactly what I was looking for. Did not find it when I was searching for it initially.

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.

If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.

Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()

Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

munging inconsistent data in a pandas dataframe - python

Related

How to switch column elements for only one specific row in pandas?

Python Pandas adding two dataframes SettingWithCopyWarning

Looping and filtering through a numpy array to cancel out subsequent related "events"

alternative for python pivot_table for frequency table of two variables

Comparison between one element and all the others of a DataFrame column

Categories

Resources