Converting a single pandas index into a three level MultiIndex in python - python

I have some data in a pandas dataframe which looks like this:
gene VIM
time:2|treatment:TGFb|dose:0.1 -0.158406
time:2|treatment:TGFb|dose:1 0.039158
time:2|treatment:TGFb|dose:10 -0.052608
time:24|treatment:TGFb|dose:0.1 0.157153
time:24|treatment:TGFb|dose:1 0.206030
time:24|treatment:TGFb|dose:10 0.132580
time:48|treatment:TGFb|dose:0.1 -0.144209
time:48|treatment:TGFb|dose:1 -0.093910
time:48|treatment:TGFb|dose:10 -0.166819
time:6|treatment:TGFb|dose:0.1 0.097548
time:6|treatment:TGFb|dose:1 0.026664
time:6|treatment:TGFb|dose:10 -0.008032
where the left is an index. This is just a subsection of the data which is actually much larger. The index is composed of three components, time, treatment and dose. I want to reorganize this data such that I can access it easily by slicing. The way to do this is to use pandas MultiIndexing but I don't know how to convert my DataFrame with one index into another with three. Does anybody know how to do this?
To clarify, the desired output here is the same data with a three level index, the outer being treatment, middle is dose and the inner being time. This would be useful so then I could access the data with something like df['time']['dose'] or 'df[0]` (or something to that effect at least).

You can first replace unnecessary strings (index has to be converted to Series by to_series, because replace doesnt work with index yet) and then use split. Last set index names by rename_axis (new in pandas 0.18.0)
df.index = df.index.to_series().replace({'time:':'','treatment:': '','dose:':''}, regex=True)
df.index = df.index.str.split('|', expand=True)
df = df.rename_axis(('time','treatment','dose'))
print (df)
VIM
time treatment dose
2 TGFb 0.1 -0.158406
1 0.039158
10 -0.052608
24 TGFb 0.1 0.157153
1 0.206030
10 0.132580
48 TGFb 0.1 -0.144209
1 -0.093910
10 -0.166819
6 TGFb 0.1 0.097548
1 0.026664
10 -0.008032

Related

Removing numbers from dataframe that lie within range

I have a pandas dataframe that contains values between -1000 to 1000. I want to eliminate all the numbers between the range of -0.00001 to 0.00001 i.e replace them with NaN. It is worth mentioning that my df contains numerous instances of very small positive and negative numbers that I want to include within this range as well e.g. 6.26478E-52.
How do I go about doing this?
P.S. I am attaching an image of the first few rows of my df for reference.
IIUC use if need less like -0.00001 and 0.00001:
df = df.mask(df.lt(-0.00001) | df.lt(0.00001))
is same like below 0.00001:
df = df.mask(df.lt(0.00001))
Or if need values between:
df = df.mask(df.gt(-0.00001) & df.lt(0.00001))

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Create Multiple New Columns Based on Pipe-Delimited Column in Pandas

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3
You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1

how to unstack a pandas dataframe with two sets of variables

I have a table that looks like this. Read from a CSV file, so no levels, no fancy indices, etc.
ID date1 amount1 date2 amount2
x 15/1/2015 100 15/1/2016 80
The actual file I have goes up to date5 and amount 5.
How can I convert it to:
ID date amount
x 15/1/2015 100
x 15/1/2016 80
If I only had one variable, I would use pandas.melt(), but with two variables I really don't know how to do it quickly.
I could do it manually exporting to a sqlite3 database in memory, and doing a union. Doing unions in pandas is more annoying because, unlike, SQL, it requires all field names to be the same, so in pandas I'd have to create a temporary dataframe and rename all the fields: a dataframe for date1 and amount1, rename the field to just date and amount, then do the same for all the other events, and only then can I do pandas.concat.
Any suggestions? Thanks!
Here is one way:
>>> pandas.concat(
... [pandas.melt(x, id_vars='ID', value_vars=x.columns[1::2].tolist(), value_name='date'),
... pandas.melt(x, value_vars=x.columns[2::2].tolist(), value_name='amount')
... ],
... axis=1
... ).drop('variable', axis=1)
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
The idea is to do two melts, one for each set of columns, then concat them. This assumes that the two kinds of columns are in alternating order, so that the columns[1::2] and columns[2::2] select them correctly. If not, you'd have to modify that part of it to choose the columns you want.
You can also do it with the little-known lreshape:
>>> pandas.lreshape(x, {'date': x.columns[1::2], 'amount': x.columns[2::2]})
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
However, lreshape is not really documented and it's not clear if it's supposed to be used.
If I assume that the columns always repeat, a simple trick provides the solution you want.
The trick lies in making a list of lists of the columns that go together, then looping over the main list appending as necessary. It does involve a call to pd.DataFrame() each time the loop runs. I am kind of pressed for time right now to find a way to avoid that. But it does work like you would expect it to, and for a small file, you should not have any problems (that is, run time).
In [1]: columns = [['date1', 'amount1'], ['date2', 'amount2'], ...]
In [2]: df_clean = pd.DataFrame(columns=['date', 'amount'])
for cols in columns:
df_clean = df_clean.append(pd.DataFrame(df.loc[:,cols].values,
columns=['date', 'amount']),
ignore_index=True)
df_clean
Out[2]: date amount
0 15/1/2015 100
1 15/1/2016 80
The neat thing about this is that it only runs over the DataFrame once, picking all the rows under the columns it is looping over. So if you have 5 column pairs, with 'n' rows under it, the loop will only run 5 times. For each run, it will append all 'n' rows below the columns to the clean DataFrame to give you a consistent result. You can then eliminate any NaN values and sort by date, or do whatever you want to do with the clean DF.
What do you think, does this beat creating an in-memory sqlite3 database?

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories