I'm trying to reshape a multi-indexed data frame so that the values from the second level of the index are incorporated into the column names in the new data frame. In the data frame below, I want to move A and B from "source" into the columns so that I have s1_A, s1_B, s2_A, ..., s3_B.
I've tried creating the structure of the new data frame explicitly and populating it with a nested for loop to reassign the values, but it is excruciatingly slow. I've tried a number of functions from the pandas API, but without much luck. Any help would be much appreciated.
midx = pd.MultiIndex.from_product( [[1,2,3], ['A','B']], names=["sample","source"])
df = pd.DataFrame( index=midx, columns=['s1', 's2', 's3'], data=np.ndarray(shape=(6,3)) )
>>> df
s1 s2 s3
sample source
1 A 1.2 3.4 5.6
B 1.2 3.4 5.6
2 A 1.2 3.4 5.6
B 1.2 3.4 5.6
3 A 1.2 3.4 5.6
B 1.2 3.4 5.6
# Want to build a new data frame thatlooks like this:
>>> df_new
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 1.2 1.2 3.4 3.4 5.6 5.6
2 1.2 1.2 3.4 3.4 5.6 5.6
3 1.2 1.2 3.4 3.4 5.6 5.6
Here's how I'm currently doing it. It's extremely slow, and I know there must be a more idiomatic way to do this with pandas, but I'm still new to its API:
substances = df.columns.values
sources = ['A','B']
subst_and_src = sorted([ subst + "_" + src for src in sources for subst in substances ])
df_new = pd.DataFrame(index=df.index.unique(0), columns=subst_and_src)
# Runs forever
for (sample, source) in df.index:
for subst in df.columns:
df_new[sample, subst + "_" + source] = df.loc[(sample,source), subst]
df = df.unstack(level=1)
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print(df)
Prints:
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 4.665045e-310 6.904071e-310 0.0 0.0 6.903913e-310 2.121996e-314
2 6.904071e-310 0.000000e+00 0.0 0.0 3.458460e-323 0.000000e+00
3 0.000000e+00 0.000000e+00 0.0 0.0 0.000000e+00 0.000000e+00
Unstack into a new dataframe and collapse multilevel index of resulting frmae using f string
df1= df.unstack()
df1.columns = df1.columns.map('{0[0]}_{0[1]}'.format)
s1_A s1_B s2_A s2_B s3_A s3_B
sample
1 1.2 1.2 3.4 3.4 5.6 5.6
2 1.2 1.2 3.4 3.4 5.6 5.6
3 1.2 1.2 3.4 3.4 5.6 5.6
Related
I have two pd.dataframes:
df1:
Year Replaced Not_replaced
2015 1.5 0.1
2016 1.6 0.3
2017 2.1 0.1
2018 2.6 0.5
df2:
Year HI LO RF
2015 3.2 2.9 3.0
2016 3.0 2.8 2.9
2017 2.7 2.5 2.6
2018 2.6 2.2 2.3
I need to create a third df3 by using the following equation:
df3[column1]=df1['Replaced']-df1['Not_replaced]+df2['HI']
df3[column2]=df1['Replaced']-df1['Not_replaced]+df2['LO']
df3[column3]=df1['Replaced']-df1['Not_replaced]+df2['RF']
I can merge the two dataframes and manually create 3 new columns one by one, but I can't figure out how to use the loop function to create the results.
You can create an empty dataframe & fill it with values while looping
(Note: col_names & df3.columns must be of the same length)
df3 = pd.DataFrame(columns = ['column1','column2','column3'])
col_names = ["HI", "LO","RF"]
for incol,df3column in zip(col_names,df3.columns):
df3[df3column] = df1['Replaced']-df1['Not_replaced']+df2[incol]
print(df3)
output
column1 column2 column3
0 4.6 4.3 4.4
1 4.3 4.1 4.2
2 4.7 4.5 4.6
3 4.7 4.3 4.4
for the for loop, I would first merge df1 and df2 into to create a new df, called df3. Then, I would create a list of te names of the columns you want to iterate through:
col_names = ["HI", "LO","RF"]
for col in col_names:
df3[f"column_{col}]= df3['Replaced']-df3['Not_replaced]+df3[col]
This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!
Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1
Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1
I want to use the value of df.d to define the row from which I calculate the relative value by using the formula df.a/df.a[x] while x is defined as df.d. But somehow this doesn't work. My approach so far is this one:
import pandas as pd
import numpy as np
import datetime
randn = np.random.randn
rng = pd.date_range('1/1/2011', periods=10, freq='D')
df = pd.DataFrame({'a': [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0], 'b': [1.1, 1.7, 1.3, 1.6, 1.5, 1.1, 1.5, 1.7, 2.1, 1.9],'c':[None] * 10},index=rng)
df["d"]= [0,0,0,0,4,4,4,4,8,8]
df["c"] =df.a/df.a[df.d]
All I get is the error: ValueError: cannot reindex from a duplicate axis
To clarify this: I want to set df.a/df.a[0] for the first 4 rows, df.a/df.a[4] for the next 4 and df.a/df.a[8] for the last 2 rows according to df["d"]= [0,0,0,0,4,4,4,4,8,8]
So how can I refer to a value in the dataframe correcty, without getting this error?
The output I seek looks like this:
a b c d
2011-01-01 1.1 1.1 1 0 # df.a/df.a[0]
2011-01-02 1.2 1.7 1.090909090909091 0 # df.a/df.a[0]
2011-01-03 1.3 1.3 1.181818181818182 0 # df.a/df.a[0]
2011-01-04 1.4 1.6 1.272727272727273 0 # df.a/df.a[0]
2011-01-05 1.5 1.5 1 4 # df.a/df.a[4]
2011-01-06 1.6 1.1 1.066666666666667 4 # df.a/df.a[4]
2011-01-07 1.7 1.5 1.133333333333333 4 # df.a/df.a[4]
2011-01-08 1.8 1.7 1.2 4 # df.a/df.a[4]
2011-01-09 1.9 2.1 1 8 # df.a/df.a[8]
2011-01-10 2.0 1.9 1.052631578947368 8 # df.a/df.a[8]
The pandas version used is 0.16.0
Thanks a lot for your support!
With regards to your original Error, I get a different error -
Unsupported Iterator Index. That's 'cos I am trying to get values from df.a at an index which is a series (df.d) and not an Index value. (I've pandas version 0.13.1), but to solve your actual problem -
Here's how I could go about it.
df['d'] = pd.Series([0,0,0,0,4,4,4,4,8,8], index=rng)
x = df.a.iloc[df.d]
note here - x you get has a different date index so simply
df['c'] = df.a/x # incorrect
won't work. we are only interested in values - so we take them out and assign (ignoring the index).
df['c'] = df.a/x.values # We ignore the index of 'x'
or as a short form
df['c'] = df.a/df.a.iloc[df.d].values
What is not clear to me yet is - even though the index of df.d is correct why simple df.a.iloc won't work.
Hope that helps.
You might want to use this instead of your last line:
df["c"] = df.a.values / df.a[df.d].values
print df
Which yields:
a b c d
2011-01-01 1.1 1.1 1.000 0
2011-01-02 1.2 1.7 1.091 0
2011-01-03 1.3 1.3 1.182 0
2011-01-04 1.4 1.6 1.273 0
2011-01-05 1.5 1.5 1.000 4
2011-01-06 1.6 1.1 1.067 4
2011-01-07 1.7 1.5 1.133 4
2011-01-08 1.8 1.7 1.200 4
2011-01-09 1.9 2.1 1.000 8
2011-01-10 2.0 1.9 1.053 8
The reason you had an error is because two series you tried to divide one by another had different indexes (not aligned ones). Adding .values gets rid of the indexes and solves the issue.
I'm reading a csv file with Pandas. The format is:
Date Time x1 x2 x3 x4 x5
3/7/2012 11:09:22 13.5 2.3 0.4 7.3 6.4
12.6 3.4 9.0 3.0 7.0
3.6 4.4 8.0 6.0 5.0
10.6 3.5 1.0 3.0 8.0
...
3/7/2012 11:09:23 10.5 23.2 0.3 7.8 4.4
11.6 13.4 19.0 13.0 17.0
...
As you can see, not every row has a timestamp. Every row without a timestamp is from the same 1-second interval as the closest row above it that does have a timestamp.
I am trying to do 3 things:
1. combine the Date and Time columns to get a single timestamp column.
2. convert that column to have units of seconds.
3. fill empty cells to have the appropriate timestamp.
The desired end result is an array with the timestamp, in seconds, at each row.
I am not sure how to quickly convert the timestamps into units of seconds, other then to do a slow for loop and use the Python builtin time.mktime method.
Then when I fill in missing timestamp values, the problem is that the cells in the Date and Time columns which did not have a timestamp each get a "nan" value and when merged give a cell with the value "nan nan". Then when I use the fillna() method, it doesn't interpret "nan nan" as being a nan.
I am using the following code to get the problem result (not including the part of trying to convert to seconds):
import pandas as pd
df = pd.read_csv('file.csv', delimiter=',', parse_dates={'CorrectTime':[0,1]}, usecols=[0,1,2,4,6], names=['Date','Time','x1','x3','x5'])
df.fillna(method='ffill', axis=0, inplace=True)
Thanks for your help.
Assuming you want seconds since Jan 1, 1900...
import pandas
from io import StringIO
import datetime
data = StringIO("""\
Date,Time,x1,x2,x3,x4,x5
3/7/2012,11:09:22,13.5,2.3,0.4,7.3,6.4
,,12.6,3.4,9.0,3.0,7.0
,,3.6,4.4,8.0,6.0,5.0
,,10.6,3.5,1.0,3.0,8.0
3/7/2012,11:09:23,10.5,23.2,0.3,7.8,4.4
,,11.6,13.4,19.0,13.0,17.0
""")
df = pandas.read_csv(data, parse_dates=['Date']).fillna(method='ffill')
def dealwithdates(row):
datestring = row['Date'].strftime('%Y-%m-%d')
dtstring = '{} {}'.format(datestring, row['Time'])
date = datetime.datetime.strptime(dtstring, '%Y-%m-%d %H:%M:%S')
refdate = datetime.datetime(1900, 1, 1)
return (date - refdate).total_seconds()
df['ordinal'] = df.apply(dealwithdates, axis=1)
print(df)
Date Time x1 x2 x3 x4 x5 ordinal
0 2012-03-07 11:09:22 13.5 2.3 0.4 7.3 6.4 3540107362
1 2012-03-07 11:09:22 12.6 3.4 9.0 3.0 7.0 3540107362
2 2012-03-07 11:09:22 3.6 4.4 8.0 6.0 5.0 3540107362
3 2012-03-07 11:09:22 10.6 3.5 1.0 3.0 8.0 3540107362
4 2012-03-07 11:09:23 10.5 23.2 0.3 7.8 4.4 3540107363
5 2012-03-07 11:09:23 11.6 13.4 19.0 13.0 17.0 3540107363
I want to apply a function f to many slices within each row of a pandas DataFrame.
For example, DataFrame df would look as such:
df = pandas.DataFrame(np.round(np.random.normal(size=(2,49)), 2))
So, I have a dataframe of 2 rows by 49 columns, and my function needs to be applied to every consequent slice of 7 data points in both rows, and so that the resulting dataframe looks identical to the input dataframe.
I was doing it as such:
df1=df.copy()
df1.T[:7], df1.T[7:14], df1.T[14:21],..., df1.T[43:50] = f(df.T.iloc[:7,:]), f(df.T.iloc[7:14,:]),..., f(df.T.iloc[43:50,:])
As you can see that's a whole lot of redundant code.. so I would like to create a loop or something so that it applies the function to every 7 subsequent data point...
I have no idea how to approach this. Is there a more elegant way to do this?
I thought I could maybe use a transform function for this, but in the pandas documentation I can only see that applied to a dataframe that has been grouped and not on slices of the data....
Hopefully this is clear.. let me know.
Thank you.
To avoid redundant code you can just do a loop like this:
STEP = 7
for i in range(0,len(df),STEP):
df1.T[i:i+STEP] = f(df1.T[i:i+STEP]) # could also do an apply here somehow, depending on what you want to do
Don't Repeat Yourself
You don't provide any examples of your desired output, so here's my best guess at what you want...
If your data are lumped into groups of seven, the you need to come up with a way to label them as such.
If other words, you with want to work with arbitrary arrays, use numpy. If you want to work with labeled, meaningful data and it's associated metadata, then use pandas.
Also, pandas works more efficiently when operating (and displaying!) row-wise data. So that mean store data long (49x2), not wide (2x49)
Here's an example of what I mean. I have the same 49x2 random array, but assigned grouping labels to the rows ahead of time.
Let's yeah you're reading in some wide-ish data as following:
import pandas
import numpy
from io import StringIO # python 3
# from StringIO import StringIO # python 2
datafile = StringIO("""\
A,B,C,D,E,F,G,H,I,J
0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9
2.0,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9
""")
df = pandas.read_csv(datafile)
print(df)
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
You could add a cluster value to the columns, like so:
cluster_size = 3
col_vals = []
for n, col in enumerate(df.columns):
cluster = int(n/cluster_size)
col_vals.append((cluster, col))
df.columns = pandas.Index(col_vals)
print(df)
0 1 2 3
A B C D E F G H I J
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
By default, the groupby method tries to group rows, but you can group columns (I just fogured this out), by passing axis=1 when you create the object. So the sum of each cluster of columns for each row is as follows:
df.groupby(axis=1, level=0).sum()
0 1 2 3
0 0.3 1.2 2.1 0.9
1 3.3 4.2 5.1 1.9
2 6.3 7.2 8.1 2.9
But again, if all you're doing is more "global" operations, there's no need to any of this.
In-place column cluster operation
df[0] *= 5
print(df)
0 1 2 3
A B C D E F G H I J
0 0 2.5 5 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
In-place row operation
df.T[0] += 20
0 1 2 3
A B C D E F G H I J
0 20 22.5 25 20.3 20.4 20.5 20.6 20.7 20.8 20.9
1 25 27.5 30 1.3 1.4 1.5 1.6 1.7 1.8 1.9
2 50 52.5 55 2.3 2.4 2.5 2.6 2.7 2.8 2.9
Operate on the entire dataframe at once
def myFunc(x):
return 5 + x**2
myFunc(df)
0 1 2 3
A B C D E F G H I J
0 405 511.25 630 417.09 421.16 425.25 429.36 433.49 437.64 441.81
1 630 761.25 905 6.69 6.96 7.25 7.56 7.89 8.24 8.61
2 2505 2761.25 3030 10.29 10.76 11.25 11.76 12.29 12.84 13.41