I want to turn this DataFrame
x K
methane 0.006233 109.237632
ethane 0.110002 6.189667
propane 0.883765 0.770425
into something like this
0.006233 0.110002 0.883765
methane 109.237632 - -
ethane - 6.189667 -
propane - - 0.770425
I keep hesitating on whether this is a regular thing to do and digging through the docs or whether I should code something myself. I don't know what I would call this operation.
Thanks #RomanPekar for test case, you can pivot with:
>>> df = pd.DataFrame({'x':[0.006233,0.110002,0.883765], 'K':[109.237632,6.189667,0.770425]}, index=['methane','ethane','propane'])
>>> df['name'] = df.index
>>> df.pivot(index='name', columns='x', values='K')
x 0.006233 0.110002 0.883765
name
ethane NaN 6.189667 NaN
methane 109.237632 NaN NaN
propane NaN NaN 0.770425
Related
I have the following table:
df = pd.DataFrame(({'code':['A121','A121','A121','H812','H812','H812','Z198','Z198','Z198','S222','S222','S222'],
'mode':['stk','sup','cons','stk','sup','cons','stk','sup','cons','stk','sup','cons'],
datetime.date(year=2021,month=5,day=1):[4,2,np.nan,2,2,np.nan,6,np.nan,np.nan,np.nan,2,np.nan],
datetime.date(year=2021,month=5,day=2):[1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
datetime.date(year=2021,month=5,day=3):[12,5,np.nan,13,5,np.nan,12,np.nan,np.nan,np.nan,5,np.nan],
datetime.date(year=2021,month=5,day=4):[np.nan,1,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan,np.nan,7,np.nan]}))
df = df.set_index('mode')
I want to achieve the following, I want the the rows wherever cons to be set according to some arithemetic calculations:
cons for the corresponding date and code needs to be set to the following calculation prev_date stk - current_date stk + sup
I have tried the code below:
dates = list(df.columns)
dates.remove('code')
for date in dates:
prev_date = date - datetime.timedelta(days=1)
if(df.loc["stk"].get(prev_date,None) is not None):
opn_stk = df.loc["stk",prev_date].reset_index(drop=True)
cls_stk = df.loc["stk",date].reset_index(drop=True)
sup = df.loc["sup",date].fillna(0).reset_index(drop=True)
cons = opn_stk - cls_stk + sup
df.loc["cons",date] = cons
I do not receive any error, however the cons values does not change at all.
I suspect this is probably because df.loc["cons",date] is an indexed Series and the calculation opn_stk - cls_stk + sup is an unindexed Series.
Any idea how to fix this?
P.S Also I am using loops to calculate this, is there any other vectorized way that would be more efficient
Expected Output
Let's try a groupby apply instead:
def calc_cons(g):
# Transpose
t = g[g.columns[g.columns != 'code']].T
# Update Cons
g.loc[g.index == 'cons', g.columns != 'code'] = (-t['stk'].diff() +
t['sup'].fillna(0)).to_numpy()
return g
df = df.groupby('code', as_index=False, sort=False).apply(calc_cons)
# print(df[df.index == 'cons'])
print(df)
code 2021-05-01 2021-05-02 2021-05-03 2021-05-04
mode
stk A121 4.0 1.0 12.0 NaN
sup A121 2.0 NaN 5.0 1.0
cons A121 NaN 3.0 -6.0 NaN
stk H812 2.0 3.0 13.0 NaN
sup H812 2.0 NaN 5.0 4.0
cons H812 NaN -1.0 -5.0 NaN
stk Z198 6.0 2.0 12.0 NaN
sup Z198 NaN NaN NaN NaN
cons Z198 NaN 4.0 -10.0 NaN
stk S222 NaN NaN NaN NaN
sup S222 2.0 NaN 5.0 7.0
cons S222 NaN NaN NaN NaN
*Assumes columns are in sorted order by date in 1 day intervals.
Although #Henry Ecker's answer is very elegant, it is very slow compared to what I have done (over 10x slower), so I would like to go ahead with my implementation fixed
My implementation fixed as per Henry Ecker's suggestion df.loc["cons",date] = cons.to_numpy()
dates = list(df.columns)
dates.remove('code')
for date in dates:
prev_date = date - datetime.timedelta(days=1)
if(df.loc["stk"].get(prev_date,None) is not None):
opn_stk = df.loc["stk",prev_date].reset_index(drop=True) # gets the stock of prev date
cls_stk = df.loc["stk",date].reset_index(drop=True) # gets the stock of current date
sup = df.loc["sup",date].fillna(0).reset_index(drop=True) # gets suplly of current date
cons = opn_stk - cls_stk + sup
df.loc["cons",date] = cons.to_numpy()
Just as a sidenote:
My implementation runs on the full data (not this, I created this as a toy example) in 0:00:00.053309 seconds and Henry Ecker's implementation run in 0:00:00.568888 seconds so more than 10x slower.
This is probably because he is iterating over the codes whereas I am iterating over dates. At any given point of time I will have at most 30 dates, but there can be more that 500 codes
I have a dataframe called 'Adj_Close' which looks like this:
AAPL TSLA GOOG
0 3.478462 NaN NaN
1 3.185191 NaN NaN
2 3.231803 NaN NaN
3 2.952128 NaN NaN
4 3.091966 NaN NaN
... ... ... ...
5005 261.779999 333.040009 1295.339966
5006 266.369995 336.339996 1306.689941
5007 264.290009 328.920013 1313.550049
5008 267.839996 331.290009 1312.989990
5009 267.250000 329.940002 1304.959961
I want to save each column ('AAPL', 'TSLA' & 'GOOG') in a new dataframe.
The code should look like this:
i = 0
n = 3
while i < n:
df_{i} = Adj_Close.iloc[:,i]
i += 1
Unfortunately it is the wrong syntax. I hope someone can help me...
The natural way to do that in python would be to create an array of dataframes, as in:
dataframes = []
for col in df.columns:
new_df = pd.DataFrame(df[col])
dataframes.append(new_df)
The result is an array (dataframes) that contains three separate data frames - one for Google, one for Tesla, and one for Apple.
[ One can also define new variables using
globals()[my_var_name] = <some_value>
But I don't believe that's what you're looking for.
]
I have a data series which looks like this:
print mys
id_L1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
I would like to check is all the values are NaN.
My attempt:
pd.isnull(mys).all()
Output:
True
Is this the correct way to do it?
Yes, that's correct, but I think a more idiomatic way would be:
mys.isnull().all()
This will check for all columns..
mys.isnull().values.all(axis=0)
if df['col'].count() > 0:
then ...
This works well but I think it might be quite a slow approach. I made the mistake of embedding this into a 6000-times loop to test four columns - and it's brutal, but I can blame the programmer clearly :)
Obviously, don't be like me. Always: Test your columns for all-null once, set a variable with the yes - "empty" or no - "not empty" result - and then loop.
I would like to synchronize two very long data frames, performance is key in this use case. The two data frames are indexed in chronological order (this should be exploited to be as fast as possible) using datetimes or Timestamps.
One way to synch is provided in this example:
import pandas as pd
df1=pd.DataFrame({'A':[1,2,3,4,5,6], 'B':[1,5,3,4,5,7]}, index=pd.date_range('20140101 101501', freq='u', periods=6))
df2=pd.DataFrame({'D':[10,2,30,4,5,10], 'F':[1,5,3,4,5,70]}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6))
# synch data frames
df3=df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
My question is if this is the most efficient way to do it? I am ready to explore other solutions (e.g. using numpy or cython) if there are faster ways to solve this task.
Thanks
Note: time-stamps are not in general equally spaced (as in the example above), the method should also work in this case
Comment after reading the answers
I think there are many use cases in which neither align nor merge or join help. The point is to not use DB related semantics for aligning (which for timeseries are not so relevant in my opinion). For me aligning means map series A into B and have a way to deal with missing values (typically sample and hold method), align and join cause a not wanted effects like several timestamps repeated as a result of joining. I still do not have a perfect solution, but it seems np.searchsorted can help (it is much faster than using several calls to join / align to do what I need). I could not find a pandas way to do this up to now.
How can I map A into B so that B so that the result has all timestamps of A and B but no repetitions (except those which are already in A and B)?
Another typical use case is sample and hold synch, which can be solved in an efficient way as follows (synch A with B, i.e. take for every timestamp in A the corresponding values in B:
idx=np.searchsorted(B.index.values, A.index.values, side='right')-1
df=A.copy()
for i in B:
df[i]=B[i].ix[idx].values
the result df contains the same index of A and the synchronized values in B.
Is there an effective way to do such things directly in pandas?
If you need to synchronize then, use align, docs are here. Otherwise merge is a good option.
In [18]: N=100000
In [19]: df1=pd.DataFrame({'A':[1,2,3,4,5,6]*N, 'B':[1,5,3,4,5,7]*N}, index=pd.date_range('20140101 101501', freq='u', periods=6*N))
In [20]: df2=pd.DataFrame({'D':[10,2,30,4,5,10]*N, 'F':[1,5,3,4,5,70]*N}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6*N))
In [21]: %timeit df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
10 loops, best of 3: 69.3 ms per loop
In [22]: %timeit df1.align(df2)
10 loops, best of 3: 36.5 ms per loop
In [24]: pd.set_option('max_rows',10)
In [25]: x, y = df1.align(df2)
In [26]: x
Out[26]:
A B D F
2014-01-01 10:15:01 1 1 NaN NaN
2014-01-01 10:15:01.000001 2 5 NaN NaN
2014-01-01 10:15:01.000002 3 3 NaN NaN
2014-01-01 10:15:01.000003 4 4 NaN NaN
2014-01-01 10:15:01.000004 5 5 NaN NaN
... .. .. .. ..
2014-01-01 10:15:01.599998 5 5 NaN NaN
2014-01-01 10:15:01.599999 6 7 NaN NaN
2014-01-01 10:15:01.600000 NaN NaN NaN NaN
2014-01-01 10:15:01.600001 NaN NaN NaN NaN
2014-01-01 10:15:01.600002 NaN NaN NaN NaN
[600003 rows x 4 columns]
In [27]: y
Out[27]:
A B D F
2014-01-01 10:15:01 NaN NaN NaN NaN
2014-01-01 10:15:01.000001 NaN NaN NaN NaN
2014-01-01 10:15:01.000002 NaN NaN NaN NaN
2014-01-01 10:15:01.000003 NaN NaN 10 1
2014-01-01 10:15:01.000004 NaN NaN 2 5
... .. .. .. ..
2014-01-01 10:15:01.599998 NaN NaN 2 5
2014-01-01 10:15:01.599999 NaN NaN 30 3
2014-01-01 10:15:01.600000 NaN NaN 4 4
2014-01-01 10:15:01.600001 NaN NaN 5 5
2014-01-01 10:15:01.600002 NaN NaN 10 70
[600003 rows x 4 columns]
If you wish to use the index of one of your DataFrames as pattern for synchronizing, maybe useful:
df3 = df1.iloc[df1.index.isin(df2.index),]
Note: I guess shape of df1 > shape of df2
In the previous code snippet, you get the elements in df1 and df2 but if you want to add new indexes maybe you prefer doing:
new_indexes = df1.index.diff(df2.index) # indexes of df1 and not in df2
default_values = np.zeros((new_indexes.shape[0],df2.shape[1]))
df2 = df2.append(pd.DataFrame(default_values , index=new_indexes)).sort(axis=0)
You can see another way to synchronize in this post
To my view sync of time series is a very simple procedure. Assume ts# (#=0,1,2) to be filled with
ts#[0,:] - time
ts#[1,:] - ask
ts#[2,:] - bid
ts#[3,:] - asksz
ts#[4,:] - bidsz
output is
totts[0,:] - sync time
totts[1-4,:] - ask/bid/asksz/bidsz of ts0
totts[5-8,:] - ask/bid/asksz/bidsz of ts1
totts[9-12,:] - ask/bid/asksz/bidsz of ts2
function:
def syncTS(ts0,ts1,ts2):
ti0 = ts0[0,:]
ti1 = ts1[0,:]
ti2 = ts2[0,:]
totti = np.union1d(ti0, ti1)
totti = np.union1d(totti,ti2)
totts = np.ndarray((13,len(totti)))
it0=it1=it2=0
nT0=len(ti0)-1
nT1=len(ti1)-1
nT2=len(ti2)-1
for it,tim in enumerate(totti):
if tim >= ti0[it0] and it0 < nT0:
it0+=1
if tim >= ti1[it1] and it1 < nT1:
it1 += 1
if tim >= ti2[it2] and it2 < nT2:
it2 += 1
totts[0, it] = tim
for k in range(1,5):
totts[k, it] = ts0[k, it0]
totts[k + 4, it] = ts1[k, it1]
totts[k + 8, it] = ts2[k, it2]
return totts
I'm attempting to read in a flat-file to a DataFrame using pandas but can't seem to get the format right. My file has a variable number of fields represented per line and looks like this:
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCinpt|MIME=application/synthesis+ssml|TXID=NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAAA-txt|TXSZ=1167|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCsynd|INPT=1167|DURS=5120|RSTT=stop|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOClise|LUSED=0|LMAX=100|OMAX=95|LFEAT=tts|UCPU=0|SCPU=0
I have the field separator at |, I've pulled a list of all unique keys into keylist, and am trying to use the following to read in the data:
keylist = ['TIME',
'CHAN',
# [truncated]
'DURS',
'RSTT']
test_fp = 'c:\\temp\\test_output.txt'
df = pd.read_csv(test_fp, sep='|', names=keylist)
This incorrectly builds the DataFrame as I'm not specifying any way to recognize the key label in the line. I'm a little stuck and am not sure which way to research -- should I be using .read_json() for example?
Not sure if there's a slick way to do this. Sometimes when the data structure is different enough from the norm it's easiest to preprocess it on the Python side. Sure, it's not as fast, but since you could immediately save it in a more standard format it's usually not worth worrying about.
One way:
with open("wfield.txt") as fp:
rows = (dict(entry.split("=",1) for entry in row.strip().split("|")) for row in fp)
df = pd.DataFrame.from_dict(rows)
which produces
>>> df
CHAN DURS EVNT INPT LFEAT LMAX LUSED \
0 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOCinpt NaN NaN NaN NaN
1 FCJNJKDCAAANPCKEAAAAAAAA 5120 NVOCsynd 1167 NaN NaN NaN
2 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOClise NaN tts 100 0
MIME OMAX RSTT SCPU TIME \
0 application/synthesis+ssml NaN NaN 15 20131203004552049
1 NaN NaN stop 15 20131203004552049
2 NaN 95 NaN 0 20131203004552049
TXID TXSZ UCPU
0 NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAA... 1167 31
1 NaN NaN 31
2 NaN NaN 0
[3 rows x 15 columns]
After you've got this, you can reshape as needed. (I'm not sure if you wanted to combine rows with the same TIME & CHAN or not.)
Edit: if you're using an older version of pandas which doesn't support passing a generator to from_dict, you can built it from a list instead:
df = pd.DataFrame(list(rows))
but note that you haev have to convert columns to numerical columns from strings after the fact.