Moving pandas series value by switching column name? - python

I have a DF, however the last value of some series should be placed in a different one. This happened due to column names not being standardized - i.e., some are "Wx_y_x_PRED" and some are "Wx_x_y_PRED". I'm having difficulty writing a function that will simply find the columns with >= 225 NaN's and changing the column it's assigned to.
I've written a function that for some reason will sometimes work and sometimes won't. When it does, it further creates approx 850 columns in its wake (the OG dataframe is around 420 with the duplicate columns). I'm hoping to have something that just reassigns the value. If it automatically deletes the incorrect column, that's awesome too, but I just used .dropna(thresh = 2) when my function worked originally.
Here's what it looks like originally:
in: df = pd.DataFrame(data = {'W10_IND_JAC_PRED': ['NaN','NaN','NaN','NaN','NaN',2],
'W10_JAC_IND_PRED': [1,2,1,2,1,'NAN']})
out:df
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 NAN
I wrote this, which occasionally works but most of the time doesn't and i'm not sure why.
def switch_cols(x):
"""Takes mismatched columns (where only the last value != NaN) and changes order of team column names"""
if x.isna().sum() == 5:
col_string = x.name.split('_')
col_to_switch = ('_').join([col_string[0],col_string[2],col_string[1],'PRED'])
df[col_to_switch]['row_name'] = x[-1]
else:
pass
return x
Most of the time it just returns to me the exact same DF, but this is the desired outcome.
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 2
Anyone have any tips or could share why my function works maybe 10% of the time?
Edit:
so this is an ugly "for" loop I wrote that works. I know there has to be a much more pythonic way of doing this while preserving original column names, though.
for i in range(df.shape[1]):
if df.iloc[:,i].isna().sum() == 5:
split_nan_col = df.columns[i].split('_')
correct_col_name = ('_').join([split_nan_col[0],split_nan_col[2],split_nan_col[1],split_nan_col[3]])
df.loc[5,correct_col_name] = df.loc[5,df.columns[i]]
else:
pass

Doing with split before frozenset(will return the order list), then we do join: Notice this solution can be implemented to more columns
df.columns=df.columns.str.split('_').map(frozenset).map('_'.join)
df.mask(df=='NaN').groupby(level=0,axis=1).first() # groupby first will return the first not null value
PRED_JAC_W10_IND
0 1
1 2
2 1
3 2
4 1
5 2

Related

Python: take log difference for each column in a dataframe

I have a list of dataframes and would like to take log for every element in these dataframes and find the first difference. In time series econometrics, this procedure gives an approximate growth rate. The following codes
for i in [0, 1, 2, 5]:
df1_list[i] = 100 * np.log(df_list[i]).diff()
gives an error
__main__:7: RuntimeWarning: divide by zero encountered in log
__main__:7: RuntimeWarning: invalid value encountered in log
When I look at the result, many of the elements resulting dataframes are nan. How can I fix the codes? Thanks !!
The problem is not with your code, but with your data. You do not get an error but two warnings. The most likely reasons are the following kinds of values within your DataFrames:
Zeros
Negative numbers
Non-numerical values
The logarithm of any of these is just not defined, so you get NaN.
Some test data
df = pd.DataFrame(np.random.rand(5, 5))
df = df.mask(np.random.random(df.shape) < .1)
0 1 2 3 4
0 0.579643 0.614592 0.333945 0.241791 0.426162
1 0.576076 0.841264 0.235148 0.577707 0.278260
2 0.735097 0.594789 0.640693 0.913639 0.620021
3 0.015446 NaN 0.062203 0.253076 0.042025
4 0.401775 0.522634 0.521139 0.032310 NaN
Applying your code
for c in df:
print(100 * np.log(df[c]).diff())
yields output like this (for c = 1):
0 NaN
1 31.394708
2 -34.670002
3 NaN
4 NaN
You can remove nans with .dropna()
for c in df:
print(100 * np.log(df[c].dropna()).diff())
which yields (for c = 1)
0 NaN
1 31.394708
2 -34.670002
4 -12.932474
As you can see, we have "lost" one row as a consequence of .dropna() and your 0th row will always be nan as there is no difference to take.
If you are interested in replacing nans with other values, there are different techniques such as fillna or imputation.

Python pandas groupby category and integer variables results in pandas last and tail difference

UPDATE:
Please download my full dataset here.
my datatype is:
>>> df.dtypes
increment int64
spread float64
SYM_ROOT category
dtype: object
I have realized that the problem might have been caused by the fact that my SYM_ROOT is a category variable.
To replicate the issue you might want to do the following first:
df=pd.read_csv("sf.csv")
df['SYM_ROOT']=df['SYM_ROOT'].astype('category')
But I am still puzzled as in why my SYM_ROOT will result in the gaps in increment being filled with NA? Unless groupby category and integer value will result in a balanced panel by default.
I noticed that the behaviour of pd.groupby().last is different from that of pd.groupby().tail(1).
For example, suppose I have the following data:
increment is an integer that spans from 0 to 4680. However, for some SYM_ROOT variable, there are gaps in between. For example, 4 could be missing from it.
What I want to do is to keep the last observation per group.
If I do df.groupby(['SYM_ROOT','increment']).last(), the dataframe becomes:
While if I do df.groupby(['SYM_ROOT','increment']).tail(1), the dataframe becomes:
It looks to me that the last() statement will create a balanced time-series data and fill in the gaps with NaN, while the tail(1) statement doesn't. Is it correct?
Update :
Your columns increment is category
df=pd.DataFrame({'A':[1,1,2,2],'B':[1,1,2,3],'C':[1,1,1,1]})
df.B=df.B.astype('category')
df.groupby(['A','B']).last()
Out[590]:
C
A B
1 1 1.0
2 NaN
3 NaN
2 1 NaN
2 1.0
3 1.0
When you using tail it will not make up the miss level since , tail is more like dataframe base , not single columns
df.groupby(['A','B']).tail(1)
Out[593]:
A B C
1 1 1 1
2 2 2 1
3 2 3 1
After hange it using astype
df.B=df.B.astype('int')
df.groupby(['A','B']).last()
Out[591]:
C
A B
1 1 1
2 2 1
3 1
It is actually an issue here at Github, where the problem is mainly caused by groupby categories guessing the values.

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Create value if missing for this identifier

I want to solve a problem that essentially boils down to this:
I have identifier numbers (thousands of them) and each should be uniquely linked to a set of letters. Let's call them a through e. These can be filled from another column (y) if that helps.
Ocassionally one of the letters is missing and is registered as NAN. How can I replace such that I get all the required numbers.
Idnumber X y
1 a a
2 a a
1 b b
1 NaN d
2 b NaN
1 d c
2 c NaN
1 NaN e
2 d d
2 e e
Any given x can be missing.
The dataset it too big to simply add all posibilities and drop dupplicates.
The idea is to get:
Idnumber X
1 a
2 a
1 b
1 c
2 b
1 d
2 c
1 e
2 d
2 e
The main issue is getting a unique solution. So making sure that I replace one NaN by c and one by e.
Is this what you're looking for? Or does this use too much RAM? If it does use too much RAM, you can use the chunksize parameter in read_csv. Then write results (with duplicates and nans dropped) for each individual chunk to csv, then load those and drop duplicates again - this time just dropping duplicates that conflict across chunks.
#Loading Dataframe
from StringIO import StringIO
x=StringIO('''Idnumber,X,y
1,a,a
2,a,a
1,b,b
1,NaN,d
2,b,NaN
1,d,c
2,c,NaN
1,NaN,e
2,d,d
2,e,e''')
#Operations on Dataframe
df = pd.read_csv(x)
df1 = df[['Idnumber','X']]
df2 = df[['Idnumber','y']]
df2.rename(columns={'y': 'X'}, inplace=True)
pd.concat([df1,df2]).dropna().drop_duplicates()

Pandas. Selection by label. One-row output

I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve

Categories