Subsetting a Pandas series - python

I have a Pandas series. Basically one specific row of a pandas data frame.
Name: NY.GDP.PCAP.KD.ZG, dtype: int64
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff 0.522725
NY.GDP.MKTP.CN_logdiff 0.884601
NY.GDP.MKTP.KD_logdiff 0.990679
NY.GDP.MKTP.KD.ZG 0.992603
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff 0.856861
NY.GDP.MKTP.PP.KD_logdiff 0.990679
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff 0.523433
NY.GDP.PCAP.KD.ZG 1.000000
NY.GDP.PCAP.KN_logdiff 0.999456
NY.GDP.PCAP.PP.CD_logdff 0.857321
NY.GDP.PCAP.PP.KD_logdiff 0.999456
The first column is index as you would find in a series. Now I want to basically get all these index names in a list such that only those index should come whose absolute value in the right column is less than 0.5. To give a context this series is basically a row corresponding to the variable NY.GDP.PCAP.KD.ZG in a correlation matrix and I want to retain this variable along with those variables which have correlation less than 0.5 with this variable. Rest variables I will drop from the dataframe
Currently I do something like this but it also keeps NaN
print(tourism[columns].corr().ix[14].where(np.absolute(tourism[columns].corr().ix[14]<0.5)))
where tourism is the data frame , columns is the set of columns on which I did correlation analysis and 14 is the row in the correlation matrix corresponding to column mentioned above
gives:
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff NaN
NY.GDP.MKTP.CN_logdiff NaN
NY.GDP.MKTP.KD_logdiff NaN
NY.GDP.MKTP.KD.ZG NaN
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff NaN
NY.GDP.MKTP.PP.KD_logdiff NaN
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff NaN
NY.GDP.PCAP.KD.ZG NaN
NY.GDP.PCAP.KN_logdiff NaN
NY.GDP.PCAP.PP.CD_logdff NaN
NY.GDP.PCAP.PP.KD_logdiff NaN
Name: NY.GDP.PCAP.KD.ZG, dtype: float64

If x is your series, then:
x[x.abs() < 0.5].index

Related

Search long series for non NaN entries

I am looking through a DataFrame with different kinds of data whose usefulness I'm trying to evaluate. So I am looking at each column and check the kind of data it is. E.g.
print(extract_df['Auslagenersatz'])
For some I get responses like this:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
263 NaN
264 NaN
265 NaN
266 NaN
267 NaN
I would like to check whether that column contains any information at all so what I am looking for is something like
s = extract_df['Auslagenersatz']
print(s.loc[s == True])
where I am assuming that NaN is interpreted as False in the same way an empty set is. I would like it to return only those elements of the series that satisfy this condition (being not empty). The code above does not work however, as I get an empty set even for columns that I know have non-NaN entries.
I oriented myself with this post How to select rows from a DataFrame based on column values
but I can't figure where I'm going wrong or how to do this instead. The Problem comes up often so any help is well appreciated.
import pandas as pd
df = pd.DataFrame({'A':[2,3,None, 4,None], 'B':[2,13,None, None,None], 'C':[None,3,None, 4,None]})
If you want to see non-NA values of column A then:
df[~df['A'].isna()]

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example
This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

How to apply "Diff()" methode to multiple columns?

I'm trying to apply diff() method to multiple columns to make data Stationary for time series.
x1 = frc_data['004L004T10'].diff(periods=8)
x1
Date
2013-10-01 NaN
2013-11-01 NaN
2013-12-01 NaN
2014-01-01 NaN
2014-02-01 NaN
So diff is working for single column.
However diff is not working for all the columns:
for x in frc_data.columns:
frc_data[x].diff(periods=1)
No errors, although the Data remains Unchanged
In order to change the DataFrame, you need to assign the diff to a new column i.e.
for x in frc_data.columns:
frc_data[x] = frc_data[x].diff(periods=1)
Loop is not necessary, only remove [x] for difference of all columns:
frc_data = frc_data.diff(periods=1)

How to perform operation on a grouped data set and replace the corresponding values in original df?

For a given dataframe, I need to group the data (via groupby) and perform an operation on the grouped data (eg: grouped_data - 100). After this, I need to replace the old values in the dataframe with the new values I just calculated.
I have tried assigning the obtained values to the grouped data frame, but it does not seem to work.
Grouped Dataframe
altitude_feet Column_irrelevant_A Column_irrelevant_B
1889.155095 NaN NaN
1889.155095 NaN NaN
1889.155095 NaN NaN
1889.155095 NaN NaN
1889.155095 NaN NaN
1889.155095 NaN NaN
new_values= (df.groupby('columnA').get_group(101)['altitude_feet'])-100
df.groupby('coulmnA').get_group(101)['altitude_feet'] = new_values
I wish to subtract 100ft from the above and store that as the new altitude_feet in the initial dataframe prior to grouping
I expect to approxiamtely 1789 for the altitude
I have also considered avoiding grouping and simply finding the index instead
data[data['columnA']==101].assign(...)
Any suggestions would be appreciated.
just define a function and use apply(). I used an example dataset from seaborn:
import pandas as pd
import numpy as np
import seaborn as sns
df=sns.load_dataset('iris')
def f(grp):
dfx=grp.copy()
dfx['new_sepal_length']=dfx.sepal_length-100
return dfx
df_new=df.groupby('species').apply(f).reset_index(drop=True)
Instead of df, I created the new dataframe df_new so you can compare df with df_new.

pandas column division ValueError (putmask: mask and data must be the same size)

I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]
Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]

Categories