How to refer to the 2nd duplicate column in pandas - python

My dataframe has two columns both called Scanned Blank. I want to always select the second column named 'Scanned Blank' below:
df['Scanned Blank'].head()
Scanned Blank Scanned Blank
1 NaN Y
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
I tried
df['Scanned Blank'][1]
which didn't work.
It's not feasible for me to use integer selection, because sometimes the columns will move around. For instance, sometimes the first scanned blank will be column 20 and the second one column 40. Sometimes it'll be respectively 21 and 41. However the exact position of the column, I know I will always want the one after it.

Realized that I was just returning another dataframe, so,
df['Scanned Blank'].iloc[:,1]
Also, yes, I'm aware this is bad practice. Unfortunately, I don't have any control over this dataset, and this script needs to reliably run when other people use it.

Let us do duplicated
df.loc[:,df.columns.duplicated()]

Related

pandas Third Column answer is based on column 1 and column 2

First - I have tried reviewing similar posts, but I am still not getting it.
I have data with corporate codes that I have to reclassify. First thing, I created a new column -['corp_reclassed'].
I populate that column with the use of the map function and a dictionary.
Most of the original corporate numbers do not change thus I have nans in the new column (see below).
corp_number corp_reclassed
100 nan
110 nan
120 160
130 nan
150 170
I want to create a final column where if ['corp_reclased'] = nan then ['corp_number] is populate by the ['corp_number'] . If not, then populate['corp_reclassed'].
I have tried many ways, but I keep running into problems. For instance, this is my lastest try:
df['final_number'] = df.['corp_number'].where(df.['gem_reclassed'] = isnull, eq['gem_reclassed'])
Please help.
FYI- I am using pandas 0.19.2. I get upgrade because of restrictions at work.
Just a fillna?
df['final_number'] = df['corp_reclassed'].fillna(df['corp_number'])
df.loc[df['gem_reclassed']= pd.np.nan, 'final_number'] = df['corp_reclassed']

Specify datatype when reading in excel data to pandas/python

I have an excel file along the lines of
gdp gdp (2009)
1929 104.6 1056.7
1930 173.6 962.0
1931 72.3 846.6
I want to read in the file and specify that the first column (which as no header information) is an integer. I don't need column B
I am reading in the file using the following
import pandas as pd
from pandas import ExcelFile
gdp = pd.read_excel('gdpfile.xls, skiprows = 2, parse_cols = "A,C")
This reads in fine, except the years all get turned into floats, e.g. 1929.0, 1930.0, 1931.0. The first two rows are NaN.
I want to specify that it should be integer. I have tried adding converters = {"A":int,"C":float} in the read_excel command, as suggested by Python pandas: how to specify data types when reading an Excel file? but this did not fix things.
I have tried to convert after the fact, which I've previously done to convert strings to float, however this also did not work.
gdp.columns = ['Year','GDP 2009']
gdp['Year'] = gdp['Year'].astype(int)
I also tried using dtypes = int as suggested in one of the comments at the above link, however this also does not work.
Note that the skiprows is necessary as my actual excel file has a few rows at the top I do not want.
As per the sample given here, two blank rows are present after the heading. So if you want heading, you can give skip rows in range:
pd.read_excel("test.xls",parse_cols="A,C",skiprows=[1,2])
Also, can you please confirm if there are any other NaN cells in that column. If there are NaN values in the column, column dtype will be promoted to float.
Please see the link below:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
Also please note that since the first column heading is not given, while importing it takes first column as index.
To avoid that, I have followed the below steps:
My excel file looks like this
NaN gdp gdp (2009)
NaN NaN NaN
NaN NaN NaN
1929 104.6 1056.7
1930 173.6 962
1931 72.3 846.6
NaN NaN NaN
1952 45.3 56.6
I removed the default headers and added headers to avoid indexing issue:
test = pd.read_excel("test.xls",skiprows=[0,3],header=None,names=['Year','gdp (2009)'],parse_cols="A,C")
As stated above, since the column contains NaN value, column type will be converted into float.You can dropna or fill na values with 0 or some other value. In this case I'm dropping na rows.
test = test.dropna(axis=0, how='all')
Once you have removed NaN values, you can use astype to convert it into int
test['Year']=test.Year.astype(int)
Please check if this works for you and let me know if you need more clarification on this.
Thanks,

How to subtract two partial columns with pandas?

I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.
The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)

Compare Columns In Dataframe

I have two data frames that I have concatenated into one. What I ultimately want to end up with is a list of all the columns the exist in both. The data frames come from two different db tables, and I need to generate queries based on the ones that exist in both tables.
I tried doing the following: concat_per.query('doe_per==focus_per') but it returned an empty data frame.
doe_per focus_per
2 NaN Period_02
3 Period_01 Period_06
4 Period_02 Period_08
5 Period_03 NaN
6 Period_04 NaN
7 Period_05 NaN
8 Period_06 NaN
9 Period_07 NaN
10 Period_08 NaN
also you can use function isin().
At first ,you can transform the first column to a set or list as you base columns. Then use isin() to filter the second dataframe.
firstList = set(df1st.doe_per)
targetDF = df2nd[df2nd.focus_per.isin(firstList)==True]
If you want to combine two dataframes into one, you can use
pd.merge(df1,df2,left_on=df1st.doe_per,right_on = df2nd.focus_per,join='inner')
or
pd.concat([df1,df2],on_,join='inner',ignore_index=True)
I'm sorry that i forgot some params in the function.But if you want to combine some dataframe into one, you need to use these two function. Maybe pd.combine() is ok. You can look up the api of pandas.

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories