pandas Third Column answer is based on column 1 and column 2 - python

First - I have tried reviewing similar posts, but I am still not getting it.
I have data with corporate codes that I have to reclassify. First thing, I created a new column -['corp_reclassed'].
I populate that column with the use of the map function and a dictionary.
Most of the original corporate numbers do not change thus I have nans in the new column (see below).
corp_number corp_reclassed
100 nan
110 nan
120 160
130 nan
150 170
I want to create a final column where if ['corp_reclased'] = nan then ['corp_number] is populate by the ['corp_number'] . If not, then populate['corp_reclassed'].
I have tried many ways, but I keep running into problems. For instance, this is my lastest try:
df['final_number'] = df.['corp_number'].where(df.['gem_reclassed'] = isnull, eq['gem_reclassed'])
Please help.
FYI- I am using pandas 0.19.2. I get upgrade because of restrictions at work.

Just a fillna?
df['final_number'] = df['corp_reclassed'].fillna(df['corp_number'])

df.loc[df['gem_reclassed']= pd.np.nan, 'final_number'] = df['corp_reclassed']

Related

Find string in one csv and replace with string in a different csv in a loop

I have two csv files. csv1 looks like this:
Title,glide gscore,IFDScore
235,-9.01,-1020.18
235,-8.759,-1020.01
235,-7.301,-1019.28
while csv2 looks like this:
ID,smiles,number
28604361,NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3,102
14492699,COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C,235
16888863,COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO,108
Both are much larger than what I show here. I need some way to match each value in the Title column of csv1 to the corresponding value in the number column of csv2. When a match is found, I need the value in the Title column of csv1 to be replaced with the corresponding value in the ID column of csv2. Thus I would want my desired output to be:
Title,glide gscore,IFDScore
14492699,-9.01,-1020.18
14492699,-8.759,-1020.01
14492699,-7.301,-1019.28
I am looking for a way to do it through pandas, bash or python.
This answer is close but gives me an ambiguous truth value of a DataFrame.
I also tried update in pandas without luck.
I'm not pasting the exact code I've tried yet because it would be overwhelming to see faulty code in pandas, bash and python all at once.
You could map it; then use fillna in case there were any "Titles" that did not have a matching "number":
csv1 = pd.read_csv('first_csv.csv')
csv2 = pd.read_csv('second_csv.csv')
csv1['Title'] = csv1['Title'].map(csv2.set_index('number')['ID']).fillna(csv1['Title']).astype(int)
Output:
Title glide gscore IFDScore
0 14492699 -9.010 -1020.18
1 14492699 -8.759 -1020.01
2 14492699 -7.301 -1019.28
You can use pandas module to load your dataframe, and then, using merge function, you can achieve what you are seeking for:
import pandas as pd
df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")
merged = df1.merge(df2, left_on="Title", right_on="number", how="right")
merged["Title"] = merged["ID"]
merged
Output
Title
glide gscore
IFDScore
ID
smiles
number
0
28604361
nan
nan
28604361
NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3
102
1
14492699
-9.01
-1020.18
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
2
14492699
-8.759
-1020.01
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
3
14492699
-7.301
-1019.28
14492699
COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C
235
4
16888863
nan
nan
16888863
COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO
108
Note that the Nan values are due to unavailable values. If your dataframe covers these parts too, it won't result in Nan.

How to refer to the 2nd duplicate column in pandas

My dataframe has two columns both called Scanned Blank. I want to always select the second column named 'Scanned Blank' below:
df['Scanned Blank'].head()
Scanned Blank Scanned Blank
1 NaN Y
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
I tried
df['Scanned Blank'][1]
which didn't work.
It's not feasible for me to use integer selection, because sometimes the columns will move around. For instance, sometimes the first scanned blank will be column 20 and the second one column 40. Sometimes it'll be respectively 21 and 41. However the exact position of the column, I know I will always want the one after it.
Realized that I was just returning another dataframe, so,
df['Scanned Blank'].iloc[:,1]
Also, yes, I'm aware this is bad practice. Unfortunately, I don't have any control over this dataset, and this script needs to reliably run when other people use it.
Let us do duplicated
df.loc[:,df.columns.duplicated()]

Compare Columns In Dataframe

I have two data frames that I have concatenated into one. What I ultimately want to end up with is a list of all the columns the exist in both. The data frames come from two different db tables, and I need to generate queries based on the ones that exist in both tables.
I tried doing the following: concat_per.query('doe_per==focus_per') but it returned an empty data frame.
doe_per focus_per
2 NaN Period_02
3 Period_01 Period_06
4 Period_02 Period_08
5 Period_03 NaN
6 Period_04 NaN
7 Period_05 NaN
8 Period_06 NaN
9 Period_07 NaN
10 Period_08 NaN
also you can use function isin().
At first ,you can transform the first column to a set or list as you base columns. Then use isin() to filter the second dataframe.
firstList = set(df1st.doe_per)
targetDF = df2nd[df2nd.focus_per.isin(firstList)==True]
If you want to combine two dataframes into one, you can use
pd.merge(df1,df2,left_on=df1st.doe_per,right_on = df2nd.focus_per,join='inner')
or
pd.concat([df1,df2],on_,join='inner',ignore_index=True)
I'm sorry that i forgot some params in the function.But if you want to combine some dataframe into one, you need to use these two function. Maybe pd.combine() is ok. You can look up the api of pandas.

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Fill pandas Panel object with data

This is probably very very basic but I can't seem to find a solution anywhere. I'm trying to construct a 3D panel object in pandas and then fill it with data which I read from several csv files. An example of what I'm trying to do would be the following:
import numpy as np
import pandas as pd
year = np.arange(2000,2005)
obs = np.arange(1,5)
variables = ['x1','x2']
data = pd.Panel(items = obs, major_axis = year, minor_axis = variables)
So that data[i] gives me all the data belonging to one of the observation units in the panel:
data[1]
x1 x2
2000 NaN NaN
2001 NaN NaN
2002 NaN NaN
2003 NaN NaN
2004 NaN NaN
Then, I read in data from a csv which gives me a DataFrame that looks like this (I'm just creating an equivalent object here to make this a working example):
x1data = pd.DataFrame(data = zip(year, np.random.randn(5)), columns = ['year', 'x1'])
x1data
year x1
0 2000 -0.261514
1 2001 0.474840
2 2002 0.021714
3 2003 -1.939358
4 2004 1.167545
No I would like to replace the NaN's in the x1 column of data[1] with the data that is in the x1data dataframe. My first idea (given that I'm coming from R) was to simply make sure that I select an object from x1data that has the same dimension as the x1 column in my panel and assign it to the panel:
data[1].x1 = x1data.x1
However, this doesn't work which I guess is due to the fact that in x1data, the years are a column of the dataframe, whereas in the panel they are whatever it is that shows up to the left of the columns (the "row names", would this be an index)?
As you can probably tell from my question I'm far from really understanding what's going on in the pandas data structure so any help would be greatly appreciated!
I'm guessing this question didn't elicit a lot of replies at it was simply too stupid, but just in case anyone ever comes across this and is as clueless as I was, the very simple answer is to access the panel using the .iloc method, as:
data.iloc[item, major_axis, minor_axis]
where each of the arguments can be single elements or lists, in order to write on slices of the panel. My question above would have been solved by
data.iloc[1, np.arange(2000,2005), 'x1'] = np.asarray(x1data.x1)
or
data.iloc[1, year, 'x1'] = np.asarray(x1data.x1)
Note than had I not used np.asarray, nothing would have happened as data.iloc[] creates an object that has the years as index, while x1data.x1 has an index starting at 0.

Categories