How to subtract two partial columns with pandas?

How to subtract two partial columns with pandas? - python

I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.

The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)

Related

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

How to extract data matrix using pandas?

I have a csv file with 6901 rows x 42 column. 39 columns of this file is a matrix of data that I would like to do some analysis on. I do not know how to extract this data from pandas as a matrix which does not need index and treat it as a numerical matrix.
df1=pd.read_csv(fileName, sep='\\t',lineterminator='\\r', engine='python', header='infer')
df1.info()
< bound method DataFrame.info of Protein.IDs ... Ratio.H.L.33
0 A0A024QZP7;P06493;P06493-2;E5RIU6;A0A087WZZ9 ... 47.88100
1 A0A024QZX5;A0A087X1N8;P35237 ... 0.13615
2 A0A024R0T9;K7ER74;P02655;Q6P163;V9GYJ8 ... NaN
3 A0A024R4E5;Q00341;Q00341-2;H0Y394;H7C0A4;C9J5E... ... 5.97650
4 A0A087WZA9;A0A024R4K9;A0A087X266;Q9BXJ8-2;Q9BXJ8 ... NaN
... ... ...
6896 V9GYT7 ... NaN
6897 V9GZ54 ... NaN
6898 X5CMH5;A0A140T9S0;A0A0G2JLV0;A0A087WYD6;E7ENX8... ... NaN
6899 X6RAL5;H7BZW6;U3KPY7 ... NaN
6900 X6RJP6 ... NaN
[6901 rows x 42 columns] >
Then I would like to put the column 4 to 42 as a normal matrix for computation. Does anyone knows how to do it?

You can convert your DataFrame into a numpy ndarray using
df1.values
or
df1.to_numpy()
If you want to extract only specific columns:
cols = ['A', 'B', 'C']
df1[cols].to_numpy()

pandas provides you with everything you need. :)
You dont need to convert it to a numpy array. This way you will keep a couple of handy methods from pandas DataFrames :)
You have a .csv file which means "comma seperated values" - that has historical reason but nowdays the values are seperated by different signs or in panda-terms by different seperators, short sep. For example commas, semi-colons, tabs.
Your data shows a seperation by semi-colons, so you should use sep=';' in your pd.read_csv command.
You want to ignore the first 3 columns as i understood. So you just set the pd.read_csv variable usecols (=use columns)
usecols=range(4,43)
usecols expects you to tell him exactly the columns you wanna use. You can just give him a range from 4 to 43 or you can pass a list
a=[4,5,6,7,....,42]
which is obviously only handy if you want to define specific columns. The python-function range does this messy job for you.
So your command should look like this:
df1=pd.read_csv(fileName, sep=';',lineterminator='\\r', engine='python', header='infer',usecols=range(4,43))
Best regards

Pandas: how to add data to a MultiIndex empty DataFrame?

I would like to use a MultiIndex DataFrame to easily select portions of the DataFrame. I created an empty DataFrame as follows:
mi = mindex = {'input':['a','b','c'],'optim':['pareto','alive']}
mi = pd.MultiIndex.from_tuples([(c,k) for c in mi.keys() for k in mi[c]])
mc = pd.MultiIndex(names=['Generation','Individual'],labels=[[],[]],levels=[[],[]])
population = pd.DataFrame(index=mi,columns=mc)
which seems to be good.
However, I do not know how to insert a single data to start populating my DataFrame. I tried the followings:
population.loc[('optim','pareto'),(0,0)]=True
where I tried to define a new column double index (0,0) leading to a NotImplementedError. I also tried with (0,1), which gave a ValueError.
I tried also with no columns indexes:
population.loc[('optim','pareto')]=True
Which gave no error...but no change in the DataFrame either...
Any help? Thanks in advance.
EDIT
To clarify my question, once populated, my DataFrame should look like this:
Generation 1 2
Individual 1 2 3 4 5 6
input a 1 1 2 ...
b 1 2 2 ...
c 1 1 2 ...
optim pareto True True False ...
alive True True False ...
EDIT 2
I found out that what I was doing works if I define my first column at the DataFrame creation. In particular with:
mc = pd.MultiIndex.from_tuples([(0,0)])
I get a first column full of nan and I can add data as I wanted to (also for new columns):
population.loc[('optim','pareto'),(0,1)]=True
I still do not know what is wrong with my first definition...

Even if I do not know why my initial definition was wrong, the following works as expected:
mi = {'input':['a','b','c'],'optim':['pareto','alive']}
mi = pd.MultiIndex.from_tuples([(c,k) for c in mi.keys() for k in mi[c]])
mc = pd.MultiIndex.from_tuples([(0,0)],names=['Generation','Individual'])
population = pd.DataFrame(index=mi,columns=mc)
It looks like the solution was to initialize the columns at the DataFrame creation (here with a (0,0) column). The created DataFrame is then:
Generation 0
Individual 0
input a NaN
b NaN
c NaN
optim pareto NaN
alive NaN
which can be then be populated adding values to the current column or new columns/rows.

Compare Columns In Dataframe

I have two data frames that I have concatenated into one. What I ultimately want to end up with is a list of all the columns the exist in both. The data frames come from two different db tables, and I need to generate queries based on the ones that exist in both tables.
I tried doing the following: concat_per.query('doe_per==focus_per') but it returned an empty data frame.
doe_per focus_per
2 NaN Period_02
3 Period_01 Period_06
4 Period_02 Period_08
5 Period_03 NaN
6 Period_04 NaN
7 Period_05 NaN
8 Period_06 NaN
9 Period_07 NaN
10 Period_08 NaN

also you can use function isin().
At first ,you can transform the first column to a set or list as you base columns. Then use isin() to filter the second dataframe.
firstList = set(df1st.doe_per)
targetDF = df2nd[df2nd.focus_per.isin(firstList)==True]
If you want to combine two dataframes into one, you can use
pd.merge(df1,df2,left_on=df1st.doe_per,right_on = df2nd.focus_per,join='inner')
or
pd.concat([df1,df2],on_,join='inner',ignore_index=True)
I'm sorry that i forgot some params in the function.But if you want to combine some dataframe into one, you need to use these two function. Maybe pd.combine() is ok. You can look up the api of pandas.

Fill pandas Panel object with data

This is probably very very basic but I can't seem to find a solution anywhere. I'm trying to construct a 3D panel object in pandas and then fill it with data which I read from several csv files. An example of what I'm trying to do would be the following:
import numpy as np
import pandas as pd
year = np.arange(2000,2005)
obs = np.arange(1,5)
variables = ['x1','x2']
data = pd.Panel(items = obs, major_axis = year, minor_axis = variables)
So that data[i] gives me all the data belonging to one of the observation units in the panel:
data[1]
x1 x2
2000 NaN NaN
2001 NaN NaN
2002 NaN NaN
2003 NaN NaN
2004 NaN NaN
Then, I read in data from a csv which gives me a DataFrame that looks like this (I'm just creating an equivalent object here to make this a working example):
x1data = pd.DataFrame(data = zip(year, np.random.randn(5)), columns = ['year', 'x1'])
x1data
year x1
0 2000 -0.261514
1 2001 0.474840
2 2002 0.021714
3 2003 -1.939358
4 2004 1.167545
No I would like to replace the NaN's in the x1 column of data[1] with the data that is in the x1data dataframe. My first idea (given that I'm coming from R) was to simply make sure that I select an object from x1data that has the same dimension as the x1 column in my panel and assign it to the panel:
data[1].x1 = x1data.x1
However, this doesn't work which I guess is due to the fact that in x1data, the years are a column of the dataframe, whereas in the panel they are whatever it is that shows up to the left of the columns (the "row names", would this be an index)?
As you can probably tell from my question I'm far from really understanding what's going on in the pandas data structure so any help would be greatly appreciated!

I'm guessing this question didn't elicit a lot of replies at it was simply too stupid, but just in case anyone ever comes across this and is as clueless as I was, the very simple answer is to access the panel using the .iloc method, as:
data.iloc[item, major_axis, minor_axis]
where each of the arguments can be single elements or lists, in order to write on slices of the panel. My question above would have been solved by
data.iloc[1, np.arange(2000,2005), 'x1'] = np.asarray(x1data.x1)
or
data.iloc[1, year, 'x1'] = np.asarray(x1data.x1)
Note than had I not used np.asarray, nothing would have happened as data.iloc[] creates an object that has the years as index, while x1data.x1 has an index starting at 0.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.