How to extract data matrix using pandas? - python

I have a csv file with 6901 rows x 42 column. 39 columns of this file is a matrix of data that I would like to do some analysis on. I do not know how to extract this data from pandas as a matrix which does not need index and treat it as a numerical matrix.
df1=pd.read_csv(fileName, sep='\\t',lineterminator='\\r', engine='python', header='infer')
df1.info()
< bound method DataFrame.info of Protein.IDs ... Ratio.H.L.33
0 A0A024QZP7;P06493;P06493-2;E5RIU6;A0A087WZZ9 ... 47.88100
1 A0A024QZX5;A0A087X1N8;P35237 ... 0.13615
2 A0A024R0T9;K7ER74;P02655;Q6P163;V9GYJ8 ... NaN
3 A0A024R4E5;Q00341;Q00341-2;H0Y394;H7C0A4;C9J5E... ... 5.97650
4 A0A087WZA9;A0A024R4K9;A0A087X266;Q9BXJ8-2;Q9BXJ8 ... NaN
... ... ...
6896 V9GYT7 ... NaN
6897 V9GZ54 ... NaN
6898 X5CMH5;A0A140T9S0;A0A0G2JLV0;A0A087WYD6;E7ENX8... ... NaN
6899 X6RAL5;H7BZW6;U3KPY7 ... NaN
6900 X6RJP6 ... NaN
[6901 rows x 42 columns] >
Then I would like to put the column 4 to 42 as a normal matrix for computation. Does anyone knows how to do it?

You can convert your DataFrame into a numpy ndarray using
df1.values
or
df1.to_numpy()
If you want to extract only specific columns:
cols = ['A', 'B', 'C']
df1[cols].to_numpy()

pandas provides you with everything you need. :)
You dont need to convert it to a numpy array. This way you will keep a couple of handy methods from pandas DataFrames :)
You have a .csv file which means "comma seperated values" - that has historical reason but nowdays the values are seperated by different signs or in panda-terms by different seperators, short sep. For example commas, semi-colons, tabs.
Your data shows a seperation by semi-colons, so you should use sep=';' in your pd.read_csv command.
You want to ignore the first 3 columns as i understood. So you just set the pd.read_csv variable usecols (=use columns)
usecols=range(4,43)
usecols expects you to tell him exactly the columns you wanna use. You can just give him a range from 4 to 43 or you can pass a list
a=[4,5,6,7,....,42]
which is obviously only handy if you want to define specific columns. The python-function range does this messy job for you.
So your command should look like this:
df1=pd.read_csv(fileName, sep=';',lineterminator='\\r', engine='python', header='infer',usecols=range(4,43))
Best regards

Related

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?
you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735
This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

Specify datatype when reading in excel data to pandas/python

I have an excel file along the lines of
gdp gdp (2009)
1929 104.6 1056.7
1930 173.6 962.0
1931 72.3 846.6
I want to read in the file and specify that the first column (which as no header information) is an integer. I don't need column B
I am reading in the file using the following
import pandas as pd
from pandas import ExcelFile
gdp = pd.read_excel('gdpfile.xls, skiprows = 2, parse_cols = "A,C")
This reads in fine, except the years all get turned into floats, e.g. 1929.0, 1930.0, 1931.0. The first two rows are NaN.
I want to specify that it should be integer. I have tried adding converters = {"A":int,"C":float} in the read_excel command, as suggested by Python pandas: how to specify data types when reading an Excel file? but this did not fix things.
I have tried to convert after the fact, which I've previously done to convert strings to float, however this also did not work.
gdp.columns = ['Year','GDP 2009']
gdp['Year'] = gdp['Year'].astype(int)
I also tried using dtypes = int as suggested in one of the comments at the above link, however this also does not work.
Note that the skiprows is necessary as my actual excel file has a few rows at the top I do not want.
As per the sample given here, two blank rows are present after the heading. So if you want heading, you can give skip rows in range:
pd.read_excel("test.xls",parse_cols="A,C",skiprows=[1,2])
Also, can you please confirm if there are any other NaN cells in that column. If there are NaN values in the column, column dtype will be promoted to float.
Please see the link below:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
Also please note that since the first column heading is not given, while importing it takes first column as index.
To avoid that, I have followed the below steps:
My excel file looks like this
NaN gdp gdp (2009)
NaN NaN NaN
NaN NaN NaN
1929 104.6 1056.7
1930 173.6 962
1931 72.3 846.6
NaN NaN NaN
1952 45.3 56.6
I removed the default headers and added headers to avoid indexing issue:
test = pd.read_excel("test.xls",skiprows=[0,3],header=None,names=['Year','gdp (2009)'],parse_cols="A,C")
As stated above, since the column contains NaN value, column type will be converted into float.You can dropna or fill na values with 0 or some other value. In this case I'm dropping na rows.
test = test.dropna(axis=0, how='all')
Once you have removed NaN values, you can use astype to convert it into int
test['Year']=test.Year.astype(int)
Please check if this works for you and let me know if you need more clarification on this.
Thanks,

How to subtract two partial columns with pandas?

I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.
The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)

Using Pandas to Manipulate Multiple Columns

I have a 30+ million row data set that I need to apply a whole host of data transformation rules to. For this task, I am trying to explore Pandas as a possible solution because my current solution isn't very fast.
Currently, I am performing a row by row manipulation of the data set, and then exporting it to a new table (CSV file) on disk.
There are 5 functions users can perform on the data within a given column:
remove white space
Capitalize all text
format date
replace letter/number
replace word
My first thought was to use the dataframe's apply or applmap, but this can only be used on a single column.
Is there a way to use apply or applymap to many columns instead of just one?
Is there a better workflow I should consider since I could be doing manipulations to 1:n columns in my dataset, where the maximum number of columns is currently around 30.
Thank you
You can use list comprehension with concat if need apply some function working only with Series:
import pandas as pd
data = pd.DataFrame({'A':[' ff ','2','3'],
'B':[' 77','s gg','d'],
'C':['s',' 44','f']})
print (data)
A B C
0 ff 77 s
1 2 s gg 44
2 3 d f
print (pd.concat([data[col].str.strip().str.capitalize() for col in data], axis=1))
A B C
0 Ff 77 S
1 2 S gg 44
2 3 D F

Definitive way to parse alphanumeric CSVs in Python with scipy/numpy

I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.
I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):
# my file
name favorite_integer favorite_float1 favorite_float2 short_description
johnny 5 60.2 0.52 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')
I'd like to be able to access data in two ways:
As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:
floats_and_ints = data.matrix
floats_and_ints[:, 0] # access the integers
floats_and_ints[:, 1:3] # access some of the floats
transpose(floats_and_ints) # etc..
As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:
data['favorite_float1'] # get all the values of the column with header
"favorite_float1"
data['name'] # get all the names of the rows
I don't want to have to know that favorite_float1 is the second column in the table, since this might change.
It's also important for me to be able to iterate through the rows and access the fields by name. For example:
for row in data:
# print names and favorite integers of all
print "Name: ", row["name"], row["favorite_int"]
The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.
The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.
Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.
In summary, the main features are:
Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
The ability to extract columns and rows by header name (as in the above example)
Have a look at pandas which is build on top of numpy.
Here is a small example:
In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]:
favorite_integer favorite_float1 favorite_float2 short_description
name
johnny 5 60.20 0.520 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
In [9]: df.describe()
Out[9]:
favorite_integer favorite_float1 favorite_float2
count 2.000000 2.000000 2.000000
mean 3.000000 38.860000 0.260500
std 2.828427 30.179317 0.366988
min 1.000000 17.520000 0.001000
25% 2.000000 28.190000 0.130750
50% 3.000000 38.860000 0.260500
75% 4.000000 49.530000 0.390250
max 5.000000 60.200000 0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]:
name
johnny 60.20
bob 17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]:
short_description mean_favorite
name
johnny johnny likes fruitflies 21.906667
bob bob, bobby, robert 6.173667
matplotlib.mlab.csv2rec returns a numpy recarray, so you can do all the great numpy things to this that you would do with any numpy array. The individual rows, being record instances, can be indexed as tuples but also have attributes automatically named for the columns in your data:
rows = matplotlib.mlab.csv2rec('data.csv')
row = rows[0]
print row[0]
print row.name
print row['name']
csv2rec also understands "quoted strings", unlike numpy.genfromtext.
In general, I find that csv2rec combines some of the best features of csv.reader and numpy.genfromtext.
numpy.genfromtxt()
Why not just use the stdlib csv.DictReader?

Categories