Retrieve TF-IDF Values for Textual Documents in CSV File

Retrieve TF-IDF Values for Textual Documents in CSV File - python

I have a CSV file with two columns (no header), held in a variable known as 'dataset':
Year Document Text
0 ['1991'] ['FACTSHEET ', 'WHAT ', 'IS ', 'AIDS', 'AIDS '...
1 ['1991'] ['HIV ', 'IT', "'S ", 'YOUR ', 'CHOICE', 'Ever...
2 ['1991'] ['ACET ', 'AIDS ', 'CARE ', 'EDUCATION ', 'AND...
I'm attempting to construct a Bag of Words model using Scikit-learn and gather the weightings using TF-IDF. However, I'm having difficulty with obtaining actual results, as the output of the code below returns 2480 rows (correct) * 346862 columns (corrected by #Jarad). I would appreciate someone helping me decipher these results, and point me in the right direction as to their formatting (to provide clarity) or correction (to provide validity) so that I can progress towards the later stages of a Bag of Words model implementation.
Python Code:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(dataset.iloc[:,1])
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)
Output:
00 000 0000 00000 00000000 00000001 0000001 00001
0 0.000000 0.011453 0.000000 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.022032 0.000000 0.0 0.0 0.0 0.0 0.0
2 0.006352 0.009717 0.000000 0.0 0.0 0.0 0.0 0.0
3 0.001422 0.015949 0.000000 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.002377 0.000000 0.0 0.0 0.0 0.0 0.0
Should I tokenize the document prior to storing it in the CSV file? I decided against it due to the fact that I would hope to analyse sentence structure at a later stage as well.

Related

drop_duplicates in pandas for a large data set

I am new to pandas so sorry for naiveté.
I have two dataframe.
One is out.hdf:
999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074
999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292
999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030
999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340
and another one is out.res (the first column is station name):
061Z 56.72 0.0 P 603879074
061Z 29.92 0.0 P 603879074
0614 46.24 0.0 P 603879292
109C 87.51 0.0 P 603947030
113A 66.93 0.0 P 603947030
113A 26.93 0.0 P 603947030
121A 31.49 0.0 P 603947340
The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).
The new dataframe:
"999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074"
061Z 56.72 0.0 P
061Z 29.92 0.0 P
"999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292"
0614 46.24 0.0 P
"999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030"
109C 87.51 0.0 P
113A 66.93 0.0 P
113A 26.93 0.0 P
"999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340"
121A 31.49 0.0 P
My code to do this is:
import csv
import pandas as pd
import numpy as np
path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)
###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter='\t')
i=0
with open('./out.hdf', 'r') as a_file:
for line in a_file:
liney = line.strip()
writer.writerow(np.array([liney]))
print(liney)
j=0
with open('./out.res', 'r') as a_file:
for line in a_file:
if res.iloc[j, 4] == hdf.iloc[i, 14]:
strng = res.iloc[j, [0, 1, 2, 3]]
print(strng)
writer.writerow(np.array(strng))
j+=1
i+=1
The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:
res.drop_duplicates([0], keep = 'last', inplace = True)
and
res.groupby([0], as_index = False).last()
and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.

I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:
unique_stations = res.loc[res[0].shift() != res[0]]

Degree Centrality and Clustering Coefficient in Adjacent matrix

Based on a dataset extracted from this link: Brain and Cosmic Web samples, I'm trying to do some Complex Network analysis.
The paper The Quantitative Comparison Between the Neuronal Network and the Cosmic Web, claims to have used this dataset, as well as its adjacent matrixes
"Mij, i.e., a matrix with rows/columns equal to the number of detected nodes, with value Mij = 1 if the nodes are separated by a distance ≤ llink , or Mij = 0 otherwise".
I then probed into the matrix, like so:
from astropy.io import fits
with fits.open('mind_dataset/matrix_CEREBELLUM_large.fits') as data:
matrix_cerebellum = pd.DataFrame(data[0].data)
which does not print a sparse matrix, but rather a matrix with distances from nodes expressed as pixels.
I've learned that the correspondence between 1 pixel and scale is:
neuronal_web_pixel = 0.32 # micrometers
And came up with a method in order to convert pixels to microns:
def pixels_to_scale(df, mind=False, cosmos=False):
one_pixel_equals_parsec = cosmic_web_pixel
one_pixel_equals_micron = neuronal_web_pixel
if mind:
df = df/one_pixel_equals_micron
if cosmos:
df = df/one_pixel_equals_parsec
return df
Then, another method to binaryze the matrix after the conversion:
def binarize_matrix(df, mind=False, cosmos=False):
if mind:
brain_Llink = 16.0 # microns
# distances less than 16 microns
brain_mask = (df<=brain_Llink)
# convert to 1
df = df.where(brain_mask, 1.0)
if cosmos:
cosmos_Llink = 1.2 # 1.2 mpc
brain_mask = (df<=cosmos_Llink)
df = df.where(brain_mask, 1.0)
return df
Finally, with:
matrix_cerebellum = pixels_to_scale(matrix_cerebellum, mind=True)
matrix_cerebellum = binarize_matrix(matrix_cerebellum, mind=True)
matrix_cerebellum.head(5) prints my sparse matrix of (mostly) 0.0s and 1.0s:
0 1 2 3 4 5 6 7 8 9 ... 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 1858 columns
Now I would like to calculate:
Degree Centrality of the network, given by the formula:
Cd(j) = Kj / n-1
Where kj is the number of (undirected) connections to/from each j-node and n is the total number of nodes in the entire network.
Clustering Coefficient, which quantifies the existence of infrastructure within the local vicinity of nodes, given by the formula:
C(j) = 2yi / Kj(Kj -1)
in which yj is the number of links between neighbooring nodes of the j-node.
For finding Degree Centrality, I have tried:
# find connections by adding matrix row values
matrix_cerebellum['K'] = matrix_cerebellum.sum(axis=1)
# applying formula
matrix_cerebellum['centrality'] = matrix_cerebellum['K']/matrix_cerebellum.shape[0]-1
Generates:
... K centrality
9.0 -0.995156
6.0 -0.996771
7.0 -0.996771
11.0 -0.996233
11.0 -0.994080
According to the paper, I should be finding:
"For the cerebellum slices we measured 〈k〉 ∼ 1.9 − 3.7",
For the average numbers of connections per node.
Also I'm finding negative centralities.
Does anyone know how to apply any of these formulas based on the dataframe above?

This is not really a programming question, but I will try to answer it. The webpage with the data sources states that the adjacent matrix files for brain samples give distances between connected nodes expressed in pixels of the images used to reconstruct the networks. The paper then explains that to get the real adjacency matrix Mij (with 0 and 1 values only) the authors consider as connected nodes where the distance is at most 16 micrometers. I don't see the information on how many pixels in the image corresponds to one micrometer. This would be needed to compute the same matrix Mij that the authors used in their calculations.
Furthermore, the value〈k〉is not the degree centrality or the clustering coefficient (that depend on a node), but rather the average number of connections per node in the network, computed using the matrix Mij. The paper then compares the observed distributions of degree centralities and clustering coefficients in the brain and cosmic networks to the distribution one would see in a random network with the same number of nodes and the same value of〈k〉. The conclusion is that brain and cosmic networks are highly non-random.
Edits:
1. The conversion of 0.32 micrometers per pixel seems to be right. In the files with data on brain samples (both for cortex and cerebellum) the largest value is 50 pixels, which with this conversion corresponds to 16 micrometers. This suggests that the authors of the paper already thresholded the matrices, listing in them only distances not exceeding 16 micrometers. In view of this, to obtain the matrix Mij with 0 and 1 values only, one simply needs to replace all non-zero values with 1. An issue is that using the matrices obtained in this way one gets 〈k〉 = 9.22 for cerebellum and 〈k〉 = 7.13 for cortex, which is somewhat outside the ranges given in the paper. I don't know how to account for this discrepancy.
2. Negative centrality values are due to a mistake (missing parentheses) in the code. It should be:
matrix_cerebellum['centrality'] = matrix_cerebellum['K']/(matrix_cerebellum.shape[0] - 1)
3. Clustering coefficient and degree centrality of each node can be computed using tools provided by the networkx library:
from astropy.io import fits
import networkx as nx
# get the adjacency matrix for cortex
with fits.open('matrix_CORTEX_large.fits') as data:
M = data[0].data
M[M > 0] = 1
# create a graph object
G_cortex = nx.from_numpy_matrix(M)
# compute degree centrality of all nodes
centrality = nx.degree_centrality(G_cortex)
# compute clustering coefficient of all nodes
clustering = nx.clustering(G_cortex)

k-means returns nan values?

I recently came across a k-means tutorial that looks a bit different than what I remember the algorithm to be, but it should still do the same after all it's k-means. So, I went and gave it a try with some data, here's how the code looks:
# Assignment Stage:
def assignment(data, centroids):
for i in centroids.keys():
#sqrt((x1-x2)^2+(y1-y2)^2 + etc)
data['distance_from_{}'.format(i)]= (
np.sqrt((data['soloRatio']-centroids[i][0])**2
+(data['secStatus']-centroids[i][1])**2
+(data['shipsDestroyed']-centroids[i][2])**2
+(data['combatShipsLost']-centroids[i][3])**2
+(data['miningShipsLost']-centroids[i][4])**2
+(data['exploShipsLost']-centroids[i][5])**2
+(data['otherShipsLost']-centroids[i][6])**2
))
print(data['distance_from_{}'.format(i)])
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
data['closest'] = data.loc[:, centroid_distance_cols].idxmin(axis=1)
data['closest'] = data['closest'].astype(str).str.replace('\D+', '')
return data
data = assignment(data, centroids)
and:
#Update stage:
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
for i in centroids.keys():
centroids[i][0]=np.mean(data[data['closest']==i]['soloRatio'])
centroids[i][1]=np.mean(data[data['closest']==i]['secStatus'])
centroids[i][2]=np.mean(data[data['closest']==i]['shipsDestroyed'])
centroids[i][3]=np.mean(data[data['closest']==i]['combatShipsLost'])
centroids[i][4]=np.mean(data[data['closest']==i]['miningShipsLost'])
centroids[i][5]=np.mean(data[data['closest']==i]['exploShipsLost'])
centroids[i][6]=np.mean(data[data['closest']==i]['otherShipsLost'])
return k
#TODO: add graphical representation?
while True:
closest_centroids = data['closest'].copy(deep=True)
centroids = update(centroids)
data = assignment(data,centroids)
if(closest_centroids.equals(data['closest'])):
break
When I run the initial assignment stage, it returns the distances, however when I run the update stage, all distance values become NaN, and I just dont know why or at which point exactly this happens... Maybe I made I mistake I can't spot?
Here's an excerpt of the data im working with:
Unnamed: 0 characterID combatShipsLost exploShipsLost miningShipsLost \
0 0 90000654.0 8.0 4.0 5.0
1 1 90001581.0 97.0 5.0 1.0
2 2 90001595.0 61.0 0.0 0.0
3 3 90002023.0 22.0 1.0 0.0
4 4 90002030.0 74.0 0.0 1.0
otherShipsLost secStatus shipsDestroyed soloRatio
0 0.0 5.003100 1.0 10.0
1 0.0 2.817807 6251.0 6.0
2 0.0 -2.015310 752.0 0.0
3 4.0 5.002769 43.0 5.0
4 1.0 3.090204 301.0 7.0

Having difficulty getting multiple columns in HDF5 Table Data

I am new to hdf5 and was trying to store a DataFrame row into the hdf5 format. I was to append a row at different locations within the file; however, every time I append it shows up at an array in a single column rather than a single value in multiple columns.
I have tried both h5py and pandas and it seems like pandas is the better option for appending. Additionally, I have really been trying a lot of different methods. Truly, any help would be greatly appreciated.
Here is me sending an array multiple times into the hdf5 file.
import pandas as pd
import numpy as np
data = np.zeros((1,48), dtype = float)
columnName = ['Hello'+str(y) for (x,y), item in np.ndenumerate(data)]
df = pd.DataFrame(data = data, columns =columnName)
file = pd.HDFStore('file.hdf5', mode = 'a', complevel = 9, comlib = 'blosc')
for x in range(0,11):
file.put('/data', df, column_data = columnName , append = True, format = 'table')

In [243]: store = pd.HDFStore('test.h5')
This seems to work fine:
In [247]: store.put('foo',df,append=True,format='table')
In [248]: store.put('foo',df,append=True,format='table')
In [249]: store.put('foo',df,append=True,format='table')
In [250]: store['foo']
Out[250]:
Hello0 Hello1 Hello2 Hello3 Hello4 ... Hello43 Hello44 Hello45 Hello46 Hello47
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
[3 rows x 48 columns]

Transform Pandas DataFrame to LIBFM format txt file

I want to transform a Pandas Data frame in python to a sparse matrix txt file in the LIBFM format.
Here the format needs to look like this:
4 0:1.5 3:-7.9
2 1:1e-5 3:2
-1 6:1
This file contains three cases. The first column states the target of each of the three case: i.e. 4 for the first case, 2 for the second and -1 for the third. After the target, each line contains the non-zero elements of x, where an entry like 0:1.5 reads x0 = 1.5 and 3:-7.9 means x3 = −7.9, etc. That means the left side of INDEX:VALUE states the index within x whereas the right side states the value of x.
In total the data from the example describes the following design matrix X and target vector y:
1.5 0.0 0.0 −7.9 0.0 0.0 0.0
X: 0.0 10−5 0.0 2.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 1.0
4
Y: 2
−1
This is also explained in the Manual file under chapter 2.
Now here is my problem: I have a pandas dataframe that looks like this:
overall reviewerID asin brand Positive Negative \
0 5.0 A2XVJBSRI3SWDI 0000031887 Boutique Cutie 3.0 -1
1 4.0 A2G0LNLN79Q6HR 0000031887 Boutique Cutie 5.0 -2
2 2.0 A2R3K1KX09QBYP 0000031887 Boutique Cutie 3.0 -2
3 1.0 A19PBP93OF896 0000031887 Boutique Cutie 2.0 -3
4 4.0 A1P0IHU93EF9ZK 0000031887 Boutique Cutie 2.0 -2
LDA_0 LDA_1 ... LDA_98 LDA_99
0 0.000833 0.000833 ... 0.000833 0.000833
1 0.000769 0.000769 ... 0.000769 0.000769
2 0.000417 0.000417 ... 0.000417 0.000417
3 0.000137 0.014101 ... 0.013836 0.000137
4 0.000625 0.000625 ... 0.063125 0.000625
Where "overall" is the target column and all other 105 columns are features.
The 'ReviewerId', 'Asin' and 'Brand' columns needs to be changed to dummy variables. So each unique 'ReviewerID', 'Asin' and brand gets his own column. This means if 'ReviewerID' has 100 unique values you get 100 columns where the value is 1 if that row represents the specific Reviewer and else zero.
All other columns don't need to get reformatted. So the index for those columns can just be the column number.
So the first 3 rows in the above pandas data frame need to be transformed to the following output:
5 0:1 5:1 6:1 7:3 8:-1 9:0.000833 10:0.000833 ... 107:0.000833 108:0.00833
4 1:1 5:1 6:1 7:5 8:-2 9:0.000769 10:0.000769 ... 107:0.000769 108:0.00769
2 2:1 5:1 6:1 7:3 8:-2 9:0.000417 10:0.000417 ... 107:0.000417 108:0.000417
In the LIBFM] package there is a program that can transform the User - Item - Rating into the LIBFM output format. However this program can't get along with this many columns.
Is there an easy way to do this? I have 1 million rows in total.

LibFM executable expects the input in libSVM format that you have explained here. If the file converter in the LibFM package do not work for your data, try the scikit learn sklearn.datasets.dump_svmlight_file method.
Ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.