Create a DTM from large corpus

Create a DTM from large corpus - python

I have a set of texts contained in a list, which I loaded from a csv file
texts=['this is text1', 'this would be text2', 'here we have text3']
and I would like to create a document-term matrix, by using stemmed words.
I have also stemmed them to have:
[['text1'], ['would', 'text2'], ['text3']]
What I would like to do is to create a DTM that counts all the stemmed terms (then I would need to do some operations on the rows).
For what concerns the unstemmed texts, I am able to make the DTM for short texts, by using the function fn_tdm_df reported here.
What would be more practical for me, though, is to make a DTM of the stemmed words. Just to be clearer, the output I have from applying "fn_tdm_df":
be have here is text1 text2 text3 this we would
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0
First, I do not know why I have only two rows, instead of three. Second, my desired output would be something like:
text1 would text2 text3
0 1 0 0 0
1 0 1 1 0
2 0 0 0 1
I am sorry but I am really desperate on this output. I also tried to export and reimport the stemmed texts on R, but it doesn't encode correctly. I would probably need to handle DataFrames, as for the huge amount of data. What would you suggest me?
----- UPDATE
Using CountVectorizer I am not fully satisfied, as I do not get a tractable matrix in which I can normalize and sum rows/columns easily.
Here is the code I am using, but it is blocking Python (dataset too large). How can I run it efficiently?
vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(texts)
print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())

Why don't you use sklearn? The CountVectorizer() method converts a collection of text documents to a matrix of token counts. What's more it gives a sparse representation of the counts using scipy.
You can either give your raw entries to the method or preprocess it as you have done (stemming + stop words).
Check this out : CountVectorizer()

Related

Loading Pandas Dataframe with skipped sentiment

I have this dataset for sentiment analysis, loading the data with this code:
url = 'https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/amazon_cells_labelled.tsv'
df = pd.read_csv(url, sep='\t', names=["Sentence", "Feeling"])
The issue is the DataFrame is getting lines with NaN, but It's just part of the whole sentence.
The Output, right now is like this:
sentence feeling
I do not like it. NaN
I give it a bad score. 0
The Output should look like:
sentence feeling
I do not like it. I give it a bad score 0
Can you help me to concatenate or load the dataset based on the scores?

Create virtual groups before groupby and agg rows:
grp = df['Feeling'].notna().cumsum().shift(fill_value=0)
out = df.groupby(grp).agg({'Sentence': ' '.join, 'Feeling': 'last'})
print(out)
# Output:
Sentence Feeling
Feeling
0 I try not to adjust the volume setting to avoi... 0.0
1 Good case, Excellent value. 1.0
2 I thought Motorola made reliable products!. Ba... 1.0
3 When I got this item it was larger than I thou... 0.0
4 The mic is great. 1.0
... ... ...
996 But, it was cheap so not worth the expense or ... 0.0
997 Unfortunately, I needed them soon so i had to ... 0.0
998 The only thing that disappoint me is the infra... 0.0
999 No money back on this one. You can not answer ... 0.0
1000 It's rugged. Well this one is perfect, at the ... NaN
[1001 rows x 2 columns]

create a feature in pandas based on another column (i.e. when it goes from positive to negative)

I have the following dataset:
data = {'ROC_9': [0.006250, 0.087230, 0.045028, 0.165738, -0.006993, -0.432736, -0.11162, 0.057466, 0.203138, -0.008234]}
price_data = pd.DataFrame (data)
It is an indicator about a stock price, namely rate of change.
I want to write a code that creates a new feature (column) on the pandas data frame when a current feature on the pandas data frame goes from positive to negative, or vica versa.
It is easier explained through an example: lets use the feature ROC9.
I create a new variable called ROC9_signal and set it equal to 0:
price_data['ROC9_signal'] = 0
When ROC_9 goes from negative to positive, I want to change the ROC9_signal from 0 to 1.
When ROC_9 goes from positive to negative, I want to change the ROC9_signal from 0 to -1.
Looking at the data, I would like ROC9_signal to change from 0 to -1, since the value has gone from 0.16 (positive) to -0.006 (negative).
Looking at the data, I would like ROC_9 signal to change from 0 to 1, since the value has gone from -0.11 (negative) to 0.05 (positive).
Looking at the data, I would like ROC9_signal to change from 0 to -1, since the value has gone from 0.20 (positive) to -0.008 (negative).
It is only the row where the change happens that I want to change from 0 to 1 or 0 to -1, the other rows must remain at 0.
I will then apply this same logic to create a momentum10_signal column and a chalkin_money_flow_signal column. Therefore I want a solution that can be applied to different columns and not manually.
Thanks in advance for the help.
This is what the full data looks like:
Full Data

You can use np.sign to extract the signs. Something like this:
signs = np.sign(price_data.ROC_9)
price_data['signal'] = np.sign(signs.diff()).fillna(0)
Output:
ROC_9 signal
0 0.006250 0.0
1 0.087230 0.0
2 0.045028 0.0
3 0.165738 0.0
4 -0.006993 -1.0
5 -0.432736 0.0
6 -0.111620 0.0
7 0.057466 1.0
8 0.203138 0.0
9 -0.008234 -1.0

Calculate Tf-Idf Scores in pandas?

I want to calculate tf and idf seperately from the documents below. I'm using python and pandas.
import pandas as pd
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})
I want to calculate using formula for Tf-Idf not using Sklearn library.
After tokenization,I have used this for TF calculation:
tf = df.sent.apply(pd.value_counts).fillna(0)
but this giving me a count but I want ratio of (count/total number of words).
For Idf:
df[df['sent'] > 0] / (1 + len(df['sent'])
but it doesn't seems to work.
I want both Tf and Idf as pandas series format.
Edit
for tokenization I used df['sent'] = df['sent'].apply(word_tokenize)
I got idf scores as :
tfidf = TfidfVectorizer()
feature_array = tfidf.fit_transform(df['sent'])
d=(dict(zip(tfidf.get_feature_names(), tfidf.idf_)))
How I can get tf scores seperately?

You'll need to do a little more work to compute this.
import numpy as np
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence',
'This is the second sentence',
'This is the third sentence']})
# Tokenize and generate count vectors
word_vec = df.sent.apply(str.split).apply(pd.value_counts).fillna(0)
# Compute term frequencies
tf = word_vec.divide(np.sum(word_vec, axis=1), axis=0)
# Compute inverse document frequencies
idf = np.log10(len(tf) / word_vec[word_vec > 0].count())
# Compute TF-IDF vectors
tfidf = np.multiply(tf, idf.to_frame().T)
print(tfidf)
is the first This sentence second third
0 0.0 0.0 0.095424 0.0 0.0 0.000000 0.000000
1 0.0 0.0 0.000000 0.0 0.0 0.095424 0.000000
2 0.0 0.0 0.000000 0.0 0.0 0.000000 0.095424
Depending on your situation, you may want to normalize:
# L2 (Euclidean) normalization
l2_norm = np.sum(np.sqrt(tfidf), axis=1)
# Normalized TF-IDF vectors
tfidf_norm = (tfidf.T / l2_norm).T
print(tfidf_norm)
is the first This sentence second third
0 0.0 0.0 0.308908 0.0 0.0 0.000000 0.000000
1 0.0 0.0 0.000000 0.0 0.0 0.308908 0.000000
2 0.0 0.0 0.000000 0.0 0.0 0.000000 0.308908

Here is my solution:
first tokenize, for convenience as a separate column:
df['tokens'] = [x.lower().split() for x in df.sent.values]
then TF as you did, but with normalize parameter (for technical reasons you need a lambda func):
tf = df.tokens.apply(lambda x: pd.Series(x).value_counts(normalize=True)).fillna(0)
then IDF (one per word in vocabulary):
idf = pd.Series([np.log10(float(df.shape[0])/len([x for x in df.tokens.values if token in x])) for token in tf.columns])
idf.index = tf.columns
then if you want TFIDF:
tfidf = tf.copy()
for col in tfidf.columns:
tfidf[col] = tfidf[col]*idf[col]

I think I had the same issue as you.
I wanted to use TfIdfVectorizer but their default tf-idf definition is not standard (tf-idf = tf + tf*idf instead of the normal tf-idf = tf*idf)
TF = the term "frequency" is generally used to mean count. For that you can use CountVectorizer() from sklearn.
Need to log transform and normalize if needed.
The option using numpy was much longer in processing time (> 50 times slower).

Filling in missing values with values that may exist elsewhere in DataFrame?

I have an aviation dataset that I am trying to clean. There are some missing values for the NumEngines feature, but there are some instances where a missing value can be derived from an entry elsewhere in the dataframe (this is not always the case). Below is a mini example of my dataset to illustrate both cases. Note that first Cessna entry can be used to fill in the second, but this is not the case for Piper.
df = pd.DataFrame()
df["Make"] = ["Cessna","Piper","Cessna","Boeing"]
df["Model"] = ["Citation","PA32RT","Citation","737-300"]
df["NumEngines"] = [2,None,None,2]
How can I make it so that the resulting DataFrame would be
Make Model NumEngines
0 Cessna Citation 2.0
1 Piper PA32RT NaN
2 Cessna Citation 2.0
3 Boeing 737-300 2.0

I would bet transform('first') could make it again here:
df.groupby(['Make', 'Model']).transform('first')
Out[179]:
NumEngines
0 2.0
1 NaN
2 2.0
3 2.0

Aligning 2 python lists according to 2 other list

I have two arrays namely nlxTTL and ttlState. Both the arrays comprise of repeating pattern of 0's and 1's indicating input voltage which can be HIGH(1) or LOW(0) and are recorded from same source which sends a TTL pulse(HIGH and LOW) with 1second pulse width.
But due to some logging mistake, some drops happen in ttlState list i.e. it doesn't log a repeating sequence of 0 and 1's and ends up dropping values.
The good part is I also log timestamp for each TTL input received for both the lists. Inter TTL event timestamp difference clearly shows that the TTL event has missed one of the pulses.
Here is an example of what data looks like:
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
As you can see the nlxTime and ttlTime clearly are different from each other. How can then using these timestamps I can align all 4 lists?

When dealing with tabular data such as a CSV file, it's a good idea to use a library to make the process easier. I like the pandas dataframe library.
Now for your question, one way to think about this problem is that you really have two datasets... An nlx dataset and a ttl dataset. You want to join those datasets together by timestamp. Pandas makes tasks like this very easy.
import pandas as pd
from StringIO import StringIO
data = """\
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
"""
# Load data into dataframe.
df = pd.read_csv(StringIO(data))
# Remove spaces from column names.
df.columns = [x.strip() for x in df.columns]
# Split the data into an nlx dataframe and a ttl dataframe.
nlx = df[['nlxTTL', 'nlxTime']].reset_index()
ttl = df[['ttlState', 'ttlTime']].reset_index()
# Merge the dataframes back together based on their timestamps.
# Use an outer join so missing data gets filled with NaNs instead
# of just dropping the rows.
merged_df = nlx.merge(ttl, left_on='nlxTime', right_on='ttlTime', how='outer')
# Get back to the original set of columns
merged_df = merged_df[df.columns]
# Print out the results.
print(merged_df)
This produces the following output.
nlxTTL ttlState nlxTime ttlTime
0 0.0 0.0 1000.0 1000.0
1 1.0 1.0 2000.0 2000.0
2 0.0 NaN 3000.0 NaN
3 1.0 1.0 4000.0 4000.0
4 0.0 NaN 5000.0 NaN
5 1.0 1.0 6000.0 6000.0
6 0.0 0.0 7000.0 7000.0
7 1.0 1.0 8000.0 8000.0
8 NaN 0.0 NaN 9000.0
9 NaN 1.0 NaN 10000.0
You'll notice that it fills in the dropped values with NaN values because we are doing an outer join. If this is undesirable, change the how='outer' parameter to how='inner' to perform an inner join. This will only keep records for which you have both an nlx and ttl response at that timestamp.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a DTM from large corpus - python

Related

Loading Pandas Dataframe with skipped sentiment

create a feature in pandas based on another column (i.e. when it goes from positive to negative)

Calculate Tf-Idf Scores in pandas?

Filling in missing values with values that may exist elsewhere in DataFrame?

Aligning 2 python lists according to 2 other list

Categories

Resources