Running loop over documents in python - python

I've got one folder on my desktop called 'companyfollowerstweets' which contains 91 folders, each called 'followerstweets(company name)', which all contain 200 csv files, each containing the most recent Tweets of a follower of that Company on Twitter. I'm performing sentiment analysis on the first 200 rows containting Tweets of each of the 200 followers per company, of which the results are being added to one list which eventually gives me one outcome per company; the percentage of negative Tweets and Positive Tweets of all 40,000 Tweets (200 Tweets for each of 200 followers). Hope this makes sense. Right now I have only managed to run a loop over those 200 csv files per folder, where I manually enter the company's name each time. However, I want that to run over each of those 91 folders without me having to enter the company name. Here's my code:
import nltk
import csv
import sklearn
import nltk, string, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
from sklearn.feature_extraction.text import CountVectorizer
columns = defaultdict(list)
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfTransformer
import math
import sentiment_mod as s
import glob
import itertools
lijst = glob.glob('companyfollowerstweets/followerstweetsCisco/*.csv')
tweets1 = []
sent1 = []
print(lijst[0])
for item in lijst:
stopwords_set = set(stopwords.words("english"))
with open(item, encoding = 'latin-1') as d:
reader1=csv.reader(d)
next(reader1)
for row in itertools.islice(reader1,200):
tweets1.extend([row[2]])
words_cleaned = [" ".join([words for words in sentence.split() if 'http' not in words and not words.startswith('#')]) for sentence in tweets1]
words_filtered = [e.lower() for e in words_cleaned]
words_without_stopwords = [word for word in words_filtered if not word in stopwords_set]
tweets1 = words_without_stopwords
tweets1 = list(filter(None, tweets1))
for d in tweets1:
new1 = s.sentiment(d)
sent1.extend(new1)
total1 = len(sent1)/2
neg_percentage1 = (sent1.count("neg")/total1)*100
pos_percentage1 = (sent1.count("pos")/total1)*100
res = sum(sent1[1::2])/total1
low = min(sent1[1::2])
high = max(sent1[1::2])
print("% of negative Tweets:", neg_percentage1)
print("% of positive Tweets:", pos_percentage1)
print("Total number of Tweets:", total1)
print("Average confidence:", res)
print("min confidence:", low)
print("max confidence:", high)
This specific example is for the company 'Cisco' as you can see. How do I keep this code running for every one of the 91 folders like this one?

os.walk() on your current directory (os.getcwd()) is what you want. That will recursively iterate over everything in your current working directory.

You can use a nested glob like this:
from glob import glob
[glob(i+'/*.csv') for i in glob('companyfollowerstweets/followerstweets*')]
This will return a list of lists (list of tweets for each company).
Note that this won't have any particular order.

You can either do two loops or use os.walk, the latter is cleaner:
import os
company_results={}
for root, dirs, files in os.walk('companyfollowerstweets'):
if len(dirs)==0:
results=do_analysis(files)
company_results[root]=results
I suggest to put all your analysis in a function, much cleaner code. Then you can get a dictionary of all the results with the above code.

Related

how to iterate pandas frame from csv and apply text summarisation

Im working on text summarization extraction on long text data. I have multiple users text data in the input csv file. But current code is appending all the text column data in to sentences and then apply logic. How do I apply the code for each row instead of merging all the column values? Any Help will appreciated.
Input.csv (^ delimited)
uid^name^text
36d73f013aa7^Don Howard^The Irvine Foundation has entered into a partnership with College Futures Foundation that starts a new chapter in our support of postsecondary success in California.To achieve Irvine’s singular goal.
36d73f013aa8^Simon Haris^That’s why we have long provided funding to expand postsecondary success. Now with our focus on low-wage workers, we have decided to split our postsecondary funding into two parts:. Strengthening and expanding work-ready credentialing programs (which we will do directly, primarily as part of our Better Careers initiative).
36d73f013aa8^David^Accelerating and expanding the attainment of bachelor’s degrees (which we will fund through our partnership with College Futures). We believe that College Futures is in a stronger position than we are to make grants to support improvements in how the CSUs and the California Community Colleges can better serve students.
pseudo code
Loop each record
apply below logic to text column to get summary
Code : Text Summarization Code
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt') # one time execution
from nltk.corpus import stopwords
import re
# Read the CSV file
import io
df = pd.read_csv('/home/sshuser/textsummerisation/input.csv',sep='^')
# split the the text in the articles into sentences
sentences = []
for s in df['text']:
sentences.append(sent_tokenize(s))
# flatten the list
sentences = [y for x in sentences for y in x]
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
nltk.download('stopwords')# one time execution
stop_words = stopwords.words('english')
def remove_stopwords(sen):
sen_new = " ".join([i for i in sen if i not in stop_words])
return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# Extract word vectors
word_embeddings = {}
fopen = open('/home/sshuser/textsummerisation/glove.6B.100d.txt', encoding='utf-8')
for line in fopen:
values = line.split()
word = values[0]
print(values)
print(word)
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
fopen.close()
sentence_vectors = []
for i in clean_sentences:
if len(i) != 0:
v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
else:
v = np.zeros((100,))
sentence_vectors.append(v)
len(sentence_vectors)
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Specify number of sentences to form the summary
sn = 10
# Generate summary
for i in range(sn):
print(ranked_sentences[i][1])
Expected output: Output of the above code should come in summary column for each record
uid^name^text^summary

Removing all punctuation from string in dataframe

This is officially doing my head in. I am web scraping a collection of tweets for text analysis. The tweets have been scraped and put into a dataframe, where each row is a string containing the entire tweet. I can't for the life of me remove quotation marks or apostrophes, but removing all other punctuation is OK.
What I am trying to do is extract just the verbs, nouns and adjectives from each of the scraped tweets, which I have done, but anything in quotation marks is excluded.
The code that I have been using so far is below, but I can't for the life of me add quotation marks or apostrophes. I have also tried every other method I can find on this site , but it either does nothing, or produces errors.
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
The entire code base up until this point is:
import GetOldTweets3 as got
import pandas as pd
import re
from wordcloud import WordCloud# Join the different processed titles together.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore", DeprecationWarning)# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
import os
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import gensim
from gensim import corpora, models, similarities
import logging
import tempfile
from nltk.corpus import stopwords
from string import punctuation
from collections import OrderedDict
import pyLDAvis.gensim
import tempfile
%matplotlib inline
init_notebook_mode(connected=True) #do not miss this line
warnings.filterwarnings("ignore")
# Function that pulls tweets based on a general search query and turns to csv file
# Parameters: (text query you want to search), (max number of most recent tweets to pull from)
def text_query_to_csv(text_query, count):
# Creation of query object
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(text_query)\
.setMaxTweets(count)
# Creation of list that contains all tweets
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
# Creating list of chosen tweet data
text_tweets = [[tweet.date, tweet.text] for tweet in tweets]
# Creation of dataframe from tweets
tweets_df = pd.DataFrame(text_tweets, columns = ['Datetime', 'Text'])
# Converting tweets dataframe to csv file
tweets_df.to_csv('{}-{}-tweets.csv'.format(text_query, int(count)), sep=',')
############################################
# Search word and number of tweets to scrape
############################################
text_query = '#barackobama'
count = 5
# Calling function to query X amount of relevant tweets and create a CSV file
text_query_to_csv(text_query, count)
filename = '#barackobama-5-tweets.csv'
tweets = pd.read_csv(filename)
# Convert tweets to strings and lower case
tweets['Text'] = tweets['Text'].astype(str)
tweets['Text'] = tweets['Text'].map(lambda x: x.lower())
tweets
This is the offending bit of code below...
# remove punctuation
tweets['Text_processed'] = tweets['Text'].map(lambda x: re.sub('[,\##.!?]', "", x))
tweets['Text_processed'].head()
#####################################
# Extract nouns, verbs and adjectives
#####################################
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from IPython.display import display
lemmatizer = nltk.WordNetLemmatizer()
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label() =='NP'):
yield subtree.leaves()
def get_word_postag(word):
if pos_tag([word])[0][1].startswith('J'):
return wordnet.ADJ
if pos_tag([word])[0][1].startswith('V'):
return wordnet.VERB
if pos_tag([word])[0][1].startswith('N'):
return wordnet.NOUN
else:
return wordnet.NOUN
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
postag = get_word_postag(word)
word = lemmatizer.lemmatize(word,postag)
return word
def get_terms(tree):
for leaf in leaves(tree):
terms = [normalise(w) for w,t in leaf]
yield terms
tidied_tweets = []
for t in tweets['Text']:
#word tokenizeing and part-of-speech tagger
document = t
tokens = [nltk.word_tokenize(sent) for sent in [document]]
postag = [nltk.pos_tag(sent) for sent in tokens][0]
# Rule for NP chunk and VB Chunk
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
{<RB.?>*<VB.?>*<JJ>*<VB.?>+<VB>?} # Verbs and Verb Phrases
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
#Chunking
cp = nltk.RegexpParser(grammar)
# the result is a tree
tree = cp.parse(postag)
terms = get_terms(tree)
features = []
for term in terms:
_term = ''
for word in term:
_term += ' ' + word
features.append(_term.strip())
tidied_tweets.append(features)
tidied_tweets
The code base I have after this works OK, but the inability to remove quoted text or anything with an apostrophe is causing real problems.
EDITED TO ADD
I've managed to solve the problem, but in doing so, created another. The latest bit of code to extract the words sans the punctuation is:
tweet_list = []
ind_tweet = []
for tweets in tidied_tweets:
for words in tweets:
a = re.findall(r"[\w']+", words)
ind_tweet.append(a)
tweet_list.append(ind_tweet)
re.findall(r"[\w']+", words) does the job of extracting the word, but I can't create the same structured list I started with. What I wanted is 'tweet_list' to act as the parent list, and 'ind_tweet' to act as a sucession of child lists (nested). When I print out the result of the code above, I'm not able to create the nested list I am looking for. ind_tweet produces the output but in a single list with no nesting, and tweet_list duplicates ind_tweet. It probably isn't helping that it's 2:30am on a Saturday, but this should be much easier than I am making it...
And the answer is...
tweet_list = [[] for i in range(len(tidied_tweets))]
for tweets, t in zip(tidied_tweets, range(len(tidied_tweets))):
for words in tweets:
a = re.findall(r"[\w']+", words)
tweet_list[t].append(a)

Adding new colums in a csv file and values from differents dictionaries-comprehension

This is my code below and I would like to write new column in my original csv , the columns are supposed to contain the values of each dictionary created during my code and I would like for the last dictionary since it contains 3 values , that each values is inserted in a single column. The code to write in the csv is at the end but maybe there is a way to write the values at each time i am producing a new dictionary.
My code for the csv route : I cannot figure it out how to add without deleting the content of the original file
# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict
#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)
# extraction colonne verbatim
d_verbatim = {}
with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
csv_file.readline()
for line in csv_file:
token = line.split(';')
try:
d_verbatim[token[0]] = token[1]
except:
print(line)
#print(d_verbatim)
#Using treetagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items():
newvalues = tagger.tag_text(val)
d_tag[key] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
for k, v in d_tag.items():
for p in v:
parts = p.split('\t')
try:
if parts[2] == '':
d_lemma[k].append(parts[0])
else:
d_lemma[k].append(parts[2])
except:
print(parts)
#print(d_lemma)
stopWords = set(stopwords.words('french'))
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}
print(d_filtered_words)
d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
for word in v:
if word in dico_lexique:
if word
print(word, dico_lexique[word])
your edit seemed to make things worse, you've ended up deleting a lot of relevant context. I think I've pieced together what you are trying to do. the core of it seems to be a routine that is performing sentiment analysis on text.
I'd start by creating a class that keeps track of this, e.g:
class Sentiment:
__slots__ = ('positive', 'neutral', 'negative')
def __init__(self, positive=0, neutral=0, negative=0):
self.positive = positive
self.neutral = neutral
self.negative = negative
def __repr__(self):
return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'
def __add__(self, other):
return Sentiment(
self.positive + other.positive,
self.neutral + other.neutral,
self.negative + other.negative,
)
which will allow you to replace your convoluted bits of code like [a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])] with score += sentiment in the function below, and allows us to refer to the various values by name
I'd then suggest preprocessing your pickled data, so you don't have to convert things to ints in the middle of unrelated code, e.g:
with open("dict_pickle", "rb") as fd:
dico_lexique = {}
for word, (pos, neu, neg) in pickle.load(fd):
dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))
this puts them directly into the above class and seems to match up with other constraints in your code. but I don't have your data, so can't check.
after pulling apart all your comprehensions and loops, we are left with a single nice routine for processing a single piece of text:
def process_text(text):
"""process the specified text
returns (words, filtered words, total sentiment score)
"""
words = []
filtered = []
score = Sentiment()
for tag in make_tags(tagger.tag_text(text)):
word = tag.lemma
words.append(word)
if word not in stopWords and lemma.isalpha():
filtered.append(word)
sentiment = dico_lexique.get(word)
if sentiment is not None:
score += sentiment
return words, filtered, score
and we can put this into a loop that reads lines from the input and sends them to an output file:
filename = sys.argv[1]
tempname = filename + '~'
with open(filename) as fdin, open(tempname, 'w') as fdout:
inp = csv.reader(fdin, delimiter=';')
out = csv.writer(fdout, delimiter=';')
# get the header, and blindly append out column names
header = next(inp)
out.writerow(header + [
'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
])
for row in inp:
# assume that second item contains the text we want to process
words, filtered, score = process_text(row[1])
extra_values = [
words, filtered,
score.positive, score.neutal, score.negative,
]
# add the values and write out
assert len(row) == len(header), "code needed to pad the columns out"
out.writerow(row + extra_values)
# only replace if everything succeeds
os.rename(tempname, filename)
we write out to a different file and only rename on success, this means that if the code crashes it won't leave partially written files around. I'd discourage working like though, and tend to make my scripts read from stdin and write to stdout. that way I can run as:
$ python script.py < input.csv > output.csv
when all is OK, but also lets me run as:
$ head input.csv | python script.py
if I just want to test with the first few lines of input, or:
$ python script.py < input.csv | less
if I want to checkout output as it's generated
note that none of this code has been run, so there are probably bugs in it, but I can actually see what the code is trying to do like this. comprehensions and 'functional' style code is great, but it can easily get unreadable if you're not careful

Python tfidf returning same values regardless of idf

I am trying to build a small program that calculates the tfidf in python. There are two very nice tutorials which I have used (I have code from here and another function from kaggle )
import nltk
import string
import os
from bs4 import *
import re
from nltk.corpus import stopwords # Import the stop word list
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
path = 'my/path'
token_dict = {}
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
def review_to_words( raw_review ):
# 1. Remove HTML
review_text = BeautifulSoup(raw_review).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
for subdir, dirs, files in os.walk(path):
for file in files:
file_path = subdir + os.path.sep + file
shakes = open(file_path, 'r')
text = shakes.read()
token_dict[file] = review_to_words(text)
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
str = 'this sentence has unseen text such as computer but also king lord lord this this and that lord juliet'#teststring
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
print feature_names[col], ' - ', response[0, col]
The code seems to work fine but then I have a look at the results.
thi - 0.612372435696
text - 0.204124145232
sentenc - 0.204124145232
lord - 0.612372435696
king - 0.204124145232
juliet - 0.204124145232
ha - 0.204124145232
comput - 0.204124145232
The IDFs seem to be the same for all the words because the TFIDFs are just n*0.204. I have checked with tfidf.idf_
and this seems to be the case.
Is there something in the method that I have not implemented correctly?
Do you know why the idf_s are the same?
Since you provided a list containing 1 document, all terms idfs will have an equal 'binary frequency'.
idf is the inverted term frequency over the set of documents (or just inverted document frequency). Most if not all idf formulas only checks for term presence in a document, so it does not matter how many times it appears per document.
Try feeding a list with 3 distinct documents for instance, this way the idfs will not be the same.
The inverse document frequency of a term t is calculated as follows.
N is the total number of documents and df_t is the number of documents where the term t appears.
In this case, your program has one document (str variable).
Therefore, both N and df_t equal 1.
As a result, the IDF for all terms are the same.

Finding document frequency using Python

Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.
DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.
You can use a dictionary for counting DF:
Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.
Python code could look like this:
from collections import defaultdict
import math
DF = defaultdict(int)
for filename in glob.glob(os.path.join(path, '*.txt')):
words = re.findall(r'\w+', open(filename).read().lower())
for word in set(words):
if len(word) >= 3 and word.isalpha():
DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part.
# Now you can compute IDF.
IDF = dict()
for word in DF:
IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).

Categories