What causes the problem: csv, pandas or nltk? - python

I have a strange problem resulting in wrong output delivered by NLTK collocations. In short, when I pass pandas object created in python envi (PyCharm or Jupyter) to the function I get correct result. When I save this object to csv and upload it to the pandas object, functions returns single letters and/or numbers instead of full words. Must be sth wrong with csv upload through pandas but I have no idea what is wrong...
here is the code.
Function that is applied:
def counts(x):
trigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(x)
finder.nbest(trigram_measures.pmi, 100)
s = pd.Series(x)
ngram_list = [pair for row in s for pair in ngrams(row, 3)]
c = Counter(ngram_list).most_common(3)
return pd.DataFrame([(x.name, ) + element for element in c], columns=['group', 'Ngram', 'Frequency'])
Here is the object:
d = {'words' : pd.Series((['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
['galley', 'work', 'table', 'stuck'],
['cloth', 'stuck'],
['stuck', 'coffee'])),
'group' : pd.Series([1, 2, 1, 2])}
df_cleaned = pd.DataFrame(d)
Then I apply function from above + some extra functions:
output = df_cleaned.groupby('group', as_index=False).words.apply(counts).reset_index(drop=True)
Result is correct:
But when pandas object is saved and uploaded result is sth like this:
here is a code for saving and uploading:
df.to_csv('test_file.csv', index=False, sep=',')
df = pd.read_csv('path/test_file.csv',
sep=',', usecols=['group','words'])
I found quotes in uploaded pandas object therefore I had removed them before applying the fucntion"
df = df.replace({'\'': ''}, regex=True)
output = df_cleaned.groupby('group', as_index=False).words.apply(counts).reset_index(drop=True)
Now it returns wrong results.
Do have any suggestions which way shall I go?

I reproduced what you described in the following steps. I don't see any errors
import pandas as pd
d = {'words' : pd.Series((['coffee', 'maker', 'brewing', 'properly', '2','420', '420', '420'],
['galley', 'work', 'table', 'stuck'],
['cloth', 'stuck'],
['stuck', 'coffee'])),
'group' : pd.Series([1, 2, 1, 2])}
df_cleaned = pd.DataFrame(d)
df_cleaned
The function you're using is
import nltk
from nltk.util import ngrams
from nltk.collocations import *
from collections import Counter
def counts(x):
trigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(x)
finder.nbest(trigram_measures.pmi, 100)
s = pd.Series(x)
ngram_list = [pair for row in s for pair in ngrams(row, 3)]
c = Counter(ngram_list).most_common(3)
return pd.DataFrame([(x.name, ) + element for element in c], columns=['group', 'Ngram', 'Frequency'])
You then apply counts to the data
output = df_cleaned.groupby('group',
as_index=False).words.apply(counts).reset_index(drop=True)
and save the results to file
output.to_csv('test_file.csv', index=False, sep=',')
df = pd.read_csv('test_file.csv',sep=',')
I don't see any problems

Related

How can I exclude words from frequency word analysis in a list of articles in python?

I have a dataframe df with a column "Content" that contains a list of articles extracted from the internet. I have already the code for constructing a dataframe with the expected output (two columns, one for the word and the other for its frequency). However, I would like to exclude some words (conectors, for instance) in the analysis. Below you will find my code, what should I add to it?
It is possible to use the code get_stop_words('fr') for a more efficiente use? (Since my articles are in French).
Source Code
import csv
from collections import Counter
from collections import defaultdict
import pandas as pd
df = pd.read_excel('C:/.../df_clean.xlsx',
sheet_name='Articles Scraping')
df = df[df['Content'].notnull()]
d1 = dict()
for line in df[df.columns[6]]:
words = line.split()
# print(words)
for word in words:
if word in d1:
d1[word] += 1
else:
d1[word] = 1
sort_words = sorted(d1.items(), key=lambda x: x[1], reverse=True)
There are a few ways you can achieve this. You can either use the isin() method with a list comprehension,
data = {'test': ['x', 'NaN', 'y', 'z', 'gamma',]}
df = pd.DataFrame(data)
words = ['x', 'y', 'NaN']
df = df[~df.test.isin([word for word in words])]
Or you can go with not string contains and a join:
df = df[~df.test.str.contains('|'.join(words))]
If you want to utilize the stop words package for French, you can also do that, but you must preprocess all of your texts before you start doing any frequency analysis.
french_stopwords = set(stopwords.stopwords("fr"))
STOPWORDS = list(french_stopwords)
STOPWORDS.extend(['add', 'new', 'words', 'here'])
I think the extend() will help you tremendously.

pandas str.contains match exact substring not working with regex boudry

I have two dataframes, and trying to find out a way to match the exact substring from one dataframe to another dataframe.
First DataFrame:
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl', 'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'],
'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
print(dataframe)
Second DataFrame
test_data = {'code name': ['PB', 'PB', 'PB'],
'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Approach
for k, l, m in zip(test_dataframe.iloc[:, 0], test_dataframe.iloc[:, 1], test_dataframe.iloc[:, 2]):
dataframe['Site'] = np.select([dataframe['Place Name'].str.contains(r'\b{}~{}\b'.format(k, m), regex=False)], [l],
default=dataframe['Site'])
The current output is as below, though I am expecting to match the exact substring, which is not working with the code above.
Current Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic-mechanics
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev
Expected Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev
Data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Map the test_datframe code and Actual into dictionary as key and value respectively
keys=test_dataframe['code'].values.tolist()
dicto=dict(zip(test_dataframe.code, test_dataframe.Actual))
dicto
Join the keys separated by | to enable search of either phrases
k = '|'.join(r"{}".format(x) for x in dicto.keys())
k
Extract string from datframe meeting any of the phrases in k and map them to to the dictionary
dataframe['Site'] = dataframe['Place Name'].str.extract('('+ k + ')', expand=False).map(dicto)
dataframe
Output
Not the most elegant solution, but this does the trick.
Set up data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Solution
Create a column in test_dataframe with the substring to match:
test_dataframe['match_str'] = test_dataframe['code name'] + '~' + test_dataframe.code
print(test_dataframe)
code name Actual code match_str
0 PB programmatic me progra PB~progra
1 PB emoteev emo PB~emo
2 PB programmatic-mechanics prog PB~prog
Define a function to apply to test_dataframe:
def match_string(row, dataframe):
ind = row.name
try:
if row[-1] in dataframe.loc[ind, 'Place Name']:
return row[1]
else:
return dataframe.loc[ind, 'Site']
except KeyError:
# More rows in test_dataframe than there are in dataframe
pass
# Apply match_string and assign back to dataframe
dataframe['Site'] = test_dataframe.apply(match_string, args=(dataframe,), axis=1)
Output:
Place Name Site
0 TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
1 FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev

Pandas read (Excel) cells, and return looked up values

A column in Excel file shows the short-form of some descriptions. They are one-to-one relationship in a dictionary.
I want to look them up each, and write the looked up values to a new file, side by side with the short forms.
Xlrd and xlwt are basic so I used them:
product_dict = {
"082" : "Specified brand(s)",
"035" : "Well known brand",
"069" : "Brandless ",
"054" : "Good middle class restaurant",
"062" : "Modest class restaurant"}
import xlwt, xlrd
workbook = xlrd.open_workbook("C:\\file.xlsx")
old_sheet = workbook.sheet_by_index(0)
book = xlwt.Workbook(encoding='cp1252', style_compression = 0)
sheet = book.add_sheet('Sheet1', cell_overwrite_ok = True)
for row_index in range(1, old_sheet.nrows):
new_list = []
Cell_a = str(old_sheet.cell(row_index, 0).value)
for each in Cell_a.split(", "):
new_list.append(product_dict[each])
sheet.write(row_index, 0, Cell_a)
sheet.write(row_index, 1, "; ".join(new_list))
book.save("C:\\files-1.xls")
It looks ok. But I want to learn the Pandas way to do the same.
How does the Pandas way looked like, in addition to below? Thank you.
data = {'Code': ["082","069","054"]}
df = pd.DataFrame(data)
If you have a lookup dictionary in the form of a python dictionary, you can do this:
import pandas as pd
lookup_dict = {'1': 'item_1', '2':'item_2'}
# Create example dataframe
df_to_process = pd.DataFrame()
df_to_process['code'] = ['1, 2', '1', '2']
# Use .apply and lambda function to split 'code' and then do a lookup on each item
df_to_process['code_items'] = df_to_process['code'].apply(lambda x: '; '.join([lookup_dict[code] for code in x.replace(' ','').split(',')]))
With your examples:
import pandas as pd
product_dict = {
"082" : "Specified brand(s)",
"035" : "Well known brand",
"069" : "Brandless ",
"054" : "Good middle class restaurant",
"062" : "Modest class restaurant"}
data = {'Code': ["082","069","054"]}
df = pd.DataFrame(data)
df['code_items'] = df['Code'].apply(lambda x: '; '.join([product_dict[code] for code in x.replace(' ','').split(',')]))
With the data given, I would first map the dictionary to a new column, then aggregate with ','.join:
final=df.assign(New=df.Code.map(product_dict)).agg(','.join).to_frame().T
Code New
0 082,069,054 Specified brand(s),Brandless ,Good middle clas...
Where:
print(df.assign(New=df.Code.map(product_dict)))
is:
Code New
0 082 Specified brand(s)
1 069 Brandless
2 054 Good middle class restaurant

savWriter writerows with Dataframe

I am trying to use the savReaderWriter library with Python. I have a dataframe which is read in via df = pd.read_csv. However using the following piece of code it won't seem to write the rows to the file.
with savReaderWriter.SavWriter(savFileName, *args) as writer:
writer.writerows(df)
I am getting the following error TypeError: 'str' object does not support item assignment.Any help is greatly appreciated.
This is the sample on https://pythonhosted.org/savReaderWriter/
savFileName = 'someFile.sav'
records = [[b'Test1', 1, 1], [b'Test2', 2, 1]]
varNames = ['var1', 'v2', 'v3']
varTypes = {'var1': 5, 'v2': 0, 'v3': 0}
with savReaderWriter.SavWriter(savFileName, varNames, varTypes) as writer:
for record in records:
writer.writerow(record)
I think you can divide you DataFrame into 3 fields(records, varNames, varTypes)
By the way, you can use method in pandas to write data into file.
import pandas as pd
sensor_values = pd.DataFrame([[1,'aaa','bbb'],[2,'ppp','xxx']], columns=['A','B','C'])
varNames=sensor_values.columns
records = sensor_values.values
varType = {key: 0 for x, key in enumerate(sensor_values.columns)}

How can I select data from a dask dataframe by a list of indices?

I want to select rows from a dask dataframe based on a list of indices. How can I do that?
Example:
Let's say, I have the following dask dataframe.
dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)
Furthermore, I have a list of indices, that I am interested in, e.g.
indices_i_want_to_select = ['x1','x3', 'y6']
From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select
Edit: dask now supports loc on lists:
ddf_selected = ddf.loc[indices_i_want_to_select]
The following should still work, but is not necessary anymore:
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)
#list of indices I want to select
l = ['i1', 4, 5]
#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
Using dask version '1.2.0' results with an error due to the mixed index type.
in any case there is an option to use loc.
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)
# #list of indices I want to select
l = ['i1', '4', '5']
# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()

Categories