Generate string set from csv file in Python

Generate string set from csv file in Python - python

Please don't flag my answer instantaniously, because I searched several other questions that didn't solve my problem, like this.
I'm trying to generate a python set of strings from a csv file. The printed pandas dataframe of the loaded csv file has the following structure:
0
0 me
1 yes
2 it
For a project I need this to be formatted to look like this
STOPWORDS = {'me', 'yes', 'it'}
I tried to do this by the following code.
import pandas as pd
df_stopwords = pd.read_csv("C:/Users/Jakob/stopwords.csv", encoding = 'iso8859-15', header=-1)
STOPWORDS = {}
for index, row in df_stopwords.iterrows():
STOPWORDS.update(str(row))
print(STOPWORDS)
However, I get this error:
dictionary update sequence element #0 has length 1; 2 is required
When I use the STOPWORDS.update(str(row)) I get the this error:
'dict' object has no attribute 'add'
Thank you all in advance!

You can directly create a set from the values in the dataframe with:
set(df.values.ravel())
{'me', 'yes', 'it'}

A dictionary is a mapping of keys and values. Like an object in many other languages. Since you need it as a set, define it as a set. Don't change it to a set later.
import pandas as pd
df_stopwords = pd.read_csv("C:/Users/Jakob/stopwords.csv", encoding = 'iso8859-15', header=-1)
STOPWORDS = set()
for index, row in df_stopwords.iterrows():
STOPWORDS.add(str(row))
print(STOPWORDS)

It looks like you need to convert the values in your column as a list and then use the list as your stop words.
stopwords = df_stopwords['0'].tolist()
--> ['me', 'yes', 'it']

As mentioned in the accepted answer here. You might wanna use itertuples() since it is faster.
STOPWORDS = set()
for index, row in df_stopwords.itertuples():
STOPWORDS.add(row)
print(STOPWORDS)

Related

Calculate Gunning-Fog score on excel values

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)

TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

How do I select/filter Pandas columns that have an empty list?

I'm trying to replace every occurrence of an empty list [] in my script output with an empty cell value, but am struggling with identifying what object it is.
So the data output after running .to_excel looks like:
Now the data originally exists in JSON format and I'm normalizing it with data_normalized = pd.json_normalize(data). I'm trying to filter out the empty lists occurrences right after that with filtered = data_normalized.loc[data_normalized['focuses'] == []] but that isn't working. I've also tried filtered = data_normalized.loc[data_normalized['focuses'] == '[]']
The dtype for column focuses is Object if that helps. So I'm stuck as to how to select this data.
Eventually, I want to just instead run data_normalized.replace('[]', '') but with the first parameter updated so that I can select the empty lists properly.

You could try to cast the df to string type with pd.DataFrame.astype(str), and then do the replace with regex parameter as False:
df.astype(str).replace('[]','',regex=False)
Example:
df=pd.DataFrame({'a':[[],1,2,3]})
df.astype(str).replace('[]','',regex=False)
a
0
1 1
2 2
3 3

I have really less experience with pandas but since you cannot identify the object,try converting the list obtained to a string,then compare it to '[]'
for example,try using this
filtered = data_normalized.loc[string(data_normalized['focuses']) == '[]']

Warning: ....SettingWithCopyWarning don't understand

Hello,
I have problem with my code Python 3. I want to copy tupple in a cell dataframe. Python return warning message ...SettingWithCopyWarning...
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df=pd.DataFrame(data,columns=['Début'],index=['P1','P2','P3','P4'])
d=data['Début'][0]
d=d.split("/")
d.reverse()
d= tuple(list(map(int,d)))
df.Début[i]=d
I read pandas doc. and I try this... but python return error...(Must have equal len keys and value when setting with an iterable).
df.loc[0,'Début']=d
other way ...no work,it's same error.
df.at[0,'Début']=d

As pointed out, there is an issue that your dataframe is already using a copy of the data dictionary as it's data, and so there are issues with copied data. One way you can avoid this by processing your data in the way you want it before you put it in the dataframe. For instance:
import pandas as pd
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df = pd.DataFrame(data, columns = ['Début'], index = ['P1','P2','P3','P4'])
# Split your data, make a tuple out of it, and reverse it in a list iteration
date_tuples = [tuple(map(int, i.split("/")))[::-1] for i in data['Debut']]
df['Début'] = date_tuples

Convert integer to string type when retrieving values from a pandas dataframe

I am trying to convert output of data (from Integer to String) from a List generated using Pandas.
I got the output of data from a csv file.
Here is my code that covers expression using Pandas (excluding part where it shows how to come up with generation of object 'InFile' (csv file)).
import pandas as pd
....
with open(InFile) as fp:
skip = next(it.ifilter(
lambda x: x[1].startswith('ID'),
enumerate(fp)
))[0]
dg = pd.read_csv(InFile, usercols=['ID'], skiprows=skip)
dgl = dg['ID'].values.tolist()
Currently, output is a List (example below).
[111111, 2222, 3333333, 444444]
I am trying to match data from other List (which is populated into String or Varchar(data type in MySQL), but somehow, I cannot come up with any match. My previous post -> How to find match from two Lists (from MySQL and csv)
So, I am guessing that the data type from the List generated by Pandas is an Integer.
So, how do I convert the data type from Integer to String?
Which line should I add something like str(10), for an example?

You can use pd.Series.astype:
dgl = dg['ID'].astype(str).values.tolist()
print(dgl)
Output:
['111111', '2222', '3333333', '444444']

NLTK ConditionalFreqDist to Pandas dataframe

I am trying to work with the table generated by nltk.ConditionalFreqDist but I can't seem to find any documentation on either writing the table to a csv file or exporting to other formats. I'd love to work with it in a pandas dataframe object, which is also really easy to write to a csv. The only thread I could find recommended pickling the CFD object which doesn't really solve my problem.
I wrote the following function to convert an nltk.ConditionalFreqDist object to a pd.DataFrame:
def nltk_cfd_to_pd_dataframe(cfd):
""" Converts an nltk.ConditionalFreqDist object into a pandas DataFrame object. """
df = pd.DataFrame()
for cond in cfd.conditions():
col = pd.DataFrame(pd.Series(dict(cfd[cond])))
col.columns = [cond]
df = df.join(col, how = 'outer')
df = df.fillna(0)
return df
But if I am going to do that, perhaps it would make sense to just write a new ConditionalFreqDist function that produces a pd.DataFrame in the first place. But before I reinvent the wheel, I wanted to see if there are any tricks that I am missing - either in NLTK or elsewhere to make the ConditionalFreqDist object talk with other formats and most importantly to export it to csv files.
Thanks.

pd.DataFrame(freq_dist.items(), columns=['word', 'frequency'])

You can treat an FreqDist as a dict, and create a dataframe from there using from_dict
fdist = nltk.FreqDist( ... )
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
print(df_fdist)
df_fdist.to_csv(...)
output:
Frequency
Term
is 70464
a 26429
the 15079

Ok, so I went ahead and wrote a conditional frequency distribution function that takes a list of tuples like the nltk.ConditionalFreqDist function but returns a pandas Dataframe object. Works faster than converting the cfd object to a dataframe:
def cond_freq_dist(data):
""" Takes a list of tuples and returns a conditional frequency distribution as a pandas dataframe. """
cfd = {}
for cond, freq in data:
try:
cfd[cond][freq] += 1
except KeyError:
try:
cfd[cond][freq] = 1
except KeyError:
cfd[cond] = {freq: 1}
return pd.DataFrame(cfd).fillna(0)

This is a nice place to use a collections.defaultdict:
from collections import defaultdict
import pandas as pd
def cond_freq_dist(data):
""" Takes a list of tuples and returns a conditional frequency
distribution as a pandas dataframe. """
cdf = defaultdict(defaultdict(int))
for cond, freq in data:
cfd[cond][freq] += 1
return pd.DataFrame(cfd).fillna(0)
Explanation: a defaultdict essentially handles the exception handling in #primelens's answer behind the scenes. Instead of raising KeyError when referring to a key that doesn't exist yet, a defaultdict first creates an object for that key using the provided constructor function, then continues with that object. For the inner dict, the default is int() which is 0 to which we then add 1.
Note that such an object may not pickle nicely due to the default constructor function in the defaultdicts - to pickle a defaultdict, you need to convert to a dict fist: dict(myDefaultDict).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate string set from csv file in Python - python

You can directly create a set from the values in the dataframe with: set(df.values.ravel()) {'me', 'yes', 'it'}

It looks like you need to convert the values in your column as a list and then use the list as your stop words. stopwords = df_stopwords['0'].tolist() --> ['me', 'yes', 'it']

As mentioned in the accepted answer here. You might wanna use itertuples() since it is faster. STOPWORDS = set() for index, row in df_stopwords.itertuples(): STOPWORDS.add(row) print(STOPWORDS)

Related

Calculate Gunning-Fog score on excel values

How do I select/filter Pandas columns that have an empty list?

Warning: ....SettingWithCopyWarning don't understand

Convert integer to string type when retrieving values from a pandas dataframe

NLTK ConditionalFreqDist to Pandas dataframe

Categories

Resources