NLTK ConditionalFreqDist to Pandas dataframe - python

I am trying to work with the table generated by nltk.ConditionalFreqDist but I can't seem to find any documentation on either writing the table to a csv file or exporting to other formats. I'd love to work with it in a pandas dataframe object, which is also really easy to write to a csv. The only thread I could find recommended pickling the CFD object which doesn't really solve my problem.
I wrote the following function to convert an nltk.ConditionalFreqDist object to a pd.DataFrame:
def nltk_cfd_to_pd_dataframe(cfd):
""" Converts an nltk.ConditionalFreqDist object into a pandas DataFrame object. """
df = pd.DataFrame()
for cond in cfd.conditions():
col = pd.DataFrame(pd.Series(dict(cfd[cond])))
col.columns = [cond]
df = df.join(col, how = 'outer')
df = df.fillna(0)
return df
But if I am going to do that, perhaps it would make sense to just write a new ConditionalFreqDist function that produces a pd.DataFrame in the first place. But before I reinvent the wheel, I wanted to see if there are any tricks that I am missing - either in NLTK or elsewhere to make the ConditionalFreqDist object talk with other formats and most importantly to export it to csv files.
Thanks.

pd.DataFrame(freq_dist.items(), columns=['word', 'frequency'])

You can treat an FreqDist as a dict, and create a dataframe from there using from_dict
fdist = nltk.FreqDist( ... )
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
print(df_fdist)
df_fdist.to_csv(...)
output:
Frequency
Term
is 70464
a 26429
the 15079

Ok, so I went ahead and wrote a conditional frequency distribution function that takes a list of tuples like the nltk.ConditionalFreqDist function but returns a pandas Dataframe object. Works faster than converting the cfd object to a dataframe:
def cond_freq_dist(data):
""" Takes a list of tuples and returns a conditional frequency distribution as a pandas dataframe. """
cfd = {}
for cond, freq in data:
try:
cfd[cond][freq] += 1
except KeyError:
try:
cfd[cond][freq] = 1
except KeyError:
cfd[cond] = {freq: 1}
return pd.DataFrame(cfd).fillna(0)

This is a nice place to use a collections.defaultdict:
from collections import defaultdict
import pandas as pd
def cond_freq_dist(data):
""" Takes a list of tuples and returns a conditional frequency
distribution as a pandas dataframe. """
cdf = defaultdict(defaultdict(int))
for cond, freq in data:
cfd[cond][freq] += 1
return pd.DataFrame(cfd).fillna(0)
Explanation: a defaultdict essentially handles the exception handling in #primelens's answer behind the scenes. Instead of raising KeyError when referring to a key that doesn't exist yet, a defaultdict first creates an object for that key using the provided constructor function, then continues with that object. For the inner dict, the default is int() which is 0 to which we then add 1.
Note that such an object may not pickle nicely due to the default constructor function in the defaultdicts - to pickle a defaultdict, you need to convert to a dict fist: dict(myDefaultDict).

Related

Trying to Pass Pandas DataFrame to a Function and Return a Modified DataFrame

I'm trying to pass different pandas dataframes to a function that does some string modification (usually str.replace operation on columns based on mapping tables stored in CSV files) and return the modified dataframes. And I'm encountering errors especially with handling the dataframe as a parameter.
The mapping table in CSV is structured as follows:
From(Str)
To(Str)
Regex(True/False)
A
A2
B
B2
CD (.*) FG
CD FG
True
My code looks as something like this:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for index in range(df_mt.shape[0]):
# If regex is true
if df_mt.iloc[index][2] is True:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=df_mt.iloc[index][0], value = df_mt.iloc[index][1], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(df_mt.iloc[index][0], df_mt.iloc[index][1])
return df_p
df_new1 = apply_mapping_table1(df_old1, 'Target_Column1', 'MappingTable1.csv')
df_new2 = apply_mapping_table2(df_old2, 'Target_Column2', 'MappingTable2.csv')
I'm getting 'IndexError: single positional indexer is out-of-bounds' for 'df_mt.iloc[index][2]' and haven't gone to the portion where the actual replacement is happening. Any suggestions to make it work or even a better way to do the dataframe string replacements based on mapping tables?
You can use the .iterrows() function to iterate through lookup table rows. Generally, the .iterrows() function is slow, but in this case because the lookup table should be a small manageable table it will be completely fine.
You can adapt your give function as I did in the following snippet:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for _, row in df_mt.iterrows():
# If regex is true
if row['Regex(True/False)']:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=row['From(Str)'], value=row['To(Str)'], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(row['From(Str)'], row['To(Str)'])
return df_p

pytest assert for pyspark dataframe comparison

I have 2 pyspark dataframe as shown in file attached. expected_df and actual_df
In my unit test I am trying to check if both are equal or not.
for which my code is
expected = map(lambda row: row.asDict(), expected_df.collect())
actual = map(lambda row: row.asDict(), actaual_df.collect())
assert expected = actual
Since both dfs are same but row order is different so assert fails here.
What is best way to compare such dfs.
You can try pyspark-test
https://pypi.org/project/pyspark-test/
This is inspired by the panadas testing module build for pyspark.
Usage is simple
from pyspark_test import assert_pyspark_df_equal
assert_pyspark_df_equal(df_1, df_2)
Also apart from just comparing dataframe, just like the pandas testing module it also accepts many optional params that you can check in the documentation.
Note:
The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
This package is for unit/integration testing, so meant to be used with small size dfs
This is done in some of the pyspark documentation:
assert sorted(expected_df.collect()) == sorted(actaual_df.collect())
We solved this by hashing each row with Spark's hash function and then summing the resultant column.
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def hash_df(df):
"""Hashes a DataFrame for comparison.
Arguments:
df (DataFrame): A dataframe to generate a hash from
Returns:
int: Summed value of hashed rows of an input DataFrame
"""
# Hash every row into a new hash column
df = df.withColumn('hash_value', F.hash(*sorted(df.columns))).select('hash_value')
# Sum the hashes, see https://shortest.link/28YE
value = df.agg(F.sum('hash_value')).collect()[0][0]
return value
expected_hash = hash_df(expected_df)
actual_hash = hash_df(actual_df)
assert expected_hash == actual_hash
Unfortunately this cannot be done without applying sort on any of the columns(specially on the key column), reason being there isn't any guarantee for ordering of records in a DataFrame . You cannot predict the order in which the records are going to appear in the dataframe. The below approach works fine for me:
expected = expected_df.orderBy('period_start_time').collect()
actual = actaual_df.orderBy('period_start_time').collect()
assert expected == actual
If the overhead of an additional library such as pyspark_test is a problem, you could try sorting both dataframes by the same columns, converting them to pandas, and using pd.testing.assert_frame_equal.
I know that the .toPandas method for pyspark dataframes is generally discouraged because the data is loaded into the driver's memory (see the pyspark documentation here), but this solution works for relatively small unit tests.
For example:
sort_cols = actual_df.columns
pd.testing.assert_frame_equal(
actual_df.sort(sort_cols).toPandas(),
expected_df.sort(sort_cols).toPandas()
)
try to have "==" instead of "=".
assert expected == actual
I have two Dataframes with the same order.
Comparing this two I use:
def test_df(df1, df2):
assert df1.values.tolist() == df2.values.tolist()
Another way go about that ensuring sort order would be:
from pandas.testing import assert_frame_equal
def assert_frame_with_sort(results, expected, key_columns):
results_sorted = results.sort_values(by=key_columns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=key_columns).reset_index(drop=True)
assert_frame_equal(results_sorted, expected_sorted)

Warning: ....SettingWithCopyWarning don't understand

Hello,
I have problem with my code Python 3. I want to copy tupple in a cell dataframe. Python return warning message ...SettingWithCopyWarning...
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df=pd.DataFrame(data,columns=['Début'],index=['P1','P2','P3','P4'])
d=data['Début'][0]
d=d.split("/")
d.reverse()
d= tuple(list(map(int,d)))
df.Début[i]=d
I read pandas doc. and I try this... but python return error...(Must have equal len keys and value when setting with an iterable).
df.loc[0,'Début']=d
other way ...no work,it's same error.
df.at[0,'Début']=d
As pointed out, there is an issue that your dataframe is already using a copy of the data dictionary as it's data, and so there are issues with copied data. One way you can avoid this by processing your data in the way you want it before you put it in the dataframe. For instance:
import pandas as pd
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df = pd.DataFrame(data, columns = ['Début'], index = ['P1','P2','P3','P4'])
# Split your data, make a tuple out of it, and reverse it in a list iteration
date_tuples = [tuple(map(int, i.split("/")))[::-1] for i in data['Debut']]
df['Début'] = date_tuples

Converting list of strings to list of floats in pandas

I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.
When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.
Is there any way to solve this? Posting code below for form, but probably not extremely helpful:
def write_func(dataset):
features = featurize_list(dataset[column]) # Returns numpy array
new_dataset = dataset.copy() # Don't want to modify the underlying dataframe
new_dataset['Text'] = features
new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
write(new_dataset, dataset_name)
def write(new_dataset, dataset_name):
dump_location = feature_set_location(dataset_name, self)
featurized_dataset.to_csv(dump_location)
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(pd.to_numeric)
The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:
ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0
I can't be the first person to run into this issue, is there some way to handle this at read/write time?
You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.
from ast import literal_eval
form io import StringIO
import pandas as pd
txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""
df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)
col1 col2
0 a [1, 2, 3]
1 b [4, 5, 6]
I have modified your last function a bit and it works fine.
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))

pandas SparseDataFrame insertion

i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).

Categories