Converting list of strings to list of floats in pandas - python

I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.
When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.
Is there any way to solve this? Posting code below for form, but probably not extremely helpful:
def write_func(dataset):
features = featurize_list(dataset[column]) # Returns numpy array
new_dataset = dataset.copy() # Don't want to modify the underlying dataframe
new_dataset['Text'] = features
new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
write(new_dataset, dataset_name)
def write(new_dataset, dataset_name):
dump_location = feature_set_location(dataset_name, self)
featurized_dataset.to_csv(dump_location)
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(pd.to_numeric)
The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:
ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0
I can't be the first person to run into this issue, is there some way to handle this at read/write time?

You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.
from ast import literal_eval
form io import StringIO
import pandas as pd
txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""
df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)
col1 col2
0 a [1, 2, 3]
1 b [4, 5, 6]

I have modified your last function a bit and it works fine.
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))

Related

Importing numbers as string into a dataframe from text

I'm trying to import a text file into Python as a dataframe.
My text file essentially consists of 2 columns, both of which are numbers.
The problem is: I want one of the columns to be imported as a string (since many of the 'numbers' start with a zero, e.g. 0123, and I will need this column to merge the df with another later on)
My code looks like this:
mydata = pd.read_csv("text_file.txt", sep = "\t", dtype = {"header_col2": str})
However, I still lose the zeros in the output, so a 4-digit number is turned into a 3-digit number.
I'm assuming there is something wrong with my import code but I could not find any solution yet.
I'm new to python/pandas, so any help/suggestions would be much appreciated!
Hard to see why your original code not working:
from io import StringIO
import pandas as pd
# this mimics your data
mock_txt = StringIO("""header_col2\theader_col3
0123\t5
0333\t10
""")
# same reading as you suggested
df = pd.read_csv(mock_txt, sep = "\t", dtype = {"header_col2": str})
# are they really strings?
assert isinstance(df.header_col2[0], str)
assert isinstance(df.header_col2[1], str)
P.S. as always at SO - really nice to have some of the data and a minimal working example with code in the original post.

Warning: ....SettingWithCopyWarning don't understand

Hello,
I have problem with my code Python 3. I want to copy tupple in a cell dataframe. Python return warning message ...SettingWithCopyWarning...
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df=pd.DataFrame(data,columns=['Début'],index=['P1','P2','P3','P4'])
d=data['Début'][0]
d=d.split("/")
d.reverse()
d= tuple(list(map(int,d)))
df.Début[i]=d
I read pandas doc. and I try this... but python return error...(Must have equal len keys and value when setting with an iterable).
df.loc[0,'Début']=d
other way ...no work,it's same error.
df.at[0,'Début']=d
As pointed out, there is an issue that your dataframe is already using a copy of the data dictionary as it's data, and so there are issues with copied data. One way you can avoid this by processing your data in the way you want it before you put it in the dataframe. For instance:
import pandas as pd
data={'Debut': ['19/12/2016','18/1/2017','13/2/2017','10/3/2017']}
df = pd.DataFrame(data, columns = ['Début'], index = ['P1','P2','P3','P4'])
# Split your data, make a tuple out of it, and reverse it in a list iteration
date_tuples = [tuple(map(int, i.split("/")))[::-1] for i in data['Debut']]
df['Début'] = date_tuples

Handling ragged CSV columns in pandas

I have a CSV file containing data: (just the first ten rows of data are listed)
0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84
The first column indicates the row number (e.g the first column in the first row is 0). When i try to use
import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')
Error occurs as below:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12
I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.
Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.
For your first problem, you can pass a names=... parameter to read_csv:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers
2.) This is for a problem with 120 classes
3.) You want a matrix with 1s and 0s for each class
4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlaps
for k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as #cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.

Py Pandas .format(dataframe)

As Python newbie I recently discovered that with Py 2.7 I can do something like:
print '{:20,.2f}'.format(123456789)
which will give the resulting output:
123,456,789.00
I'm now looking to have a similar outcome for a pandas df so my code was like:
import pandas as pd
import random
data = [[random.random()*10000 for i in range(1,4)] for j in range (1,8)]
df = pd.DataFrame (data)
print '{:20,.2f}'.format(df)
In this case I have the error:
Unknown format code 'f' for object of type 'str'
Any suggestions to perform something like '{:20,.2f}'.format(df) ?
As now my idea is to index the dataframe (it's a small one), then format each individual float within it, might be assign astype(str), and rebuild the DF ... but looks so looks ugly :-( and I'm not even sure it'll work ..
What do you think ? I'm stuck ... and would like to have a better format for my dataframes when these are converted to reportlabs grids.
import pandas as pd
import numpy as np
data = np.random.random((8,3))*10000
df = pd.DataFrame (data)
pd.options.display.float_format = '{:20,.2f}'.format
print(df)
yields (random output similar to)
0 1 2
0 4,839.01 6,170.02 301.63
1 4,411.23 8,374.36 7,336.41
2 4,193.40 2,741.63 7,834.42
3 3,888.27 3,441.57 9,288.64
4 220.13 6,646.20 3,274.39
5 3,885.71 9,942.91 2,265.95
6 3,448.75 3,900.28 6,053.93
The docstring for pd.set_option or pd.describe_option explains:
display.float_format: [default: None] [currently: None] : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See core.format.EngFormatter for an example.

NLTK ConditionalFreqDist to Pandas dataframe

I am trying to work with the table generated by nltk.ConditionalFreqDist but I can't seem to find any documentation on either writing the table to a csv file or exporting to other formats. I'd love to work with it in a pandas dataframe object, which is also really easy to write to a csv. The only thread I could find recommended pickling the CFD object which doesn't really solve my problem.
I wrote the following function to convert an nltk.ConditionalFreqDist object to a pd.DataFrame:
def nltk_cfd_to_pd_dataframe(cfd):
""" Converts an nltk.ConditionalFreqDist object into a pandas DataFrame object. """
df = pd.DataFrame()
for cond in cfd.conditions():
col = pd.DataFrame(pd.Series(dict(cfd[cond])))
col.columns = [cond]
df = df.join(col, how = 'outer')
df = df.fillna(0)
return df
But if I am going to do that, perhaps it would make sense to just write a new ConditionalFreqDist function that produces a pd.DataFrame in the first place. But before I reinvent the wheel, I wanted to see if there are any tricks that I am missing - either in NLTK or elsewhere to make the ConditionalFreqDist object talk with other formats and most importantly to export it to csv files.
Thanks.
pd.DataFrame(freq_dist.items(), columns=['word', 'frequency'])
You can treat an FreqDist as a dict, and create a dataframe from there using from_dict
fdist = nltk.FreqDist( ... )
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
print(df_fdist)
df_fdist.to_csv(...)
output:
Frequency
Term
is 70464
a 26429
the 15079
Ok, so I went ahead and wrote a conditional frequency distribution function that takes a list of tuples like the nltk.ConditionalFreqDist function but returns a pandas Dataframe object. Works faster than converting the cfd object to a dataframe:
def cond_freq_dist(data):
""" Takes a list of tuples and returns a conditional frequency distribution as a pandas dataframe. """
cfd = {}
for cond, freq in data:
try:
cfd[cond][freq] += 1
except KeyError:
try:
cfd[cond][freq] = 1
except KeyError:
cfd[cond] = {freq: 1}
return pd.DataFrame(cfd).fillna(0)
This is a nice place to use a collections.defaultdict:
from collections import defaultdict
import pandas as pd
def cond_freq_dist(data):
""" Takes a list of tuples and returns a conditional frequency
distribution as a pandas dataframe. """
cdf = defaultdict(defaultdict(int))
for cond, freq in data:
cfd[cond][freq] += 1
return pd.DataFrame(cfd).fillna(0)
Explanation: a defaultdict essentially handles the exception handling in #primelens's answer behind the scenes. Instead of raising KeyError when referring to a key that doesn't exist yet, a defaultdict first creates an object for that key using the provided constructor function, then continues with that object. For the inner dict, the default is int() which is 0 to which we then add 1.
Note that such an object may not pickle nicely due to the default constructor function in the defaultdicts - to pickle a defaultdict, you need to convert to a dict fist: dict(myDefaultDict).

Categories