I have a CSV file containing data: (just the first ten rows of data are listed)
0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84
The first column indicates the row number (e.g the first column in the first row is 0). When i try to use
import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')
Error occurs as below:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12
I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.
Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.
For your first problem, you can pass a names=... parameter to read_csv:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers
2.) This is for a problem with 120 classes
3.) You want a matrix with 1s and 0s for each class
4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlaps
for k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as #cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.
Related
I have a rather large (1.3 GB, unzipped) csv file, with 2 dense columns and 1.4 K sparse columns, about 1 M rows.
I need to make a pandas.DataFrame from it.
For small files I can simply do:
df = pd.read_csv('file.csv')
For the large file I have now, I get a memory error, clearly due to the DataFrame size (tested by sys.getsizeof(df)
Based on this document:
https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating
it looks like I can make a DataFrame with mixed dense and sparse columns.
However, I can only see instructions to add individual sparse columns, not a chunk of them all together, from the csv file.
Reading the csv sparse columns one by one and adding them to df using:
for colname_i in names_of_sparse_columns:
data = pd.read_csv('file.csv', usecols = [colname_i])
df[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
works, and df stays very small, as desired, but the execution time is absurdly long.
I tried of course:
pd.read_csv(path_to_input_csv, usecols = names_of_sparse_columns, dtype = "Sparse[float]")
but that generates this error:
NotImplementedError: Extension Array: <class 'pandas.core.arrays.sparse.array.SparseArray'> must implement _from_sequence_of_strings in order to be used in parser methods
Any idea how I can do this more efficiently?
I checked several posts, but they all seem to be after something slightly different from this.
EDIT adding a small example, to clarify
import numpy as np
import pandas as pd
import sys
# Create an unpivoted sparse dataset
lengths = list(np.random.randint(low = 1, high = 5, size = 10000))
cols = []
for l in lengths:
cols.extend(list(np.random.choice(100, size = l, replace = False)))
rows = np.repeat(np.arange(10000), lengths)
vals = np.repeat(1, sum(lengths))
df_unpivoted = pd.DataFrame({"row" : rows, "col" : cols, "val" : vals})
# Pivot and save to a csv file
df = df_unpivoted.pivot(index = "row", columns = "col", values = "val")
df.to_csv("sparse.csv", index = False)
This file occupies 1 MB on my PC.
Instead:
sys.getsizeof(df)
# 8080016
This looks like 8 MB to me.
So there is clearly a large increase in size when making a pd.DataFrame from a sparse csv file (in this case I made the file from the data frame, but it's the same as reading in the csv file using pd.read_csv()).
And this is my point: I cannot use pd.read_csv() to load the whole csv file into memory.
Here it's only 8 MB, that's no problem at all; with the actual 1.3 GB csv I referred to, it goes to such a huge size that it crashes our machine's memory.
I guess it's easy to try that, by replacing 10000 with 1000000 and 100 with 1500 in the above simulation.
If I do instead:
names_of_sparse_columns = df.columns.values
df_sparse = pd.DataFrame()
for colname_i in names_of_sparse_columns:
data = pd.read_csv('sparse.csv', usecols = [colname_i])
df_sparse[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
The resulting object is much smaller:
sys.getsizeof(df_sparse)
# 416700
In fact even smaller than the file.
And this is my second point: doing this column-by-column addition of sparse columns is very slow.
I was looking for advice on how to make df_sparse from a file like "sparse.csv" faster / more efficiently.
In fact, while I was writing this example, I noticed that:
sys.getsizeof(df_unpivoted)
# 399504
So maybe the solution could be to read the csv file line by line and unpivot it. The rest of the handling I need to do however would still require that I write out a pivoted csv, so back to square one.
EDIT 2 more information
Just as well that I describe the rest of the handling I need to do, too.
When I can use a non-sparse data frame, there is an ID column in the file:
df["ID"] = list(np.random.choice(20, df.shape[0]))
I need to make a summary of how many data exist, per ID, per data column:
df.groupby("ID").count()
The unfortunate bit is that the sparse data frame does not support this.
I found a workaround, but it's very inefficient and slow.
If anyone can advise on that aspect, too, it would be useful.
I would have guessed there would be a way to load the sparse part of the csv into some form of sparse array, and make a summary by ID.
Maybe I'm approaching this completely the wrong way, and that's why I am asking this large competent audience for advice.
I don't have the faintest idea why someone would have made a CSV in that format. I would just read it in as chunks and fix the chunks.
# Read in chunks of data, melt it into an dataframe that makes sense
data = [c.melt(id_vars=dense_columns, var_name="Column_label", value_name="Thing").dropna()
for c in pd.read_csv('file.csv', iterator=True, chunksize=100000)]
# Concat the data together
data = pd.concat(data, axis=0)
Change the chunksize and the name of the value column as needed. You could also read in chunks and turn the chunks into a sparse dataframe if needed, but it seems that you'd be better off with a melted dataframe for what you want to do, IMO.
You can always chunk it again going the other way as well. Change the number of chunks as needed for your data.
with open('out_file.csv', mode='w') as out:
for i, chunk in enumerate(np.array_split(df, 100)):
chunk.iloc[:, 2:] = chunk.iloc[:, 2:].sparse.to_dense()
chunk.to_csv(out, header=i==0)
The same file.csv should not be read on every iteration; this line of code:
data = pd.read_csv('file.csv', ...)
should be moved ahead of the for-loop.
To iterate through names_of_sparse_columns:
df = pd.read_csv('file.csv', header = 0).copy()
data = pd.read_csv('file.csv', header = 0).copy()
for colname_i in names_of_sparse_columns:
dataFromThisSparseColumn = data[colname_i]
df[colname_i] = np.reshape(dataFromThisSparseColumn, -1)
Empirically it seems that whenever you set_index on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.
Here is a demonstration:
import pandas as pd
import dask.dataframe as dd
users = [1]*1000 + [2]*1000 + [3]*1000
df = pd.DataFrame({'user': users})
ddf = dd.from_pandas(df, npartitions=1000)
ddf = ddf.set_index('user')
counts = ddf.map_partitions(lambda x: len(x)).compute()
counts.loc[counts > 0]
# 500 1000
# 999 2000
# dtype: int64
However, I found no guarantee of this behaviour anywhere.
I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:
set_index
set_partitions
rearrange_by_column
rearrange_by_column_tasks
SimpleShuffleLayer
When you set_index, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?
Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.
is it the case that a single index can never be in two different partitions?
No, it's certainly allowed. Dask will even intend for this to happen. However, because of a bug in set_index, all the data will still end up in one partition.
An extreme example (every row is the same value except one):
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] + [1] * 20})
In [4]: ddf = dd.from_pandas(df, npartitions=10)
In [5]: s = ddf.set_index("A")
In [6]: s.divisions
Out[6]: (0, 0, 0, 0, 0, 0, 0, 1)
As you can see, Dask intends for the 0s to be split up between multiple partitions. Yet when the shuffle actually happens, all the 0s still end up in one partition:
In [7]: import dask
In [8]: dask.compute(s.to_delayed()) # easy way to see the partitions separately
Out[8]:
([Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [],
Empty DataFrame
Columns: []
Index: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],)
This is because the code deciding which output partition a row belongs doesn't consider duplicates in divisions. Treating divisions as a Series, it uses searchsorted with side="right", hence why all the data always ends up in the last partition.
I'll update this answer when the issue is fixed.
Is it the case that a single index can never be in two different partitions?
IIUC, the answer for practical purposes is yes.
A dask dataframe will in general have multiple partitions and dask may or may not know about the index values associated with each partition (see Partitions). If dask does know which partition contains which index range, then this will be reflected in df.divisions output (if not, the result of this call will be None).
When running .set_index, dask will compute divisions and it seems that in determining the divisions it will require that divisions are sequential and unique (except for the last element). The relevant code is here.
So two potential follow-up questions: why not allow any non-sequential indexing, and as a specific case of the previous, why not allow duplicate indexes in partitions.
With regards to the first question: for smallish data it might be feasible to think about a design that allows non-sorted indexing, but you can imagine that a general non-sorted indexing won't scale well, since dask will need to store indexes for each partition somehow.
With regards to the second question: it seems that this should be possible, but it also seems that right now it's not implemented correctly. See the snippet below:
# use this to generate 10 indexed partitions
import pandas as pd
for user in range(10):
df = pd.DataFrame({'user_col': [user//3]*100})
df['user'] = df['user_col']
df = df.set_index('user')
df.index.name = 'user_index'
df.to_parquet(f'test_{user}.parquet', index=True)
# now load them into a dask dataframe
import dask.dataframe as dd
ddf = dd.read_parquet('test_*.parquet')
# dask will know about the divisions
print(ddf.known_divisions) # True
# further evidence
print(ddf.divisions) # (0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3)
# this should show three partitions, but will show only one
print(ddf.loc[0].npartitions) # 1
I have just noticed that Dask's documentation for shuffle says
After this operation, rows with the same value of on will be in the same partition.
This seems to confirm my empirical observation.
I'm a novice in Python.
I'm writing up a jupyter notebook for data analysis which is supposed to work on already provided datafiles.
These datafiles (.txt) contain each a large table of floats, with delimitator ' '. They are ugly in the sense that they have relatively few rows (~2k) and a lot of columns (~100k).
The "single-file" detailed analysis works fine (I have more than enough RAM to load one of these files entirely in memory, e.g. via np.loadtxt(), and work on it); but I then wanted to attempt a multi-file cross analysis in which I would be only interested in the last column of each file. I cannot find a fast/efficient/nice way of doing this.
What I can do is to np.loadtxt() these files one at a time, then each time copy the last column of the resulting array and delete the rest; and repeat. This is painfully slow but it's working. I was wondering if I could do better!
I also tried this, inspired by something I saw searching the web:
data=[]
for i in range(N_istar):
for j in range(N_col_pos):
with open(filename(i,j), 'r') as f:
lastcol=[]
line=f.readline()
while line:
sp=line.split()
lastcol.append(sp[-1])
data.append(lastcol)
but this either goes on forever or takes a ridiculous amount of time.
Any suggestions?
You can use pandas read_csv(usecols=). You must know the index or name of the column. Code is clean and short, see below example.
In case you do not know the index of the last column, you can read the first row and count the number of seperators.
Example
test.csv
a b c d
0 1 2 3
2 4 6 8
python code
import pandas as pd
seperator = r"\s*" # default this will be ",". Using a regex does make it slower.
# column names
pd.read_csv('test.csv', sep=seperator, usecols=['d'])
# column index
pd.read_csv('test.csv', sep=seperator, header=None, usecols=[3])
# Unknown number of columns
with open('test.csv') as current_file:
last_column_index = len(current_file.readline().split())
pd.read_csv('test.csv', sep=seperator, header=None, usecols=[last_column_index])
You were on the right track with the np.readtxt. Usually files are read and processed quickly in python either by using np or pd.
The best trick would be to read each file using pandas/numpy and concatenate the last columns (if you have enough memory). If you ran you, drop the columns you don't use and leave only the last one.
Just to give you the right direction:
df1 = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=5),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=5)})
print(df1)
df2 = pd.DataFrame({'A': np.random.choice([1,2,3,4,5,6], size=5),
'B': np.random.choice([1,2,3,4,5,6,7,8], size=5)})
print(df2)
conc_df = pd.concat([df1['B'], df2['B']])
print(conc_df.head(10))
numpy aslo provide reshape() method for you.
ex: f.csv
1,2,3
4,5,6
7,8,9
code
import numpy as np
data = np.loadtxt('f.csv', delimiter=',')
last_col = data[:, -1:].reshape(1,-1)[0]
result
>>>last_col.tolist()
[3.0, 6.0, 9.0]
I have the below data which I store in a csv (df_sample.csv). I have the column names in a list called cols_list.
df_data_sample:
df_data_sample = pd.DataFrame({
'new_video':['BASE','SHIVER','PREFER','BASE+','BASE+','EVAL','EVAL','PREFER','ECON','EVAL'],
'ord_m1':[0,1,1,0,0,0,1,0,1,0],
'rev_m1':[0,0,25.26,0,0,9.91,'NA',0,0,0],
'equip_m1':[0,0,0,'NA',24.9,20,76.71,57.21,0,12.86],
'oev_m1':[3.75,8.81,9.95,9.8,0,0,'NA',10,56.79,30],
'irev_m1':['NA',19.95,0,0,4.95,0,0,29.95,'NA',13.95]
})
attribute_dict = {
'new_video': 'CAT',
'ord_m1':'NUM',
'rev_m1':'NUM',
'equip_m1':'NUM',
'oev_m1':'NUM',
'irev_m1':'NUM'
}
Then I read each column and do some data processing as below:
cols_list = df_data_sample.columns
# Write to csv.
df_data_sample.to_csv("df_seg_sample.csv",index = False)
#df_data_sample = pd.read_csv("df_seg_sample.csv")
#Create empty dataframe to hold final processed data for each income level.
df_final = pd.DataFrame()
# Read in each column, process, and write to a csv - using csv module
for column in cols_list:
df_column = pd.read_csv('df_seg_sample.csv', usecols = [column],delimiter = ',')
if (((attribute_dict[column] == 'CAT') & (df_column[column].unique().size <= 100))==True):
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
# Check and remove duplicate columns if any:
df_target_attribute = df_target_attribute.loc[:,~df_target_attribute.columns.duplicated()]
for target_column in list(df_target_attribute.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target_attribute[[target_column]]
elif (attribute_dict[column] == 'NUM'):
#Let's impute with 0 for numeric variables:
df_target_attribute = df_column
df_target_attribute.fillna(value=0,inplace=True)
df_final[column] = df_target_attribute
attribute_dict is a dictionary containing the mapping of variable name : variable type as :
{
'new_video': 'CAT'
'ord_m1':'NUM'
'rev_m1':'NUM'
'equip_m1':'NUM'
'oev_m1':'NUM'
'irev_m1':'NUM'
}
However, this column by column operation takes long time to run on a dataset of size**(5million rows * 3400 columns)**. Currently the run time is approximately
12+ hours.
I want to reduce this as much as possible and one of the ways I can think of is to do processing for all NUM columns at once and then go column by column
for the CAT variables.
However I am neither sure of the code in Python to achieve this nor if this will really fasten up the process.
Can someone kindly help me out!
For numeric columns it is simple:
num_cols = [k for k, v in attribute_dict.items() if v == 'NUM']
print (num_cols)
['ord_m1', 'rev_m1', 'equip_m1', 'oev_m1', 'irev_m1']
df1 = pd.read_csv('df_seg_sample.csv', usecols = [num_cols]).fillna(0)
But first part code is performance problem, especially in get_dummies called for 5 million rows:
df_target_attribute = pd.get_dummies(df_column[column], dummy_na=True, prefix=column)
Unfortunately there is problem processes get_dummies in chunks.
There are three things i would advice you to speed up your computaions:
Take a look at pandas HDF5 capabilites. HDF is a binary file
format for fast reading and writing data to disk.
I would read in bigger chunks (several columns) of your csv file at once (depending on
how big your memory is).
There are many pandas operations you can apply to every column at once. For example nunique() (giving you the number of unique values, so you don't need unique().size). With these column-wise operations you can easily filter columns by selecting with a binary vector. E.g.
df = df.loc[:, df.nunique() > 100]
#filter out every column where less then 100 unique values are present
Also this answer from the author of pandas on large data workflow might be interesting for you.
I have what I assumed would be a super basic problem, but I'm unable to find a solution. The short is that I have a column in a csv that is a list of numbers. This csv that was generated by pandas with to_csv. When trying to read it back in with read_csv it automatically converts this list of numbers into a string.
When then trying to use it I obviously get errors. When I try using the to_numeric function I get errors as well because it is a list, not a single number.
Is there any way to solve this? Posting code below for form, but probably not extremely helpful:
def write_func(dataset):
features = featurize_list(dataset[column]) # Returns numpy array
new_dataset = dataset.copy() # Don't want to modify the underlying dataframe
new_dataset['Text'] = features
new_dataset.rename(columns={'Text': 'Features'}, inplace=True)
write(new_dataset, dataset_name)
def write(new_dataset, dataset_name):
dump_location = feature_set_location(dataset_name, self)
featurized_dataset.to_csv(dump_location)
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(pd.to_numeric)
The Features column is the one in question. When I attempt to run the apply currently in read_func I get this error:
ValueError: Unable to parse string "[0.019636873200000002, 0.10695576670000001,...]" at position 0
I can't be the first person to run into this issue, is there some way to handle this at read/write time?
You want to use literal_eval as a converter passed to pd.read_csv. Below is an example of how that works.
from ast import literal_eval
form io import StringIO
import pandas as pd
txt = """col1|col2
a|[1,2,3]
b|[4,5,6]"""
df = pd.read_csv(StringIO(txt), sep='|', converters=dict(col2=literal_eval))
print(df)
col1 col2
0 a [1, 2, 3]
1 b [4, 5, 6]
I have modified your last function a bit and it works fine.
def read_func(read_location):
df = pd.read_csv(read_location)
df['Features'] = df['Features'].apply(lambda x : pd.to_numeric(x))