Creating two lists from one randomly - python

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.

You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

Related

BatchDataset not subscriptable when trying to format Python dictionary as table

I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):
Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']
BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.

Dataframe Generator Based On Conditions Pandas

I have manually created a bunch of dataframes to later concatenate back together based on a list of bigrams I have(my reason for doing this is out of the scope of this question). The problem is, I want to set this code to run daily or weekly and the manually created dataframes I have created will no longer work if the data has changed once refreshed. For instance, looking at the code below, what if "data_science," is no longer a bigram being pulled from my code next week and I have another bigram like "hello_world," that is not listed below in my code. I need to set up one function that will do all of these for me. I have about 50 dataframes I am making from my real data so even without the automation purposes, it would be a huge time saver to get a function going for this. One KEY point to make is that I am grabbing all of these bigrams from a list and naming a dataframe for each one of them. My function below with the list_input is what I am using that for.
data_science = df[df['column_name'].str.contains("data") &
df['column_name'].str.contains("science")]
data_science['bigram'] = "(data_science)"
p_value = df[df['column_name'].str.contains("p") &
df['column_name'].str.contains("value")]
p_value['bigram'] = "(p_value)"
ab_testing = df[df['column_name'].str.contains("ab") &
df['column_name'].str.contains("testing")]
ab_testing['bigram'] = "(ab_texting)"```
I am trying something like this code below but have not figured out how to make it work yet.
```def df_creator(a,b, my_list):
for a,b in my_list:
a_b = df[df['Message_stop'].str.contains(a) &
df['Message_stop'].str.contains(b)]
a_b['bigram'] = "(a_b)"```

How to efficiently delete multiple columns in a huge dataframe Python

I have a dataframe that consists of 75750 columns.
I'm trying to automatically grab 5 specific columns because I need the data from each of those 5 columns to generate a plot.
I'm using a for loop which is incredibly slow.
The max_list contains 5 labels, that are generated, so I don't know what columns each label might refer to in the huge data frame. So the columns can't be selected manually or be known before the max_list is generated.
max_list = ["column7000", "column200", "column15000", "column30", "column2"]
for i in max_frame.columns:
if i not in max_list:
del max_frame[i]
The code works, but it takes foreveeeer! And no other code will run until it's finished running.
I've tried to get cython, but it won't work properly. I'm using the latest version of Jupyter notebook with Python 3.6.
Any help would be greatly appreciated.
Understand a bit of the problem, Suppose we want to slice all the columns except the columns in the max_list and we may be having many columns and rows in a dataset.
During iteration we are going to remove the item which is not in the list and add to the desired new list.
max_list = ["column7000", "column200", "column15000", "column30", "column2"]
max_frame_1 = max_frame[:] # let's take a copy of actual dataset
desired = [max_frame_1.remove(item) for item in max_frame_1 if not in max_list]
If this works, hope this is the shortest and quick method.
Moreover, when we have lots of data and the workout is less, we need to try to be as simple as possible.

Deleting rows of data for multiple variables

I have over 500 files that I cleaned up using a pandas data frame, and read in later as a matrix. I now want to delete missing rows of data from multiple variables for the entirety of my files. Each variable is pretty lengthy for its shape, for example, tc and wspd have the shape (84479, 558) and pressure has the shape (558,). I have tried the following example before and has worked in the past for single dimensional arrays with the same shape, but will no longer work with a two dimensional array.
bad=[]
for i in range(len(p)):
if p[i]==-9999 or tc[i]==-9999:
bad.append(i)
p=numpy.delete(p, bad)
tc=numpy.delete(tc, bad)
I tried using the following code instead but with no success (unfortunately).
import numpy as n
import pandas as pd
wspd=pd.read_pickle('/home/wspd').as_matrix()
tc=pd.read_pickle('/home/tc').as_matrix()
press=n.load('/home/file1.npz')
p=press['press']
names=press['names']
length=n.arange(0,84479)
for i in range(len(names[0])): #using the first one as a trial to run faster
print i #used later to see how far we have come in the 558 files
bad=[]
for j in range(len(length)):
if (wspd[j,i]==n.nan or tc[j,i]==n.nan):
bad.append(j)
print bad
From there I plan on deleting missing data as I had done previously except indexing which dimension I am deleting from within my first forloop.
new_tc=n.delete(tc[j,:], bad)
Unfortunately, this has not worked. I have also tried masking the array which also has not worked.
The reason I need to delete the data is my next library does not understand nan values, it requires strictly integers, floats, etc.
I am open to new methods for removing rows of data if anyone has any guidance. I greatly appreciate it.
I would load your 2 dimensional arrays as pandas DataFrames and then use the dropna function to drop any rows that contain a null value
wspd = pd.read_pickle('/home/wspd').dropna()
tc = pd.read_pickle('/home/tc').dropna()
The documentation for pandas.DataFrame.dropna is here

Looking for a quick way to speed up my code

I am looking for a way to speed up my code. I managed to speed up most parts of my code, reducing runtime to about 10 hours, but it's still not fast enough and since I'm running out of time I'm looking for a quick way to optimize my code.
An example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000)
new_text = [np.array(chunk)[:,2] for chunk in text]
new_text = list(itertools.chain.from_iterable(new_text))
In the code above I read in about 6 million rows of text documents in chunks and flatten them. This code takes about 3-4 hours to execute. This is the main bottleneck of my program. edit: I realized that I wasn't very clear on what the main issue was, The flattening is the part which takes the most amount of time.
Also this part of my program takes a long time:
train_dict = dict(izip(text,labels))
result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
The code above first zips the text documents with their corresponding labels (This a machine learning task, with the train_dict being the training set). Earlier in the program I generated predictions on a test set. There are duplicates between my train and test set so I need to find those duplicates. Therefore, I need to iterate over my test set row by row (2 million rows in total), when I find a duplicate I actually don't want to use the predicted label, but the label from the duplicate in the train_dict. I assign the result of this iteration to the variable result in the above code.
I heard there are various libraries in python that could speed up parts of your code, but I don't know which of those could do the job and right know I do not have the time to investigate this, that is why I need someone to point me in the right direction. Is there a way with which I could speed the code snippets above up?
edit2
I have investigated again. And it is definitely a memory issue. I tried to read the file in a row by row manner and after a while the speed declined dramatically, furthermore my ram usage is nearly 100%, and python's disk usage increased sharply. How can I decrease the memory footprint? Or should I find a way to make sure that I don't hold everything into memory?
edit3
As memory is the main issue of my problems I'll give an outline of a part of my program. I have dropped the predictions for the time being, which reduced the complexity of my program significantly, instead I insert a standard sample for every non duplicate in my test set.
import numpy as np
import pandas as pd
import itertools
import os
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000)
sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb
sample = sample[1]
test_set = [np.array(chunk)[:,2] for chunk in test]
test_set = list(itertools.chain.from_iterable(test_set))
train_set = [np.array(chunk)[:,2] for chunk in train]
train_set = list(itertools.chain.from_iterable(train_set))
labels = [np.array(chunk)[:,3] for chunk in train_2]
labels = list(itertools.chain.from_iterable(labels))
"""zipping train and labels"""
train_dict = dict(izip(train,labels))
"""finding duplicates"""
results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this isn't my entire program, this is the part of my code that needs optimization. As you can see I am only using three important modules in this part, pandas, numpy and itertools. The memory issues arise when flattening train_set and test_set. The only thing I am doing is reading in the files, getting the necessary parts zipping the train documents with the corresponding labels in a dictionary. And then search for duplicates.
edit 4
As requested I'll give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain ID's for every sample, the second column contains titles and the third column contains text body samples(varying from 100-700 words). The fourth column contains category labels. Test.csv contains only the ID's and text bodies and titles. The columns are separated by commas.
Could you please post a dummy sample data set of a half dozen rows or so?
I can't quite see what your code is doing and I'm not a Pandas expert, but I think we can greatly speed up this code. It reads all the data into memory and then keeps re-copying the data to various places.
By writing "lazy" code we should be able to avoid all the re-copying. The ideal would be to read one line, transform it as we want, and store it into its final destination. Also this code uses indexing when it should be just iterating over values; we can pick up some speed there too.
Is the code you posted your actual code, or something you made just to post here? It appears to contain some mistakes so I am not sure what it actually does. In particular, train and labels would appear to contain identical data.
I'll check back and see if you have posted sample data. If so I can probably write "lazy" code for you that will have less re-copying of data and will be faster.
EDIT: Based on your new information, here's my dummy data:
id,title,body,category_labels
0,greeting,hello,noun
1,affirm,yes,verb
2,deny,no,verb
Here is the code that reads the above:
def get_train_data(training_file):
with open(training_file, "rt") as f:
next(f) # throw away "headers" in first line
for line in f:
lst = line.rstrip('\n').split(',')
# lst contains: id,title,body,category_labels
yield (lst[1],lst[2])
train_dict = dict(get_train_data("data.csv"))
And here is a faster way to build results:
results = [train_dict.get(x, sample) for x in test]
Instead of repeatedly indexing test to find the next item, we just iterate over the values in test. The dict.get() method handles the if x in train_dict test we need.
You can try Cython. It supports numpy and can give you a nice speedup.
Here is an introduction and explanation of what needs to be done
http://www.youtube.com/watch?v=Iw9-GckD-gQ
If the order of your rows is not important you can use sets to find elements that are in train set but not in test set (intersection trainset & testset) and add them first to your "result" and after that use set difference (testset-trainset) to add elements that are in your test set but not in the train set. This will allow to save on checking if sample is in trainset.

Categories