How to efficiently delete multiple columns in a huge dataframe Python - python

I have a dataframe that consists of 75750 columns.
I'm trying to automatically grab 5 specific columns because I need the data from each of those 5 columns to generate a plot.
I'm using a for loop which is incredibly slow.
The max_list contains 5 labels, that are generated, so I don't know what columns each label might refer to in the huge data frame. So the columns can't be selected manually or be known before the max_list is generated.
max_list = ["column7000", "column200", "column15000", "column30", "column2"]
for i in max_frame.columns:
if i not in max_list:
del max_frame[i]
The code works, but it takes foreveeeer! And no other code will run until it's finished running.
I've tried to get cython, but it won't work properly. I'm using the latest version of Jupyter notebook with Python 3.6.
Any help would be greatly appreciated.

Understand a bit of the problem, Suppose we want to slice all the columns except the columns in the max_list and we may be having many columns and rows in a dataset.
During iteration we are going to remove the item which is not in the list and add to the desired new list.
max_list = ["column7000", "column200", "column15000", "column30", "column2"]
max_frame_1 = max_frame[:] # let's take a copy of actual dataset
desired = [max_frame_1.remove(item) for item in max_frame_1 if not in max_list]
If this works, hope this is the shortest and quick method.
Moreover, when we have lots of data and the workout is less, we need to try to be as simple as possible.

Related

Pandas - value changes when adding a new column to a dataframe

I'm trying to add a new column to a dataframe using the following code.
labels = df1['labels']
df2['labels'] = labels
However, in the later part of my program, I found that there might be something wrong with the assignment. So, I checked it using
labels.equals(other=df2['labels'])
and I got a False. (I added this line instantly after assignment)
I also tried to
print out part of labels and df2, and it turns out that there are indeed some lines that are different.
check max and min values of both series, and they are different
check number of unique values in both series using len(set(labels)) and len(set(df2['labels'])), and they differs a lot
test with a smaller amount of data, but this works totally fine.
My dataframe is rather large (40 million+ lines), so I cannot print them all out and check the values. Does anyone have any idea about what might lead to this kind of problem? Or is there any suggestions for further tests?

Working with .csv data as a Pandas DataFrame, getting redundancy error when applying logic

Been working on this project all day and it's destroying me. Currently have finished web scraping and have a final .csv which contains the elements of a pandas dataframe. Working with this dataframe in a new file, and currently have the following:
df = pd.read_csv('active_homes.csv')
for i in range(len(df)):
add = df['Address'][i]
price = df['Price'][i]
if (price<100000) == True:
print(price)
'active_homes.csv' looks like this:
Address,Status,Price,Meta
"387 8th St, Burlington, CO 80807",For Sale,169500,"4bed2bath1,560sqft"
,and the resulting df's shape is (1764, 4).
This should, in theory, print the price for each iteration of price<100000.
In practice, it prints this:
I have confirmed that at each iteration of the above for loop, it is collecting the correct 'Price' and 'Address' information, and have also confirmed that at each interval the logic (price<100000) is working correctly. However, it is still doing the above. I was originally trying to just drop the rows of the dataframe that were <100000 but that wasn't doing anything. I was also trying to reassign the data to a new dataframe and it would either return an empty dataframe, or return a dataframe with duplicate data of this house (with the 'Price' of 58900).
So far, from all of that, I believe that the program is recognizing the amount of correct houses < 100000, but for some reason the assignment is sticking for the one address. It also does the same thing without assignment, as in:
for i in range(len(df)):
if (df['Price'][i]<100000) == True:
print(df['Price'][i])
Any help in identifying the error would be much appreciated.
With Pandas you try to never iterate everything in the traditional python way. Instead, you could achieve the desired result using the following method:
df = pd.read_csv('active_homes.csv')
temp_df = df[df["Price"]<100000] # initiating a new df isn't required, just a force of a habit
print(temp_df["Price"]) # displaying a series of houses that are below 100K; imo prettier print

Is there a difference between a series and a 1 dimensional dataframe in Python

I have the following code to adress columns and rows in a dataframe in Python:
y_train = features.iloc[start:end] [[1]]
y_train_noDoppleBracket = features.iloc[start:end] [1]
y_train_noIloc = features [start:end] [[1]]
y_train_noIloc_noDoppleBracket = features [start:end] [1]
In the cases without doppleBrackets I get a series of size (300693,) and in the cases with dopplBrackets I get a dataframe of size (300693,1). This also holds for the iloc examples. However if I have a look at them in the Variable Explorer of Spyder, they look exactly the same. So is there a difference between them? And why do I get a dataframe when using dopple brackets while only getting a series when using single brackets?
I'd appreciate every comment.
Reminder: As I still do not understand whether there is a difference between them, I would like to remind you on my question (the comments to this question say yes, but in the variable explorer they look exactly the same). I'd be quite happy for your help.

Dataframe Generator Based On Conditions Pandas

I have manually created a bunch of dataframes to later concatenate back together based on a list of bigrams I have(my reason for doing this is out of the scope of this question). The problem is, I want to set this code to run daily or weekly and the manually created dataframes I have created will no longer work if the data has changed once refreshed. For instance, looking at the code below, what if "data_science," is no longer a bigram being pulled from my code next week and I have another bigram like "hello_world," that is not listed below in my code. I need to set up one function that will do all of these for me. I have about 50 dataframes I am making from my real data so even without the automation purposes, it would be a huge time saver to get a function going for this. One KEY point to make is that I am grabbing all of these bigrams from a list and naming a dataframe for each one of them. My function below with the list_input is what I am using that for.
data_science = df[df['column_name'].str.contains("data") &
df['column_name'].str.contains("science")]
data_science['bigram'] = "(data_science)"
p_value = df[df['column_name'].str.contains("p") &
df['column_name'].str.contains("value")]
p_value['bigram'] = "(p_value)"
ab_testing = df[df['column_name'].str.contains("ab") &
df['column_name'].str.contains("testing")]
ab_testing['bigram'] = "(ab_texting)"```
I am trying something like this code below but have not figured out how to make it work yet.
```def df_creator(a,b, my_list):
for a,b in my_list:
a_b = df[df['Message_stop'].str.contains(a) &
df['Message_stop'].str.contains(b)]
a_b['bigram'] = "(a_b)"```

Creating two lists from one randomly

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

Categories