Large DataFrame with random columns of a smaller one (Pandas)

Large DataFrame with random columns of a smaller one (Pandas) - python

I am trying to generate a huge dataset in Python 3.6 using Pandas for testing a code but the method I developed is too slow and I would like to know if there is a more efficient way of doing it.
I create a smaller dataframe with a few columns and around 3 millions of rows (for example):
# assume we have relevant information in 'a','b','c'
data = pd.DataFrame(index=range(int(3e6)))
data['a'] = 0
data['b'] = 0
data['c'] = 0
Now, I would like to take random rows of this dataframe and build with them a bigger one with approximately 15 million rows, of course, even if rows repeat.
I tried taking samples and appending to the new dataframe like this:
data_tot = pd.DataFrame(columns=data.columns)
for i in range(int(15e6)):
samp = data.sample(1)
data_tot = data_tot.append(samp)
It looks very inefficient but I never had to generate such amount of data. I also tried preallocation and using iloc then, but still very slow.

You can use np.random.choice, generate random indices, and index df.
idx = np.random.choice(len(df), int(15e6))
df = df.iloc[idx, :]

Related

How to find Levenshtein distance between 1 million article titles, where every title is compared to every other title?

I have a large pandas DataFrame consisting of 1 million rows, and I want to get the Levenshtein distance between every entity in one column of the DataFrame. I tried merging the column with itself to generate the Cartesian product and then apply the Levenshtein distance function to this new column, but this is too computationally expensive as it would require a df of 1 trillion rows, and I'm working from a personal computer.
#dataframe with 1m rows
df = pd.read_csv('titles_dates_links.csv')
df1 = DataFrame(df['title'])
df2 = DataFrame(df['title'])
#df3 is just too big for me to work with, 1 trillion rows
df3 = df1.merge(df2, how='cross')
#something like this is the function I want to apply
df3['distance'] = df3.apply(lambda x: distance(x.title_x, x.title_y), axis=1)
I was thinking that a 1m x 1m matrix with each element as a pair of titles ('title 1", "title 2") would be cheaper, but I'm having a hard time getting that data structure correct, and furthermore I don't know if this is the right solution, since ultimately I'm just interested in calculating the distance between every possible combination of titles.
I've been trying to use pivot functions in Pandas but these require the complete dataset to exist in the first place, and the issue is that I can't generate the table that I would pivot off of, since it's too large with the approaches I've been trying.

Using product from itertools should work for your case as it generates everything lazily.
from itertools import product
titles = df['title'].tolist()
result = product(titles, titles)
And from there you can just iterate over your lazy list and apply your levenshtein distance function :)

Quickest way to access & compare huge data in Python

I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance

You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.

You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.

How to work with a DataFrame which cannot be transformed by pandas pivot due to excessive memory usage?

I have dataframe with this structure
i built this dfp with 100 rows of the original for testing
and then i tried to make a pivot operation to get a dataframe like this
The problem with the pivot operations using all data is that the solution would have 131209 rows and 39123 columns. When I try the operation the memory collapse and restar my pc.
I tried segmenting the dataframe with 10 or 20. The pivot works but when I do a concat operation it crashes the memory again.
My pc have 16gb of memory. I have also tried with collabs but it also collapses the memory.
Is there a format or another strategy to work on this operation?

You may try this,
dfu = dfp.groupby(['order_id','product_id'])[['my_column']].sum().unstack().fillna(0)
Another way is you split product_id to process and concatinate back to ,
front_part = []
rear_part = []
dfp_f = dfp[dfp['product_id'].isin(front_part)]
dfp_r = dfp[dfp['product_id'].isin(rear_part)]
dfs_f = dfp_f.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs_r = dfp_r.pivot(index='order_id', columns='product_id', values=['my_column']).fillna(0)
dfs = pd.concat([dfs_f, dfs_r], axis=1)
front_part, rear_part means we wanna separate product_id into two parts, but we need to specify the discrete numerical value into lists.

pandas : sampling avoiding twice same values in different samples

I have this 5000 rows dataframe.
I want to make 4 random samples of 300 rows from the dataframe. I want each of my sample to have no duplicate inside the sample, but i also want no duplicate among samples. Ie i dont want a row to appear in sample 1 and sample 3 for example.
I have tried df.sample(300,replace=False) but it's not enough.
I have also searched the forum but didnt find what i want.
How can i code pandas to do so without doing batch groups?

I don't think there is a pandas function specifically for that, but how about doing this:
df = pd.DataFrame({"col": range(5000)})
sample = df.sample(1200, replace= False)
sample.duplicated().any()
>> False # <-- no duplicates
samples = [sample.iloc[i-300:i] for i in range(300, 1500, 300)] # <-- 4 samples
Considering that .sample will return a random selection without replacement, this would achieve what you want.

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.

Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large DataFrame with random columns of a smaller one (Pandas) - python

You can use np.random.choice, generate random indices, and index df. idx = np.random.choice(len(df), int(15e6)) df = df.iloc[idx, :]

Related

How to find Levenshtein distance between 1 million article titles, where every title is compared to every other title?

Quickest way to access & compare huge data in Python

How to work with a DataFrame which cannot be transformed by pandas pivot due to excessive memory usage?

pandas : sampling avoiding twice same values in different samples

Paritition matrix into smaller matrices based on multiple values

Categories

Resources