Quickest way to access & compare huge data in Python

Quickest way to access & compare huge data in Python - python

I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance

You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.

You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.

Related

Select specific rows on pandas based on condition

I have a dataframe containing a column called bmi (Body Mass Index) containing int values
I have to separate the values in bmi column into Under weight, Normal, Over weight and Obese based on the values. Below is the loop for the same
However I am getting an error. I am a beginner. Just started coding 2 weeks back.

Generally speaking, using a for loop in pandas is usually a bad idea. Pandas allows you to manipulate data easily. For example, if you want to filter by some condition:
print(df[df["bmi"] > 30])
will print all rows where there bmi>30. It works as follows: df[condition]. Condition in this case is "bmi" is larger then 30, so our condition is df["bmi"] > 30. Notice the line df[df["bmi"] > 30] returns all rows that satisfy the condition. I printed them, but you can manipulate them whatever you like.
Even though it's a bad technique (or used only for specific need), you can of course iterate through dataframe. This is not done via for l in df, as df is a dataframe object. To iterate through it you can use iterrows:
for index, row in df.iterrows():
if (row["bmi"] > 30)
print("Obese")
Also for next time please provide your code inline. Don't paste an image of it
If your goal is to separate into different labels, I suggest the following:
df.loc[df[df["bmi"] > 30, "NewColumn"] = "Obese"
df.loc[df[df["bmi"] < 18.5, "NewColumn"] = "Underweight"
.loc operator allows me to manipulate only part of the data. It's format is [rows, columns]. So the above code takes on rows where bmi>30, and it takes only "NewColumn" (change it whatever you like) which is a new column. It puts the value on the right to this column. That way, after that operation, you have a new column in your dataframe which has "Obese/Underweight" as you like.
As side note - there are better ways to map values (e.g pandas' map and others) but if you are a beginner, it's important to understand simple methods to manipulate data before diving into more complex one. That's why I am avoiding into explaining more complex method

First of all, As mentioned in the comment you should post text/code instead of screenshots.
You could do binning in pandas:
bmi_labels = ['Normal', 'Overweight', 'Obese']
cut_bins = [18.5, 24.9, 29.9, df["bmi"].max()]
df['bmi_label'] = pd.cut(df['bmi'], bins=cut_bins, labels=bmi_labels)
Here, i have made a seperate column (bmi_label) to store label but you could can do it in same column (bmi) too.

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.

A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09

Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

Python pandas loop efficient through two dataframes with different lengths

I have two dataframes with different lengths(df,df1). They share one similar label "collo_number". I want to search the second dataframe for every collo_number in the first data frame. Problem is that the second date frame contains multiple rows for different dates for every collo_nummer. So i want to sum these dates and add this in a new column in the first database.
I now use a loop but it is rather slow and has to perform this operation for al 7 days in a week. Is there a way to get a better performance? I tried multiple solutions but keep getting the error that i cannot use the equal sign for two databases with different lenghts. Help would really be appreciated! Here is an example of what is working but with a rather bad performance.
df5=[df1.loc[(df1.index == nasa) & (df1.afleverdag == x1) & (df1.ind_init_actie=="N"), "aantal_colli"].sum() for nasa in df.collonr]

Your description is a bit vague (hence my comment). First what you good do is to select the rows of the dataframe that you want to search:
dftmp = df1[(df1.afleverdag==x1) & (df1.ind_init_actie=='N')]
so that you don't do this for every item in the loop.
Second, use .groupby.
newseries = dftmp['aantal_colli'].groupby(dftmp.index).sum()
newseries = newseries.ix[df.collonr.unique()]

How to write a pandas DataFrame to a CSV by fixed size chunks

I need to output data from pandas into CSV files to interact with a 3rd party developed process.
The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash).
That said, how can I write something that takes a dataframe in pandas and splits it into 100,000 records frames? Nothing would be different other than the exported dataframes would be subsets of the parent dataframe.
I assume I could do a loop with something like this, but I assume it would be remarkably inefficient..
First, taking recordcount=len(df.index) to get the number of records and then looping until I get there using something like
df1 = df[currentrecord:currentrecord+100000,]
And then exporting that to a CSV file
There has to be an easier way.

You can try smth like this:
def save_df(df, chunk_size=100000):
df_size=len(df)
for i, start in enumerate(range(0, df_size, chunk_size)):
df[start:start+chunk_size].to_csv('df_name_{}.csv'.format(i))

You could add a column with a group, and then use the function groupby:
df1['Dummy'] = [a for b in zip(*[range(N)] * 100000) for a in b][:len(df1)]
Where N is set to a value large enough, the minimum being:
N = int(np.ceil(df1.len() / 100000))
Then group by that column and apply function to_csv():
def save_group(df):
df.drop('Dummy', axis=1).to_csv("Dataframe_" + str(df['Dummy'].iloc[0]) + ".csv")
df1.groupby('Dummy').apply(save_group)

Creating dataframe by merging a number of unknown length dataframes

I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?

it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)

I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.