What's the fastest way to do these tasks? - python

I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.

On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)

Related

How do you check if all the values in a column in a dataframe exist in another column in another dataframe using Vaex?

I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.
I have tried doing this in pandas but it takes way too long to run.
Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.
There are few tricks you can do.
Some ideas:
you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.
Dunno if this will work out of the box, but it would be something like:
df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it.
Similar idea: instead of doing join do something like:
unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.

Correct way of adding new columns/headers to a dataframe

I need to add new columns to a dataframe. Every column has a header and a value across all the rows (the value is the same for all the columns).
Right now im doing something like this:
array_of_new_headers = [...]
for column in array_of_new_headers:
df[column] = 0
As a result I'm getting this message:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many
times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
It tells me to use concat, but, I don't need to concatenate two dataframes really, should I use concat for better performance and better code? To me it doesn't really make sense unless I think of the arrays as also dataframes maybe.
You can pass an unpacked dictionary with keys as column names, and values as value for the columns to pandas.DataFrame.assign :
>>> array_of_new_headers = [...]
>>> df.assign(**{c:0 for c in array_of_new_headers})
But the operation is immutable, so make sure to assign it back to the required variable.
should I use concat for better performance
Beware so-called premature optimization, if your code does work rapidly enough for your needs then you might end simply wasting your time on trying to make it faster.

Why do double square brackets create a DataFrame with loc or iloc?

Comparing:
df.loc[:,'col1']
df.loc[:,['col1']]
Why does (2) create a DataFrame, while (1) creates a Series?
in principle when it's a list, it can be a list of more than one column's names, so it's natural for pandas to give you a DataFrame because only DataFrame can host more than one column. However, when it's a string instead of a list, pandas can safely say that it's just one column, and thus giving you a Series won't be a problem. Take the two formats and two outcomes as a reasonable flexibility to get whichever you need, a series or a dataframe. sometimes you just need specifically one of the two.

add array to pandas data frame

I have here an issue and would like to ask for support
Suppose you have a following frame
frame=pd.Dataframe({"Arbitary Number":[1,2,3,4]})
I want to add an additional column, whose entries are np.arrays. I add the entry the following way
frame["new col"]='[8,8,8,8]'
How ever in a later stage I need the entries as array. If I apply
frame["new col"]=frame["new col"].appy(np.array)
I still get object as column type and cannot use the entries to do some math work. I need to go the way with
np.array([eval(xxx)])
to have an array
The question is: Is there a nice and clean way to add arrays as column values without transforming them as strings before assigning them as value?
Or if this is not the case and I do need to assign the list as string, is there a way to change the column type to np.array format?
My mentioned solution is not working
Thanks a lot for any kind of help
Cheers

How to add new values to dataframe's columns based on specific row without overwrite existing data

I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps

Categories