Pandas concatenate all elements of dataframe into single series

Pandas concatenate all elements of dataframe into single series - python

There must be a simple answer to this, but for some reason I can't find it. Apologies if this is a duplicate question.
I have a dataframe with shape on the order of (1000,100). I want to concatenate ALL items in the dataframe into a single series (or list). Order doesn't matter (so it doesn't matter what axis to concatenate along). I don't want/need to keep any column names or indices. Dropping NaNs and duplicates is ok but not required.
What's the easiest way to do this?

This will yield a 1-dim numpy-array of the lowest-common dtype for all elements.
df.values.ravel()

Related

Filtered pandas dataframe containing boolean version of dataframe

In pandas I have a dataframe with X by Y dimension containing values.
Then I have an identical pandas dataframe with X by Y dimension (same as df1) containing True/False values.
I want to return only the elements from df1 where the same location on df2 the value = True.
What is the fastest way to do this? Is there a way to do this without converting to numpy array?

Without having the reproducible example, I may be missing a couple tweaks/details here, but I think you may be able to accomplish this by dataframe multiplication
df1.mul(df2)
This will multiply each element by the corresponding element in the other dataframe, where True will act to return the other element and False will return a null.

It is also possible to use mask
df1.mask(df2)
This is similar to df1[df2] and replaces hidden values with NaN, although you can choose the value to replace with using the other option
A quick benchmark on a 10x10 dataframe suggests that the df.mul approach is ~5 times faster

pandas not matching initial index when I try to join/merge/loc

I've got 2 pd.series, one has datetimes and is short, the other has datetimes with matching values and is long.
I want to get a dataframe with indexes from the first series and corresponding values from the second series. Both have some duplicates. I can create a new object looping through the indexes, but there's got to be a better way? I tried join, merge and loc each time the resulting dataframe is longer than the first series of datetimes.

You can try with merge() method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
It would help a lot if you could provide a snippet of dataframes you are working on.

putting matrix in one pandas DataFrame cell

I'd like to take a list of 1000 np.ndarrays (each element in the list is an array whose shape is 3X3X8) and use this list as a pandas DataFrame column, so that each cell in the column is a matrix.
How can it be accomplished?

You may want to look at xarray.
I've found this really useful for abstracting "square" data where all of the arrays in your list have the same shape.

Use to_numeric on certain columns only in PANDAS

I have a dataframe with 15 columns. 5 of those columns use numbers but some of the entries are either blanks, or words. I want to convert those to zero.
I am able to convert the entries in one of the column to zero but when I try to do that for multiple columns, I am not able to do it. I tried this for one column:
pd.to_numeric(Tracker_sample['Product1'],errors='coerce').fillna(0)
and it works, but when I try this for multiple columns:
pd.to_numeric(Tracker_sample[['product1','product2','product3','product4','Total']],errors='coerce').fillna(0)
I get the error : arg must be a list, tuple, 1-d array, or Series
I think it is the way I am calling the columns to be fixed. I am new to pandas so any help would be appreciated. Thank you

You can use:
Tracker_sample[['product1','product2','product3','product4','Total']].apply(pd.to_numeric, errors='coerce').fillna(0)

With a for loop?
for col in ['product1','product2','product3','product4','Total']:
Tracker_sample[col] = pd.to_numeric(Tracker_sample[col],errors='coerce').fillna(0)

What exactly is the lexsort_depth of a multi-index Dataframe?

What exactly is the lexsort_depth of a multi-index dataframe? Why does it have to be sorted for indexing?
For example, I have noticed that, after manually building a multi-index dataframe df with columns organized in three levels, if I try to do:
idx = pd.IndexSlice
df[idx['foo', 'bar']]
I get:
KeyError: 'Key length (2) was greater than MultiIndex lexsort depth (0)'
and at this point, df.columns.lexsort_depth is 0
However, if I do, as recommended here and here:
df = df.sortlevel(0,axis=1)
then the cross-section indexing works. Why? What exactly is lexsort_depth, and why sorting with sortlevel fixes this type of indexing?

lexsort_depth is the number of levels of a multi-index that are sorted lexically. That is, in an a-b-c-1-2-3 order (normal sort order).
So element indexing will work if a multi-index is not sorted, but the lookups may be quite a bit slower (in 0.15.2, this will show a PerformanceWarning for doing these kinds of lookups, see here
The reason that sorting in general a good idea is that pandas is able to use hash-based indexing to figure out where the location is in a particular level independently for the level. ; then you can use these indexers to find the final locations.
Pandas takes advantage of np.searchsorted to find these locations when its sorted. If its not sorted, then you have to fallback to some different (slower) methods.
here is the code that does this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas concatenate all elements of dataframe into single series - python

This will yield a 1-dim numpy-array of the lowest-common dtype for all elements. df.values.ravel()

Related

Filtered pandas dataframe containing boolean version of dataframe

pandas not matching initial index when I try to join/merge/loc

putting matrix in one pandas DataFrame cell

Use to_numeric on certain columns only in PANDAS

What exactly is the lexsort_depth of a multi-index Dataframe?

Categories

Resources