Efficient method to append pandas rows into a list - python

I have a pandas DataFrame with ~5m rows. I am looking for an efficient method to append / store each rows into a list.
import pandas as pd
df = pd.DataFrame({
'id': [0, 1, 2, 3],
'val': ['w','x','y','z'],
'pos': ['p1','p2','p3','p4']
})
# Using List comprehensions
df_lst = []
[df_lst.append(rows) for rows in df.iterrows()]
Given the size of the DataFrame; I am looking for other methods that are more efficient at storing rows to a list. Is there a vectorized solution to this?

I'd recommend .tolist() as others also mentioned in the comments. So I'll give an example of it.
df_lst = df.values.tolist()
in terms of efficiency, which I see some have mentioned why you want to do that, that would depend on the use case. of course df is more efficient on performing various tasks on the data and it seems to be redundant to convert it, but note that a list is more memory-efficient than a dataframe. So converting that is not unrational if you don't need the features of the df.

From here:
You can use df.to_dict('records') to convert the rows into a list of dicts. If this is useful depends on what you want to do with the list afterwards.

Related

Selecting Various "Pieces" of a List

I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df

Pandas: how to compare several cells with a list/tuple

I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.
Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>
You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.

Is there a way to create a single column pandas DataFrame from list without copying the list?

Suppose I have this code:
import pandas as pd
mylist = [item for item in range(100000)]
df = pd.DataFrame()
df["col1"] = mylist
Is the data in mylist copied when it is assigned to df["col1"] ? If so, is there a way to avoid this copy?
Edit: My list in this case is a list of strings. One things I am getting from these answers is if I instead create a numpy array of these strings, no data duplication will occur I call df["col1"] = mynparray?
When you assign your list to a series, a new NumPy array is created. This data structure permits vectorised computations for numeric types. Such series are laid out in contiguous memory blocks. See Why NumPy instead of Python lists? for more details.
Therefore, you will need enough memory to hold duplicate data. This is unavoidable. There is no way to "convert" a list into a Pandas series in place.
Note: the above does not relate to what happens when you assign a NumPy array to a series.
just a thought - can you remove a list after creating df, if memory is a concern?
import pandas as pd
mylist = [item for item in range(100000)]
df = pd.Series(mylist).to_frame()
del mylist

Appending a list of lists and adding a column efficiently

Appending works fine with this method:
for poss in pos:
df = df.append([[poss,'1']], ignore_index=True)
Is this possible to write as a one-liner? This way is showing a syntax error.
df = df.append([[poss,'1']], ignore_index=True) for poss in pos
No, it's not. It seems like you are looking to use list comprehension syntax to single-linify multiple method calls. You can't do this.
Instead, what you can do is aggregate your list elements into a dataframe and append in one go. It is also inefficient to df.append within a loop versus building a dataframe from list of lists and appending just once.
df = df.append(pd.DataFrame(pos, columns=player_df.columns))
This assumes pos is a list of lists with columns aligned with df.
If you need to add an extra column "in one line", try this:
df = df.append(pd.DataFrame(pos, columns=player_df.columns)).assign(newcol=1)
I think it is possible, but I cannot test it, because I do not work with those datatypes and normal list.append does not return the resulting list.
from functools import reduce # Python3 only
df = reduce(lambda d, p: d.append([[p,'1']], ignore_index=True), pos, df)
Anyway, the for loop is much better. Also this is just a rewrite to one line. There might be other ways to modify your data.

pandas iterrows changes ints into floats

I'm trying to iterate over the rows of a DataFrame that contains some int64s and some floats. iterrows() seems to be turning my ints into floats, which breaks everything I want to do downstream:
>>> import pandas as pd
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [id for id in df.id]
[10000000000000001, 10000000000000002]
>>> [r['id'] for (idx,r) in df.iterrows()]
[10000000000000000.0, 10000000000000002.0]
Iterating directly over df.id is fine. But through iterrows(), I get different values. Is there a way to iterate over the rows in such a way that I can still index by column name and get all the correct values?
Here's the relevant part of the docs:
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames) [...] To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster as iterrows.
Example for your data:
>>> df = pd.DataFrame([[10000000000000001, 1.5], [10000000000000002, 2.5]], columns=['id', 'prc'])
>>> [t[1] for t in df.itertuples()]
[10000000000000001, 10000000000000002]
If possible you're better off avoiding iteration. Check if you can vectorize your work first.
If vectorization is impossible, you probably want DataFrame.itertuples. That will return an iterable of (named)tuples where the first element is the index label.
In [2]: list(df.itertuples())
Out[2]:
[Pandas(Index=0, id=10000000000000001, prc=1.5),
Pandas(Index=1, id=10000000000000002, prc=2.5)]
iterrows returns a Series for each row. Since series are backed by numpy arrays, whose elements must all share a single type, your ints were cast as floats.

Categories