Appending a list of lists and adding a column efficiently - python

Appending works fine with this method:
for poss in pos:
df = df.append([[poss,'1']], ignore_index=True)
Is this possible to write as a one-liner? This way is showing a syntax error.
df = df.append([[poss,'1']], ignore_index=True) for poss in pos

No, it's not. It seems like you are looking to use list comprehension syntax to single-linify multiple method calls. You can't do this.
Instead, what you can do is aggregate your list elements into a dataframe and append in one go. It is also inefficient to df.append within a loop versus building a dataframe from list of lists and appending just once.
df = df.append(pd.DataFrame(pos, columns=player_df.columns))
This assumes pos is a list of lists with columns aligned with df.
If you need to add an extra column "in one line", try this:
df = df.append(pd.DataFrame(pos, columns=player_df.columns)).assign(newcol=1)

I think it is possible, but I cannot test it, because I do not work with those datatypes and normal list.append does not return the resulting list.
from functools import reduce # Python3 only
df = reduce(lambda d, p: d.append([[p,'1']], ignore_index=True), pos, df)
Anyway, the for loop is much better. Also this is just a rewrite to one line. There might be other ways to modify your data.

Related

Selecting Various "Pieces" of a List

I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df

Pandas: how to compare several cells with a list/tuple

I need to compare some columns in a dataframe as a whole, for example:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
#Select condition: If df['A'] == 1 and df['B'] == 4, then pick up this row.
For this simple example, I can use below method:
df.loc[(df['A']==1)&(df['B']==4),'A':'B']
However, in reality my dataframe has tens of columns which should be compared as whole. Above solution will be very very messy if I choose to list all of them. So I think if regard them as a whole to compare with a list may solve the problem:
#something just like this:
df.loc[df.loc[:,'A':'B']==[1,4],'A':'B')]
Not worked. So I came up with the idea that first combine all desired columns into a new column as a list value, then compare this new column with the list. The latter has been solved in Pandas: compare list objects in Series
Although generally I've solved my case, I still want to know if there is an easier way to solve this problem? Thanks.
Or use [[]] for getting multiple columns:
df[(df[['A','B']].values==[1,4]).all(1)]
Demo:
>>> df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
>>> df[(df[['A','B']].values==[1,4]).all(1)]
A B
0 1 4
>>>
You can use a Boolean mask via a NumPy array representation of your data:
df = pd.DataFrame({'A':[1,1,3],'B':[4,5,6]})
res = df[(df.loc[:, 'A':'B'].values == [1, 4]).all(1)]
print(res)
A B
0 1 4
In this situation, never combine your columns into a single series of lists. This is inefficient as you will lose all vectorisation benefits, and any processing thereafter will involve Python-level loops.

Iterate two pandas dataframes error when one dataframe is empty

I have been trying to iterate two pandas dataframes using zip . It works perfectly until I have values available in both dataframes. If one of the dataframe is empty this won't iterate and return null.
for (kin_index, kin_row), (sub_index, sub_row) in zip(df1.iterrows(), df2.iterrows()):
print(kin_index,sub_index)
I want to iterate both dataframes even if one is empty.
This don't go through if one of the dataframe is empty.
zip only runs as far as the shortest iterable. If one of the iterables is empty, you won't be able to iterate any values.
itertools.zip_longest iterates to the longest iterable, but to ensure this works with unpacking you need to specify fillvalue as a tuple of length 2:
from itertools import zip_longest
df1 = pd.DataFrame([[0, 1], [2, 3]])
df2 = pd.DataFrame()
zipper = zip_longest(df1.iterrows(), df2.iterrows(), fillvalue=(None, None))
for (idx1, row1), (idx2, row2) in zipper:
print(idx1, idx2)
0 None
1 None
But there are very few occasions when you should need to iterate rows like this. In fact, it should be avoided if at all possible. You should consider refactoring your logic to use vectorised functionality.

Looping through a list of pandas dataframes

Two quick pandas questions for you.
I have a list of dataframes I would like to apply a filter to.
countries = [us, uk, france]
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
When I run this, the df's don't change afterwards. Why is that?
If I loop through the dataframes to create a new column, as below, this works fine, and changes each df in the list.
for df in countries:
df["Continent"] = "Europe"
As a follow up question, I noticed something strange when I created a list of dataframes for different countries. I defined the list then applied transformations to each df in the list. After I transformed these different dfs, I called the list again. I was surprised to see that the list still pointed to the unchanged dataframes, and I had to redefine the list to update the results. Could anybody shed any light on why that is?
Taking a look at this answer, you can see that for df in countries: is equivalent to something like
for idx in range(len(countries)):
df = countries[idx]
# do something with df
which obviously won't actually modify anything in your list. It is generally bad practice to modify a list while iterating over it in a loop like this.
A better approach would be a list comprehension, you can try something like
countries = [us, uk, france]
countries = [df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
for df in countries]
Notice that with a list comprehension like this, we aren't actually modifying the original list - instead we are creating a new list, and assigning it to the variable which held our original list.
Also, you might consider placing all of your data in a single DataFrame with an additional country column or something along those lines - Python-level loops are generally slower and a list of DataFrames is often much less convenient to work with than a single DataFrame, which can fully leverage the vectorized pandas methods.
For why
for df in countries:
df["Continent"] = "Europe"
modifies countries, while
for df in countries:
df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
does not, see why should I make a copy of a data frame in pandas. df is a reference to the actual DataFrame in countries, and not the actual DataFrame itself, but modifications to a reference affect the original DataFrame as well. Declaring a new column is a modification. However, taking a subset is not a modification. It is just changing what the reference is referring to in the original DataFrame.

Filtering pandas DataFrame

I'm reading in a .csv file using pandas, and then I want to filter out the rows where a specified column's value is not in a dictionary for example. So something like this:
df = pd.read_csv('mycsv.csv', sep='\t', encoding='utf-8', index_col=0,
names=['col1', 'col2','col3','col4'])
c = df.col4.value_counts(normalize=True).head(20)
values = dict(zip(c.index.tolist()[1::2], c.tolist()[1::2])) # Get odd and create dict
df_filtered = filter out all rows where col4 not in values
After searching around a bit I tried using the following to filter it:
df_filtered = df[df.col4 in values]
but that unfortunately didn't work.
I've done the following to make it works for what I want to do, but it's incredibly slow for a large .csv file, so I thought there must be a way to do it that's built in to pandas:
t = [(list(df.col1) + list(df.col2) + list(df.col3)) for i in range(len(df.col4)) if list(df.col4)[i] in values]
If you want to check against the dictionary values:
df_filtered = df[df.col4.isin(values.values())]
If you want to check against the dictionary keys:
df_filtered = df[df.col4.isin(values.keys())]
As A.Kot mentioned you could use the values method of the dict to search. But the values method returns either a list or an iterator depending on your version of Python.
If your only reason for creating that dict is membership testing, and you only ever look at the values of the dict then you are using the wrong data structure.
A set will improve your lookup performance, and simplify your check back to:
df_filtered = df[df.col4 in values]
If you use values elsewhere, and you want to check against the keys, then you're ok because membership testing against keys is efficient.

Categories