Filtering pandas DataFrame - python

I'm reading in a .csv file using pandas, and then I want to filter out the rows where a specified column's value is not in a dictionary for example. So something like this:
df = pd.read_csv('mycsv.csv', sep='\t', encoding='utf-8', index_col=0,
names=['col1', 'col2','col3','col4'])
c = df.col4.value_counts(normalize=True).head(20)
values = dict(zip(c.index.tolist()[1::2], c.tolist()[1::2])) # Get odd and create dict
df_filtered = filter out all rows where col4 not in values
After searching around a bit I tried using the following to filter it:
df_filtered = df[df.col4 in values]
but that unfortunately didn't work.
I've done the following to make it works for what I want to do, but it's incredibly slow for a large .csv file, so I thought there must be a way to do it that's built in to pandas:
t = [(list(df.col1) + list(df.col2) + list(df.col3)) for i in range(len(df.col4)) if list(df.col4)[i] in values]

If you want to check against the dictionary values:
df_filtered = df[df.col4.isin(values.values())]
If you want to check against the dictionary keys:
df_filtered = df[df.col4.isin(values.keys())]

As A.Kot mentioned you could use the values method of the dict to search. But the values method returns either a list or an iterator depending on your version of Python.
If your only reason for creating that dict is membership testing, and you only ever look at the values of the dict then you are using the wrong data structure.
A set will improve your lookup performance, and simplify your check back to:
df_filtered = df[df.col4 in values]
If you use values elsewhere, and you want to check against the keys, then you're ok because membership testing against keys is efficient.

Related

Transitioning from pandas .apply to a vectoriztion approach

I am trying to improve a pandas iteration with a purely vectorization approach. I am a little new to vectoriztion and am having trouble getting it to work.
Within one dataframe field, I am finding all the unique string-based records of addresses. I need to seach the dataframe for each unique address idividually and assign a single unique identifier to the returned records. In this way, I can have 1 UID for each each address regardless of multiple occurances in the dataframe.
I have developed an approach that utilizes vectorition with the pandas .apply method.
def addr_id(x):
global df
df['Unq_ID'][df['address'] == x] = uuid.uuid4()
pd.DataFrame(df['address'].unique(), columns=["column1"]).apply(lambda x: addr_id(x["column1"]), axis=1)
However, I am trying to do away with the .apply method completely. This is where I am stuck.
df['Unq_ID'][df['address'] == (pd.DataFrame(df['address'].unique(), columns=["column1"]))["column1"]] = uuid.uuid4()
I keep getting a ValueError: Can only compare identically-labeled Series objects
You want to get rid of the Pandas apply due to performance reasons, right?
May I suggest a different approach to your problem?
You can construct a dict with the unique values of column1 as keys and the uuids as values and then map them to the DataFrame:
uuid_dict = {key: uuid.uuid4() for key in df['column1'].unique()}
df['address'] = df.column1.map(uuid_dict)
This would be very fast because it avoids looping in Python (which Pandas apply does under the hood).

Proper way to extract value from DataFrame with composite index?

I have a dataframe, call it current_data. This dataframe is generated via running statistical functions over another dataframe, current_data_raw. It has a compound index on columns "Method" and "Request.Name"
current_data = current_data_raw.groupby(['Name', 'Request.Method']).size().reset_index().set_index(['Name', 'Request.Method'])
I then run a bunch of statistical functions over current_data_raw adding new columns to current_data
I then need to query that dataframe for specific values of columns. I would love to do something like:
val = df['Request.Name' == some_name, 'Method' = some_method]['Average']
However this isn't working, nor are the varients I have attempted above. .xs is returning a series. I could grab the only row in the series but that doesn't seem proper.
If want select in MultiIndex is possible use tuple in order of levels, but here is not specified index name like 'Request.Name':
val = df.loc[(some_name, some_method), 'Average']
Another way is use DataFrame.query, but if levels names contains spaces or . is necessary use backticks:
val = df.query("`Request.Name`=='some_name' & `Request.Method`=='some_method'")['Average']
If one word levels names:
val = df.query("Name=='some_name' & Method=='some_method'")['Average']

Selecting Various "Pieces" of a List

I have a list of columns in a Pandas DataFrame and looking to create a list of certain columns without manual entry.
My issue is that I am learning and not knowledgable enough yet.
I have tried searching around the internet but nothing was quite my case. I apologize if there is a duplicate.
The list I am trying to cut from looks like this:
['model',
'displ',
'cyl',
'trans',
'drive',
'fuel',
'veh_class',
'air_pollution_score',
'city_mpg',
'hwy_mpg',
'cmb_mpg',
'greenhouse_gas_score',
'smartway']
Here is the code that I wrote on my own: dataframe.columns.tolist()[:6,8:10,11]
In this case scenario I am trying to select everything but 'air_pollution_score' and 'greenhouse_gas_score'
My ultimate goal is to understand the syntax and how to select pieces of a list.
You could do that, or you could just use drop to remove the columns you don't want:
dataframe.drop(['air_pollution_score', 'greenhouse_gas_score'], axis=1).columns
Note that you need to specify axis=1 so that pandas knows you want to remove columns, not rows.
Even if you wanted to use list syntax, I would say that it's better to use a list comprehension instead; something like this:
exclude_columns = ['air_pollution_score', 'greenhouse_gas_score']
[col for col in dataframe.columns if col not in exclude_columns]
This gets all the columns in the dataframe unless they are present in exclude_columns.
Let's say df is your dataframe. You can actually use filters and lambda, though it quickly becomes too long. I present this as a "one-liner" alternative to the answer of #gmds.
df[
list(filter(
lambda x: ('air_pollution_score' not in x) and ('greenhouse_gas_x' not in x),
df.columns.values
))
]
What's happening here are:
filter applies a function to a list to only include elements following a defined function/
We defined that function using lambda to only check if 'air_pollution_score' or 'greenhouse_gas_x' are in the list.
We're filtering on the df.columns.values list; so the resulting list will only retain the elements that weren't the ones we mentioned.
We're using the df[['column1', 'column2']] syntax, which is "make a new dataframe but only containing the 2 columns I define."
Simple solution with pandas
import pandas as pd
data = pd.read_csv('path to your csv file')
df = data['column1','column2','column3',....]
Note: data is your source you have already loaded using pandas, new selected columns will be stored in a new data frame df

Dataframe iteration better practices for values assignment [duplicate]

This question already has answers here:
Pandas DataFrame to List of Dictionaries
(5 answers)
Closed 4 years ago.
I was wondering how to make cleaner code, so I started to pay attention to some of my daily code routines. I frequently have to iterate over a dataframe to update a list of dicts:
foo = []
for index, row in df.iterrows():
bar = {}
bar['foobar0'] = row['foobar0']
bar['foobar1'] = row['foobar1']
foo.append(bar)
I think it is hard to maintain, because if df keys are changed, then the loop will not work. Besides that, write same index for two data structures is kind of code duplication.
The context is, I frequently make api calls to a specific endpoint that receives a list of dicts.
I'm looking for improviments for that routine, so how can I change index assignments to some map and lambda tricks, in order to avoid errors caused by key changes in a given dataframe(frequently resulted from some query in database)?
In other words, If a column name in database is changed, the dataframe keys will change too, So I'd like to create a dict on the fly with same keys of a given dataframe and fill each dict entry with dataframe corresponding values.
How can I do that?
The simple way to do this is to_dict, which takes an orient argument that you can use to specify how you want the result structured.
In particular, orient='records' gives you a list of records, each one a dict in {col1name: col1value, col2name: col2value, ...} format.
(Your question is a bit confusing. At the very end, you say, "I'd like to create a dict on the fly with same keys of a given dataframe and fill each dict entry with dataframe corresponding values." This makes it sound like you want a dict of lists (that's to_dict(orient='list') or maybe a dict of dicts (that's to_dict(orient='dict')—or just to_dict(), because that's the default), not a list of dicts.
If you want to know how to do this manually (which you don't want to actually do, but it's worth understanding): a DataFrame acts like a dict, with the column names as the keys and the Series as the values. So you can get a list of the column names the same way you do with a normal dict:
columns = list(df)
Then:
foo = []
for index, row in df.iterrows():
bar = {}
for key in keys:
bar[key] = row[key]
foo.append(bar)
Or, more compactly:
foo = [{key: row[key] for key in keys} for _, row in df.iterrows()}]

Convert multiple columns to string in pandas dataframe

I have a pandas data frame with different data types. I want to convert more than one column in the data frame to string type. I have individually done for each column but want to know if there is an efficient way?
So at present I am doing something like this:
repair['SCENARIO']=repair['SCENARIO'].astype(str)
repair['SERVICE_TYPE']= repair['SERVICE_TYPE'].astype(str)
I want a function that would help me pass multiple columns and convert them to strings.
To convert multiple columns to string, include a list of columns to your above-mentioned command:
df[['one', 'two', 'three']] = df[['one', 'two', 'three']].astype(str)
# add as many column names as you like.
That means that one way to convert all columns is to construct the list of columns like this:
all_columns = list(df) # Creates list of all column headers
df[all_columns] = df[all_columns].astype(str)
Note that the latter can also be done directly (see comments).
I know this is an old question, but I was looking for a way to turn all columns with an object dtype to strings as a workaround for a bug I discovered in rpy2. I'm working with large dataframes, so didn't want to list each column explicitly. This seemed to work well for me so I thought I'd share in case it helps someone else.
stringcols = df.select_dtypes(include='object').columns
df[stringcols] = df[stringcols].fillna('').astype(str)
The "fillna('')" prevents NaN entries from getting converted to the string 'nan' by replacing with an empty string instead.
You can also use list comprehension:
df = [df[col_name].astype(str) for col_name in df.columns]
You can also insert a condition to test if the columns should be converted - for example:
df = [df[col_name].astype(str) for col_name in df.columns if 'to_str' in col_name]

Categories