Pandas Filtering Column - python

Following up on the accepted answer at another question on SO (Filtering a dataframe by column name based on multiple conditions):
import pandas as pd
c = ["XYYZ 1011", "XYYZ 1021", "XYYZ 1031", "XXYZ 1011", "XXYZ 1021", "XXYZ 1031","XYY 1001", "XYY 1002", "XXZ 1001"]
df = pd.DataFrame(columns=c)
print(df)
df = df.filter(regex='X[XY|YY]Z 10[1|2|3]1')
print(df)
The output of print misses XXYZ 1011, XXYZ 1021, XXYZ 1031 etc column Why?

IIUC, your regex should be 'X(XY|YY)Z 10[123]1':
df.filter(regex='X(XY|YY)Z 10[123]1')
output:
Empty DataFrame
Columns: [XYYZ 1011, XYYZ 1021, XYYZ 1031, XXYZ 1011, XXYZ 1021, XXYZ 1031]
Index: []
You have to differentiate characters groups [123] -> one "1 or 2 or 3" character, and alternative patterns (ABC|DEF) the string ABC or DEF

Related

Using map to convert pandas dataframe to list

I am using map to convert some columns in a dataframe to list of dicts. Here is a MWE illustrating my question.
import pandas as pd
df = pd.DataFrame()
df['Col1'] = [197, 1600, 1200]
df['Col2'] = [297, 2600, 2200]
df['Col1_a'] = [198, 1599, 1199]
df['Col2_a'] = [296, 2599, 2199]
print(df)
The output is
Col1 Col2 Col1_a Col2_a
0 197 297 198 296
1 1600 2600 1599 2599
2 1200 2200 1199 2199
Now say I want to extract only those columns whose name ends with a suffix "_a". One way to do it is the following -
list_col = ["Col1","Col2"]
cols_w_suffix = map(lambda x: x + '_a', list_col)
print(df[cols_w_suffix].to_dict('records'))
[{'Col1_a': 198, 'Col2_a': 296}, {'Col1_a': 1599, 'Col2_a': 2599}, {'Col1_a': 1199, 'Col2_a': 2199}]
This is expected answer. However, if I try to print the same expression again, I get an empty dataframe.
print(df[cols_w_suffix].to_dict('records'))
[]
Why does it evaluate to an empty dataframe? I think I am missing something about the behavior of map. Because when I directly pass the column names, the output is still as expected.
df[["Col1_a","Col2_a"]].to_dict('records')
[{'Col1_a': 198, 'Col2_a': 296}, {'Col1_a': 1599, 'Col2_a': 2599}, {'Col1_a': 1199, 'Col2_a': 2199}]
Your map generator is exhausted.
Use cols_w_suffix = list(map(lambda x: x + '_a', list_col)) or a list comprehension cols_w_suffix = [f'{x}_a' for x in list_col].
That said, a better method to select the columns would be:
df.filter(regex='_a$')
Or:
df.loc[:, df.columns.str.endswith('_a')]

how to drop multiple columns in pandas?

How to drop multiple columns in pandas and python ?
import pandas as pd
df =pd.DataFrame({
"source_number": [
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1-loc3","loc3","loc1","loc2-loc1","loc2","loc3-loc2","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
def remove_columns(dataset,cols):
for col in cols:
del dataset[col]
return dataset
for col in df.columns:
df=remove_columns(df,col)
df.head()
in the code above the task is done and the columns are droped.
But when
I tried this code on streamlit where the user select multiple columns that he want to remove from the dataframe.
But the problem is that the system just take the first element and not all the item in the list.
like if the user select location and source number the col variable will contains just location and display the below error:
KeyError: 'location'
Traceback:
File "f:\aienv\lib\site-packages\streamlit\script_runner.py", line 333, in _run_script
exec(code, module.__dict__)
File "F:\AIenv\streamlit\app.py", line 373, in <module>
sidebars[y]=st.sidebar.multiselect('Filter '+y, df[y].unique(),key="1")
File "f:\aienv\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "f:\aienv\lib\site-packages\pandas\core\indexes\base.py", line 2893, in get_loc
raise KeyError(key) from err
Streamlit code:
import numpy as np
import pandas as pd
import streamlit as st
#function drop unwanted columns
def remove_columns(dataset,cols):
for col in cols:
del dataset[col]
return dataset
df =pd.DataFrame({
"source_number": [
[11199,11328,11287,32345,12342,1232,13456,123244,13456],
"location":
["loc2","loc1-loc3","loc3","loc1","loc2-loc1","loc2","loc3-loc2","loc2","loc1"],
"category":
["cat1","cat2","cat1","cat3","cat3","cat3","cat2","cat3","cat2"],
})
drop_button = st.sidebar.button("Remove")
columns = st.sidebar.multiselect("Select column/s", df.columns)
sidebars = {}
for y in columns:
ucolumns=list(df[y].unique())
st.write(y)
if (drop_button):
df_drop=df.drop(y,axis=1,inplace=True)
print(y)
st.table(df)
Use DataFrame.drop:
def remove_columns(dataset,cols):
return dataset.drop(cols, axis=1)
And for call pass function with no loop - is possible pass scalar or list:
df = remove_columns(df,'location')
df = remove_columns(df,['location','category'])
EDIT:
If need remove column seelcted in multiselect use:
drop_button = st.sidebar.button("Remove")
#in columns variable are selected values
columns = st.sidebar.multiselect("Select column/s", df.columns)
print (columns)
#so if use button remove values by variable columns
if (drop_button):
df.drop(columns,axis=1,inplace=True)
st.table(df)
Pandas have already implemented this inside of the function drop.
You can use pandas.drop with parameter columns = [columns that you want to drop] like this instead:
df.drop(columns = ["source_number","location"])
I hope this is what you are looking for

How Do I Write A Function That Counts Identical and Different IDs in Two Columns of Different Sizes

Given a reference dataframe A of one column "ID" (50,000 rows),
and dataframes B, C, D, with column "ID" with 45,000 rows, 55,000, 70,000 rows respectively,
with each instance of "ID" being a large(seventeen digit) integer value,
with many identical values in all of the columns but not necessarily in the same row.
How do I write a function that counts the number of identical and different values in two of these columns?
COLUMNS:
A ['ID', 196, 202, 443, 781, 557]
B ['ID', 781, 488, 712, 202, 482, 311]
C ['ID', 889, 196, 302, 444]
D ['ID', 444, 202, 675]
INPUT:
A, B
OUTPUT:
Matches: 2 Difference: 3
you can try .isin(). Example with pd.Series:
A = pd.Series([196, 202, 443, 781, 557])
B = pd.Series([781, 488, 712, 202, 482, 311])
if len(A) >= len(B):
matches = A.isin(B)
else:
matches = B.isin(A)
mismatches = ~matches
print('matches: {}, mismatches: {}'.format(sum(matches), sum(mismatches))
comparing lengths is done so that the right number of mismatches is found. it would not matter for finding the right number of matches, of cours. Interpreting True as 1 and False as 0 allows summing up the numbers.

KeyError using `dask.merge()`

So I have two pandas dataframes created via
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
These both have the column column1. To double check,
print(df1.columns)
print(df2.columns)
both return a column 'column1'.
So, I would like to merge these two dataframes with dask, using 60 threads locally (using an outer merge):
dd1 = dd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
That fails with a KeyError, KeyError: 'column1'
Traceback (most recent call last):
File "INSTALLATIONPATH/python3.5/site-packages/pandas/indexes/base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 'column1'
I would think this is a parallelizable task, i.e. dd.merge(df1, df2, on='id')
Is there a "dask-equivalent" operation for this? I also tried reindexing the pandas dataframes on chr (i.e. df1 = df1.reset_index('chr') ) and then tried joining on the index
dd.merge(df1, df2, left_index=True, right_index=True)
That didn't work either, same error.
http://dask.pydata.org/en/latest/dataframe-overview.html
From your error, I would double check your initial dataframe to make sure you do have column1 in both (no extra spaces or anything) as an actual column, because it should work fine (no error in the following code)
Additionally, there's a difference between calling merge on pandas.DataFrame or on Dask.dataframe.
Here's some example data:
df1 = pd.DataFrame(np.transpose([np.arange(1000),
np.arange(1000)]), columns=['column1','column1_1'])
df2 = pd.DataFrame(np.transpose([np.arange(1000),
np.arange(1000, 2000)]), columns=['column1','column1_2'])
And their dask equivalent:
ddf1 = dd.from_pandas(df1, npartitions=100)
ddf2 = dd.from_pandas(df2, npartitions=100)
Using pandas.DataFrame:
In [1]: type(dd.merge(df1, df2, on="column1", how="outer"))
Out [1]: pandas.core.frame.DataFrame
So this returns a pandas.DataFrame, so you cannot call compute() on it.
Using dask.dataframe:
In [2]: type(dd.merge(ddf1, ddf2, on="column1", how="outer"))
Out[2]: dask.dataframe.core.DataFrame
Here you can call compute:
In [3]: dd.merge(ddf1,ddf2, how='outer').compute(num_workers=60)
Out[3]:
column1 column1_1 column1_2
0 0 0 1000
1 400 400 1400
2 100 100 1100
3 500 500 1500
4 300 300 1300
Side Note: depending on the size of your data and your hardware, you might want to check if doing a pandas.join wouldn't be faster:
df1.set_index('column1').join(df2.set_index('column1'), how='outer').reset_index()
Using a size of (1 000 000, 2) for each df it's faster than the dask solution on my hardware.

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.

Categories