Python pandas filter by word - python

I have
csv file:
df=pd.read_csv(Path(os.getcwd()+r'\all_files.csv'), sep=',', on_bad_lines='skip', index_col=False, dtype='unicode')
column:
column=input("Column:")
word:
word=input("Word:")
I want to filter a csv file:
df2=df[(df[column].dropna().str.contains(word.lower()))]
But when I write to column:ЄДРПОУ(Гр.8)
I have a error:
Warning (from warnings module):
File "C:\python\python\FilterExcelFiles.py", line 35
df2=df[(df[column].dropna().str.contains(word.lower()))]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Traceback (most recent call last):
File "C:\python\python\FilterExcelFiles.py", line 51, in <module>
s()
File "C:\python\python\FilterExcelFiles.py", line 35, in s
df2=df[(df[column].dropna().str.contains(word.lower()))]
File "C:\Users\Станислав\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 3496, in __getitem__
return self._getitem_bool_array(key)
File "C:\Users\Станислав\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 3549, in _getitem_bool_array
key = check_bool_indexer(self.index, key)
File "C:\Users\Станислав\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexing.py", line 2383, in check_bool_indexer
raise IndexingError(
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
And I wont to lower df[column]

You're dropping the NaN in the indexer, making it likely shorter, which results in the error in boolean indexing.
Don't dropna, the NaN will be False anyway:
df2 = df[df[column].str.contains(word.lower())]
Alternatively, if you had a operation that would return NaNs, you could fill them with False:
df2 = df[df[column].str.contains(word.lower()).fillna(False)]

I have searched around for an answer and I came across a similar post that might have the solution for your problem.
According to the mentioned post, the reason for this error is due to the encoding for Python, which is usually ascii; the encoding can be checked by:
import sys
sys.getdefaultencoding()
To solve your problem, you need to change it to UTF-8, using the following!
import sys
reload(sys) # Note this line is essential for the change
sys.setdefaultencoding('utf-8')
Would like to credit the original solution to #jochietoch

Related

Pysolar get_azimuth function applied to pandas DataFrame

I got myself a pandas dataframe with columns latitude, longitude (which are integer type) and a date column (datetime64[ns, UTC] - as needed for the function). I use following line to produce new column of sun's azimuth:
daa['azimuth'] = daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date']), axis=1)
It crashes and I cannot figure out why, the only thing I know is that there is a problem in date:
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
If anyone had an idea what I am supposed to do with the date, it would be great, thanks.
this goes back to a bug in pandas, see issue #32174. pysolar.solar.get_azimuth calls .utctimetuple() method of given datetime object (or pd.Timestamp), which fails:
import pandas as pd
s = pd.to_datetime(pd.Series(["2020-01-01", "2020-01-02"])).dt.tz_localize('UTC')
s.iloc[0]
Out[3]: Timestamp('2020-01-01 00:00:00+0000', tz='UTC')
s.iloc[0].utctimetuple()
Traceback (most recent call last):
File "<ipython-input-4-f5e393f18fdb>", line 1, in <module>
s.iloc[0].utctimetuple()
File "pandas\_libs\tslibs\timestamps.pyx", line 1332, in pandas._libs.tslibs.timestamps.Timestamp.__new__
TypeError: an integer is required
You can work-around by converting the pandas Timestamp to a Python datetime object, were utctimetuple works as expected. For the given example, you can use
daa.apply(lambda row: get_azimuth(row['latitude'], row['longitude'], row['date'].to_pydatetime()), axis=1)

Merging two dataframes with pd.NA in merge column yields 'TypeError: boolean value of NA is ambiguous'

With Pandas 1.0.1, I'm unable to merge if the
df = df.merge(df2, on=some_column)
yields
File /home/torstein/code/fintechdb/Sheets/sheets/gild.py, line 42, in gild
df = df.merge(df2, on=some_column)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py, line 7297, in merge
validate=validate,
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 88, in merge
return op.get_result()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 643, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 862, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 841, in _get_join_indexers
self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1311, in _get_join_indexers
zipped = zip(*mapped)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1309, in <genexpr>
for n in range(len(left_keys))
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1918, in _factorize_keys
rlab = rizer.factorize(rk)
File pandas/_libs/hashtable.pyx, line 77, in pandas._libs.hashtable.Factorizer.factorize
File pandas/_libs/hashtable_class_helper.pxi, line 1817, in pandas._libs.hashtable.PyObjectHashTable.get_labels
File pandas/_libs/hashtable_class_helper.pxi, line 1732, in pandas._libs.hashtable.PyObjectHashTable._unique
File pandas/_libs/missing.pyx, line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
while this works:
df[some_column].fillna(np.nan, inplace=True)
df2[some_column].fillna(np.nan, inplace=True)
df = df.merge(df2, on=some_column)
# Works
If instead, I do
df[some_column].fillna(pd.NA, inplace=True)
then the error returns.
This has to do with pd.NA being implemented in pandas 1.0.0 and how the pandas team decided it should work in a boolean context. Also, you take into account it is an experimental feature, hence it shouldn't be used for anything but experimenting:
Warning Experimental: the behaviour of pd.NA can still change without warning.
In another link of pandas documentation, where it covers working with missing values, is where I believe the reason and the answer you are looking for can be found:
NA in a boolean context:
Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. The following raises an error: TypeError: boolean value of NA is ambiguous
Furthermore, it provides a valuable piece of advise:
"This also means that pd.NA cannot be used in a context where it is evaluated to a boolean, such as if condition: ... where condition can potentially be pd.NA. In such cases, isna() can be used to check for pd.NA or condition being pd.NA can be avoided, for example by filling missing values beforehand."
I decided that the pd.NA instances in my data were valid, and hence I needed to deal with them rather than filling them, like with fillna(). If you're like me in this case, then convert it from pd.NA to either True or False by simply using pd.isna(val). Only you can decide whether the null should come out T or F, but here's a simple example:
val = pd.NA
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is null
Then,
val = 7
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is not null
Hope this helps other trying to get a definitive course of action (Celius's answer is accurate, but I wanted to provide actionable code for those struggling with this).

IndexError when replacing missing values with mode using groupby in pandas

I have a dataset which requires missing value treatment.
Column Missing Values
Complaint_ID 0
Date_received 0
Transaction_Type 0
Complaint_reason 0
Company_response 22506
Date_sent_to_company 0
Complaint_Status 0
Consumer_disputes 7698
Now the problem is, when I try to replace the missing values with mode of other columns using groupby:
Code:
data11["Company_response"] =
data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()
[0]))["Company_response"]
data11["Consumer_disputes"] =
data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()
[0]))["Consumer_disputes"]
I get the following error:
Stacktrace
Traceback (most recent call last):
File "<ipython-input-89-8de6a010a299>", line 1, in <module>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3741, in transform
return self._transform_general(func, *args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3699, in _transform_general
res = path(group)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4360, in apply
ignore_failures=ignore_failures)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4456, in _apply_standard
results[i] = func(v)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3783, in <lambda>
lambda x: func(x, *args, **kwargs), axis=self.axis)
File "<ipython-input-89-8de6a010a299>", line 1, in <lambda>
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
File "C:\Anaconda3\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2434, in get_value
return libts.get_value_box(s, key)
File "pandas\_libs\tslib.pyx", line 923, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18843)
File "pandas\_libs\tslib.pyx", line 939, in pandas._libs.tslib.get_value_box (pandas\_libs\tslib.c:18560)
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
I have checked the length of the dataframeand all of its columns and it is same: 43266.
I have also found a question similar to this but does not have correct answer: Click here
Please help resolve the error.
IndexError: ('index out of bounds', 'occurred at index Consumer_disputes')
Here is a snapshot of the dataset if it helps in any way: Dataset Snapshot
I am using the below code successfully. But it does not serve my purpose exactly. Helps to fill the missing values though.
data11['Company_response'].fillna(data11['Company_response'].mode()[0],
inplace=True)
data11['Consumer_disputes'].fillna(data11['Consumer_disputes'].mode()[0],
inplace=True)
Edit1: (Attaching Sample)
Input Given:
Expected Output:
You can see that the missing values for company-response of Tr-1 and Tr-3 are filled by taking mode of Complaint-Reason.
And similarly for the Consumer-Disputes by taking mode of transaction-type, for Tr-5.
The below snippet consists of the dataframe and the code for those who want to replicate and give it a try.
Replication Code
import pandas as pd
import numpy as np
data11=pd.DataFrame({'Complaint_ID':['Tr-1','Tr-2','Tr-3','Tr-4','Tr-5','Tr-6'],
'Transaction_Type':['Mortgage','Credit card','Bank account or service','Debt collection','Credit card','Mortgage'],
'Complaint_reason':['Loan servicing, payments, escrow account','Incorrect information on credit report',"Cont'd attempts collect debt not owed","Cont'd attempts collect debt not owed",'Payoff process','Loan servicing, payments, escrow account'],
'Company_response':[np.nan,'Company chooses not to provide a public response',np.nan,'Company believes it acted appropriately as authorized by contract or law','Company has responded to the consumer and the CFPB and chooses not to provide a public response','Company disputes the facts presented in the complaint'],
'Consumer_disputes':['Yes','No','No','No',np.nan,'Yes']})
data11.isnull().sum()
data11["Company_response"] = data11.groupby("Complaint_reason").transform(lambda x: x.fillna(x.mode()[0]))["Company_response"]
data11["Consumer_disputes"] = data11.groupby("Transaction_Type").transform(lambda x: x.fillna(x.mode()[0]))["Consumer_disputes"]
The error is raised because for at least one of the groups the values in corresponding aggregated columns contains only np.nan values. In this case pd.Series([np.nan]).mode() returns an empty series which leads to an error when you take the first value.
So, you may use something like transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else "Empty") ).
Try:
data11["Company_response"] = data11.groupby("Complaint_reason")['Company_response'].transform(lambda x: x.fillna(x.mode()[0]))
data11["Consumer_disputes"] = data11.groupby("Transaction_Type")['Consumer_disputes'].transform(lambda x: x.fillna(x.mode()[0]))
#Mikhail Berlinkov is almost certainly correct. I was able to reproduce your error, and then avoid it by using dropna():
data11.groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Returns IndexError
data11.dropna().groupby("Transaction-Type").transform(
lambda x: x.fillna(x.mode() [0]))["Consumer-disputes"]
# Works

Python "InterfaceError: Error binding parameter 2 - probably unsupported type."

When I run the following code, I keep getting the "InterfaceError: Error binding parameter 2 - probably unsupported type" error, and I need help identifying where the problem is. Everything works fine up until I try to send the data to sql through.
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
cdf=pd.read_sql("select (distinct ID) from anagrams;",conn)
import pandas as pd
import sqlite3
conn = sqlite3.connect("anagrams")
xsorted=sorted(anagrams,key=sorted)
xunique=[x[0] for x in anagrams]
xunique=pd.Series(xunique)
xanagrams=pd.Series(anagrams)
anagramsdf=pd.concat([xunique,dfcount,xanagrams],axis=1)
anagramsdf.columns=['ID','anagram_count','anagram_list']
c=conn.cursor()
c.execute("create table anagrams(ID, anagram_count, anagram_list)")
conn.commit()
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
cdf=pd.read_sql("select (distinct ID) from anagrams;",conn)
cdf=pd.read_sql("select max(anagram_count) from anagrams;",conn)
cdf
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
cdf=pd.read_sql("select * from anagrams where anagram_count=12;",conn)
pd.set_option('max_colwidth',200)
Full traceback error:
Traceback (most recent call last):
File "sqlpandas.py", line 88, in <module>
anagramsdf.to_sql('anagrams',con=conn,if_exists='replace',index=False)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 982, in to_sql
dtype=dtype)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 549, in to_sql
chunksize=chunksize, dtype=dtype)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 1567, in to_sql
table.insert(chunksize)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 728, in insert
self._execute_insert(conn, keys, chunk_iter)
File "/Users/andrewclark/anaconda/lib/python2.7/site-packages/pandas/io/sql.py", line 1357, in _execute_insert
conn.executemany(self.insert_statement(), data_list)
sqlite3.InterfaceError: Error binding parameter 2 - probably unsupported type.
Snippet from Dataframe:
ID anagram_count anagram_list
0 aa 1 (aa,)
1 anabaena 1 (anabaena,)
2 baaskaap 1 (baaskaap,)
3 caracara 1 (caracara,)
4 caragana 1 (caragana,)
I used the following code to change the datatypes to strings, and this solved the problem:
anagramsdf.dtypes
anagramsdf['ID']= anagramsdf['ID'].astype('str')
anagramsdf['anagram_list']= anagramsdf['anagram_list'].astype('str')
anagramsdf.to_sql('anagramsdf',con=conn,if_exists='append',index=False)
Using Pandas 0.23.4 I have a column with datetime values (format '%Y-%m-%d %H:%M:%S') that is of data type "string" that was throwing the same error when I tried to pass it to "to_sql" method. After converting to "datetime" dtype it worked. Hope that's helpful to someone with the same issue :).
To convert:
df['date'] = pd.to_datetime(df['date'],format=datetimeFormat,errors='coerce')
Sparrow's solution worked for me. If the index is not converted to a datetime then SQL is going to throw "error binding parameter"
I used a column that had the datetimes to first convert to the correct format and then use it as the index:
df.set_index(pd.to_datetime(df['datetime']), inplace=True)

Appending DataFrame to List in Pandas, Python

I have a a file of data and want to select a specific State. From there I need to return this in a list, but there will be years that correspond to the date with missing data, so I need to replace the missing data.
I am having some issue with my code, likely something is slightly off in my for loop:
def stateCountAsList(filepath,state):
import pandas as pd
pd.set_option('display.width',200)
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
dfState = df[df['State'] == state]
yearList = range(1999,2012)
countsList = []
for dfState['Year'] in yearList:
countsList = dfState['Count']
else:
countsList.append(np.nan)
return countsList
print countsList.tolist()
stateCountAsList(filepath, state)
state = 'California'
Traceback:
C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py:59: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
for dfState['Year'] in yearList:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 67, in <module>
stateCountAsList(filepath, state)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 62, in stateCountAsList
countsList.append(np.nan)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\series.py", line 1466, in append
verify_integrity=verify_integrity)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 754, in concat
copy=copy)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 805, in __init__
raise TypeError("cannot concatenate a non-NDFrame object")
TypeError: cannot concatenate a non-NDFrame object
You have at least two different issues in your code:
The warning
A value is trying to be set on a copy of a slice from a DataFrame.
is triggered by for dfState['Year'] in yearList (line 59 in your code). In this line you try to loop over a range of years (1999 to 2012), but instead you implicitely try to assign the year value to dfState['Year']. This is not a copy, but a "view" (http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy), since df = dataFrame.iloc[0:638,:] returns a view.
But as mentioned earlier, you don't want to assign a value to the DataFrame here, only loop over years. So the for-loop should look like:
for year in range(1999,2012):
...
The second issue is in line 62. Here, you try to append np.nan to your "list" countsList - but countsList is not a list anymore, but a DataFrame!
Two lines before, you assign a pd.Series (countsList = dfState['Count']), effectively changing the type. This gives you the TypeError: cannot concatenate a non-NDFrame object
With this information you should be able to correct your loop.
As an alternative, you can get the desired result using Pandas query method (http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental):
def stateCountAsList(filepath,state):
import pandas as pd
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
stateList = df.query("(State == #state) & (Year > 1999 < 2005").Count.tolist()
return stateList

Categories