Odd behavior of to_dict - python

I'm building a fuzzy search program, using FuzzyWuzzy, to find matching names in a dataset. My data is in a DataFrame of about 10378 rows and len(df['Full name']) is 10378, as expected. But len(choices) is only 1695.
I'm running Python 2.7.10 and pandas 0.17.0, in an IPython Notebook.
choices = df['Full name'].astype(str).to_dict()
def fuzzy_search_to_df (term, choices=choices):
search = process.extract(term, choices, limit=len(choices)) # does the search itself
rslts = pd.DataFrame(data=search, index=None, columns=['name', 'rel', 'df_ind']) # puts the results in DataFrame form
return rslts
results = fuzzy_search_to_df(term='Ben Franklin') # returns the search result for the given term
matches = results[results.rel > 85] # subset of results, these are the best search results
find = df.iloc[matches['df_ind']] # matches in the main df
As you can probably tell, I'm getting the index of the result in the choices dict as df_ind, which I had assumed would be the same as the index in the main dataframe.
I'm fairly certain that the issue is in the first line, with the to_dict() function, as len(df['Full name'].astype(str)results in 10378 and len(df['Full name'].to_dict()) results in 1695.

The issue is that you have multiple rows in your dataframe, where the index is the same, hence since a Python dictionary can only hold a single value for a single key , and in Series.to_dict() method, the index is used as the key, the values from those rows get overwritten by the values that come later.
A very simple example to show this behavior -
In [36]: df = pd.DataFrame([[1],[2]],index=[1,1],columns=['A'])
In [37]: df
Out[37]:
A
1 1
1 2
In [38]: df['A'].to_dict()
Out[38]: {1: 2}
This is what is happening in your case, and noted from the comments, since the amount of unique values for the index are only 1695 , we can confirm this by testing the value of len(df.index.unique()) .
If you are content with having numbers as key (the index of the dataframe) , then you can reset the indexes using DataFrame.reset_index() , and then use .to_dict() on that. Example -
choices = df.reset_index()['Full name'].astype(str).to_dict()
Demo from above example -
In [40]: df.reset_index()['A'].to_dict()
Out[40]: {0: 1, 1: 2}
This is the same the solution OP found - choices = dict(zip(df['n'],df['Full name'].astype(str))) (as can be seen from the comments) - but this method would be faster than using zip and dict .

Related

How can you extract a scalar in a for loop?

I'm using a for loop to slice a dataframe and then extract information from each slice. I then store that information in a dict so I can append it to a list for later use. My problem is that the infomation is not useable: it exists as a pandas Series rather than as the actual scalar value of the cell I'm trying to extract. Below is an example of the process I'm trying to execute:
df = pd.DataFrame({'c1': np.arange(0,15),'c2': np.arange(0,15), 'c3': ['A']*5+['B']*5+['C']*5})
iterable = ['A', 'B', 'C']
dict_list = []
for i in iterable:
out_dict = dict()
data = df[df.c3==i]
out = data.c1[-1:].iloc[0]
out_dict['out'] = out
dict_list.append(out_dict)
out_df = pd.DataFrame.from_records(dict_list)
Bizzarrely, the code above works, but when I change the dataframe to my real data, I get an IndexError: single positional indexer is out-of-bounds error at line 7, which I believe means that there is no index. In both my data and the example above, the type of data.c1[-1:] is pandas.core.series.Series and they both have length 1.
Even stranger is that If I run out = data.c1[-1:] inside the for loop, and then run out.iloc[0] outside the for loop I don't get an error.
Does anyone know why iloc would fail in this case? Is there a way to force out to be indexable?
This happens when you index a row/column with a number that is larger than the dimensions of your dataframe.
dataframe1.fillna("nan") # or whatever you want as a fill value
dataframe2.fillna("nan")
for example
df.iloc[:, 10] would refer to the eleventh column.
Okay I don't have an answer to the original question, but replacing the .iloc[0] with .squeeze() solved my issue, like so: out = data.c1[-1:].squeeze()

How do I pull the index(es) and column(s) of a specific value from a dataframe?

---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)

How to drop a series of rows from dataframe in a faster way

I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time
I want to drop every user who makes less than 3 operations.
every operation is stored in a row in which user_id is not the ID of my data
undesirable_users=[]
for i in range(len(operations_per_user)):
if operations_per_user.get_value(operations_per_user.index[i])<=3:
undesirable_users.append(operations_per_user.index[i])
for i in range(len(undesirable_users)):
data = data.drop(data[data.user_id == undesirable_users[i]].index)
data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().
Why not just filter them? You don't need to loop at all.
You can get the filtered indexes by:
operations_per_user.index[operations_per_user <= 3]
And then you can filter these indexes from the df, making the solution:
data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]
EDIT
My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.
filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()
data = data[~data[user_id].isin(filtered_user_ids)]
If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:
data = data.drop(data.loc[data['operations_per_user'] <= 3].index)
Edit
Instead of creating a seperate series, you could add operations_per_user to data with:
data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()
You could either perform the drop as above or perform the selection with the inverse logical condition:
data = data.loc[data['operations_per_user' > 3]]
Original
It would be preferable if you could supply some more information about the variables used in your code.
If operations_per_user is a pandas Series, your first loop could be improved with:
undesirable_users=[]
for i in operations_per_user.index:
if operations_per_user.loc[i] <= 3:
undesirable_users.append(i)
The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.
You can use python lists as iterators; for your second loop:
for user in undesirable_users:
data = data.drop(data.loc[data['user_id'] == user].index)
Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.
First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.
keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]

Pandas dataFrame.nunique() : ("unhashable type : 'list'", 'occured at index columns')

I want to apply the .nunique() function to a full dataFrame.
On the following screenshot, we can see that it contains 130 features. Screenshot of shape and columns of the dataframe.
The goal is to get the number of different values per feature.
I use the following code (that worked on another dataFrame).
def nbDifferentValues(data):
total = data.nunique()
total = total.sort_values(ascending=False)
percent = (total/data.shape[0]*100)
return pd.concat([total, percent], axis=1, keys=['Total','Pourcentage'])
diffValues = nbDifferentValues(dataFrame)
And the code fails at the first line and I get the following error which I don't know how to solve ("unhashable type : 'list'", 'occured at index columns'):
Trace of the error
You probably have a column whose content are lists.
Since lists in Python are mutable they are unhashable.
import pandas as pd
df = pd.DataFrame([
(0, [1,2]),
(1, [2,3])
])
# raises "unhashable type : 'list'" error
df.nunique()
SOLUTION: Don't use mutable structures (like lists) in your dataframe:
df = pd.DataFrame([
(0, (1,2)),
(1, (2,3))
])
df.nunique()
# 0 2
# 1 2
# dtype: int64
To get nunique or unique in a pandas.Series , my preferred approaches are
Quick Approach
NOTE: It wouldn't hurt if the col values are lists and string type. Also, nested lists might needed to be flattened.
_unique_items = df.COL_LIST.explode().unique()
or
_unique_count = df.COL_LIST.explode().nunique()
Alternate Approach
Alternatively, if I wish not to explode the items,
# If col values are strings
_unique_items = df.COL_STR_LIST.apply("|".join).unique()
# Lambda will save if col values are non-strings
_unique_items = df.COL_LIST.apply(lambda _l: "|".join([str(_y) for _y in _i])).unique()
Bonus
df.COL.apply(json.dumps) might handle all the cases.
OP's solution
df['uniqueness'] = df.apply(lambda _x: json.dumps(_x.to_list()), axis=1)
...
# Plug more code
...
I have come across this problem with .nunique() when converting results from a Rest API from dict (or list) to pandas dataframe. The problem is that one of the columns is stored as a list or dict (common situation in nested json results). Here is a sample code to remove the columns causing the error.
# this is the dataframe that is causing your issues
df = data.copy()
print(f"Rows and columns: {df.shape} \n")
print(f"Null values per column: \n{df.isna().sum()} \n")
# check which columns error when counting number of uniques
ls_cols_nunique = []
ls_cols_error_nunique = []
for each_col in df.columns:
try:
df[each_col].nunique()
ls_cols_nunique.append(each_col)
except:
ls_cols_error_nunique.append(each_col)
print(f"Unique values per column: \n{df[ls_cols_nunique].nunique()} \n")
print(f"Columns error nunique: \n{ls_cols_error_nunique} \n")
This code should split your dataframe columns into 2 lists:
Column that can calculate .nunique()
Column that errors when running .nunique()
Then just calculate the .nunique() on the columns without errors.
As far as converting the columns with errors, there are other resources that address that with .apply(pd.series).

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Categories