Remake dataframe based of fuzzywuzzy matches - python

i have a dataframes now it have 5 rows(in future will have more). In column names there 5 values, if those 5 names the same(their fuzz.ratio close to each other) then ok no changes needed.
But there is cases where:
4 values good(their fuzz.ratio close) and 1 value different, bad.
3 values good, 2 bad,
3 values good, 1 bad and 1 bad.
2 values the same, other 2 the same, and 1 different, bad.
2 values the same, other 1 and 1 and 1 values bad.
So I need dataframes where at least 2 rows the same, 3 better, 4 good, 5 the best.
Here is some simple example, of course series will have row index based on that it will be easier to select needed rows.
fruits_4_1 = ['banana', 'bananas', 'bananos', 'banandos','cherry']
fruits_3_2 = ['tomato','tamato','tomatos','apple','apples']
fruits_3_1_1 = ['orange','orangad','orandges','ham', 'beef']
fruits_2_2_1 = ['kiwi', 'kiwiss', 'mango','mangas', 'grapes']
fruits_2_1_1_1 = ['kiwi', 'kiwiss', 'mango','apples', 'beefs']
for f in fruits_4_1:
score_1 = process.extract(f, fruits_2_1_1_1, limit=10, scorer=fuzz.ratio)
print(score_1)
I need implement logic, that will check dataframe`s series and determine what type it is 4+1\3+2 etc. And based on that will create new dataframes, with only similar rows. How do i do that?

Related

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

How to select data with list of indexes from a partitioned DF (non-unique indexes)?

Problem
I have a dataframe df with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N].
I need to select rows based on a indexes list [0..M] where M > N.
Using loc would yield to an inconsistent output as there are multiple rows indexed by 0 (see example).
In other words, I'd need to overcome the difference between Dask's and Pandas' reset_index, as it'd easily solve my issue.
Example
print df.loc[0].compute() results in:
Unnamed: 0 best_answer thread_id ty_avc ty_ber ty_cjr ty_cpc \
0 0 1 1 1 0.052174 9 18
0 0 1 5284 12 0.039663 34 60
0 0 1 18132 2 0.042254 7 20
0 0 1 44211 4 0.025000 5 5
Possible solutions
repartition df to 1 single partition and reset_index, don't like as won't fit in memory;
add a column with [0..M] indexes and use set_index, discouraged in performance tips;
solution to this question solves a different problem as his df has unique indexes;
split the indexes list in npartitions parts, apply offset computation and use map_partitions
I cannot think of other solutions... probably the last one is more efficient although not sure if it's actually feasible.
Generally Dask.dataframe does not track the lengths of the pandas dataframes that make up the dask.dataframe. I suspect that your option 4 is best. You might also consider using dask.delayed
See also http://dask.pydata.org/en/latest/delayed-collections.html

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.
As an example, this is a simplified version of my problem:
User Time Col1 newcol1 newcol2 newcol3 newcol4
0 1 6 [cat, dog, goat] 0 0 0 0
1 1 6 [cat, sheep] 0 0 0 0
2 1 12 [sheep, goat] 0 0 0 0
3 2 3 [cat, lion] 0 0 0 0
4 2 5 [fish, goat, lemur] 0 0 0 0
5 3 9 [cat, dog] 0 0 0 0
6 4 4 [dog, goat] 0 0 0 0
7 4 11 [cat] 0 0 0 0
At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.
Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):
def myJFunc(df):
... #initialize jnum counter
... jnum = 0;
... #loop through each row of dataframe (not including the first/zeroeth)
... for i in range(1,len(df)):
... #has user changed?
... if df.User.loc[i] == df.User.loc[i-1]:
... #has time increased by more than 1 (hour)?
... if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
... #update new columns
... df['newcol2'].loc[i-1] = 1;
... df['newcol1'].loc[i] = 1;
... #increase jnum
... jnum += 1;
... #has content changed?
... if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
... #record this change
... df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
... #different user?
... elif df.User.loc[i] != df.User.loc[i-1]:
... #update new columns
... df['newcol1'].loc[i] = 1;
... df['newcol2'].loc[i-1] = 1;
... #store jnum elsewhere (code not included here) and reset jnum
... jnum = 1;
I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!
(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)
(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)
I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)
df['time_diff'] = df.groupby('User')['Time'].diff()
df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )
df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()
User Time Col1 time_diff Col1_0 Col1_0_prev
0 1 6 [cat, dog, goat] NaN cat NaN
1 1 6 [cat, sheep] 0 cat cat
2 1 12 [sheep, goat] 6 sheep cat
3 2 3 [cat, lion] NaN cat NaN
4 2 5 [fish, goat, lemur] 2 fish cat
5 3 9 [cat, dog] NaN cat NaN
6 4 4 [dog, goat] NaN dog NaN
7 4 11 [cat] 7 cat dog
As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.
Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.
'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.
Calculate these separately:
df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??
df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1
It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...
Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.
In your problem, it seems like you want to iterate through row pairwise. The first thing you could do is something like this:
from itertools import tee, izip
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
# you stuff
However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes.
If iterrows is still too slow I suggest to do something like this:
Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids.
Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe.
df[[col1, ..]].shift() - df[[col1, ..]])
If user_id > 0, it means that the user changed in two consecutive row. The time column can be filtered directly with delta[delta['time' > 1]]
With this delta dataframe you record the changes row-wise. You can use it a a mask to update the columns you need from you original dataframe.

Applying Operation to Pandas column if other column meets criteria

I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.
There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2

Categories