I would like to sort the following dataframe:
Region LSE North South
0 Cn 33.330367 9.178917
1 Develd -36.157025 -27.669988
2 Wetnds -38.480206 -46.089908
3 Oands -47.986764 -32.324991
4 Otherg 323.209834 28.486310
5 Soys 34.936147 4.072872
6 Wht 0.983977 -14.972555
I would like to sort it so the LSE column is reordered based on the list:
lst = ['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht']
of, course the other columns will need to be reordered accordingly as well. Is there any way to do this in pandas?
The improved support for Categoricals in pandas version 0.15 allows you to do this easily:
df['LSE_cat'] = pd.Categorical(
df['LSE'],
categories=['Oands','Wetnds','Develd','Cn','Soys','Otherg','Wht'],
ordered=True
)
df.sort('LSE_cat')
Out[5]:
Region LSE North South LSE_cat
3 3 Oands -47.986764 -32.324991 Oands
2 2 Wetnds -38.480206 -46.089908 Wetnds
1 1 Develd -36.157025 -27.669988 Develd
0 0 Cn 33.330367 9.178917 Cn
5 5 Soys 34.936147 4.072872 Soys
4 4 Otherg 323.209834 28.486310 Otherg
6 6 Wht 0.983977 -14.972555 Wht
If this is only a temporary ordering then keeping the LSE column as
a Categorical may not be what you want, but if this ordering is
something that you want to be able to make use of a few times
in different contexts, Categoricals are a great solution.
In later versions of pandas, sort, has been replaced with sort_values, so you would need instead:
df.sort_values('LSE_cat')
Related
I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df
I have two dataframes, both have an ID and a Column Name that contains Strings. They might look like this:
Dataframes:
DF-1 DF-2
--------------------- ---------------------
ID Name ID Name
1 56 aaeessa 1 12 H.P paRt 1
2 98 1o7v9sM 2 76 aa3esza
3 175 HP. part 1 3 762 stakoverfl
4 2 stackover 4 2 lo7v9Sm
I would like to compute the string similarity (Ex: Jaccard, Levenshtein) between one element with all the others and select the one that has the highest score. Then match the two IDs so I can join the complete Dataframes later. The resulting table should look like this:
Result:
Result
-----------------
ID1 ID2
1 56 76
2 98 2
3 175 12
4 2 762
This could be easily achieved using a double for loop, but I'm looking for an elegant (and faster way) to accomplish this, maybe lambdas list comprehension, or some pandas tool. Maybe some combination of groupby and idxmax for the similarity score but I can't quite come up with the soltution by myself.
EDIT: The DataFrames are of different lenghts, one of the purposes of this function is to determine which elements of the lesser dataframe appear in the greater dataframe and match those, discarding the rest. So in the resulting table should only appear pairs of IDs that match, or pairs of ID1 - NaN (assuming DF-1 has more rows than DF-2).
Using the pandas dedupe package: https://pypi.org/project/pandas-dedupe/
You need to train the classifier with human input and then it will use the learned setting to match the whole dataframe.
first pip install pandas-dedupe and try this:
import pandas as pd
import pandas_dedupe
df1=pd.DataFrame({'ID':[56,98,175],
'Name':['aaeessa', '1o7v9sM', 'HP. part 1']})
df2=pd.DataFrame({'ID':[12,76,762,2],
'Name':['H.P paRt 1', 'aa3esza', 'stakoverfl ', 'lo7v9Sm']})
#initiate matching
df_final = pandas_dedupe.link_dataframes(df1, df2, ['Name'])
# reset index
df_final = df_final.reset_index(drop=True)
# print result
print(df_final)
ID Name cluster id confidence
0 98 1o7v9sm 0.0 1.000000
1 2 lo7v9sm 0.0 1.000000
2 175 hp. part 1 1.0 0.999999
3 12 h.p part 1 1.0 0.999999
4 56 aaeessa 2.0 0.999967
5 76 aa3esza 2.0 0.999967
6 762 stakoverfl NaN NaN
you can see matched pairs are assigned a cluster and confidence level. unmatched are nan. you can now analyse this info however you wish. perhaps only take results with a confidence level above 80% for example.
I suggest you a library called Python Record Linkage Toolkit.
Once you import the library, you must index the sources you intend to compare, something like this:
indexer = recordlinkage.Index()
#using url as intersection
indexer.block('id')
candidate_links = indexer.index(df_1, df_2)
c = recordlinkage.Compare()
Let's say you want to compare based on the similiraties of strings, but they don't match exactly:
c.string('name', 'name', method='jarowinkler', threshold=0.85)
And if you want an exact match you should use:
c.exact('name')
Using my fuzzy_wuzzy function from the linked answer:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
mrg = fuzzy_merge(df1, df2, 'Name', 'Name', threshold=70)\
.merge(df2, left_on='matches', right_on='Name', suffixes=['1', '2'])\
.filter(like='ID')
Output
ID1 ID2
0 56 76
1 98 2
2 175 12
3 2 762
I have a data frame with columns:
User_id PQ_played PQ_offered
1 5 15
2 12 75
3 25 50
I need to divide PQ_played by PQ_offered to calculate the % of games played. This is what I've tried so far:
new_df['%_PQ_played'] = df.groupby('User_id').((df['PQ_played']/df['PQ_offered'])*100),as_index=True
I know that I am terribly wrong.
It's much simpler than you think.
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
PQ_offered PQ_played %_PQ_played
User_id
1 15 5 33.333333
2 75 12 16.000000
3 50 25 50.000000
You can use lambda functions
df.groupby('User_id').apply(lambda x: (x['PQ_played']/x['PQ_offered'])*100)\
.reset_index(1, drop = True).reset_index().rename(columns = {0 : '%_PQ_played'})
You get
User_id %_PQ_played
0 1 33.333333
1 2 16.000000
2 3 50.000000
I totally agree with #mVChr and think you are over complicating what you need to do. If you are simply trying to add an additional column then his response is spot on. If you truly need to groupby it is worth noting that this is typically used for aggregation, e.g., sum(), count(), etc. If, for example, you had several records with non-unique values in the User_id column then you could create the additional column using
df['%_PQ_played'] = df['PQ_played'] / df['PQ_offered'] * 100
and then perform an aggregation. Let's say you wanted to know the average number of games played of the games offered for each user, you could do something like
new_df = df.groupby('User_id', as_index=False)['%_PQ_played'].mean()
This would yield (numbers are arbitrary)
User_id %_PQ_played
0 1 52.777778
1 2 29.250000
2 3 65.000000
i am trying to join two data frames but cannot get my head around the possibilities Python has to offer.
First dataframe:
ID MODEL REQUESTS ORDERS
1 Golf 123 4
2 Passat 34 5
3 Model 3 500 8
4 M3 5 0
Second dataframe:
MODEL TYPE MAKE
Golf Sedan Volkswagen
M3 Coupe BMW
Model 3 Sedan Tesla
What I want is to add another column in the first dataframe called "make" so that it looks like this:
ID MODEL MAKE REQUESTS ORDERS
1 Golf Volkswagen 123 4
2 Passat Volkswagen 34 5
3 Model 3 Tesla 500 8
4 M3 BMW 5 0
I already looked at merge, join and map but all examples just appended the required information at the end of the dataframe.
I think you can use insert with map by Series created with df2 (if some value in column MODEL in df2 is missing get NaN):
df1.insert(2, 'MAKE', df1['MODEL'].map(df2.set_index('MODEL')['MAKE']))
print (df1)
ID MODEL MAKE REQUESTS ORDERS
0 1 Golf Volkswagen 123 4
1 2 Passat NaN 34 5
2 3 Model 3 Tesla 500 8
3 4 M3 BMW 5 0
Although not in this case, but there might be scenarios where df2 has more than two columns and you would just want to add one out of those to df1 based on a specific column as key. Here is a generic code that you may find useful.
df = pd.merge(df1, df2[['MODEL', 'MAKE']], on = 'MODEL', how = 'left')
The join method acts very similarly to a VLOOKUP. It joins a column in the first dataframe with the index of the second dataframe so you must set MODEL as the index in the second dataframe and only grab the MAKE column.
df.join(df1.set_index('MODEL')['MAKE'], on='MODEL')
Take a look at the documentation for join as it actually uses the word VLOOKUP.
I always found merge to be an easy way to do this:
df1.merge(df2[['MODEL', 'MAKE']], how = 'left')
However, I must admit it would not be as short and nice if you wanted to call the new column something else than 'MAKE'.
I'm having two dataframes with each of them looking like
date country value
20100101 country1 1
20100102 country1 2
20100103 country1 3
date country value
20100101 country2 4
20100102 country2 5
20100103 country2 6
I want to merge them into one dataframe looking like
date country1 country2
20100101 1 4
20100102 2 5
20100103 3 6
Is there any clever way to do this in pandas?
This looks like pivot table, which in Pandas is called unstack for some bizarre reason.
Example analogous to the one used in Wes McKinley's "python for data analysis" book:
bytz = df.groupby(['tz', opersystem])
counts = bytz.size().unstack().fillna(0)
(groupby operating system in rows which is then pivoted so that operating system becomes columns, just like your "country*" values).
P.S. for catting dataframes you can use pandas.concat. It's also often good to do .reset_index on resulting dataframe, bc in some (many?) cases duplicate values in index can make pandas go haywire, throwing strange exceptions on .apply used on dataframe and the like.