I am trying to manipulate some data in pandas so that it's compatible with an existing piece of software, the operation to perform would be similar to this:
original dataframe:
some_data language spelling
1 12 french un
1 12 english one
1 12 spanish uno
2 52 french deux
2 52 english two
2 52 spanish dos
target dataframe:
some_data lang_en lang_fr lang_sp
1 12 one un uno
2 52 two deux dos
So it will merge the indexes and reorder some of the rows to show it in a column, while keeping any supplementary column data.
All the columns that are not to be 'spitted' (some_data, in this example) contain duplicate data across a single index, many such columns exist in the real data.
I would definitely be able to do it by looping on the dataframe, but am trying to figure out if it's possible to do this entirely with pandas.
You can use:
df.set_index(['some_data','language'])['spelling']\
.unstack()\
.rename(columns=lambda x: 'lang_' + x[:2])\
.rename_axis([None], axis=1)\
.reset_index()
Output:
some_data lang_en lang_fr lang_sp
0 12 one un uno
1 52 two deux dos
Related
I am working with a large dataset which I've stored in a pandas dataframe. All of my methods I've written to operate on this dataset work on dataframes, but some of them don't work on GroupBy objects.
I've come to a point in my code where I would like to group all data by author name (which I was able to achieve easily via .groupby()). Unfortunately, this outputs a GroupBy object which isn't very useful to me when I want to use dataframe only methods.
I've searched tons of other posts but not found any satisfying answer... how do I convert this GroupBy object back into a DataFrame? (Note: It is much too large for me to manually select groups and concatenate them into a dataframe, I need something automated).
Not exactly sure I understand, so if this isn't what you are looking for, please comment.
Creating a dataframe:
df = pd.DataFrame({'author':['gatsby', 'king', 'michener', 'michener','king','king', 'tolkein', 'gatsby'], 'b':range(13,21)})
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
#create the groupby object
dfg = df.groupby('author')
In [44]: dfg
Out[44]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002169D24DB20>
#show groupby works using count()
dfg.count()
b
author
gatsby 2
king 3
michener 2
tolkein 1
But I think this is what you want. How to revert dfg back to a dataframe. You just need to perform some function on it that doesn't change the data. This is one way.
df_reverted = dfg.apply(lambda x: x)
author b
0 gatsby 13
1 king 14
2 michener 15
3 michener 16
4 king 17
5 king 18
6 tolkein 19
7 gatsby 20
This is another way and may be faster; note the dataframe names df and dfg.
df[dfg['b'].transform('count') > 0]
It's testing groupby and taking all groups greater than zero (so everything), returns a boolean series that is applied against the original dataframe, df
I have two pandas dataframes with some columns in common. These columns are of type category but unfortunately the category codes don't match for the two dataframes. For example I have:
>>> df1
artist song
0 The Killers Mr Brightside
1 David Guetta Memories
2 Estelle Come Over
3 The Killers Human
>>> df2
artist date
0 The Killers 2010
1 David Guetta 2012
2 Estelle 2005
3 The Killers 2006
But:
>>> df1['artist'].cat.codes
0 55
1 78
2 93
3 55
Whereas:
>>> df2['artist'].cat.codes
0 99
1 12
2 23
3 99
What I would like is for my second dataframe df2 to take the same category codes as the first one df1 without changing the category values. Is there any way to do this?
(Edit)
Here is a screenshot of my two dataframes. Essentially I want the song_tags to have the same cat codes for artist_name and track_name as the songs dataframe. Also song_tags is created from a merge between songs and another tag dataframe (which contains song data and their tags, without the user information) and then saved and loaded through pickle. Also it might be relevant to add that I had to cast artist_name and track_name in song_tags to type category from type object.
I think essentially my question is: how to modify category codes of an existing dataframe column?
I have a DataFrame, and one column is "lang" for "language."
Two different values in this column are "en" for "English" and "en-gb" for "British English."
There are numerous other values in this row, including "es" for "Spanish, "fr" for "French," and so on.
So it looks something like this:
user lang id
joe en 77788
jim en-gb 23323
pedro es 12134
tom en 53892
juan es 24434
phillippe fr 04211
george en-gb 99999
For the purposes of my analysis, I want to count the 'en' and 'en-gb' values together as being the same "en" or "English" value. Perhaps I could put just this column into a Series and then count them as one, or I could change the "en-gb" values with "en."
If you want the first two letters you can use string slicing i.e .str[:2] So we can consider language divisions as one.
df['lang'].str[:2]
0 en
1 en
2 es
3 en
4 es
5 fr
6 en
Name: lang, dtype: object
Now you got the series store it in one of the columns like
df['new'] = df['lang'].str[:2]
Merge with key as new. Hope it helps
You can change the column using .str[:2] as Bharath suggested. If you want to keep the column unchanged, you can use groupby on that column directly.
Say you want to find the count of users for each language,
df_new = df.groupby(df.lang.str[:2]).user.count()
Or
df_new = df.groupby(df.lang.str.split('-').str[0]).user.count()
will return
lang
en 4
es 2
fr 1
And your original data is unaffected
By using replace
df=df.replace({'en-gb':'en'})
df
Out[358]:
user lang id
0 joe en 77788
1 jim en 23323
2 pedro es 12134
3 tom en 53892
4 juan es 24434
5 phillippe fr 4211
6 george en 99999
I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.
Example:
In the above example the third row is very similar to the first row so I want it to take the value of the first row.
I have a working python code but it is too slow:
import pandas as pd
import datetime
import time
import numpy as np
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray
data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")
# Create lookup column from the dataframe itself:
lookup_data=data['UNIQUE_ID']
lookup_data=pd.Series.to_frame(lookup_data)
# Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
start = time.time()
for index,row in data.iterrows():
if index%5000==0:print(index, time.time()-start)
for index2, row2 in lookup_data.iterrows():
ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
if ratio_val<0.15:
data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
data.set_value(index,'Ratio_Val',ratio_val)
break
Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?
One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:
if index2 <= index:
continue
This alone will speed up the matching by the factor of 2.
I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.
You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())
>>> from e_utils.utils import clean_df
>>> from e_utils.utils import homog_lev
>>> series
0 Bad Bunny
1 bad buny
2 bag bunny
3 Ozuna
4 De La Ghetto
5 de la geto
6 Daddy Yankee
7 dade yankee
8 Nicky Jam
9 nicky jam
10 J Balvin
11 jbalvin
12 Maluma
13 maluma
14 Anuel AA
>>> series2 = clean_df(series)
>>> series2 = homog_lev(series2, eps=3)
>>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])
*Original* *Fixed*
0 Bad Bunny Bad Bunny
1 bad buny Bad Bunny
2 bag bunny Bad Bunny
3 Ozuna Ozuna
4 De La Ghetto De La Ghetto
5 de la geto De La Ghetto
6 Daddy Yankee Daddy Yankee
7 dade yankee Daddy Yankee
8 Nicky Jam Nicky Jam
9 nicky jam Nicky Jam
10 J Balvin J Balvin
11 jbalvin J Balvin
12 Maluma Maluma
13 maluma Maluma
14 Anuel AA Anuel Aa
DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.