Convert text to binary columns - python
I have a column in my dataframe that contains many different companies separated by commas (assume there are additional rows with even more companies).
company
apple,microsoft,disney,nike
microsoft,adidas,amazon,eBay
I want to convert this to binary columns for every possible company that appears. It should ultimately look like this:
adidas apple amazon eBay disney microsoft nike ... last_store
0 1 0 0 1 1 1 ... 0
1 0 1 1 0 1 0 ... 0
Let us try get_dummies
s=df.brand.str.get_dummies(',')
adidas amazon apple disney eBay microsoft nike
0 0 0 1 1 0 1 1
1 1 1 0 0 1 1 0
Related
Count per category on seperate columns - Panda Python
I have a panda dataframe with 3 columns: Brand Model car_age PEUGEOT 207 4. 6-8 BMW 3ER REIHE 2. 1-2 FIAT FIAT DOBLO 3. 3-5 PEUGEOT 207 1. 0 BMW 3ER REIHE 2. 1-2 PEUGEOT 308 2. 1-2 BMW 520D 2. 1-2 ... ... ... And I want to group by Brand and Model and calculate the count per car_age category: Brand Model "1. 0" "2. 1-2" "3. 3-5" "4. 6-8" PEUGEOT 207 1 0 0 1 PEUGEOT 308 0 1 0 0 BMW 3ER REIHE 0 2 0 0 BMW 520D 0 1 0 0 FIAT FIAT DOBLO 0 0 1 0 PS: 1. 0 means category one that corresponds to car age of zero. 2. 1-2 means category two that corresponds to car ages between 1-2. I enumerate my categories so they appear in the correct order. I tried that: output_count = pd.DataFrame({'Count':df.groupby('Brand','Model','car_age').size()}) but it dropped an error: ValueError: No axis named Model for object type <class 'pandas.core.frame.DataFrame'> Could anyone help me with this issue? I think I provided enough information, but let me know if I can provide more.
Use pd.crosstab: pd.crosstab([df['Brand'], df['Model']], df['car_age']).reset_index() Output: car_age Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8 0 BMW 3ER REIHE 0 2 0 0 1 BMW 520D 0 1 0 0 2 FIAT FIAT DOBLO 0 0 1 0 3 PEUGEOT 207 1 0 0 1 4 PEUGEOT 308 0 1 0 0
The correct way to group by a data-frame with multiple columns is to use square brace around the columns name df.groupby(['Brand','Model','car_age']) I hope it will help you to solve your problem.
Here is a function you can call. If you want to see how it works granularly def group_by(df): data_dumm = pd.get_dummies(df['car_age']) data =df.drop(columns='car_age') X= pd.concat([data,data_dumm], axis=1).groupby(['Brand','Model']).sum() return X.reset_index() group_by(df) output: Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8 0 BMW 3ER REIHE 0 2 0 0 1 BMW 520D 0 1 0 0 2 FIAT FIAT DOBLO 0 0 1 0 3 PEUGEOT 207 1 0 0 1 4 PEUGEOT 308 0 1 0 0
how to create a matrix when two values are in the same groupby column pandas?
So i basically have a dataframe of products and orders: product order apple 111 orange 111 apple 121 beans 121 rice 131 orange 131 apple 141 orange 141 What i need to do is, groupby the products based on the id of the order, and generate this matrix with the value of times they appeared together in the same order. I don't know any efficient way of doing this, if someone could help me! apple orange beans rice apple x 2 1 0 orange 2 x 0 1 beans 1 0 x 0 rice 0 1 0 x
One option is to join the dataframe with itself on order and then calculate the cooccurrences using crosstab on the two product columns: df.merge(df, on='order').pipe(lambda df: pd.crosstab(df.product_x, df.product_y)) product_y apple beans orange rice product_x apple 3 1 2 0 beans 1 1 0 0 orange 2 0 3 1 rice 0 0 1 1
Another way is to perform a crosstab between product and order, then do a matrix multiplication # with the transpose so: a_ = pd.crosstab(df['product'], df['order']) res = a_#a_.T print(res) product apple beans orange rice product apple 3 1 2 0 beans 1 1 0 0 orange 2 0 3 1 rice 0 0 1 1 or using pipe to do a one liner: res = pd.crosstab(df['product'], df['order']).pipe(lambda x: x#x.T)
Boolean Masking on a Pandas Dataframe where columns may not exist
I have a dataframe called compare that looks like this: Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc John 0 -500 0 50 50 0 0 3 3 0 0 0 1803 1803 0 0 30 30 0 0 Cheldone -500 0 0 50 50 0 0 1.25 1.25 0 0 0 1565 1565 0 0 30 30 0 0 Dieu -300 -300 0 0 0 0 0 3 3 0 0 0 1372 1372 0 0 18 18 0 0 Here is the dataframe in a form that can be copied and pasted: ,Resident,1xdisc,1xdisc_doc,conpark,parking,parking_doc,conmil,conmil_doc,pest,pest_doc,pet,pet1x,pet_doc,rent,rent_doc,stlc,storage,trash,trash_doc,water,water_doc 0,Acacia,0,0,0,0,0,0,-500,3.0,3.0,0,0,70,2067,2067,0,0,15,15,0,0 1,ashley,0,0,0,0,0,0,0,3.0,3.0,0,0,0,2067,2067,0,0,15,15,0,0 2,Sheila,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1574,1574,0,0,0,0,0,0 3,Brionne,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1787,1787,0,0,0,0,0,0 4,Danielle,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1422,0,0,0,0,0,0,0 5,Nmesomachi,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1675,1675,0,0,0,0,0,0 6,Doaa,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1306,1306,0,0,0,0,0,0 7,Reynaldo,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1685,1685,0,0,0,0,0,0 8,Shajuan,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1768,0,0,0,0,0,0,0 9,Dalia,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1596,1596,0,0,0,0,0,0 I want to create another dataframe using boolean masking that only contains rows where there are mismatches various sets of columns. For example, where parking doesn't match parking_doc or conmil doesn't match conmil_doc. Here is the code I am using currently: nonmatch = compare[((compare['1xdisc']!=compare['1xdisc_doc']) & (compare['conpark']!=compare['1xdisc'])) | (compare['rent']!=compare['rent_doc']) | (compare['parking']!=compare['parking_doc']) | (compare['trash']!=compare['trash_doc']) | (compare['pest']!=compare['pest_doc']) | (compare['stlc']!=compare['stlc_doc']) | (compare['pet']!=compare['pet_doc']) |(compare['conmil']!=compare['conmil_doc']) ] The problem I'm having is that some columns may not always exist, for example stlc_doc or pet_doc. How do I select rows with mismatches, but only check for mismatches for particular columns if the columns exist?
If the column names doesn't always exist, you can either add the columns that doesn't exist which I don't think will be a good idea since you will have to replicate the corresponding columns which will eventually increase the size of the dataframe. So, another approach might be to filter the column names themselves and take only the column pairs that exists: Given DataFrame: >>> df.head(3) Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc 0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0 1 ashley 0 0 0 0 0 0 0 3.0 3.0 0 0 0 2067 2067 0 0 15 15 0 0 2 Sheila 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1574 1574 0 0 0 0 0 0 Take out the columns pairs: >>> maskingCols = [(col[:-4], col) for col in df if col[:-4] in df and col.endswith('_doc')] maskingCols [('1xdisc', '1xdisc_doc'), ('parking', 'parking_doc'), ('conmil', 'conmil_doc'), ('pest', 'pest_doc'), ('pet', 'pet_doc'), ('rent', 'rent_doc'), ('trash', 'trash_doc')] Now that you have the column pairs, you can create the expression required to mask the dataframe. >>> "|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols) "(df['1xdisc'] != df['1xdisc_doc'])|(df['parking'] != df['parking_doc'])|(df['conmil'] != df['conmil_doc'])|(df['pest'] != df['pest_doc'])|(df['pet'] != df['pet_doc'])|(df['rent'] != df['rent_doc'])|(df['trash'] != df['trash_doc'])" You can simply pass this expression string to eval function to evaluate it. >>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) You can add other criteria other than this masking: >>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc'])) 0 True 1 False 2 False 3 False 4 True 5 False 6 False 7 False 8 True 9 False dtype: bool You can use it to get your desired dataframe: >>> df[eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))] OUTPUT: Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc 0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0 4 Danielle 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1422 0 0 0 0 0 0 0 8 Shajuan 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1768 0 0 0 0 0 0 0
I'm not a pandas expert, so there might be a simpler, library way to do this, but here's a relatively Pythonic, adaptable implementation: mask = True for col_name in df.columns: # inefficient but readable, could get this down # to O(n) with a better data structure if col_name + '_doc' in df.columns: mask = mask & (df[col_name] != df[col_name + '_doc']) non_match = df[mask]
how to utilize Pandas aggregate functions on this DataFrame?
This is the table: order_id product_id reordered department_id 2 33120 1 16 2 28985 1 4 2 9327 0 13 2 45918 1 13 3 17668 1 16 3 46667 1 4 3 17461 1 12 3 32665 1 3 4 46842 0 3 I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this: department_id number_of_orders number_of_reordered_0 3 2 1 4 2 0 12 1 0 13 2 1 16 2 0 I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work? I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1): df1 = (df.groupby('department_id')['reordered'] .agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())]) .reset_index()) print (df1) department_id number_of_orders number_of_reordered_0 0 3 2 1 1 4 2 0 2 12 1 0 3 13 2 1 4 16 2 0 If values are only 1 and 0 is possible use sum and last subtract: df1 = (df.groupby('department_id')['reordered'] .agg([('number_of_orders','size'), ('number_of_reordered_0','sum')]) .reset_index()) df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0'] print (df1) department_id number_of_orders number_of_reordered_0 0 3 2 1 1 4 2 0 2 12 1 0 3 13 2 1 4 16 2 0
in sql it would be simple aggregation select department_id,count(*) as number_of_orders, sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0 from tabl_name group by department_id
Split ("extract") Columns using Panda [duplicate]
This question already has answers here: Python Pandas: create a new column for each different value of a source column (with boolean output as column values) (4 answers) Closed 4 years ago. I currently have a column called Country that can have a value of USA, Canada, Japan. For example: Country ------- Japan Japan USA .... Canada I want to split ("extract") the values into three individual columns (Country_USA, Country_Canada, and Country_Japan), and basically, a column will have a value of 1 if it matches the original value from the Country column. For example: Country --> Country_Japan Country_USA Country_Canada ------- ------------- ----------- --------------- Japan 1 0 0 USA 0 1 0 Japan 1 0 0 .... Is there a simple (non-tedious) way to do this using Panda / Python 3.x? Thanks!
Use join with get_dummies and with add_prefix: print(df.join(df['Country'].str.get_dummies().add_prefix('Country_'))) Demo: df=pd.DataFrame({'Country':['Japan','USA','Japan','Canada']}) print(df.join(df['Country'].str.get_dummies().add_prefix('Country_'))) Output: Country Country_Canada Country_Japan Country_USA 0 Japan 0 1 0 1 USA 0 0 1 2 Japan 0 1 0 3 Canada 1 0 0 Better version, thanks to Scott: print(df.join(pd.get_dummies(df))) Output: Country Country_Canada Country_Japan Country_USA 0 Japan 0 1 0 1 USA 0 0 1 2 Japan 0 1 0 3 Canada 1 0 0 Another good version from Scott: print(df.assign(**pd.get_dummies(df))) Output: Country Country_Canada Country_Japan Country_USA 0 Japan 0 1 0 1 USA 0 0 1 2 Japan 0 1 0 3 Canada 1 0 0