I have a dataset
I want to get to know our customers by looking at the typical shared characteristics (e.g. "Married customers in their 40s like wine"). This would correspond to the itemset {Married, 40s, Wine}.
How can I create a new dataframe called customer_data_onehot such that rows correspond to customers (as in the original data set) and columns correspond to the categories of each of the ten categorical attributes in the data. The new dataframe should only contain boolean values (True/False or 0/1s) such that the value in row 𝑖 and column 𝑗 is True (or 1) if and only if the attribute value corresponding to the column 𝑗 holds for the customer corresponding to row 𝑖 . Display the dataframe.
I have this hint "Hint: For example, for the attribute "Education" there are 5 possible categories: 'Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'. Therefore, the new dataframe must contain one column for each of those attribute values." but I don't understand how can I achieve this.
Can someone guide me here to achieve the correct solution?
i have this code which Imports the csv file and selects 90% of data from the original dataset.
import pandas as pd
pre_process = pd.read_csv('customer_data.csv')
pre_process = pre_process.sample(frac=0.9, random_state=413808).to_csv('customer_data_2.csv',
index=False)
Use get_dummies:
Setup a MRE
data = {'Customer': ['A', 'B', 'C'],
'Marital_Status': ['Together', 'Married', 'Single'],
'Age_Group': ['40s', '60s', '20s']}
df = pd.DataFrame(data)
print(df)
# Output
Customer Marital_Status Age_Group
0 A Together 40s
1 B Married 60s
2 C Single 20s
out = pd.get_dummies(df.set_index('Customer')).reset_index()
print(out)
# Output
Customer Marital_Status_Married Marital_Status_Single Marital_Status_Together Age_Group_20s Age_Group_40s Age_Group_60s
0 A 0 0 1 0 1 0
1 B 1 0 0 0 0 1
2 C 0 1 0 1 0 0
Related
Say I have two tables:
Table A:
state value
0 A 100
Table B:
state 100 200
0 A 1 4
1 B 2 5
2 C 3 6
I want to create a new field for Table A called "Factor" that returns the respective value from Table B:
state value factor
0 A 100 1
How would I do this in Python/Pandas?
In Excel, I would do: INDEX('Table B'!B2:C4, MATCH('Table A'!A2, 'Table B'!A:A, 0), MATCH('Table A'!B2, 'Table B'!B1:C1, 0))
Pivot dfA from wide to long format using melt. Ensure the column names get converted to a numeric type.
merge the longform data to dfA
melted = dfB.melt(id_vars=['state'], var_name='value', value_name='factor')
melted['value'] = melted['value'].astype(int)
dfA = dfA.merge(melted, on=['state', 'value'])
Result:
state value factor
0 A 100 1
This maybe feels like overkill for this example, but could be helpful for larger lookups.
You should try to use loc for this. Please try the following example:
Assuming Table A (dfA) and Table B (dfB) are both pandas dataframes:
>>> A = {
'state' : ['A'],
'value' : [100],
}
>>> B = {
'state' : ['A','B','C'],
'100' : [1,2,3],
'200' : [4,5,6]
}
>>> dfA = pd.DataFrame(A)
>>> dfB = pd.DataFrame(B)
which gives you your tables as pandas dataframes
>>> dfA
state value
0 A 100
>>> dfB
state 100 200
0 A 1 4
1 B 2 5
2 C 3 6
Then we search a specific column in the dataframe for a specific value and return the value of a different column from that same row.
extracted_value = list(dfB.loc[dfB['state'] == 'A', '100'])[0]
We create a column and set that value for that column dfA['factor'] = extracted_value
and we now have that extracted value in the appropriate column
>>> dfA
state value factor
0 A 100 1
Now I'm sure you want to do this in a loop to address a particular list of values and I hope this helps you start. Additionally, using dfA['factor'] = extracted_value is going to set every value in that column to extracted_value and you might want to have different values per row. To accomplish that you're going to want to use .loc again to add values based on the row index.
I have a dataframe like as shown below
df = pd.DataFrame(
{'supplier_id':[1,1,1,1],
'prod_id':[123,456,789,342],
'country' : ['UK', 'UK', 'UK','US'],
'transaction_date' : ['13/11/2020', '10/1/2018','11/11/2017', '27/03/2016'],
'industry' : ['STA','STA','PSA','STA'],
'segment' : ['testa','testb','testa','testc'],
'label':[1,1,1,0]})
My objective is to find out answers to the below questions
a) from the current row, How many times prior (previously), the same supplier has succeeded and failed in the same country? (use supplier_id and country column). here column label = 1 means success and label=0 means failure
Similarly, I would like to compute the success and failure count based on industry, country and segment as well.
Note that 1st transaction will always starts with 0 because supplier will have no previous transactions associated with that column.
As we are looking at chronological order of business done, we need to first sort the dataframe based on transaction_date.
So, I tried the below
df.sort_values(by=['supplier_id','transaction_date'],inplace=True)
df['prev_biz_country_success_count'] = df.groupby(['supplier_id', 'country']).cumcount()
df['prev_biz_country_failure_count'] = df.groupby(['supplier_id', 'country']).cumcount()
but as you can see, am not sure how to include the label column value. Meaning, we need to count based on label=1 and label=0.
I expect my output to be like as shown below
We can group the dataframe by supplier_id and country column then apply transformation function shift + cumsum on label column to get the count of rows where the criteria in previous rows
g = df.groupby(['supplier_id', 'country'])
for criteria, label in dict(success=1, failure=0).items():
df[f'prev_biz_country_{criteria}_count'] =\
g['label'].apply(lambda s: s.eq(label).shift(fill_value=0).cumsum())
supplier_id prod_id country transaction_date industry segment label prev_biz_country_success_count prev_biz_country_failure_count
1 1 456 UK 10/1/2018 STA testb 1 0 0
2 1 789 UK 11/11/2017 PSA testa 1 1 0
0 1 123 UK 13/11/2020 STA testa 1 2 0
3 1 342 US 27/03/2016 STA testc 0 0 0
I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary
I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.
We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')