Identify ID columns in a data frame - python

Is there any way to identify columns such as Account_Number, Employee_ID, Transaction_ID etc type of columns automatically in a data frame which are usually not included in model building?
Note that there might be more than one record of the same employee across different dates. In short, how to identify useless columns when they are not unique?

There are several ways to recognize the lease important columns/classes/features, in a dataset. Correlation is one of them. Follow the example below by first downloading this movies dataset from Kaggle.
df = pd.read_csv("tmdb_5000_movies.csv")
df = df[["id", "budget", "popularity", "vote_average"]]
df.head()
This is how the dataframe looks:
id budget popularity vote_average
0 19995 237000000 150.437577 7.2
1 285 300000000 139.082615 6.9
2 206647 245000000 107.376788 6.3
3 49026 250000000 112.312950 7.6
4 49529 260000000 43.926995 6.1
We are looking for an automatic way of detecting that "id" is a useless column.
Let's find the correlation between each column and the other:
corr_df = pd.DataFrame(columns=list(df.columns))
for col_from in df.columns:
for col_to in df.columns:
corr_df.loc[col_from, col_to] = df[col_from].corr(df[col_to])
print(corr_df.head())
Correlation is simply a measure between -1 and 1, numbers close to zero indicate that the two classes are uncorrelated, the further you go from zero, (even in the negative direction) is an indication that the two parameters are coupled in some sense.
Observe how id has a very small correlation with budget and popularity
id budget popularity vote_average
id 1 -0.0893767 0.031202 -0.270595
budget -0.0893767 1 0.505414 0.0931457
popularity 0.031202 0.505414 1 0.273952
vote_average -0.270595 0.0931457 0.273952 1
Let's go a little step further and get the absolute value and sum all the correlations, the class with the least correlation score is considered the least useless:
corr_df = corr_df.abs()
corr_df["sum"] = corr_df.sum(axis=0) - 1
print(corr_df.head())
Result:
id budget popularity vote_average sum
id 1 0.0893767 0.031202 0.270595 0.391173
budget 0.0893767 1 0.505414 0.0931457 0.687936
popularity 0.031202 0.505414 1 0.273952 0.810568
vote_average 0.270595 0.0931457 0.273952 1 0.637692
Not that there are many issues with this method, for example: if ids are increasing from 0 to N and there is a value that is also increasing amongst the rows with a constant rate, their correlation will be high; moreover, some column X might yield a smaller correlation with column Y than the correlation between Y and id; nevertheless the absolute sum result is good enough in most cases.

Related

How to group and sort columns by occurrence of a Letter in terms of %s and also find the mean of another column in a pandas dataframe?

I have the following pandas dataframe:
Date
PtsMoved
Profile
Type
EventName
Indy
2010-11-19
16.250
A
High Impact Expected
Fed Chairman Bernanke Speaks
29.2500
2010-11-23
43.000
NaN
High Impact Expected
Prelim GDP q/q
29.5500
2010-11-24
50.500
D
High Impact Expected
New Home Sales
35.2000
2010-11-30
34.870
NaN
High Impact Expected
CB Consumer Confidence
31.5240
2010-12-01
54.500
B
High Impact Expected
ISM Manufacturing PMI
32.3740
Some rows have NAN Values for the profile column, while others contain a single letter value, and others with multiple letter values ranging from A-D. Each event under EventName has multiple occurrence throughout the dataset
What I'm trying to do is sort the table by EventName and put the occurrence of A, B, C, or D Profile with the percentages of how many times it is present under the EventName column and the average(mean) Indy value per EventName.
EventName
Indy
Profile
Fed Chairman Bernanke Speaks
XX.XX
A(34%), B(45%)
Prelim GDP q/q
XX.XX
D(14%)
I have tried to groupby but am unsure of how to sort multiple columns by it.

Change a percentage of dataframe column values according to the value of another column

I have a dataframe similar to the one below, in which activities assume binary values representing whether they require a doctor:
d = {'activity': ['Check-up', 'Assessment', 'Medication', 'Medication', 'Medication'], 'doctor_requirement': [1, 0,0,0,0]}
df = pd.DataFrame(data=d)
df
activity doctor_requirement
0 Check-up 1
1 Assessment 0
2 Medication 0
3 Medication 0
4 Medication 0
I would like to consider that a percentage of 'Medication' activities require a doctor. That is, to assign binary 1 to doctor_requirement for a percentage of 'Medication' visits. For instance, such that 50% of the activity 'Medication' requires a doctor (i.e. doctor_requirement = 1).
I would greatly appreciate your help, I've been looking online and can't seem to find how to apply such a condition.
Thanks in advance!
If you'd like a 50% chance then you can use:
df.loc[df['activity']=='Medication','doctor_requirement'] = np.random.choice([0,1],(df['activity']=='Medication').sum())
If you wish to contorl the probabilities for 0 and 1s, you can use np.random.choice's p parameter to specify odds.
df.loc[df['activity']=='Medication','doctor_requirement'] = np.random.choice([0,1],(df['activity']=='Medication').sum(),p=[0.99,0.01])

Groupby values, perform calculations and apply to repeating rows

I have the following df:
wallet position position_rewards position_type token_sales
0 0x123 SUSHI_LP 250 Sushi_LP 500
0 0x123 ALCX 750 LP Token 500
1 0xabc GAMMA 333.33 LP Token 750
1 0xabc FXS 666.66 LD 750
Note that the sum of the values in position_rewards for each wallet is the TOTAL for that wallet, and the token_sales column might show a lower amount that was sold from that total amount. you can see that in: wallet 0x123received 1000 rewards in total, but sold only 500.
I want to create the following columns, which are calculations based on the already existing columns. Logic below too:
Column 1: df['position_rewards_pct']
This column is supposed to have the corresponding % of the rewards per positionover the total rewards per wallet.
My code:
df['position_rewards_pct'] = (df['position_rewards'] / sum(df['position_rewards'].apply(Decimal))) * 100
Problem: Outputting NaNs
Column 2: df['token_sales_per_type'] This column is supposed to show how many tokens have been sold (token_sales column) for a given potition_type.
Please note that for each value in the existing token_sales column, each wallet has only that value. That is, you will never have a different value in token_sales for a single wallet.
In th end, this column should show (repeatedly, for every row in position_type, the amount of tokens sold for that specific type. So as rows in position_typerepeat, so will the rows in df['token_sales_per_type'].
Note that all values are in Decimal object form.
Essentially, the structure of the final df should logically be the following:
In trying to formulate a response, I'm finding that the text of your question doesn't quite match up with the data and visualizations you've provided. Perhaps that DataFrame you're showing is a result of a preliminary grouping operation, rather than the underlying data?
In any event, your question is in the general category of split/apply/combine, for which Pandas has many tools, some of which may seem a bit tricky to grasp.
Usually when you want to perform a grouping, and then apply some operation back to the dataset on the basis of what you found in the grouping, you use .groupby() followed by .transform().
.transform() has the wonderful ability to take the result of an aggregation function, and apply it back to every member of the group. The classic example is subtracting the mean() for a group from every value within that group.
Examples using the dataframe you provided:
Group by the wallet and position, sum the rewards
someDF.groupby(by=["wallet", "position"]).transform(sum)["position_rewards"]
0 250.0
1 750.0
2 333.33
3 666.66
percentage of total (this one doesn't quite make sense in the context of the df you provided, since position column is all unique)
someDF["position_rewards"] / someDF.groupby(by=["wallet", "position"]).transform(sum)["position_rewards"]
0 1.0
1 1.0
2 1.0
3 1.0
Apply the sum of token_sales to each position type
someDF.groupby(by=["position_type"]).transform(sum)["token_sales"]
0 500
1 1250
2 1250
3 750
One final comment on decimal and percentage formatting - best to leave that for display, rather than modifying the data. You can do that with the pandas styler.
For the first question you need to a groupby.transform('sum') and rdiv:
df['position_rewards_pct'] = (df.groupby('wallet')['position_rewards']
.transform('sum').rdiv(df['position_rewards'])
.mul(100).round(2)
)
output:
wallet position position_rewards position_type token_sales position_rewards_pct
0 0x123 SUSHI_LP 250.00 Sushi_LP 500 25.00
0 0x123 ALCX 750.00 LP Token 500 75.00
1 0xabc GAMMA 333.33 LP Token 750 33.33
1 0xabc FXS 666.66 LD 750 66.67

How to detect outliers in a timeseries dataframe and write the "clean" ones in a new dataframe

I'm really new to Python (and programming in general, hihi) and I'm analyzing 2 years of metereological data measured every 10s, in total I have 12 metereological parameters and I've created my dataframe df with the time as my row index and the name of the metereological parameters as the column names. Since I don't need a super granularity, I've resampled the data to hourly data, so the dataframe looks something like this.
Time G_DIFF G_HOR G_INCL RAIN RH T_a V_a V_a_dir
2016-05-01 02:00:00 0.0 0.011111 0.000000 0.013333 100.0 9.128167 1.038944 175.378056
2016-05-01 03:00:00 0.0 0.200000 0.016667 0.020000 100.0 8.745833 1.636944 218.617500
2016-05-01 04:00:00 0.0 0.105556 0.013889 0.010000 100.0 8.295333 0.931000 232.873333
There are outliers and I can get rid of them with a rolling standard deviation and mean which is what I've done "by hand" with the following code for one of the columns (the ambient temperature) where the algorithm writes the clean data in another dataframe (tr, in the example below).
roll = df["T_a"].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df["T_a"] < low) | (df["T_a"] > up) | (df["T_a"].isna()), "outliers"] = df["T_a"]
tr.loc[(df["T_a"] >= low) & (df["T_a"] <= up), "T_a"] = df["T_a"]
tr.loc[tr["T_a"].isna(),"T_a"] = tr["T_a"].bfill() #to input a value when a datum is NaN
Now, as I said, that works okay for a column BUT I would like to be able to do it for the 12 columns and, also, I'm almost sure that there's a more pythonic way to do it. I guess with a for loop should be feasible but nothing I've tried so far is working.
Could anyone give me some light, please? Thank you so much!!
all_columns = [df.columns] #This will give you list of all column names
all_columns = all_columns.remove('G_DIFF') # This will remove the column name that doesn't include those 12 columns
for column in all_columns:
roll = df[column].rolling(24,center = True) #24h window
mean, std = roll.mean(), roll.std()
cut = std*3
low, up = mean - cut, mean+cut
tr.loc[(df[column] < low) | (df[column] > up) | (df[column].isna()), "outliers"] = df[column]
tr.loc[(df[column] >= low) & (df[column] <= up), column] = df[column]
tr.loc[tr[column].isna(),column] = tr[column].bfill() #to input a value when a datum is NaN
There are two ways to remove outliers from time series data one is calculating percentile, mean std-dev which I am thinking you are using another way is looking at the graphs because sometimes data spread gives more information visually.
I have worked in data of yellow taxi prediction in a certain area, so basically I have a model which can predict in which region of NYC taxi can get more customers.
In that I had a time series data with 10-sec intervals with various features like trip distance,speed, working hours, and one was "Total fare", now I also wanted to remove the outliers from each column so started using mean and percentiles to do so,
The thing with total fares was mean and percentile was not giving an accurate threshold,
and my percentiles values:
0 percentile value is -242.55//
10 percentile value is 6.3//
20 percentile value is 7.8//
30 percentile value is 8.8//
40 percentile value is 9.8//
50 percentile value is 11.16//
60 percentile value is 12.8//
70 percentile value is 14.8//
80 percentile value is 18.3//
90 percentile value is 25.8//
100 percentile value is 3950611.6//
as you can see 100 was an ok fare but was considered as an outlier,
So I basically turned to visualization,
I sorted my fare values and plot it
as you can see in the end there is little of steepness
so basically magnified it,
Something like this,
and then I magnified it more for 50th to second last percentile
and voila I got my threshold, i.e 1000,
This method in actual terms is called the "elbow method", what you are doing is the first step and if you are not happy this can be the second step to find those thresholds,
I suggest you go from column to column and use any of these techniques because if you go from column to column you know how much data you are losing because losing data is losing information.
Personally, I follow visualization, in the end, it really depends on the data.

Satisfying Cross tab constraints in Python by filling in Random Numbers

I have a problem to modify a dataframe (actual data) to satisfy cross tab constraints and generating a new dataframe as described below:
In cross-tab 1 (attached pic n code), we have 2 tasks for John in Area A, 1 task for John in Area B and so on. However, my desired distribution is as shown in cross-tab 2 i.e., John has 1 task in Area A, 4 tasks in Area B etc. Thus I need to modify original data as depicted by crosstab 1, to satisfy the row and column totals constraints as required in crosstab2, while grand total should remain 18 as in both cross tabs. Number filling may be random
Another constraint is average time which should be for example 11 minutes for John (average of 03 tasks), 7 minutes for William and 5 minutes for Richard(03 tasks).
Thus, the task is to modify original dataframe which satisfies row, column total as in crosstab2 and average time requirement. The final dataframe will have three columns Person, Area of Work and Time and will generate a crosstab similar to crosstab2, while randomly filling in numbers
Cross tab2- Required
Cross tab1- Actual Data
Actual Data:
df = pd.DataFrame([['John','A',2,8],['John','B',1,9],['John','C',0,12],['William','A',1,14],['William','B',2,10],['William','C',2,9],['Richard','A',3,8],['Richard','B',4,7],['Richard','C',3,5]],columns=['Person', 'AreaOfWork', 'Task','Time'])
1.1 Actual Cross-Tab:
pd.crosstab(df.AreaOfWork, df.Person, values=df.Task, aggfunc=np.sum, margins=True)
Required-Dataframe
df1 = pd.DataFrame([['John','A',1,10],['John','B',4,11],['John','C',3,12],['William','A',0,9],['William','B',1,7],['William','C',3,5],['Richard','A',2,5],['Richard','B',1,3],['Richard','C',3,8]],columns=['Person', 'AreaOfWork', 'Task','Time'])
2.1 Required crosstab
pd.crosstab(df.AreaOfWork, df.Person, values=df1.Task, aggfunc=np.sum, margins=True)

Categories