Satisfying Cross tab constraints in Python by filling in Random Numbers

Satisfying Cross tab constraints in Python by filling in Random Numbers - python

I have a problem to modify a dataframe (actual data) to satisfy cross tab constraints and generating a new dataframe as described below:
In cross-tab 1 (attached pic n code), we have 2 tasks for John in Area A, 1 task for John in Area B and so on. However, my desired distribution is as shown in cross-tab 2 i.e., John has 1 task in Area A, 4 tasks in Area B etc. Thus I need to modify original data as depicted by crosstab 1, to satisfy the row and column totals constraints as required in crosstab2, while grand total should remain 18 as in both cross tabs. Number filling may be random
Another constraint is average time which should be for example 11 minutes for John (average of 03 tasks), 7 minutes for William and 5 minutes for Richard(03 tasks).
Thus, the task is to modify original dataframe which satisfies row, column total as in crosstab2 and average time requirement. The final dataframe will have three columns Person, Area of Work and Time and will generate a crosstab similar to crosstab2, while randomly filling in numbers
Cross tab2- Required
Cross tab1- Actual Data
Actual Data:
df = pd.DataFrame([['John','A',2,8],['John','B',1,9],['John','C',0,12],['William','A',1,14],['William','B',2,10],['William','C',2,9],['Richard','A',3,8],['Richard','B',4,7],['Richard','C',3,5]],columns=['Person', 'AreaOfWork', 'Task','Time'])
1.1 Actual Cross-Tab:
pd.crosstab(df.AreaOfWork, df.Person, values=df.Task, aggfunc=np.sum, margins=True)
Required-Dataframe
df1 = pd.DataFrame([['John','A',1,10],['John','B',4,11],['John','C',3,12],['William','A',0,9],['William','B',1,7],['William','C',3,5],['Richard','A',2,5],['Richard','B',1,3],['Richard','C',3,8]],columns=['Person', 'AreaOfWork', 'Task','Time'])
2.1 Required crosstab
pd.crosstab(df.AreaOfWork, df.Person, values=df1.Task, aggfunc=np.sum, margins=True)

Related

Groupby values, perform calculations and apply to repeating rows

I have the following df:
wallet position position_rewards position_type token_sales
0 0x123 SUSHI_LP 250 Sushi_LP 500
0 0x123 ALCX 750 LP Token 500
1 0xabc GAMMA 333.33 LP Token 750
1 0xabc FXS 666.66 LD 750
Note that the sum of the values in position_rewards for each wallet is the TOTAL for that wallet, and the token_sales column might show a lower amount that was sold from that total amount. you can see that in: wallet 0x123received 1000 rewards in total, but sold only 500.
I want to create the following columns, which are calculations based on the already existing columns. Logic below too:
Column 1: df['position_rewards_pct']
This column is supposed to have the corresponding % of the rewards per positionover the total rewards per wallet.
My code:
df['position_rewards_pct'] = (df['position_rewards'] / sum(df['position_rewards'].apply(Decimal))) * 100
Problem: Outputting NaNs
Column 2: df['token_sales_per_type'] This column is supposed to show how many tokens have been sold (token_sales column) for a given potition_type.
Please note that for each value in the existing token_sales column, each wallet has only that value. That is, you will never have a different value in token_sales for a single wallet.
In th end, this column should show (repeatedly, for every row in position_type, the amount of tokens sold for that specific type. So as rows in position_typerepeat, so will the rows in df['token_sales_per_type'].
Note that all values are in Decimal object form.
Essentially, the structure of the final df should logically be the following:

In trying to formulate a response, I'm finding that the text of your question doesn't quite match up with the data and visualizations you've provided. Perhaps that DataFrame you're showing is a result of a preliminary grouping operation, rather than the underlying data?
In any event, your question is in the general category of split/apply/combine, for which Pandas has many tools, some of which may seem a bit tricky to grasp.
Usually when you want to perform a grouping, and then apply some operation back to the dataset on the basis of what you found in the grouping, you use .groupby() followed by .transform().
.transform() has the wonderful ability to take the result of an aggregation function, and apply it back to every member of the group. The classic example is subtracting the mean() for a group from every value within that group.
Examples using the dataframe you provided:
Group by the wallet and position, sum the rewards
someDF.groupby(by=["wallet", "position"]).transform(sum)["position_rewards"]
0 250.0
1 750.0
2 333.33
3 666.66
percentage of total (this one doesn't quite make sense in the context of the df you provided, since position column is all unique)
someDF["position_rewards"] / someDF.groupby(by=["wallet", "position"]).transform(sum)["position_rewards"]
0 1.0
1 1.0
2 1.0
3 1.0
Apply the sum of token_sales to each position type
someDF.groupby(by=["position_type"]).transform(sum)["token_sales"]
0 500
1 1250
2 1250
3 750
One final comment on decimal and percentage formatting - best to leave that for display, rather than modifying the data. You can do that with the pandas styler.

For the first question you need to a groupby.transform('sum') and rdiv:
df['position_rewards_pct'] = (df.groupby('wallet')['position_rewards']
.transform('sum').rdiv(df['position_rewards'])
.mul(100).round(2)
)
output:
wallet position position_rewards position_type token_sales position_rewards_pct
0 0x123 SUSHI_LP 250.00 Sushi_LP 500 25.00
0 0x123 ALCX 750.00 LP Token 500 75.00
1 0xabc GAMMA 333.33 LP Token 750 33.33
1 0xabc FXS 666.66 LD 750 66.67

Assigning different values to different rows based on the number of existing columns

I have a pandas data frame as below:
The demand_served column is the column I wanted to edit. For now, I just assign the same value for all the rows. My first objective here is to assign the average value to different facilities (the first column). For example, if facility A covers 10 GEOID, all these 10 rows should have 1836.988/10 = 183.6988; if facility B covers 20 GEOID, all these 20 rows share 1836.988/20 = 91.8494.
Moreover, if the same GEOID is covered by 2 facilities, for example, the above facility A and B, it should have demand_served 183.6988+91.8494 = 275.5482.
I am pretty stuck on this problem and cannot come up with any useful thoughts. Any ideas?

Performing a calculation on a DataFrame based on condition in another DataFrame

I am working with COVID data and am trying to control for population and show incidence per 100,000.
I have one DataFrame with population:
**Country** **Population**
China 1389102
Israel 830982
Iran 868912
I have a second DataFrame showing COVID data:
**Date** **Country** **Confirmed**
01/Jan/2020 China 8
01/Jan/2020 Israel 3
01/Jan/2020 Iran 2
02/Jan/2020 China 15
02/Jan/2020 Israel 5
02/Jan/2020 Iran 5
I wish to perform a calculation on my COVID DataFrame using info from the population DataFrame. That is to say, to normalise cases per 100,000 for each data via:
(Chinese Datapoint/Chinese Population) * 100,000
Likewise for my other countries.
I am stumped on this one and not too sure do I achieve my result via grouping data, zipping data, etc.
Any help welcome.
Edit: I should have added that confirmed cases are cumulative as each day goes on. So for example, I wish to performed for Jan 1st for China: (8/china population)*100000 and like wise for Jan 2nd, Jan 3rd, Jan 4th... And again, likewise for each country. Essentially performing a calculation to the entire DataFrame based on data in another DataFrame.

You could merge 2 dataframes and perform the operation:
# Define the norm operation
def norm_cases(cases, population):
return (cases/population)*100
# If the column name for country is same in both dataframes
covid_df.merge(population_df, on='country_column', how='left')
# For different col names
covid_df.merge(population_df, left_on='covid_country_column', right_on='population_country_column', how='left')
covid_df['norm_cases'] = covid_df.apply(lambda x: norm_cases(x['cases_column'], x['population_column']), axis=1)

Assuming that your dataframes are called df1 and df2 and by "Datapoint" you mean the column **Confirmed**:
normed_cases = (
df2.reset_index().groupby(['**Country**', '**Date**']).sum()['**Confirmed**']
/ df1.set_index('**Country**')['**Population**'] * 100000)
reset the index of df2 to make the date a column (only applicable if **Date** was the index before)
Group by country and date and sum the groups to get the total cases per country and date
set country as index to the first df df1 to allow country-index oriented division
divide by population

I took an approach combining many of your suggestions. Step one, I merged my two dataframes. Step two, I divided my confirmed column by the population. Step three, I multiplied the same column by 100,000. There probably is a more elegant approach but this works.
covid_df = covid_df.merge(population_df, on='Country', how='left')
covid_df["Confirmed"] = covid_df["Confirmed"].divide(covid_df["Population"], axis="index")
covid_df["Confirmed"] = covid_df["Confirmed"] *100000

Suppose Dataframe with population as df_pop, and Covid data as df_data.
# Set index country of df_pop
df_pop = df_pop.set_index(['Country'])
# Norm value
norm = 100000
# Calculate norm cases
df_data['norm_cases'] = [((conf/df_pop.loc[country].Population )*norm
for (conf, country) in zip(df_data.Confirmed,df_data.Country) ]

You can use df1.set_index('Country').join(df2.set_index('Country')) here, then it will be easy for you to perform this actions.

pandas data frame, apply t-test on rows simultaneously grouping by column names (have duplicates!)

I have a data frame with particular readouts as indexes (different types of measurements for a given sample), each column is a sample for which these readouts were taken. I also have a treatment group assigned as the column name for each sample. You can see the example below.
What I need to do: for a given readout (row) group samples by treatment (column name) and perform a t-test (Welch's t-test) on each group (each treatment). T-test must be done as a comparison with one fixed treatment (control treatment). I do not care about tracking out the sample ids (it was required, now I dropped them on purpose), I'm not going to do paired tests.
For example here, for readout1 I need to compare treatment1 vs treatment3, treatment2 vs treatment3 (it's ok if I'll also get treatment3 vs treatment3).
Example of data frame:
frame = pd.DataFrame(np.arange(27).reshape((3, 9)),
index=['readout1', 'readout2', 'readout3'],
columns=['treatment1', 'treatment1', 'treatment1',\
'treatment2', 'treatment2', 'treatment2', \
'treatment3', 'treatment3', 'treatment3'])
frame
Out[757]:
treatment1 treatment1 ... treatment3 treatment3
readout1 0 1 ... 7 8
readout2 9 10 ... 16 17
readout3 18 19 ... 25 26
[3 rows x 9 columns]
I'm fighting it for several days now. Tried to unstack/stack the data, transposing the data frame, then grouping by index, removing nan and applying lambda. Tried other strategies but none worked. Will appreciate any help.
thank you!

Identify ID columns in a data frame

Is there any way to identify columns such as Account_Number, Employee_ID, Transaction_ID etc type of columns automatically in a data frame which are usually not included in model building?
Note that there might be more than one record of the same employee across different dates. In short, how to identify useless columns when they are not unique?

There are several ways to recognize the lease important columns/classes/features, in a dataset. Correlation is one of them. Follow the example below by first downloading this movies dataset from Kaggle.
df = pd.read_csv("tmdb_5000_movies.csv")
df = df[["id", "budget", "popularity", "vote_average"]]
df.head()
This is how the dataframe looks:
id budget popularity vote_average
0 19995 237000000 150.437577 7.2
1 285 300000000 139.082615 6.9
2 206647 245000000 107.376788 6.3
3 49026 250000000 112.312950 7.6
4 49529 260000000 43.926995 6.1
We are looking for an automatic way of detecting that "id" is a useless column.
Let's find the correlation between each column and the other:
corr_df = pd.DataFrame(columns=list(df.columns))
for col_from in df.columns:
for col_to in df.columns:
corr_df.loc[col_from, col_to] = df[col_from].corr(df[col_to])
print(corr_df.head())
Correlation is simply a measure between -1 and 1, numbers close to zero indicate that the two classes are uncorrelated, the further you go from zero, (even in the negative direction) is an indication that the two parameters are coupled in some sense.
Observe how id has a very small correlation with budget and popularity
id budget popularity vote_average
id 1 -0.0893767 0.031202 -0.270595
budget -0.0893767 1 0.505414 0.0931457
popularity 0.031202 0.505414 1 0.273952
vote_average -0.270595 0.0931457 0.273952 1
Let's go a little step further and get the absolute value and sum all the correlations, the class with the least correlation score is considered the least useless:
corr_df = corr_df.abs()
corr_df["sum"] = corr_df.sum(axis=0) - 1
print(corr_df.head())
Result:
id budget popularity vote_average sum
id 1 0.0893767 0.031202 0.270595 0.391173
budget 0.0893767 1 0.505414 0.0931457 0.687936
popularity 0.031202 0.505414 1 0.273952 0.810568
vote_average 0.270595 0.0931457 0.273952 1 0.637692
Not that there are many issues with this method, for example: if ids are increasing from 0 to N and there is a value that is also increasing amongst the rows with a constant rate, their correlation will be high; moreover, some column X might yield a smaller correlation with column Y than the correlation between Y and id; nevertheless the absolute sum result is good enough in most cases.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.