Aggregation based on the median of probability distribution - python

I have a big data set (75 Million rows) consists of 12 columns. The rows are repeated holding the same values except for the last 2 columns that constitute the probability distribution
As we can see in this snipping, regarding the first 10 columns, the rows are equal in values and the last 2 ones (value_count, value) are the probability distribution for the rows. I want to aggregate those rows to be one row based on the median of the probability distribution of value_count, value

EDIT: edited after the comment.
you can get the median easily by using summary.
df_result = your_table.select("value_count", "value").summary("50%")
The result is a dataframe with one row and 2 columns. You can join it to your original dataframe if you want to: your_table.select("col1", .. , "coln").distinct().join(df_result, "outer")
Alternatively there are the functions approx_quantile and percentile_approx which could probably do the job without using a join (like above)

Related

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

averaging columns on a very big matrix of data (approximately 150,000*80) in a normal runtime

I am having problems with analyzing very big matrices.
I have 4 big matrices of numbers (between 0 to 1), each one of the matrices contains over 100,000 rows, with 70-80 columns.
I need to average each column and add them to lists according to the column title.
I tried to use the built in mean() method of pandas (the input is pandas dataframe), but the runtime is crazy (it takes numerous hours to calculate the mean of a single column) and I cannot afford that runtime.
Any suggestions on how can I do it with normal runtime?
I am adding my code here-
def filtering(data):
healthy=[]
nmibc_hg=[]
nmibc_lg=[]
mibc_hg=[]
for column in data:
if 'HEALTHY' in column:
healthy.append(data[column].mean())
elif 'NMIBC_HG' in column:
nmibc_hg.append(data[column].mean())
elif 'NMIBC_LG' in column:
nmibc_lg.append(data[column].mean())
elif 'MIBC_HG' in column and not 'NMIBC_HG' in column:
mibc_hg.append(data[column].mean())
Use df.mean() and it will generate a mean for every numeric column in the dataframe df. The column name and summary results go into a pandas series. I tried it on a dataframe with 190,000 rows and 12 numeric columns and it only took 4 seconds. It should only take a few seconds on your dataset too.

Sort columns after rows correlation

I have a pandas dataframe with two or more rows and 42 columns. By transposing and plotting it, I get the profiles of the rows.
df.T.plot()
I want to sort the columns, so that first there are the columns, where the rows are strongly correlated (similar profile, values go in the same direction) and later the columns, where the rows have a weak correlation (opposite profile, values go in opposite direction).
I could run a cluster algorithm on the columns, but clusters are not exactly what I want.
I think one solution would be to sort after the distance of the points from the linear regression line??
Correlation is a measure that describes the relationship between two variables in total, not at specific points. The metric you've described for sorting isn't correlation, but rather the absolute difference between the column values in the two rows. (With the transpose operation the two rows become two columns, and their lines on the graph you're making will 'go in the opposite direction" when the values in the two columns are further away from each other.)
Achieving this with the dataframe you've described would look something like:
df_T = df.T
df_T['sort_column'] = df_T.panB - df_T.panC
df_T.sort_values('sort_column', inplace=True)
df_T.drop('sort_column', inplace=True)
df_T.plot()

Remove duplicate columns only by their values

I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.
I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.

Select columns in a pandas DataFrame

I have a pandas dataframe with hundreds of columns of antibiotic names. Each specific antibiotic is coded in the dataframe as ending in E, T, or P to indicate empirical, treatment, or prophylactic regimens.
An example excerpt from the column list is:
['MeropenemP', 'MeropenemE', 'MeropenemT', DoripenemP', 'DoripenemE',
'DoripenemT', ImipenemP', 'ImipenemE', 'ImipenemT', 'BiapenemP',
'BiapenemE', 'BiapenemT', 'PanipenemP', 'PanipenemE',
'PanipenemT','PipTazP', 'PipTazE', 'PipTazT','PiperacillinP',
'PiperacillinE', 'PiperacillinT']
A small sample of data is located here:
Sample antibiotic data
It is simple enough for me to separate out columns any type into separate dataframes with a regex, e.g. to select all the empirically prescribed antibiotics columns I use:
E_cols = master.filter(axis=1, regex=('[a-z]+E$'))
Each column has a binary value (0,1) for prescription of each antibiotic regimen type per person (row).
Question:
How would I go about summing the rows of all columns (1's) for each type of regimen type and generating a new column for each result in the dataframe e.g. total_emperical, total_prophylactic, total_treatment.
The reason I want to add to the existing dataframe is that I wish to filter on other values for each regimen type.
Once you've generated the list of columns that match your reg exp then you can just create the new total columns like so:
df['total_emperical'] = df[E_cols].sum(axis=1)
and repeat for the other totals.
Passing axis=1 to sum will sum row-wise

Categories