How to group multiple columns with NA values and discrepancies? - python

I'm looking for a way to group a data frame with multiple columns with missing values.
I want to regroup every row that has a common value for each columns inspected and ignore if a missing value is present or a discrepancies in the data.
The script should be independent of the order of appearance of missing values.
I succeed of doing so by iteration, but I would like a more efficient way of doing so by vectorizing the process. I used R software but I would also like to do it in Python.
As an example if a data frame as
df = data.frame("ID1"=c(NA,NA,"A","A","A","B","C","B","C"), "ID2"=c("D","E",NA,"D","E","F","F",NA,NA))
I want to obtain a final grouping vector as
c(1,1,1,1,1,2,2,2,2)
Where 1 and 2 can be any number, they should only be common between row that has a common value for any column.
I hope it's understandable ?
The easiest way I found, was by using a double iteration
df$GrpF = 1:dim(df)[1]
for (i in 1:dim(df)[1]){
for (ID in c("ID1","ID2")){
if (!is.na(df[i,ID])){
df$GrpF[df[ID]==df[i,ID]] = min(df$GrpF[df[ID]==df[i,ID]],na.rm = T)
}
}
}
Where df$GrpF is my final grouping vector. It works well and I don't have any duplicates when I summarise the information.
library(dplyr)
library(plyr)
dfG = df %>% group_by_("GrpF")%>%summarise_all(
function(x){
x1 = unique(x)
paste0(x1[!is.na(x1) & x1 != ""],collapse = "/")
}
)
But when I use my real data 60000 rows on 4 columns, it takes a lot of time (5 mins).
I tried using a single iteration by columns using the library dplyr and plyr
grpData = function(df, colGrp, colData, colReplBy = NA){
a = df %>% group_by_at(colGrp) %>% summarise_at(colData, function(x) { sort(x,na.last=T)[1]}) %>% filter_at(colGrp,all_vars(!is.na(.)))
b = plyr::mapvalues(df[[colGrp]], from=a[[colGrp]], to=a[[colData]])
if (is.na(colReplBy)) {
b[which(is.na(b))] = NA
}else if (colReplBy %in% colnames(df)) {
b[which(is.na(b))] = df[[colReplBy]][which(is.na(b))] #Set old value for missing values
}else {
stop("Col to use as replacement for missing values not present in dataframe")
}
return(b)
}
df$GrpF = 1:dim(df)[1]
for (ID in c("ID1","ID2")){
#Set all same old group same ID
df$IDN = grpData(df,"GrpF",ID)
#Set all same new ID the same old group
df$GrpN = grpData(df,"IDN","GrpF")
#Set all same ID the same new group
df$GrpN = grpData(df,ID,"GrpN")
#Set all same old group the same new group
df$GrpF = grpData(df,"GrpF","GrpN", colReplBy = "GrpF")
}
This does work (takes 30 sec for the real data) but I would like a more efficient way of doing so.
Do you have any ideas ?

Related

Improve performance of 8million iterations over a dataframe and query it

There is a for loop of 8 million iterations, which takes 2 sample values from a column of a 1 million records dataframe (say df_original_nodes) and then query that 2 samples in another dataframe say (df_original_rel) and if sample does not exist then add that samples as a new row into the queried dataframe (df_original_rel) and finally write the dataframe (df_original_rel) into a CSV.
This loop is taking roughly around 24+ hrs to complete. How this can be made performant? Happy if it even takes 8 hrs to complete than anything 12+ hrs.
Here is the piece of code:
for j in range(1, n_8000000):
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
df_ran_rel = df_original_nodes["UID"].sample(2, ignore_index=True)
FROM = df_ran_rel[0]
TO = df_ran_rel[1]
if df_original_rel.query("#FROM == FROM and #TO == TO").empty:
k += 1
new_row = {"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]}
df_original_rel = df_original_rel.append(new_row, ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)
My assumption is that querying a dataframe df_original_rel is the heavy-lifting part where the dataframe df_original_rel is also keep growing as the new row is added.
In my view lists are faster to traverse and maybe to query but then there will be another layer of conversion from dataframe to lists and vice-versa which could add further complexity.
Some things that should probably help – most of them around "do less Pandas".
Since I don't have your original data or anything like it, I can't test this.
# Grab a regular list of UIDs that we can use with `random.sample`
original_nodes_uid_list = df_original_nodes["UID"].tolist()
# Make a regular set of FROM-TO tuples
rel_from_to_pairs = set(df_original_rel[["FROM", "TO"]].apply(tuple, axis=1).tolist())
# Store new rows here instead of putting them in the dataframe; we'll also update rel_from_to_pairs as we go.
new_rows = []
for j in range(1, 8_000_000):
# These two lines could probably also be a `random.choice`
ran_num = random.randint(0, 1)
ran_rel_type = rel_type[ran_num]
# Grab a from-to pair from the UID list
FROM, TO = random.sample(original_nodes_uid_list, 2)
# If this pair isn't in the set of known pairs...
if (FROM, TO) not in rel_from_to_pairs:
# ... prepare a new row to be added later
new_rows.append({"FROM": FROM, "TO": TO, "TYPE": ran_rel_type[0], "PART_OF": ran_rel_type[1]})
# ... and since this from-to pair _would_ exist had df_original_rel
# been updated, update the pairs set.
rel_from_to_pairs.add((FROM, TO))
# Finally, make a dataframe of the new rows, concatenate it with the old, and output.
df_new_rel = pd.DataFrame(new_rows)
df_original_rel = pd.concat([df_original_rel, df_new_rel], ignore_index=True)
df_original_rel.to_csv("output/extra_rel.csv", encoding="utf-8", index=False)

Merge FAST two DataFrames based on specific conditions, row by row

I have been struggling with my case for the past 10 days and I can't find a fast and efficient solution.
Here is the case. I have one DF containing web traffic data of a human resources website.
Every row of this Dataframe refers to an application (aka : someone reached the website via a specific web source, and apply to a specific job offer at a specific time).
Here is an example :
import pandas as pd
web_data = {'source': ['Google', 'Facebook','Email'],
'job_id': ['123456', '654321','010101'],
'd_date_hour_event' : ['2019-11-01 00:09:59','2019-11-01 00:10:41','2019-11-01 00:19:20'],
}
web_data = pd.DataFrame(web_data)
On the second DataFrame, I have an extract of a Human Resources internal tool where we gather all the received appplications with some complementary data. Here is an example :
hr_data ={'candidate_id': ['ago23ak', 'bli78gro','123tru456'],
'job_id': ['675848', '343434','010101'],
'date_time_submission' : ['2019-11-10 00:24:59','2019-11-09 12:10:41','2019-11-01 00:19:22'],
'job_label':['HR internship','Data Science Supervisor','Project Manager']
}
hr_data = pd.DataFrame(hr_data)
Here are the difficulties I am facing :
There is not a unique key I can use to merge those two tables. I have to use the "Job_id" (which is unique to every job) combined with the time when the application occured via the columns "d_date_hour_event" (on web_data DF) and "date_time_submission" (on hr_data DF).
For the same application, the time registered on the 2 tables might not be the same (difference of few seconds)
Some of web_data values might not be present in hr_data
In the end, I would like to get one DataFrame that looks like this :
result_dataframe.png
Actually, I already coded the function to realize this merge. It looks like this :
for i, row in web_data.iterrows() :
#we stock the needed value for hr_data research
date = row.d_date_hour_event
job = row.job_id
#we compute the time period
inf = date - timedelta(seconds=10)
sup = date + timedelta(seconds=10)
#we check if there a matching row in hr_data
temp_df = pd.DataFrame()
temp_df = hr_data[(hr_data.job_id == job) & \
(hr_data.date_time_submission >= inf) & (hr_data.date_time_submission <= sup)].tail(1)
#if there is a matching row, we merge them and update web_data table
if not temp_df.empty:
row = row.to_frame().transpose()
join = pd.merge(row, temp_df, how='inner', on='job_id',left_index=False, right_index=True)
web_data.update(join)
But, because my Web_data is over 250K rows and my HR_data is over 140k rows, it takes hours ! (estimation of 35hours running script...).
I am sure that the iterrows is not optimal and that this code can be optimized. I tried to use a custom function with .apply(lambda x: ...) but without success.
Any help would be more than welcome !
Please let me know if you need more explanations.
Many thanks !
Let's try this in a few steps:
1. Convert the datetime columns in both dataframes to pd.datetime format.
web_data = web_data.assign(d_date_hour_event= lambda x: pd.to_datetime(x['d_date_hour_event']))
hr_data = hr_data.assign(date_time_submission=lambda x: pd.to_datetime(x['date_time_submission']))
2. Rename the job_id of the hr_data dataframe so it will not provide any errors when merging.
hr_data = hr_data.rename(columns={"job_id": "job_id_hr"})
3. Make numpy arrays from the columns containing timestamps and job_ids in both dataframes and check for rows in which the timestamp in the web_data dataframe is within 10 seconds of the timestamp in the hr_data dataframe and have the same job_id using numpy broadcasting.
web_data_dates = web_data['d_date_hour_event'].values
hr_data_dates = hr_data['date_time_submission'].values
web_data_job_ids = web_data['job_id'].values
hr_data_job_ids = hr_data['job_id_hr'].values
i, j = np.where(
(hr_data_dates[:, None] <= (web_data_dates+pd.Timedelta(10, 'S'))) &
(hr_data_dates[:, None] >= (web_data_dates-pd.Timedelta(10, 'S'))) &
(hr_data_job_ids[:, None] == web_data_job_ids)
)
overlapping_rows = pd.DataFrame(
np.column_stack([web_data.values[j], hr_data.values[i]]),
columns=web_data.columns.append(hr_data.columns)
)
4. Assign new columns to the original web_data dataframe, so we can update these rows with all the information in case any rows overlap
web_data = web_data.assign(candidate_id=np.nan, job_id_hr=np.nan, date_time_submission=np.datetime64('NaT'), job_label=np.nan)
Finally just update web_data dataframe (or first create a copy if you don't want to overwrite the original datframe)
web_data.update(overlapping_rows)
Must be a lot faster than iterating over all rows
This is the code I am using (which is not working unless you make the changes I described in comments)
web_data = {'source': ['Google', 'Facebook','Email'],
'job_id': ['123456', '654321','010101'],
'd_date_hour_event' : ['2019-11-01 00:09:59','2019-11-01 00:10:41','2019-11-01 00:19:20'],
}
web_data = pd.DataFrame(web_data)
#placed '010101' and '2019-11-01 00:19:22' in second position instead of third position like it used to be
#if you reswitch these values to 3rd position in respectively 'job_id' and 'date_time_submission' arrays, it should work
hr_data ={'candidate_id': ['ago23ak', 'bli78gro','123tru456'],
'job_id': ['675848', '010101','343434'],
'date_time_submission' : ['2019-11-10 00:24:59','2019-11-01 00:19:22','2019-11-09 12:10:41'],
'job_label':['HR internship','Data Science Supervisor','Project Manager']
}
hr_data = pd.DataFrame(hr_data)
hr_data = hr_data.rename(columns={"job_id": "job_id_hr"})
web_data = web_data.assign(d_date_hour_event= lambda x: pd.to_datetime(x['d_date_hour_event']))
hr_data = hr_data.assign(date_time_submission=lambda x: pd.to_datetime(x['date_time_submission']))
web_data_dates = web_data['d_date_hour_event'].values
hr_data_dates = hr_data['date_time_submission'].values
web_data_job_ids = web_data['job_id'].values
hr_data_job_ids = hr_data['job_id_hr'].values
i, j = np.where(
(hr_data_dates[:, None] <= (web_data_dates+pd.Timedelta(10, 'S'))) &
(hr_data_dates[:, None] >= (web_data_dates-pd.Timedelta(10, 'S'))) &
(hr_data_job_ids[:, None] == web_data_job_ids[:, None])
)
overlapping_rows = pd.DataFrame(
np.column_stack([web_data.values[j], hr_data.values[i]]),
columns=web_data.columns.append(hr_data.columns)
)
overlapping_rows

Updating or adding multiple columns with pydatatable in style of R datable's .SDcols

Given iris data I'd like to add new columns corresponding to all numeric columns found. I can do by explicitly listing each numeric column:
from datatable import fread, f, mean, update
iris_dt = fread("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
iris_dt[:, update(C0_dist_from_mean = dt.abs(f.C0 - mean(f.C0)),
C1_dist_from_mean = dt.abs(f.C1 - mean(f.C1)),
C2_dist_from_mean = dt.abs(f.C2 - mean(f.C2)),
C3_dist_from_mean = dt.abs(f.C3 - mean(f.C1)))]
But that way I hard-coded column names. More robust way is readily available with R datatable using .SDcols:
library(data.table)
iris = fread("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
cols = names(sapply(iris, class)[sapply(iris, class)=='numeric'])
iris[, paste0(cols,"_dist_from_mean") := lapply(.SD, function(x) {abs(x-mean(x))}),
.SDcols=cols]
Is there a way to take similar approach with pydatatable today?
I do realize how to get all numeric columns in py-datatable, e.g. like this:
iris_dt[:, f[float]]
but it's the last part that uses .SDcols in R that evades me.
Create a dict comprehension of the new column names and the f expressions, then unpack it in the update method:
from datatable import f, update, abs, mean
aggs = {f"{col}_dist_from_mean" : abs(f[col] - mean(f[col]))
for col in iris_dt[:, f[float]].names}
iris_dt[:, update(**aggs)]
UPDATE:
Using the Type properties in v1.1, this is an alternative approach :
aggs = {f"{col}_dist_from_mean" : dt.math.abs(f[col] - f[col].mean())
for col, col_type
in zip(iris_dt.names, iris_dt.types)
if col_type.is_float}
You could also chunk the steps:
Create a Frame with the calculated values:
expression = f[float]-f[float].mean()
expression = dt.math.abs(expression)
compute = iris_dt[:, expression]
Rename the column labels for compute:
compute.names = [f"{name}_dist_from_mean" for name in compute.names]
Update iris_dt with compute (note that you could also use a cbind):
iris_dt[:, update(**compute)]

how to find repetitive words in each rows in CSV and put into as column name and the next row data as its instances

I have a csv file containing two column, in first column having dsyn,clnd and gngm repetedly and in next column having respective disease name or drug name or gene name like below
abstract1.csv
> clnd,Melatonin 3 MG
dsyn,Disease
dsyn,DYSFUNCTION
dsyn,Migraine Disorders
gngm,CD5L wt Allele
gngm,CD69 wt Allele
gngm,CLOCK gene
I want the output like below shown output
> dsyn clnd gngm
Disease Melatonin 3 MG CD5L wt Allele
DYSFUNCTION CD69 wt Allele
Migraine Disorders CLOCK gene
If you're using R, then something like this may work for you.
# Read the csv into a data frame
df <- read.csv('PATH.TO.FILE.csv', header = FALSE, col.names = c('V1', 'V2'), stringsAsFactors = TRUE)
# Split the data frame into a list of data frames based on the unique values in the first column
ldf <- split(df, factor(df$V1))
# Get the maximum number of rows in any data frame
max.rows <- max(unlist(lapply(ldf, nrow)))
# Apply function to each data frame in the list
ldf <- lapply(ldf, function(x){
# Remove unused factor levels from the data frame
x <- droplevels(x)
# Set the column name of the second column to the value of the first column
colnames(x)[2] <- unique(levels(x$V1))
# If there are fewer rows than the max, then add empty rows to ensure all data frames have the same length
if(nrow(x) < max.rows){
x[c((nrow(x) + 1):max.rows),] <- NA
}
# Remove the first column
x$V1 <- NULL
# Return the modified data frame
return(x)
})
# Combine all data frames back into a single data frame
df2 <- do.call(cbind, ldf)
Or you could also try this approach using tidyr and dplyr
library(tidyr)
library(dplyr)
df <- read.csv('PATH.TO.FILE.csv', header = FALSE, col.names = c('V1', 'V2'), stringsAsFactors = TRUE)
df2 <- df %>%
mutate(i = row_number()) %>%
spread(V1, V2) %>%
select(-i)
df2 <- as.data.frame(lapply(df2, sort, na.last = TRUE)) %>%
filter(rowSums(is.na(.)) != ncol(.))

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Categories