Is there an python process for this R code - python

i am working on python and i want to group by my data by to columns and at the same time add missing dates from a date1 corresponding to the occurrence of the event to another date2 corresponding to a date that a choose and fill the missing values into the columns i decided by forwarfill .
I tried the code bellow on r and its works i want to do the same in python
library(data.table)
library(padr)
library(dplyr)
data = fread("path", header = T)
data$ORDERDATE <- as.Date(data$ORDERDATE)
datemax = max(data$ORDERDATE)
data2 = data %>%
group_by(Column1, Column2) %>%
pad(.,group = c('Column1', 'Column2'), end_val = as.Date(datemax), interval = "day",break_above = 100000000000) %>%
tidyr::fill("Column3")
I search for the corresponding package library(padr) in python but couldn't find any.

Thank your for answering my request.
As required as an example i have this table:
users=['User1','User1','User2','User1','User2','User1','User2','User1','User2'],
products=['product1','product1','product1','product1','product1','product2','product2','product2','product2'],
quantities=[5,6,8,10,4,5,2,9,7],
prices=[2,2,5,5,6,6,6,7,7],
data = pd.DataFrame({'date':dates,'user':users,'product':products,'quantity':quantities,'price':prices}),
data['date'] = pd.to_datetime(data.date, format='%Y-%m-%d'),
data2=data.groupby(['user','product','date'],as_index=False).mean()```[enter image description here][1]
for User1 and product1 for exemple i want to input missing dates and fill the quantities column with the value 0 and the column price with backward values from a range of date that a choose.
And do the same by users and by product for remainings in my data.
the result should look like this:
[1]: https://i.stack.imgur.com/qOOda.png
the r code i used to generate the image is as follow:
```library(padr)
library(dplyr)
dates=c('2014-01-14','2014-01-14','2014-01-15','2014-01-19','2014-01-18','2014-01-25','2014-01-28','2014-02-05','2014-02-14')
users=c('User1','User1','User2','User1','User2','User1','User2','User1','User2')
products=c('product1','product1','product1','product1','product1','product2','product2','product2','product2')
quantities=c(5,6,8,10,4,5,2,9,7)
prices=c(2,2,5,5,6,6,6,7,7)
data=data.frame(date=c('2014-01-14','2014-01-14','2014-01-15','2014-01-19','2014-01-18','2014-01-25','2014-01-28','2014-02-05','2014-02-14'),user=c('User1','User1','User2','User1','User2','User1','User2','User1','User2'),product=c('product1','product1','product1','product1','product1','product2','product2','product2','product2'),quantity=c(5,6,8,10,4,5,2,9,7),price=c(2,2,5,5,6,6,6,7,7))
data$date <- as.Date(data$date)
datemax = max(data$date)
data2 = data %>% group_by(user, product) %>% pad(.,group = c('user', 'product'), end_val = as.Date(datemax), interval = "day",break_above = 100000000000)
data3=data2 %>% group_by(user,product,date) %>%
summarize(quantity=sum(quantity),price=mean(price))
data4=data3%>% tidyr::fill("price")%>% fill_by_value(quantity, value = 0)```

Related

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

How to group multiple columns with NA values and discrepancies?

I'm looking for a way to group a data frame with multiple columns with missing values.
I want to regroup every row that has a common value for each columns inspected and ignore if a missing value is present or a discrepancies in the data.
The script should be independent of the order of appearance of missing values.
I succeed of doing so by iteration, but I would like a more efficient way of doing so by vectorizing the process. I used R software but I would also like to do it in Python.
As an example if a data frame as
df = data.frame("ID1"=c(NA,NA,"A","A","A","B","C","B","C"), "ID2"=c("D","E",NA,"D","E","F","F",NA,NA))
I want to obtain a final grouping vector as
c(1,1,1,1,1,2,2,2,2)
Where 1 and 2 can be any number, they should only be common between row that has a common value for any column.
I hope it's understandable ?
The easiest way I found, was by using a double iteration
df$GrpF = 1:dim(df)[1]
for (i in 1:dim(df)[1]){
for (ID in c("ID1","ID2")){
if (!is.na(df[i,ID])){
df$GrpF[df[ID]==df[i,ID]] = min(df$GrpF[df[ID]==df[i,ID]],na.rm = T)
}
}
}
Where df$GrpF is my final grouping vector. It works well and I don't have any duplicates when I summarise the information.
library(dplyr)
library(plyr)
dfG = df %>% group_by_("GrpF")%>%summarise_all(
function(x){
x1 = unique(x)
paste0(x1[!is.na(x1) & x1 != ""],collapse = "/")
}
)
But when I use my real data 60000 rows on 4 columns, it takes a lot of time (5 mins).
I tried using a single iteration by columns using the library dplyr and plyr
grpData = function(df, colGrp, colData, colReplBy = NA){
a = df %>% group_by_at(colGrp) %>% summarise_at(colData, function(x) { sort(x,na.last=T)[1]}) %>% filter_at(colGrp,all_vars(!is.na(.)))
b = plyr::mapvalues(df[[colGrp]], from=a[[colGrp]], to=a[[colData]])
if (is.na(colReplBy)) {
b[which(is.na(b))] = NA
}else if (colReplBy %in% colnames(df)) {
b[which(is.na(b))] = df[[colReplBy]][which(is.na(b))] #Set old value for missing values
}else {
stop("Col to use as replacement for missing values not present in dataframe")
}
return(b)
}
df$GrpF = 1:dim(df)[1]
for (ID in c("ID1","ID2")){
#Set all same old group same ID
df$IDN = grpData(df,"GrpF",ID)
#Set all same new ID the same old group
df$GrpN = grpData(df,"IDN","GrpF")
#Set all same ID the same new group
df$GrpN = grpData(df,ID,"GrpN")
#Set all same old group the same new group
df$GrpF = grpData(df,"GrpF","GrpN", colReplBy = "GrpF")
}
This does work (takes 30 sec for the real data) but I would like a more efficient way of doing so.
Do you have any ideas ?

PYTHON, Pandas Dataframe: how to select and read only certain rows

For the purpose to be clear here is the code that works perfectly (of course I put only the beginning, the rest is not important here):
df = pd.read_csv(
'https://github.com/pcm-dpc/COVID-19/raw/master/dati-andamento-nazionale/'
'dpc-covid19-ita-andamento-nazionale.csv',
parse_dates=['data'], index_col='data')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
So basically it takes some data from the online Github csv that you may find here:
Link NAZIONALE and look at the "data" which is the italian for "date" and extract for every date the value nuovi_positivi and then it put it into the program.
Now I have to do the same thing with this json that you may find here
Link Json
As you may see, now for every date there are 21 different values because Italy has 21 regions (Abruzzo Basilicata Campania and so on) but I am interested ONLY with the values of the region "Veneto", and I want to extract only the rows that contains "Veneto" under the label "denominazione_regione" to get for every day the value "nuovi_positivi".
I tried with:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-
json/dpc-covid19-ita-regioni.json' , parse_dates=['data'], index_col='data',
index_row='Veneto')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
but of course it doesn't work. How to solve the problem? Thanks
try this:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json',
convert_dates =['data'])
df.index = df['data']
df.index = df.index.normalize()
df = df[df["denominazione_regione"] == 'Veneto']
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi

how to find repetitive words in each rows in CSV and put into as column name and the next row data as its instances

I have a csv file containing two column, in first column having dsyn,clnd and gngm repetedly and in next column having respective disease name or drug name or gene name like below
abstract1.csv
> clnd,Melatonin 3 MG
dsyn,Disease
dsyn,DYSFUNCTION
dsyn,Migraine Disorders
gngm,CD5L wt Allele
gngm,CD69 wt Allele
gngm,CLOCK gene
I want the output like below shown output
> dsyn clnd gngm
Disease Melatonin 3 MG CD5L wt Allele
DYSFUNCTION CD69 wt Allele
Migraine Disorders CLOCK gene
If you're using R, then something like this may work for you.
# Read the csv into a data frame
df <- read.csv('PATH.TO.FILE.csv', header = FALSE, col.names = c('V1', 'V2'), stringsAsFactors = TRUE)
# Split the data frame into a list of data frames based on the unique values in the first column
ldf <- split(df, factor(df$V1))
# Get the maximum number of rows in any data frame
max.rows <- max(unlist(lapply(ldf, nrow)))
# Apply function to each data frame in the list
ldf <- lapply(ldf, function(x){
# Remove unused factor levels from the data frame
x <- droplevels(x)
# Set the column name of the second column to the value of the first column
colnames(x)[2] <- unique(levels(x$V1))
# If there are fewer rows than the max, then add empty rows to ensure all data frames have the same length
if(nrow(x) < max.rows){
x[c((nrow(x) + 1):max.rows),] <- NA
}
# Remove the first column
x$V1 <- NULL
# Return the modified data frame
return(x)
})
# Combine all data frames back into a single data frame
df2 <- do.call(cbind, ldf)
Or you could also try this approach using tidyr and dplyr
library(tidyr)
library(dplyr)
df <- read.csv('PATH.TO.FILE.csv', header = FALSE, col.names = c('V1', 'V2'), stringsAsFactors = TRUE)
df2 <- df %>%
mutate(i = row_number()) %>%
spread(V1, V2) %>%
select(-i)
df2 <- as.data.frame(lapply(df2, sort, na.last = TRUE)) %>%
filter(rowSums(is.na(.)) != ncol(.))

python pandas groupby complex calculations

I have dataframe df with columns: a, b, c,d. And i want to group data by a and make some calculations. I will provide code of this calculations in R. My main question is how to do the same in pandas?
library(dplyr)
df %>%
group_by(a) %>%
summarise(mean_b = mean(b),
qt95 = quantile(b, .95),
diff_b_c = max(b-c),
std_b_d = sd(b)-sd(d)) %>%
ungroup()
This example is synthetic, I just want to understand pandas syntaxis
I believe you need custom function with GroupBy.apply:
def f(x):
mean_b = x.b.mean()
qt95 = x.b.quantile(.95)
diff_b_c = (x.b - x.c).max()
std_b_d = x.b.std() - x.d.std()
cols = ['mean_b','qt95','diff_b_c','std_b_d']
return pd.Series([mean_b, qt95, diff_b_c, std_b_d], index=cols)
df1 = df.groupby('a').apply(f)

Categories