Concatenate on specific condition python - python

EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data

You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)

Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19

Related

Remove non numeric rows from dataframe

I have a dataframe of patients and their gene expressions. I has this format:
Patient_ID | gene1 | gene2 | ... | gene10000
p1 0.142 0.233 ... bla
p2 0.243 0.243 ... -0.364
...
p4000 1.423 bla ... -1.222
As you see, that dataframe contains noise, with cells that are values other then a float value.
I want to remove every row that has a any column with non numeric values.
I've managed to do this using apply and pd.to_numeric like this:
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
The problem is that it's taking for ever to run, and I need a better and more efficient way of achieving this
EDIT: To reproduce something like my data:
arr = np.random.random_sample((3000,10000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(10000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
df['gene0'][2] = 'bla'
df['gene9998'][4] = 'bla'
Was right it is worth trying numpy :)
I got 30-60x times faster version (bigger array, larger improvement)
Convert to numpy array (.values)
Iterate through all rows
Try to convert each row to row of floats
If it fails (some NaN present), note this in boolean array
Create array based on the results
Code:
import pandas as pd
import numpy as np
from line_profiler_pycharm import profile
def op_version(df):
cols = df.columns[1:]
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
return df.dropna()
def np_version(df):
keep = np.full(len(df), True)
for idx, row in enumerate(df.values[:, 1:]):
try:
row.astype(np.float)
except:
keep[idx] = False
pass # maybe its better to store to_remove list, depends on data
return df[keep]
#profile
def main():
arr = np.random.random_sample((3000, 5000))
df = pd.DataFrame(arr, columns=['gene' + str(i) for i in range(5000)])
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)],
columns=['Patient_ID']), df], axis=1)
df['gene0'][2] = 'bla'
df['gene998'][4] = 'bla'
df2 = df.copy()
df = op_version(df)
df2 = np_version(df2)
Note I decreased number of columns so it is more feasible for tests.
Also, fixed small bug in your example, instead of:
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(10000)], columns=['Patient_ID']),df],axis = 1)
I think should be
df = pd.concat([pd.DataFrame(['p' + str(i) for i in range(3000)], columns=['Patient_ID']),df],axis = 1)

How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Essentially I have a csv file which has an OFFENCE_CODE column and a column with some dates called OFFENCE_MONTH. The code I have provided retrieves the 10 most frequently occuring offence codes within the OFFENCE_CODE column, however I need to be able to do this between 2 dates from the OFFENCE_MONTH column.
import numpy as np
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
print(df['OFFENCE_CODE'].value_counts().nlargest(10))
You can use pandas.Series.between :
df['OFFENCE_MONTH'] = pd.to_datetime(df['OFFENCE_MONTH'])
input_date1 = pd.to_datetime('2012/11/1')
input_date2 = pd.to_datetime('2013/11/1')
m = df['OFFENCE_MONTH'].between(input_date1, input_date2)
df.loc[m, 'OFFENCE_CODE'].value_counts().nlargest(10)
You can do this if it is per month:
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
# example dataframe
# df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
d = {'OFFENCE_MONTH':[1,1,1,2,3,4,4,5,6,12],
'OFFENCE_CODE':['a','a','b','d','r','e','f','g','h','a']}
df = pd.DataFrame(d)
print(df)
# make a filter (example here)
df_filter = df.loc[(df['OFFENCE_MONTH']>=1) & (df['OFFENCE_MONTH']<5)]
print(df_filter)
# arrange the filter
print(df_filter['OFFENCE_CODE'].value_counts().nlargest(10))
example result:
a 2
b 1
d 1
r 1
e 1
f 1
First you need to convert the date in OFFENCE_MONTH column to datetime :
from datetime import datetime
datetime.strptime(input_date1, "%Y-%m-%d")
datetime.strptime(input_date2, "%Y-%m-%d")
datetime.strptime(df['OFFENCE_MONTH'], "%Y-%m-%d")
Then Selecting rows based on your conditions:
rslt_df = df[df['OFFENCE_MONTH'] >= input_date1 and df['OFFENCE_MONTH'] <= input_date2]
print(rslt_df['OFFENCE_CODE'].value_counts().nlargest(10))

pandas cumsum on lag-differenced dataframe

Say I have a pd.DataFrame() that I differenced with .diff(5), which works like "new number at idx i = (number at idx i) - (number at idx i-5)"
import pandas as pd
import random
example_df = pd.DataFrame(data=random.sample(range(1, 100), 20), columns=["number"])
df_diff = example_df.diff(5)
Now I want to undo this operation using the first 5 entries of example_df, and using df_diff.
If i had done .diff(1), I would simply use .cumsum(). But how can I achieve that it only sums up every 5th value?
My desired output is a df with the following values:
df_example[0]
df_example[1]
df_example[2]
df_example[3]
df_example[4]
df_diff[5] + df_example[0]
df_diff[6] + df_example[1]
df_diff[7] + df_example[2]
df_diff[8] + df_example[3]
...
you could shift the column, add them and fill nans:
df_diff["shifted"] = example_df.shift(5)
df_diff["undone"] = df_diff["number"] + df_diff["shifted"]
df_diff["undone"] = df_diff["undone"].fillna(example_df["number"])

How to extrapolate a periodic time serie in Pandas?

In Python 3.5, Pandas 20, say I have a one year periodic time serie :
import pandas as pd
import numpy as np
start_date = pd.to_datetime("2015-01-01T01:00:00.000Z", infer_datetime_format=True)
end_date = pd.to_datetime("2015-12-31T23:00:00.000Z", infer_datetime_format=True)
index = pd.DatetimeIndex(start=start_date,
freq="60min",
end=end_date)
time = np.array((index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
df = pd.DataFrame(index=index)
df["foo"] = np.sin( 2 * np.pi * time / len(time))
df.plot()
I want to do some periodic extrapolation of the time serie for a new index. I.e with :
new_start_date = pd.to_datetime("2017-01-01T01:00:00.000Z", infer_datetime_format=True)
new_end_date = pd.to_datetime("2019-12-31T23:00:00.000Z", infer_datetime_format=True)
new_index = pd.DatetimeIndex(start=new_start_date,
freq="60min",
end=new_end_date)
I would like to use some kind of extrapolate_periodic method to get:
# DO NOT RUN
new_df = df.extrapolate_periodic(index=new_index)
# END DO NOT RUN
new_df.plot()
What is the best way do such a thing in pandas?
How can I define a periodicity and get data from a new index easily?
I think I have what you are looking for, though it is not a simple pandas method.
Carrying on directly from where you left off,
def extrapolate_periodic(df, new_index):
df_right = df.groupby([df.index.dayofyear, df.index.hour]).mean()
df_left = pd.DataFrame({'new_index': new_index}).set_index('new_index')
df_left = df_left.assign(dayofyear=lambda x: x.index.dayofyear,
hour=lambda x: x.index.hour)
df = (pd.merge(df_left, df_right, left_on=['dayofyear', 'hour'],
right_index=True, suffixes=('', '_y'))
.drop(['dayofyear', 'hour'], axis=1))
return df.sort_index()
new_df = extrapolate_periodic(df, new_index)
# or as a method style
# new_df = df.pipe(extrapolate_periodic, new_index)
new_df.plot()
If you have more that a years worth of data it will take the mean of each duplicated day-hour. Here mean could be changed for last if you wanted just the most recent reading.
This will not work if you do not have a full years worth of data but you could fix this by adding in a reindex to complete the year and then using interpolate with a polynomial feature to fill in the missing foo column.
Here is some code I've used to solve my problem. The asumption is that the initial serie corresponds to a period of data.
def extrapolate_periodic(df, new_index):
index = df.index
start_date = np.min(index)
end_date = np.max(index)
period = np.array((end_date - start_date) / np.timedelta64(1, 'h'), dtype=int)
time = np.array((new_index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
new_df = pd.DataFrame(index=new_index)
for col in list(df.columns):
new_df[col] = np.array(df[col].iloc[time % period])
return new_df

Adding to date in pandas dataframe with other dataframe value

I want to increase the date in one dataframe column by an integer value in another.
I receive TypeError: unsupported type for timedelta days component: numpy.int64
My dataframes look like this:
import pandas as pd
import numpy as np
import datetime as dt
dfa = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',1],
['5/15/17',1]],
columns = ['Start','Days'])
I format the 'Start' column to datetime with this code:
dfa['Start'] = dfa['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
dfb['Start'] = dfb['Start'].apply(lambda x:
dt.datetime.strptime(x,'%m/%d/%y'))
I try to change the values in the dfa dataframe. The dfb dataframe reference works for 'Days' but not for 'Start':
for i, row in dfb.iterrows():
for j, row in dfa.iterrows():
new = pd.DataFrame({"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]), "Days": dfa.loc[j,"Days"] - dfb.loc[i,"Days"]}, index = [j+1])
dfa = pd.concat([dfa.ix[:j], new, dfa.ix[j+1:]]).reset_index(drop=True)
This is the key component that raises the error:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=dfb.loc[i,"Days"]
It works fine if I use:
"Start": dfa.loc[j,"Start"] + datetime.timedelta(days=1)
but I need it to be taking that value from dfb, not a static integer.
IIUC (I changed the input values a bit to clarify what is going on):
import pandas as pd
dfa = pd.DataFrame([
['5/15/17',1],
['5/16/17',1]],
columns = ['Start','Days'])
dfb = pd.DataFrame([
['5/15/17',3],
['5/16/17',4]],
columns = ['Start','Days'])
dfa['Start'] = pd.to_datetime(dfa['Start'])
dfb['Start'] = pd.to_datetime(dfb['Start'])
dfa['Start'] = dfa['Start'] + dfb['Days'].apply(pd.Timedelta,unit='D')
print(dfa)
Output:
Start Days
0 2017-05-18 1
1 2017-05-20 1

Categories