dataframe if string contains substring divide by 1000 - python

I have this dataframe looking like the below dataframe.
import pandas as pd
data = [['yellow', '800test' ], ['red','900ui'], ['blue','900test'], ['indigo','700test'], ['black','400ui']]
df = pd.DataFrame(data, columns = ['Donor', 'value'])
In the value field, if a string contains say 'test', I'd like to divide these numbers by 1000. What would be the best way to do this?

Check Below code using lambda function
df['value_2'] = df.apply(lambda x: str(int(x.value.replace('test',''))/1000)+'test' if x.value.find('test') > -1 else x.value, axis=1)
df
Output:

df["value\d"] = df.value.str.findall("\d+").str[0].astype(int)
df["value\w"] = df.value.str.findall("[^\d]+").str[0]
df.loc[df["value\w"] == "test", "value\d"] = df["value\d"]/1000
df["value"] = df["value\w"] + df["value\d"].astype(str)

Related

How to aggregate a dataframe then transpose it with Pandas

I'm trying to achieve this kind of transformation with Pandas.
I made this code but unfortunately it doesn't give the result I'm searching for.
CODE :
import pandas as pd
df = pd.read_csv('file.csv', delimiter=';')
df = df.count().reset_index().T.reset_index()
df.columns = df.iloc[0]
df = df[1:]
df
RESULT :
Do you have any proposition ? Any help will be appreciated.
First create columns for test nonOK and then use named aggregatoin for count, sum column Values and for count Trues values use sum again, last sum both columns:
df = (df.assign(NumberOfTest1 = df['Test one'].eq('nonOK'),
NumberOfTest2 = df['Test two'].eq('nonOK'))
.groupby('Category', as_index=False)
.agg(NumberOfID = ('ID','size'),
Values = ('Values','sum'),
NumberOfTest1 = ('NumberOfTest1','sum'),
NumberOfTest2 = ('NumberOfTest2','sum'))
.assign(TotalTest = lambda x: x['NumberOfTest1'] + x['NumberOfTest2']))

How to delete row in pandas dataframe based on condition if string is found in cell value of type list?

I've been struggling with the following issue that sounds very easy in fact but can't seem to figure it out and I'm sure it's something very obvious in the stacktrace but I'm just being dumb.
I simply have a pandas dataframe looking like this:
And want to drop the rows that contain, in the jpgs cell value (list), the value "123.jpg". So normally I would get the final dataframe with only rows of index 1 and 3.
However I've tried a lot of methods and none of them works.
For example:
df = df["123.jpg" not in df.jpgs]
or
df = df[df.jpgs.tolist().count("123.jpg") == 0]
give error KeyError: True:
df = df[df['jpgs'].str.contains('123.jpg') == False]
Returns an empty dataframe:
df = df[df.jpgs.count("123.jpg") == 0]
And
df = df.drop(df["123.jpg" in df.jpgs].index)
Gives KeyError: False:
This is my entire code if needed, and I would really appreciate if someone would help me with an answer to what I'm doing wrong :( . Thanks!!
import pandas as pd
df = pd.DataFrame(columns=["person_id", "jpgs"])
id = 1
pair1 = ["123.jpg", "124.jpg"]
pair2 = ["125.jpg", "300.jpg"]
pair3 = ["500.jpg", "123.jpg"]
pair4 = ["111.jpg", "122.jpg"]
row1 = {'person_id': id, 'jpgs': pair1}
row2 = {'person_id': id, 'jpgs': pair2}
row3 = {'person_id': id, 'jpgs': pair3}
row4 = {'person_id': id, 'jpgs': pair4}
df = df.append(row1, ignore_index=True)
df = df.append(row2, ignore_index=True)
df = df.append(row3, ignore_index=True)
df = df.append(row4, ignore_index=True)
print(df)
#df = df["123.jpg" not in df.jpgs]
#df = df[df['jpgs'].str.contains('123.jpg') == False]
#df = df[df.jpgs.tolist().count("123.jpg") == 0]
df = df.drop(df["123.jpg" in df.jpgs].index)
print("\n Final df")
print(df)
Since you filter on a list column, apply lambda would probably be the easiest:
df.loc[df.jpgs.apply(lambda x: "123.jpg" not in x)]
Quick comments on your attempts:
In df = df.drop(df["123.jpg" in df.jpgs].index) you are checking whether the exact value "123.jpg" is contained in the column ("123.jpg" in df.jpgs) rather than in any of the lists, which is not what you want.
In df = df[df['jpgs'].str.contains('123.jpg') == False] goes in the right direction, but you are missing the regex=False keyword, as shown in Ibrahim's answer.
df[df.jpgs.count("123.jpg") == 0] is also not applicable here, since count returns the total number of non-NaN values in the Series.
For str.contains one this is how it is done
df[df.jpgs.str.contains("123.jpg", regex=False)]
You can try this:
mask = df.jpgs.apply(lambda x: '123.jpg' not in x)
df = df[mask]

Concatenate on specific condition python

EDITED
I want to write an If loop with conditions on cooncatenating strings.
i.e. If cell A1 contains a specific format of text, then only do you concatenate, else leave as is.
example:
If bill number looks like: CM2/0000/, then concatenate this string with the date column (month - year), else leave the bill number as it is.
Sample Data
You can create function which does what you need and use df.apply() to execute it on all rows.
I use example data from #Boomer answer.
EDIT: you didn't show what you really have in dataframe and it seems you have datetime in bill_date but I used strings. I had to convert strings to datetime to show how to work with this. And now it needs .strftime('%m-%y') or sometimes .dt.strftime('%m-%y') instead of .str[3:].str.replace('/','-'). Because pandas uses different formats to display dateitm for different countries so I couldn't use str(x) for this because it gives me 2019-09-15 00:00:00 instead of yours 15/09/19
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
def convert(row):
if row['bill_number'].endswith('/'):
#return row['bill_number'] + row['bill_date'].str[3:].replace('/','-')
return row['bill_number'] + row['bill_date'].strftime('%m-%y')
else:
return row['bill_number']
df['bill_number'] = df.apply(convert, axis=1)
print(df)
Result:
bill_number bill_date
0 CM2/0000/09-19 15/09/19
1 CM2/0000 15/09/19
2 CM3/0000/09-19 15/09/19
3 CM3/0000 15/09/19
Second idea is to create mask
mask = df['bill_number'].str.endswith('/')
and later use it for all values
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
Left side needs .loc[mask,'bill_number'] instead of `[mask]['bill_number'] to correctly assing values - but right side doesn't need it.
import pandas as pd
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
mask = df['bill_number'].str.endswith('/')
#df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].str[3:].str.replace('/','-')
# or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].str[3:].str.replace('/','-')
df.loc[mask,'bill_number'] = df[mask]['bill_number'] + df[mask]['bill_date'].dt.strftime('%m-%y')
#or
#df.loc[mask,'bill_number'] = df.loc[mask,'bill_number'] + df.loc[mask,'bill_date'].dt.strftime('%m-%y')
print(df)
Third idea is to use numpy.where()
import pandas as pd
import numpy as np
df = pd.DataFrame({
'bill_number': ['CM2/0000/', 'CM2/0000', 'CM3/0000/', 'CM3/0000'],
'bill_date': ['15/09/19', '15/09/19', '15/09/19', '15/09/19']
})
df['bill_date'] = pd.to_datetime(df['bill_date'])
df['bill_number'] = np.where(
df['bill_number'].str.endswith('/'),
#df['bill_number'] + df['bill_date'].str[3:].str.replace('/','-'),
df['bill_number'] + df['bill_date'].dt.strftime('%m-%y'),
df['bill_number'])
print(df)
Maybe this will work for you. It would be nice to have a data sample like #Mike67 was stating. But based on your information this is what I came up with. Bulky, but it works. I'm sure someone else will have a fancier version.
import pandas as pd
from pandas import DataFrame, Series
dat = {'num': ['CM2/0000/','CM2/0000', 'CM3/0000/', 'CM3/0000',],
'date': ['15/09/19','15/09/19','15/09/19','15/09/19']}
df = pd.DataFrame(dat)
df['date'] = df['date'].map(lambda x: str(x)[3:])
df['date'] = df['date'].str.replace('/','-')
for cols in df.columns:
df.loc[df['num'].str.endswith('/'), cols] = df['num'] + df['date']
print(df)
Results:
num date
0 CM2/0000/09-19 09-19
1 CM2/0000 09-19
2 CM3/0000/09-19 09-19
3 CM3/0000 09-19

Pandas Columns Operations with List

I have a pandas dataframe with two columns, the first one with just a single date ('action_date') and the second one with a list of dates ('verification_date'). I am trying to calculate the time difference between the date in 'action_date' and each of the dates in the list in the corresponding 'verification_date' column, and then fill the df new columns with the number of dates in verification_date that have a difference of either over or under 360 days.
Here is my code:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['verification_date'] = ['2016-01-01', '2015-01-08', '2017-01-01']
df['verification_date'] = pd.to_datetime(df['verification_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf']
df.index = df.action_date
df = df.groupby(pd.TimeGrouper(freq='2D'))['verification_date'].apply(list).reset_index()
def make_columns(df):
df = df
for i in range(len(df)):
over_360 = []
under_360 = []
for w in [(df['action_date'][i]-x).days for x in df['verification_date'][i]]:
if w > 360:
over_360.append(w)
else:
under_360.append(w)
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
return df
make_columns(df)
This kinda works EXCEPT the df has the same values for each row, which is not true as the dates are different. For example, in the first row of the dataframe, there IS a difference of over 360 days between the action_date and both of the items in the list in the verification_date column, so the over_360 column should be populated with 2. However, it is empty and instead the under_360 column is populated with 1, which is accurate only for the second row in 'action_date'.
I have a feeling I'm just messing up the looping but am really stuck. Thanks for all help!
Your problem was that you were always updating the whole column with the value of the last calculation with these lines:
df['over_360'] = len(over_360)
df['under_360'] = len(under_360)
what you want to do instead is set the value for each line calculation accordingly, you can do this by replacing the above lines with these:
df.set_value(i,'over_360',len(over_360))
df.set_value(i,'under_360',len(under_360))
what it does is, it sets a value in line i and column over_360 or under_360.
you can learn more about it here.
If you don't like using set_values you can also use this:
df.ix[i,'over_360'] = len(over_360)
df.ix[i,'under_360'] = len(under_360)
you can check dataframe.ix here.
you might want to try this:
df['over_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days >360) for i in x['verification_date']]) , axis=1)
df['under_360'] = df.apply(lambda x: sum([((x['action_date'] - i).days <360) for i in x['verification_date']]) , axis=1)
I believe it should be a bit faster.
You didn't specify what to do if == 360, so you can just change > or < into >= or <=.

looping through an index and the previous item

I would like to loop through a list of dates, in order to obtain the difference between a date and the date before (e.g. the difference between 4/13 and 3/13).
I'm looking for a for-loop able to scan thorugh couple of dates
import pandas as pd
import numpy as np
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,np.nan,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','age'])
df1
input data
df2=df1.T
df2['new']=df2.iloc[:,3]-df2.iloc[:,2]
desidered result:
raw_data = {'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),'age': [10,np.nan,50,30],
'diff': [10,-10,50,-20]}
output = pd.DataFrame(raw_data, columns = ['date','age','diff'])
output
You don't need a loop if you want to know the difference between to dates.
Maybe you should do this:
df1['difference'] = (df1['date'] - df1['date'].shift(1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)
or this:
df1['difference'] = (df1['date'] - df1['date'].shift(-1)).apply(lambda x: pd.NaT if x is pd.NaT else x.days)

Categories