I have a simple CSV file named input.csv as follows:
name,money
Dan,200
Jimmy,xd
Alice,15
Deborah,30
I want to write a python script that sanitizes the data in the money column:
every value that has non-numerical characters needs to be replaced with 0
This is my attempt so far:
import pandas as pd
df = pd.read_csv(
"./input.csv",
sep = ","
)
# this line is the problem: it doesn't update on a row by row basis, it updates all rows
df['money'] = df['money'].replace(to_replace=r'[^0‐9]', value=0, regex=True)
df.to_csv("./output.csv", index = False)
The problem is that when the script runs, because the invalud money value xd exists on one of the rows, it will change ALL money values to 0 for ALL rows.
I want it to ONLY change the money value for the second data row (Jimmy) which has the invalid value.
this is what it gives at the end:
name,money
Dan,0
Jimmy,0
Alice,0
Deborah,0
but what I need it to give is this:
name,money
Dan,200
Jimmy,0
Alice,15
Deborah,30
What is the problem?
You can use:
df['money'] = pd.to_numeric(df['money'], errors='coerce').fillna(0).astype(int)
The above assumes all valid values are integers. You can leave off the .astype(int) if you want float values.
Another option would be to use a converter function in the read_csv method. Again, this assumes integers. You can use float(x) in place of int(x) if you expect float money values:
def convert_to_int(x):
try:
return int(x)
except ValueError:
return 0
df = pd.read_csv(
'input.csv',
converters={'money': convert_to_int}
)
Some list comprehension could work for this (given the "money" column has no decimals):
df.money = [x if type(x) == int else 0 for x in df.money]
If you are dealing with decimals, then something like:
df.money = [x if (type(x) == int) or (type(x) == float) else 0 for x in df.money]
... will work. Just know that pandas will convert the entire "money" column to float (decimals).
Related
Probably easy, but I am still learning.
I am creating a new column in dask dataframe where the value will come from after extracting the last four str characters of date column in str ddmmyyyy.
What I did:
have is a list of inv_years
extract the lst four characters of the string date
tried to define a function that if the extracted years are in the inv_years list, return 1 else 0 in a new column.
Issue: How do I write a working function or better in fewer lines a lambda function
def valid_yr(x):
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
validity_year = ddf['string_ddmmyyyy'].str[-4:] #extract the last four to get the year
if validity_year.isin(inv_years):
x = 1
else:
x = 0
return x
#create a new column and apply function
ddf['validity_year']= ??? # what to write here?
A very grumpy way I could come up with is
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
ddf['validity_year'] = ddf.apply(lambda row: 1 if row.string_ddmmyyyy[-4:] in inv_years else 0, axis=1)
or to try and get your approach working we initially modify your function a bit so as it's argument is a single row.
def valid_yr(row):
inv_years = ['1921','1969','2026','2030','2041','2060','2062']
validity_year = row.string_ddmmyyyy[-4:]
if validity_year in inv_years:
x = 1
else:
x = 0
return x
Now we can apply this function to all rows.
ddf['validity_year'] = ddf.apply(valid_yr, axis=1)
I have list and DataFrame as follows, the question is I want to search for each element in the list and if they exist in each list in a column, add a new column and put the word in the new column. Iv tried but my solution is not correct. can anyone please help me?
The List :
list_m = ['KathyConWom',
'monkeyhead78',
'acorncarver',
'bonglez',
'9NewsQueensland',
'paulinedaniels',
'AdvoBarryRoux',
'_sara_jade_',
'theage',
'gaskell_mike',
'saidtarraf',
'BroHilderchump',
'jodyvance',
'COdendahl',
'pfizer',
'RobertKennedyJr',
'Real_Sheezzii',
'Kellie_Martin',
'ThatsOurWaldo',
'SCN_Nkosi',
'azsweetheart013']
name of DataFrame: test
user_id text tweet_id user_name mention
22 1334471712528855040 #KathyConWom #JamesDelingpole Time to stand-up... 1362119551375314948 #KYourrights [KathyConWom, JamesDelingpole]
23 334131548 #KathyConWom #Exp_Sec_Prof It seems like weste... 1362096715877212161 #GowTolson [KathyConWom, Exp]
24 1252182507715526657 #KathyConWom I guess that the hard part would ... 1362096654514552837 #Peterpu52451065 [KathyConWom]
What I want :
user_id text tweet_id user_name mention new_col
22 1334471712528855040 #KathyConWom #JamesDelingpole Time to stand-up... 1362119551375314948 #KYourrights [KathyConWom, JamesDelingpole] KathyConWom
23 334131548 #KathyConWom #Exp_Sec_Prof It seems like weste... 1362096715877212161 #GowTolson [KathyConWom, Exp] KathyConWom
24 1252182507715526657 #KathyConWom I guess that the hard part would ... 1362096654514552837 #Peterpu52451065 [azsweetheart013] azsweetheart013
what I tried :
for index, row in df.iterrows():
for i in list_m:
i in test.mention
test["c"] = i
test
You can use the intersection operation of set to find the common part of two lists.
df['new_col'] = df['mention'].apply(lambda mentions: list(set(mentions).intersection(list_m)))
To turn list into string, you can use
df['new_col'] = df['mention'].apply(lambda mentions: ', '.join(set(mentions).intersection(list_m)))
try this
def add(x):
ret = ''
for y in x:
if y in list_m:
if len(ret) > 0:
ret += ','
ret += y
return ret
df['new_col'] = df['mention'].apply(lambda x: add(x))
You can also use np.intersect1d() to get the unique intersection in list format, as follows:
import numpy as np
df['new_col'] = df['mention'].map(lambda x: np.intersect1d(x, list_m))
If you want to convert the list to comma separated string, simply chain it with .str.join(), as follows:
import numpy as np
df['new_col'] = df['mention'].map(lambda x: np.intersect1d(x, list_m)).str.join(', ')
You can also simply use list comprehension in .apply(), as follows:
df['new_col'] = df['mention'].apply(lambda x: [y for y in x if y in list_m]).str.join(', ')
So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')
I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])
I am trying to create a new column in a pandas data frame by and calculating the value from existing columns.
I have 3 existing columns ("launched_date", "item_published_at", "item_created_at")
However, my "if row[column_name] is not None:" statement is allowing columns with NaN value and not skipping to the next statement.
In the code below, I would not expect the value of "nan" to be printed after the first conditional, I would expect something like "2018-08-17"
df['adjusted_date'] = df.apply(lambda row: adjusted_date(row), axis=1)
def adjusted_launch(row):
if row['launched_date']is not None:
print(row['launched_date'])
exit()
adjusted_date = date_to_time_in_timezone(row['launched_date'])
elif row['item_published_at'] is not None:
adjusted_date = row['item_published_at']#make datetime in PST
else:
adjusted_date = row['item_created_at'] #make datetime in PST
return adjusted_date
How can I structure this conditional statement correctly?
First fill "nan" as string where the data is empty
df.fillna("nan",inplace=True)
Then in function you can apply if condition like:
def adjusted_launch(row):
if row['launched_date'] !='nan':
......
Second Sol
import numpy as np
df.fillna(np.nan,inplace=True)
#suggested by #ShadowRanger
def funct(row):
if row['col'].notnull():
pass
df = df.where((pd.notnull(df)), None)
This will replace all nans with None, No other modifications required.