I am trying to create a new column in a pandas data frame by and calculating the value from existing columns.
I have 3 existing columns ("launched_date", "item_published_at", "item_created_at")
However, my "if row[column_name] is not None:" statement is allowing columns with NaN value and not skipping to the next statement.
In the code below, I would not expect the value of "nan" to be printed after the first conditional, I would expect something like "2018-08-17"
df['adjusted_date'] = df.apply(lambda row: adjusted_date(row), axis=1)
def adjusted_launch(row):
if row['launched_date']is not None:
print(row['launched_date'])
exit()
adjusted_date = date_to_time_in_timezone(row['launched_date'])
elif row['item_published_at'] is not None:
adjusted_date = row['item_published_at']#make datetime in PST
else:
adjusted_date = row['item_created_at'] #make datetime in PST
return adjusted_date
How can I structure this conditional statement correctly?
First fill "nan" as string where the data is empty
df.fillna("nan",inplace=True)
Then in function you can apply if condition like:
def adjusted_launch(row):
if row['launched_date'] !='nan':
......
Second Sol
import numpy as np
df.fillna(np.nan,inplace=True)
#suggested by #ShadowRanger
def funct(row):
if row['col'].notnull():
pass
df = df.where((pd.notnull(df)), None)
This will replace all nans with None, No other modifications required.
Related
I have a simple CSV file named input.csv as follows:
name,money
Dan,200
Jimmy,xd
Alice,15
Deborah,30
I want to write a python script that sanitizes the data in the money column:
every value that has non-numerical characters needs to be replaced with 0
This is my attempt so far:
import pandas as pd
df = pd.read_csv(
"./input.csv",
sep = ","
)
# this line is the problem: it doesn't update on a row by row basis, it updates all rows
df['money'] = df['money'].replace(to_replace=r'[^0‐9]', value=0, regex=True)
df.to_csv("./output.csv", index = False)
The problem is that when the script runs, because the invalud money value xd exists on one of the rows, it will change ALL money values to 0 for ALL rows.
I want it to ONLY change the money value for the second data row (Jimmy) which has the invalid value.
this is what it gives at the end:
name,money
Dan,0
Jimmy,0
Alice,0
Deborah,0
but what I need it to give is this:
name,money
Dan,200
Jimmy,0
Alice,15
Deborah,30
What is the problem?
You can use:
df['money'] = pd.to_numeric(df['money'], errors='coerce').fillna(0).astype(int)
The above assumes all valid values are integers. You can leave off the .astype(int) if you want float values.
Another option would be to use a converter function in the read_csv method. Again, this assumes integers. You can use float(x) in place of int(x) if you expect float money values:
def convert_to_int(x):
try:
return int(x)
except ValueError:
return 0
df = pd.read_csv(
'input.csv',
converters={'money': convert_to_int}
)
Some list comprehension could work for this (given the "money" column has no decimals):
df.money = [x if type(x) == int else 0 for x in df.money]
If you are dealing with decimals, then something like:
df.money = [x if (type(x) == int) or (type(x) == float) else 0 for x in df.money]
... will work. Just know that pandas will convert the entire "money" column to float (decimals).
I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])
when I try to make a new column to add to an existing dataframe , the new column only has empty values . However, when print "result" before assigns it to the dataframe it works fine! and thus I get this weird error of max arg!
ValueError: max() arg is an empty sequence
I'm using mplfinance to plot the data
strategy.py
def moving_average (self, df , i):
signal = df['sma20'][i]*1.10
if (df['sma20'][i] > df['sma50'][i]) & (signal >df['Close'][i]):
return df['Close'][i]
else:
return None
trading.py
for i in range(0, len(df['Close'])-1):
result = strategy.moving_average(df , i)
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)
Based on the very small amount of information here, and on your comment
"because df['buy'] column has nan values only."
I'm going to guess that your problem is that strategy.moving_average() is returning None instead of nan when there is no signal.
There is a big difference between None and nan. (The main issue is that nan supports math, whereas None does not; and as a general rule plotting packages always do math).
I suggest you import numpy as np and then in strategy.moving_average()
change return None
to return np.nan.
ALSO just saw another problem.
You are only assigning a single value to df['buy'].
You need to take it out of the loop.
I suggest initialize result as an empty list before the loop
then:
result = []
for i in range(0, len(df['Close'])-1):
result.append(strategy.moving_average(df , i))
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)
I am trying to have Python Pandas compare two dataframes with each other. In dataframe 1, i have two columns (AC-Cat and Origin). I am trying to compare the AC-Cat column with the inputs of Dataframe 2. If a match is found between one of the columns of Dataframe 2 and the value of dataframe 1 being studied, i want Pandas to copy the header of the column of Dataframe 2 in which the match is found to a new column in Dataframe 1.
DF1:
f = {'AC-Cat': pd.Series(['B737', 'A320', 'MD11']),
'Origin': pd.Series(['AJD', 'JFK', 'LRO'])}
Flight_df = pd.DataFrame(f)
DF2:
w = {'CAT-C': pd.Series(['DC85', 'IL76', 'MD11', 'TU22', 'TU95']),
'CAT-D': pd.Series(['A320', 'A321', 'AN12', 'B736', 'B737'])}
WCat_df = pd.DataFrame(w)
I imported pandas as pd and numpy as np and tried to define a function to compare these columns.
def get_wake_cat(AC_cat):
try:
Wcat = [WCat_df.columns.values[0]][WCat_df.iloc[:,1]==AC_cat].values[0]
except:
Wcat = np.NAN
return Wcat
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT))
However, the function does not result in the desired outputs. For example: Take the B737 AC-Cat value. I want Python Pandas to then find this value in DF2 in the column CAT-D and copy this header to the new column of DF 1. This does not happen. Can someone help me find out why my code is not giving the desired results?
Not pretty but I think I got it working. Part of the error was that the function did not have WCat_df. I also changed the indexing into two steps:
def get_wake_cat(AC_cat, WCat_df):
try:
d=WCat_df[WCat_df.columns.values][WCat_df.iloc[:]==AC_cat]
Wcat=d.columns[(d==AC_cat).any()][0]
except:
Wcat = np.NAN
return Wcat
Then you need to change your next line to:
Flight_df.loc[:,'CAT'] = Flight_df.loc[:,'AC-Cat'].apply(lambda CT: get_wake_cat(CT,WCat_df ))
AC-Cat Origin CAT
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
Hope that solves the problem
This will give you 2 new columns with the name\s of the match\s found:
Flight_df['CAT1'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-C' if x in list(WCat_df['CAT-C']) else '')
Flight_df['CAT2'] = Flight_df['AC-Cat'].map(lambda x: 'CAT-D' if x in list(WCat_df['CAT-D']) else '')
Flight_df.loc[Flight_df['CAT1'] == '', 'CAT1'] = Flight_df['CAT2']
Flight_df.loc[Flight_df['CAT1'] == Flight_df['CAT2'], 'CAT2'] = ''
IUC, you can do a stack and merge:
final=(Flight_df.merge(WCat_df.stack().reset_index(1,name='AC-Cat'),on='AC-Cat',how='left')
.rename(columns={'level_1':'New'}))
print(final)
Or with melt:
final=Flight_df.merge(WCat_df.melt(var_name='New',value_name='AC-Cat'),
on='AC-Cat',how='left')
AC-Cat Origin New
0 B737 AJD CAT-D
1 A320 JFK CAT-D
2 MD11 LRO CAT-C
I'm trying to remove wrong values form my data (a series of 15mln values, 700MB). The values to be removed are values next to 'nan' values, e.g.:
Series: /1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9
Numbers surrounded by slashes i.e. /1/,/2/,/4/,/8/ are values, which should be removed.
The problem is that it takes way too long to compute that with the following code that I have:
%%time
import numpy as np
import pandas as pd
# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
# add 'nan' to data in form of a string.
for i in range(len(df.difference)):
# arbitrary condition
if df.difference[i] < -2:
df.difference[i] = 'nan'
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
How to make it more time-efficient?
This is still a work in progress for me. I knocked 100x off your dummy data size to get down to something I could stand to wait for.
I also added this code at the top of my version:
import time
current_milli_time = lambda: int(round(time.time() * 1000))
def mark(s):
print("[{}] {}".format(current_milli_time()/1000, s))
This just prints a string with a time-mark in front of it, to see what's taking so long.
With that done, in your 'difference' column computation, you can replace the manual list generation with a vector operation. This code:
df = pd.DataFrame(data_dict)
mark("Got DataFrame")
# calculate difference between the current speed and the next speed
list_of_differences = []
for i in df.index:
difference = df.next_speed[i]-df.speed[i]
list_of_differences.append(difference)
df['difference'] = list_of_differences
mark("difference 1")
df['difference2'] = df['next_speed'] - df['speed']
mark('difference 2')
print(df[:10])
Produces this output:
[1490943913.921] Got DataFrame
[1490943922.094] difference 1
[1490943922.096] difference 2
next_speed speed difference difference2
0 18.008314 20.182982 -2.174669 -2.174669
1 14.736095 18.008314 -3.272219 -3.272219
2 5.352993 14.736095 -9.383102 -9.383102
3 5.854199 5.352993 0.501206 0.501206
4 2.003826 5.854199 -3.850373 -3.850373
5 12.736061 2.003826 10.732236 10.732236
6 2.512623 12.736061 -10.223438 -10.223438
7 18.224716 2.512623 15.712093 15.712093
8 14.023848 18.224716 -4.200868 -4.200868
9 15.991590 14.023848 1.967741 1.967741
Notice that the two difference columns are the same, but the second version took about 8 seconds less time. (Presumably 800 seconds when you have 100x more data.)
I did the same thing in the 'nanify' process:
df.difference2[df.difference2 < -2] = np.nan
The idea here is that many of the binary operators actually generate either a placeholder, or a Series or vector. And that can be used as an index, so that df.difference2 < -2 becomes (in essence) a list of the places where that condition is true, and you can then index either df (the whole table) or any of the columns of df, like df.difference2, using that index. It's a fast shorthand for the otherwise-slow python for loop.
Update
Okay, finally, here is a version that vectorizes the "Time-inefficient Loop". I'm just pasting the whole thing in at the bottom, for copying.
The premise is that the Series.isnull() method returns a boolean Series (column) that is true if the contents are "missing" or "invalid" or "bogus." Generally, this means NaN, but it also recognizes Python None, etc.
The tricky part, in pandas, is shifting that column up or down by one to reflect "around"-ness.
That is, I want another boolean column, where col[n-1] is true if col[n] is null. That's my "before a nan" column. And likewise, I want another column where col[n+1] is true if col[n] is null. That's my "after a nan" column.
It turns out I had to take the damn thing apart! I had to reach in, extract the underlying numpy array using the Series.values attribute, so that the pandas index would be discarded. Then a new index is created, starting at 0, and everything works again. (If you don't strip the index, the columns "remember" what their numbers are supposed to be. So even if you delete column[0], the column doesn't shift down. Instead, is knows "I am missing my [0] value, but everyone else is still in the right place!")
Anyway, with that figured out, I was able to build three columns (needlessly - they could probably be parts of an expression) and then merge them together into a fourth column that indicates what you want: the column is True when the row is before, on, or after a nan value.
missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
Here's the whole thing:
import numpy as np
import pandas as pd
import time
current_milli_time = lambda: int(round(time.time() * 1000))
def mark(s):
print("[{}] {}".format(current_milli_time()/1000, s))
# sample data
speed = np.random.uniform(0,25,150000)
next_speed = speed[1:]
# create a dataframe
data_dict = {'speed': speed[:-1],
'next_speed': next_speed}
df = pd.DataFrame(data_dict)
mark("Got DataFrame")
# calculate difference between the current speed and the next speed
list_of_differences = []
#for i in df.index:
#difference = df.next_speed[i]-df.speed[i]
#list_of_differences.append(difference)
#df['difference'] = list_of_differences
#mark("difference 1")
df['difference'] = df['next_speed'] - df['speed']
mark('difference 2')
df['difference2'] = df['next_speed'] - df['speed']
# add 'nan' to data in form of a string.
#for i in range(len(df.difference)):
## arbitrary condition
#if df.difference[i] < -2:
#df.difference[i] = 'nan'
df.difference[df.difference < -2] = np.nan
mark('nanify')
df.difference2[df.difference2 < -2] = np.nan
mark('nanify 2')
missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan
mark('looped')
#########################################
# THE TIME-INEFFICIENT LOOP
# remove wrong values before and after 'nan'.
for i in range(len(df)):
# check if the value is a number to skip computations of the following "if" cases
if not(isinstance(df.difference[i], str)):
continue
# case 1: where there's only one 'nan' surrounded by values.
# Without this case the algo will miss some wrong values because 'nan' will be removed
# Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
# A number surrounded by slashes e.g. /1/ is a value to be removed
if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
df.difference[i-1]= 'wrong'
df.difference[i+1]= 'wrong'
# case 2: where the following values are 'nan': /1/, nan, nan, 4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
df.difference[i-1]= 'wrong'
# case 3: where next value is NOT 'nan' wrong, nan,nan,4
# E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
df.difference[i+1]= 'wrong'
mark('time-inefficient loop done')
I am assuming that you don't want either 'nan' or wrong values and nan values are not much compared to size of data. Please try with this:
nan_idx = df[df['difference']=='nan'].index.tolist()
from copy import deepcopy
drop_list = deepcopy(nan_idx)
for i in nan_idx:
if (i+1) not in(drop_list) and (i+1) < len(df):
mm.append(i+1)
if (i-1) not in(drop_list) and (i-1) < len(df):
mm.append(i-1)
df.drop(df.index[drop_list])
if nan is not a string but it is NaN which is for missing values then use this to get its indexes:
nan_idx = df[pandas.isnull(df['difference'])].index.tolist()