How to get the pandas series diff in a for loop? - python

I have a timeseries called ts, with some values as shown below:
import numpy as np
import pandas as pd
ts = pd.Series(range(10))
ts.index = pd.date_range('2019-01-01',periods=len(ts))
ts
I can get multiple differencing like this:
ts.diff().dropna()
ts.diff().diff().dropna()
How can I do this using for loop?
for d in range(7):
tsx = ? # I dont know what to do here?

We have pd.eval
for d in range(7):
tsx = pd.eval('ts'+'.diff()'*d+'.dropna()')

You can simply use eval function:
for d in range(7):
tsx = eval('ts' + '.diff()'*d + '.dropna()')

Related

iterating over each row in pandas to evaluate condition

I have the following code
import pandas as pd
from pandas_datareader import data as web
import numpy as np
import math
data = web.DataReader('goog', 'yahoo')
df['lifetime'] = data['High'].asfreq('D').rolling(window=999999, min_periods=1).max() #To check if it is a lifetime high
How can i compare it so that i get a boolean (in 1 and 0 preferably) if df['High'] is close to its df['lifetime'] for each row in pandas:
data['isclose'] = math.isclose(data['High'], data['lifetime'], rel_tol = 0.003)
Any help would be appreciated.
You can use np.where()
import numpy as np
import math
data['isclose'] = np.where(math.isclose(data['High'], data['lifetime'], rel_tol = 0.003), 1, 0)
You could also use pandas' apply() function:
import math
from pandas_datareader import data as web
data = web.DataReader("goog", "yahoo")
data["lifetime"] = data["High"].asfreq("D").rolling(window=999999, min_periods=1).max()
data["isclose"] = data.apply(
lambda row: 1 if math.isclose(row["High"], row["lifetime"], rel_tol=0.003) else 0,
axis=1,
)
print(data)
However, yudhiesh's solution using np.where() is faster.
See also: Why is np.where faster than pd.apply

Pandas: memory usage when working with very many columns using Groupby

I have a dataframe with over 1000 columns and I would like to know whether it makes a difference on memory usage and/or speed to run a groupby directly on a dataframe or to create a smaller subset of the dataframe columnwise.
df[['xnew','ynew','znew']] = df.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
or,
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])['x','y','z'].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
I would like to test this myself but I am unfamiliar with how to do it. Advice on how to test this would be much appreciated.
The short answer is no, it doesn't matter on either dimension. From a Colab notebook:
%load_ext memory_profiler
import pandas as pd
import numpy as np
d = {'a': [1]*100 + [2]*100, 'b': [3]*50 + [4]*50 + [5]*50 + [6]*50}
for i in range(1000):
d[i] = np.random.random(200)
for c in 'xyz':
d[c] = np.random.random(200)
df = pd.DataFrame(d)
%time %memit df[['xnew','ynew','znew']] = df.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
%%time
%%memit
df2=df[['a','b','x','y','z']]
df2[['xnew','ynew','znew']] = df2.groupby(['a','b'])[['x','y','z']].transform(lambda f: f.rolling(3).mean().shift())
df=pd.concat([df,df2[['xnew','ynew','znew']]],axis=1)
The simple way to do this is to get the time and then subtract the time at the end of the process to display the elapsed time.
import time
start = time.time()
# Write down the process.
process_time = time.time() - start
print(process_time)

How to optimize levenstien edit distance on pandas dataframe using python?

I am running levenstein comparison on 50k records. I need to compare each record between each other. Is there a way how to optimize the following code to run it faster? The data is stored in pandas dataframe.
import pandas as pd
import numpy as np
import Levenshtein
df_s_sorted = df.sort_values(['nonascii_2', 'birth_date'])
df_similarity = pd.DataFrame()
q=0
for index, p in df_s_sorted.iterrows():
q = q + 1
print(q)
for index, p1 in df_s_sorted.iterrows():
if ((p["birth_date"] == p1["birth_date"]) & (p["name"] != p1["name"] )):
if (Levenshtein.distance(p["name"],p1["name"]) == 1):
df_similarity = df_similarity.append(p)
print(p)
df_s_sorted.drop(index, inplace=True)

How to reference a dataframe name as a string in a for loop?

Here is my code.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import quandl
start = pd.to_datetime('2012-01-01')
end = pd.to_datetime('2017-01-01')
aapl = quandl.get('WIKI/aapl.11', start_date = start, end_date= end )
csco = quandl.get('WIKI/csco.11', start_date = start, end_date= end )
ibm = quandl.get('WIKI/ibm.11', start_date = start, end_date= end )
amzn = quandl.get('WIKI/amzn.11', start_date = start, end_date= end )
This creates 4 data frames. I want to be able to achieve this by using a for loop.
Here is what I imagine the for loop would look like.
for i in [aapl,csco,ibm,amzn]:
a = 'WIKI/'+ i + '.11'
i = quandl.get(a, start_date=start, end_date=end)
I would like to be able to reference the name of the data frame as a string in the loop to perform other functions that require the name of the dataframe as a string.
Anyone help me with this or suggest an alternative approach which achieve the same result. I am hoping to be able to do this in a way that would scale to 100s of loop iterations.
Thank you for your help.
Equivalent to G. B.'s answer, but without dictionary comprehension in case you are not familiar with it yet.
import pandas as pd
import quandl
start = pd.to_datetime('2012-01-01')
end = pd.to_datetime('2017-01-01')
data = {}
for key in ['aapl', 'csco', 'ibm', 'amzn']:
name = 'WIKI/'+ key + '.11'
data[key] = quandl.get(name, start_date=start, end_date=end)
# Then you can use it like
data['aapl'].DoSomething()
inames = ['aapl', 'csco', 'ibm', 'amzn']
data = {name: quandl.get('WIKI/'+ name + '.11', start_date=start, end_date=end) for name in inames}
You are almost there! All you have to do is enclose the list elements in strings.
What you are doing now:
a = 'WIKI/'+ i + '.11' # Where i is an object, probably a pointer/instance
What you need to do:
a = 'WIKI/'+ i + '.11' # Where i is a good-old-happy string
So, your code your code should look something like:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import quandl
start = pd.to_datetime('2012-01-01')
end = pd.to_datetime('2017-01-01')
# You have to add Strings to calculate i (But, you are adding objects)
L = ["aap1", "csco", "ibm", "amzn"]
L_i = []
for i in L:
a = 'WIKI/'+ i + '.11'
i = quandl.get(a, start_date=start, end_date=end)
# Then do whatever you want with the i-s : L_i.append(i)

Count occurrences of number from specific column in python

I am trying to do the equivalent of a COUNTIF() function in excel. I am stuck at how to tell the .count() function to read from a specific column in excel.
I have
df = pd.read_csv('testdata.csv')
df.count('1')
but this does not work, and even if it did it is not specific enough.
I am thinking I may have to use read_csv to read specific columns individually.
Example:
Column name
4
4
3
2
4
1
the function would output that there is one '1' and I could run it again and find out that there are three '4' answers. etc.
I got it to work! Thank you
I used:
print (df.col.value_counts().loc['x']
Here is an example of a simple 'countif' recipe you could try:
import pandas as pd
def countif(rng, criteria):
return rng.eq(criteria).sum()
Example use
df = pd.DataFrame({'column1': [4,4,3,2,4,1],
'column2': [1,2,3,4,5,6]})
countif(df['column1'], 1)
If all else fails, why not try something like this?
import numpy as np
import pandas
import matplotlib.pyplot as plt
df = pandas.DataFrame(data=np.random.randint(0, 100, size=100), columns=["col1"])
counters = {}
for i in range(len(df)):
if df.iloc[i]["col1"] in counters:
counters[df.iloc[i]["col1"]] += 1
else:
counters[df.iloc[i]["col1"]] = 1
print(counters)
plt.bar(counters.keys(), counters.values())
plt.show()

Categories