parsing CSV in pandas

parsing CSV in pandas - python

I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.

You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']

If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.
The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.
# Initial call
init_df = init_df.assign(HighestHigh = lambda x:
highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))
# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
if date1.size == ticker.size:
temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
temp_df['DateIdentified'] = date1
temp_df['Ticker'] = ticker
else:
print("dates and tickers size mismatching")
sys.exit(1)
counter = itertools.count(0)
high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
return high_list
# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
print("trade %s" % count)
currDate = datetime.datetime.now().date()
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
hh = allpm_df.iloc[:,0].max()
hd = allpm_df.loc[(allpm_df['high']==hh),'date']
hh = round(hh,2)
h_list = [hh,hd]
return h_list
# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")
# The function to split the list elements in to different columns
def split_columns(orig_df,col):
split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col+"Mod", col+"Date"])
split_df[col+"Date"] = split_df[col+"Date"].apply(lambda x: x.squeeze())
orig_df = pd.concat([orig_df,split_df], axis=1)
orig_df = orig_df.drop(col,axis=1)
orig_df = orig_df.rename(columns={col+"Mod": col})
return orig_df

There are a couple of obvious solutions that would help reduce your runtime.
First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:
hh, hd = allpm_df[allpm_df['high'].idxmax()]
This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.
Edit
In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:
allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:
index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)
This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:
allp_df[ticker, (dateidentified: currDate)]
which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.

How to return result based on a string found on a list?

I'm trying to return all data from my excel sheet from the column TOURNAMENT that has the string FIFA. I keep getting no results back and am not sure how to fix this. Below is a sample of data from my excel. Any insight would be helpful thank you
My excel:
import pandas as pd
import numpy as np
filename = ("results.csv")
df = pd.read_csv(filename)
#convert to datetime format
df['date'] = pd.to_datetime(df['date'], format='%Y/%M/%D')
#Which country has scored the most goals in FIFA events (qualifiers, cups, etc.) since 2010?
#To get the most goals by sum
df['total_score'] = df['home_score'] + df['away_score']
#Not sure how to check all data with the string "FIFA" in the column "Tournament"
sub_df = df[(df['date'].dt.year >= 2010)]
if "FIFA" in df['tournament']:
sub_df2 = sub_df[sub_df['total_score'] == sub_df['total_score'].max()]
print(sub_df2)
else:
print("no results")

You can use Series.str.contains to check if a substring exists in the value, then use the masking to get only such occurrences:
>>> df[df['tournament'].str.contains('FIFA')]

How to improve performance on average calculations in python dataframe

I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price

The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.

By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price

Why does performance decrease with the size of the data frame?

I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!

I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing CSV in pandas - python

I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.

Related

How to reduce the time complexity of KS test python code?

trying to figure out a pythonic way of code that is taking time even after using list comprehension and pandas

How to return result based on a string found on a list?

How to improve performance on average calculations in python dataframe

Why does performance decrease with the size of the data frame?

Categories

Resources