Pandas: how to make algorithm faster - python

I have the task: I should find some data in big file and add this data to some file.
File, where I search data is 22 million string and I divide it using chunksize.
In other file I have column with 600 id of users and I find info about every users in big file.
The first I divide data to interval and next search information about every user in all of this files.
I use timer to know, how many time it spend to write to file and average time to find information in df size 1 million string and write it to file is 1.7 sec. And after count all time of program I get 6 hours. (1.5 sec * 600 id * 22 interval).
I want to do it faster, but I don't know any way besides chunksize.
I add my code
el = pd.read_csv('df2.csv', iterator=True, chunksize=1000000)
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
dates1 = buys['date']
ids1 = buys['id']
for i in el:
i['used_at'] = pd.to_datetime(i['used_at'])
df = i.sort_values(['ID', 'used_at'])
dates = df['used_at']
ids = df['ID']
urls = df['url']
for i, (id, date, url, id1, date1) in enumerate(zip(ids, dates, urls, ids1, dates1)):
start = time.time()
df1 = df[(df['ID'] == ids1[i]) & (df['used_at'] < (dates1[i] + dateutil.relativedelta.relativedelta(days=5)).replace(hour=0, minute=0, second=0)) & (df['used_at'] > (dates1[i] - dateutil.relativedelta.relativedelta(months=1)).replace(day=1, hour=0, minute=0, second=0))]
df1 = DataFrame(df1)
if df1.empty:
continue
else:
with open('3.csv', 'a') as f:
df1.to_csv(f, header=False)
end = time.time()
print(end - start)

There are some issues in your code
zip takes arguments that may be of different length
dateutil.relativedelta may not be compatible with pandas Timestamp.
With pandas 0.18.1 and python 3.5, I'm getting this:
now = pd.Timestamp.now()
now
Out[46]: Timestamp('2016-07-06 15:32:44.266720')
now + dateutil.relativedelta.relativedelta(day=5)
Out[47]: Timestamp('2016-07-05 15:32:44.266720')
So it's better to use pd.Timedelta
now + pd.Timedelta(5, 'D')
Out[48]: Timestamp('2016-07-11 15:32:44.266720')
But it's somewhat inaccurate for months:
now - pd.Timedelta(1, 'M')
Out[49]: Timestamp('2016-06-06 05:03:38.266720')
This is a sketch of code. I didn't test and I may be wrong about what you want.
The crucial part is to merge the two data frames instead of iterating row by row.
# 1) convert to datetime here
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys
# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')
# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
# not necessary
# dates1 = buys['date']
# ids1 = buys['id']
for chunk in el:
# already converted to datetime
# i['used_at'] = pd.to_datetime(i['used_at'])
# defer sorting until later
# df = i.sort_values(['ID', 'used_at'])
# merge!
# (option how='inner' selects only rows that have the same id in both data frames; it's default)
merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
selected = merged.loc[bool_idx]
# probably don't need additional columns from buys,
# so either drop them or select the ones from chunk (beware of possible duplicates in names)
selected = selected[chunk.columns]
# sort now (possibly a smaller frame)
selected = selected.sort_values(['ID', 'used_at'])
if selected.empty:
continue
with open('3.csv', 'a') as f:
selected.to_csv(f, header=False)
Hope this helps. Please double check the code and adjust to your needs.
Please, take a look at the docs to understand the options of merge.

Related

Rearranging with pandas melt

I am trying to rearrange a DataFrame. Currently, I have 1035 rows and 24 columns, one for each hour of the day. I want to make this a array with 1035*24 rows. If you want to see the data it can be extracted from the following JSON file:
url = "https://www.svk.se/services/controlroom/v2/situation?date={}&biddingArea=SE1"
svk = []
for i in parsing_range_svk:
data_json_svk = json.loads(urlopen(url.format(i)).read())
svk.append([v["y"] for v in data_json_svk["Data"][0]["data"]])
This is the code I am using to rearrange this data, but it is not doing the job. The first obeservation is in the right place, then it starts getting messy. I have not been able to figure out where each observation goes.
svk = pd.DataFrame(svk)
date_start1 = datetime(2020, 1, 1)
date_range1 = [date_start1 + timedelta(days=x) for x in range(1035)]
date_svk = pd.DataFrame(date_range1, columns=['date'])
svk['date'] = date_svk['date']
svk.drop(24, axis=1, inplace=True)
consumption_svk_1 = (svk.melt('date', value_name='SE1_C')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

Dataframe drop between_time multiple rows by shifting timedelta

I would like to drop multiple groups of rows by time criterion. Date criterion may be ignored.
I have dataframe that contains 100 million rows, with around 0.001s sampling frequency - but it is variable for different columns.
The goal is to drop multiple rows by criterion of "shifting". The leave duration might be 0.01 seconds and the drop duration might be 0.1 second, as shown in Figure:
I have many problems with Timestamp to Time conversions and with the defining the oneliner that will drop multiple groups of rows.
I made tries with following code:
import pandas as pd
from datetime import timedelta#, timestamp
from datetime import datetime
import numpy as np
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
minimum_time=df.index.min()
print("Minimum time:",minimum_time)
maximum_time=df.index.max()
print("Maximum time:",maximum_time)
# futuredate = minimum_time + timedelta(microseconds=100)
print("Dataframe before dropping:\n",df)
df.drop(df.between_time(*pd.to_datetime([minimum_time, maximum_time]).time).index, inplace=True)
print("Dataframe after dropping:\n",df)
# minimum_time=str(minimum_time).split()
# minimum_time=minimum_time[1]
# print(minimum_time)
# maximum_time=str(maximum_time).split()
# maximum_time=maximum_time[1]
# print(maximum_time)
How can I drop rows by time criterion, with shifting?
Working for me:
df = df.loc[(df.index - df.index[0]) % pd.to_timedelta('110ms') > pd.to_timedelta('100ms')]
I *think this is what you're looking for. If not, it hopefully gets you closer.
I defined drop periods by taking the minimum time and incrementing it by your drop/leave times. I then append it to a dictionary where the key is the start of the drop period and the value is the end of the drop period.
Lastly I just iterate through the dictionary and drop rows that fall between those two times in your dataframe, shedding rows at each step.
drop_periods = {}
start_drop = minimum_time + datetime.timedelta(seconds=0.01)
end_drop = start_drop + datetime.timedelta(seconds=0.1)
drop_periods[start_drop] = end_drop
while end_drop < maximum_time:
start_drop = end_drop + datetime.timedelta(seconds=0.01)
end_drop = start_drop + datetime.timedelta(seconds=0.1)
drop_periods[start_drop] = end_drop
for start, end in drop_periods.items():
print("Dataframe before dropping:\n", len(df))
df.drop(df.between_time(*pd.to_datetime([start, end]).time).index, inplace=True)
print("Dataframe after dropping:\n", len(df))

How to fill missing date in timeSeries

Here's what my data looks like:
There are daily records, except for a gap from 2017-06-12 to 2017-06-16.
df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x:
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()
I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).
How can I accomplish this with Pandas?
This is what I have done before
weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'
for i in range(0,len(weektime)):
df['week'][i+1] = weektime[i]
df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
df['weekbattery_voltage'][i+1] =
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
i = i + 1
The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.
Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.
import numpy as np
import pandas as pd
data_arr = np.array(your_df)
periodicity = 'daily'
def fill_gaps(data_arr, periodicity):
rows = data_arr.shape[0]
data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
data_no_gaps_idx = 0
for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
oldtimestamp_str = str(data_arr[row_idx-1, 0])
oldtimestamp = np.datetime64(oldtimestamp_str)
currenttimestamp_str = str(data_arr[row_idx, 0])
currenttimestamp = np.datetime64(currenttimestamp_str)
period = currenttimestamp - oldtimestamp
if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):
if periodicity == 'quarterly':
desired_period = 900
elif periodicity == 'hourly':
desired_period = 3600
elif periodicity == 'daily':
desired_period = 86400
periods_missing = int(period / np.timedelta64(desired_period,'s'))
for missing in np.arange(1, periods_missing):
new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
new_time = new_time_orig.replace('T', ' ')
data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing),
np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW
data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows
data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)
#create a dataframe:
data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])
return data_arr_no_gaps
Fill time gaps and nulls
Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.
import pandas as pd
import os
def fill_gaps_and_nulls(df, freq='1D'):
'''
General steps:
A) check for extra dates (out of expected frequency/sequence)
B) check for missing dates (based on expected frequency/sequence)
C) use forwardfill to fill nulls
D) use backwardfill to fill remaining nulls
E) append to file
'''
#rename the timestamp to 'date'
df.rename(columns={"timestamp": "date"})
#sort to make indexing faster
df = df.sort_values(by=['date'], inplace=False)
#create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
#check for extra dates and/or dates out of order. print warning statement for log
extra_dates = set(df.date).difference(all_dates)
#if there are extra dates (outside of expected sequence/frequency), deal with them
if len(extra_dates) > 0:
#############################
#INSERT DESIRED BEHAVIOR HERE
print('WARNING: Extra date(s):\n\t{}\n\t Shifting highlighted date(s) back by 1 day'.format(extra_dates))
for date in extra_dates:
#shift extra dates back one day
df.date[df.date == date] = date - pd.Timedelta(days=1)
#############################
#check the artificial date index against df to identify missing gaps in time and fill them with nulls
gaps = all_dates.difference(set(df.date))
print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))
#if there are time gaps, deal with them
if len(gaps) > 0:
#initialize df of correct size, filled with nulls
gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
#give index a name
gaps_df.index.name = 'date'
#add the region and type
gaps_df.region = r
gaps_df.type = t
#remove that index so gaps_df and df are compatible
gaps_df.reset_index(inplace=True)
#append gaps_df to df
new_df = pd.concat([df, gaps_df])
#sort on date
new_df.sort_values(by='date', inplace=True)
#fill nulls
new_df.fillna(method='ffill', inplace=True)
new_df.fillna(method='bfill', inplace=True)
#append to file
new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False)
return df_cols, regions, types, all_dates

parsing CSV in pandas

I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time

Categories