Exception during groupby pandas - python

I am just beginning to learn analytics with python for network analysis using the Python For Data Analysis book and I'm getting confused by an exception I get while doing some groupby's... here's my situation.
I have a CSV of NetFlow data that I've imported to pandas. The data looks something like:
dt, srcIP, srcPort, dstIP, dstPort, bytes
2013-06-06 00:00:01.123, 123.123.1.1, 12345, 234.234.1.1, 80, 75
I've imported and indexed the data as follows:
df = pd.read_csv('mycsv.csv')
df.index = pd.to_datetime(full_set.pop('dt'))
What I want is a count of unique srcIPs which visit my servers per time period (I have data over several days and I'd like time period by date,hour). I can obtain an overall traffic graph by grouping and plotting as follows:
df.groupby([lambda t: t.date(), lambda t: t.hour]).srcIP.nunique().plot()
However, I want to know how that overall traffic is split amongst my servers. My intuition was to additionally group by the 'dstIP' column (which only has 5 unique values), but I get errors when I try to aggregate on srcIP.
grouped = df.groupby([lambda t: t.date(), lambda t: t.hour, 'dstIP'])
grouped.sip.nunique()
...
Exception: Reindexing only valid with uniquely valued Index objects
So, my specific question is: How can I avoid this exception in order to create a plot where traffic is aggregated over 1 hour blocks and there is a different series for each server.
More generally, please let me know what newb errors I'm making.
Also, the data does not have regular frequency timestamps and I don't want sampled data in case that makes any difference in your answer.
EDIT 1
This is my ipython session exactly as input. output ommitted except for the deepest few calls in the error.
EDIT 2
Upgrading pandas from 0.8.0 to 0.12.0 as yielded a more descriptive exception shown below
import numpy as np
import pandas as pd
import time
import datetime
full_set = pd.read_csv('june.csv', parse_dates=True, index_col=0)
full_set.sort_index(inplace=True)
gp = full_set.groupby(lambda t: (t.date(), t.hour, full_set['dip'][t]))
gp['sip'].nunique()
...
/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _make_labels(self)
1239 raise Exception('Should not call this method grouping by level')
1240 else:
-> 1241 labs, uniques = algos.factorize(self.grouper, sort=self.sort)
1242 uniques = Index(uniques, name=self.name)
1243 self._labels = labs
/usr/local/lib/python2.7/dist-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel)
123 table = hash_klass(len(vals))
124 uniques = vec_klass()
--> 125 labels = table.get_labels(vals, uniques, 0, na_sentinel)
126
127 labels = com._ensure_platform_int(labels)
/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_labels (pandas/hashtable.c:12229)()
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
52 def __hash__(self):
53 raise TypeError('{0!r} objects are mutable, thus they cannot be'
---> 54 ' hashed'.format(self.__class__.__name__))
55
56 def __unicode__(self):
TypeError: 'TimeSeries' objects are mutable, thus they cannot be hashed

So I'm not 100 percent sure why that exception was raised.. but a few suggestions:
You can read in your data and parse the datetime and index by the datetime all at once with read_csv:
df = pd.read_csv('mycsv.csv', parse_dates=True, index_col=0)
Then you can form your groups by using a lambda function that returns a tuple of values:
gp = df.groupby( lambda t: ( t.date(), t.hour, df['dstIP'][t] ) )
The input to this lambda function is the index, we can use this index to go into the dataframe in the outer scope and retrieve the srcIP value at that index and thus factor it into the grouping.
Now that we have the grouping, we can apply the aggregator:
gp['srcIP'].nunique()

I ended up solving my problem by adding a new column of hour-truncated datetimes to the original dataframe as follows:
f = lambda i: i.strftime('%Y-%m-%d %H:00:00')
full_set['hours'] = full_set.index.map(f)
Then I can groupby('dip') and loop through each destIP creating an hourly grouped plot as I go...
for d, g in dipgroup:
g.groupby('hours').sip.nunique().plot()

Related

I'm getting a ValueError: unable to convert str to float 'XX'

Some background, I'm taking a machine learning class on customer segmentation. My code env is pandas(python) and sklearn. I have two datasets, a general population dataset and a customer demographics dataset with 85 identical columns.
I'm calling a function I created to run preprocessing steps on the 'customers' data, steps that were previously run outside this function on the general population dataset. Within the function is a loop that replaces missing values with np.nan. Here is the loop:
#replacing missing data with NaNs.
#feat_sum is a dataframe (feature_summary) of coded values
for i in range(len(feat_sum)):
mi_unk = feat_sum.iloc[i]['missing_or_unknown'] #locate column and values
mi_unk = mi_unk.strip('[').strip(']').split(',')# strip the brackets then split
mi_unk = [int(val) if (val!='' and val!='X' and val!='XX') else val for val in mi_unk]
if mi_unk != ['']:
featsum_attrib = feat_sum.iloc[i]['attribute']
df = df.replace({featsum_attrib: mi_unk}, np.nan)
Toward the end of the function I'm engineering new variables:
#Investigate "CAMEO_INTL_2015" and engineer two new variables.
df['WEALTH'] = df['CAMEO_INTL_2015']
df['LIFE_STAGE'] = df['CAMEO_INTL_2015']
mf_wealth_dict = {'11':1, '12':1, '13':1, '14':1, '15':1, '21':2, '22':2, '23':2, '24':2, '25':2, '31':3,'32':3, '33':3, '34':3, '35':3, '41':4, '42':4, '43':4, '44':4, '45':4, '51':5, '52':5, '53':5, '54':5, '55':5}
mf_lifestage_dict = {'11':1, '12':2, '13':3, '14':4, '15':5, '21':1, '22':2, '23':3, '24':4, '25':5, '31':1, '32':2, '33':3, '34':4, '35':5, '41':1, '42':2, '43':3, '44':4, '45':5, '51':1, '52':2, '53':3, '54':4, '55':5}
#replacing the 'WEALTH' and 'LIFE_STAGE' columns with values from the dictionaries
df['WEALTH'].replace(mf_wealth_dict, inplace=True)
df['LIFE_STAGE'].replace(mf_lifestage_dict, inplace=True)
Near the end of the project code, I'm running an imputer to replace the np.nans which ran successfully on the general population dataset(azdias):
az_imp = Imputer(strategy="most_frequent")
azdias_cleaned_imp = pd.DataFrame(az_imp.fit_transform(azdias_cleaned_encoded))
So when I call the clean_data function passing the 'customers' dataframe, clean_data(customers),it is giving me the ValueError: could not convert str to float: 'XX' on this line:
customers_imp = Imputer(strategy="most_frequent")
---> 19 customers_cleaned_imputed = pd.DataFrame(customers_imp.fit_transform(customers_cleaned_encoded))
In the data dictionary for the CAMEO_INTL_2015 column of the dataset, the very last category is 'XX': unknown. When I run a value count on the WEALTH and LIFE_STAGE columns, 124 occurrences of 'XX' under those two columns. No other columns in the dataset have the 'XX' value except these. Again, no issues with the other dataset, I did not run into this problem. I know this is wordy, but any help appreciated and I can provide the project code as well.
A mentor and myself tried troubleshooting looking at all the steps that were performed on both datasets, to no avail. I was expecting the 'XX' to be dealt with from the loop I mentioned earlier.

Efficiently load and manipulate csv using dask DataFrame

I am trying to manipulate the csv-file from https://www.kaggle.com/raymondsunartio/6000-nasdaq-stocks-historical-daily-prices using dask.dataframe. The original dataframe has columns 'date', 'ticker', 'open', 'close', etc...
My goal is to create a new data frame with index 'date' and columns as the closing price of each unique ticker.
The following code does the trick, but is quite slow, using almost a minute for N = 6. I suspect that dask tries to read the CSV-file multiple times in the for-loop, but I don't know how I would go about making this faster. My initial guess is that using df.groupby('ticker') somewhere would help, but I am not familiar enough with pandas.
import dask.dataframe as dd
from functools import reduce
def load_and_fix_csv(path: str, N: int, tickers: list = None) -> dd.DataFrame:
raw = dd.read_csv(path, parse_dates=["date"])
if tickers is None:
tickers = raw.ticker.unique().compute()[:N] # Get unique tickers
dfs = []
for tick in tickers:
tmp = raw[raw.ticker == tick][["date", "close"]] # Temporary dataframe from specific ticker with columns date, close
dfs.append(tmp)
df = reduce(lambda x, y: dd.merge(x, y, how="outer", on="date"), dfs) # Merge all dataframes on date
df = df.set_index("date").compute()
return df
Every kind of help is appreciated!
Thank you.
I'm pretty sure you're right that Dask is likely going "back to the well" for each loop; this is because Dask builds a graph of operations and attempts to defer computation until forced or necessary. One thing I like to do is to cut the reading operations of the graph with Client.persist:
from distributed import Client
client = Client()
def persist_load_and_fix_csv(path: str, N: int, tickers: list = None) -> dd.DataFrame:
raw = dd.read_csv(path, parse_dates=["date"])
# This "cuts the graph" prior operations (just the `read_csv` here)
raw = client.persist(raw)
if tickers is None:
tickers = raw.ticker.unique().compute()[:N] # Get unique tickers
dfs = []
for tick in tickers:
tmp = raw[raw.ticker == tick][["date", "close"]] # Temporary dataframe from specific ticker with columns date, close
dfs.append(tmp)
df = reduce(lambda x, y: dd.merge(x, y, how="outer", on="date"), dfs) # Merge all dataframes on date
df = df.set_index("date").compute()
return df
In a Kaggle session I tested both functions with persist_load_and_fix_csv(csv_path, N=3) and managed to cut the time in half. You'll also get better performance by only keeping the columns you end up using.
(Note: I've found that, at least for me and my code, if I start seeing .compute() crop up in functions that I should step back and reevaluate the code paths; I view it as a code smell)

exploding single pandas row into multiple rows using itertools's chain

I have a pandas dataframe, sectors
with every value within each field as string and all the fields except for sector_id have null values wihtin them.
sector_id sector_code sector_percent sector
----------------------------------------------------
UB1274 230;455;621 20;30;50 some_sector1
AB12312 234;786;3049 45;45;10 some_sector2
WEU234I 2344;9813 70;30 some_sector3
U2J3 3498 10 some_sector4
ALK345 ;;1289; 25;50;10;5 some_sector5
YAB45 2498;456 80 some_sector6
I'm basically trying to explode each row into multiple rows. And with some help from the stackoverflow community split-cell-into-multiple-rows-in-pandas-dataframe this is how I have been trying to do this,
from itertools import chain
def chainer(s):
return list(chain.from_iterable(s.str.split(';')))
sectors['sector_code'].fillna(value='0', inplace=True)
sectors['sector'].fillna(value='unknown', inplace=True)
sectors['sector_percent'].fillna(value='100', inplace=True)
len_of_split = sectors['sector_code'].str.split(';').map(len) if isinstance(sectors['sector_code'], str) else 0
pd.DataFrame({
'sector_id': np.repeat(sectors['sector_id'], len_of_split),
'sector_code': chainer(sectors['sector_code']),
'sector': np.repeat(sectors['sector'], len_of_split),
'sector_percent': chainer(sectors['sector_percent'])
})
but as there are also NULL values in all the columns except for sector_id, I'm getting this error as,
ValueError: arrays must all be same length
Here's a sample code for creating the above dummy dataframe sectors,
sectors = pandas.DataFrame({'sector_id':['UB1274','AB12312','WEU234I','U2J3','ALK345','YAB45'], 'sector_code':['230;455;621','234;786;3049','2344;9813','3498',';;1289;','2498;456'], 'sector_percent':['20;30;50','45;45;10','70;30','10','25;50;10;5','80'], 'sector':['some_sector1','some_sector2','some_sector3','some_sector4','some_sector5','some_sector6']})
How do I handle this? Any help is appreciated. Thanks.

Python fuzzy string matching as correlation style table/matrix

I have a file with x number of string names and their associated IDs. Essentially two columns of data.
What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.
This is sort of what I had in mind. Just to show my intent:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.read_csv('random_data_file.csv')
df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')
df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())
But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.
I hope my issue is clearly worded, and really, any input is appreciated,
Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
This is considerably more elegant than my first answer.
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Create the cartesian product between the strings column with itself.
ct = pd.crosstab(df['strings'], df['strings'])
# Note: for pandas versions <0.22, the two series must have different names.
# In case you observe a "Level XX not found" error, the following may help:
# ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())
# Apply the fuzz (column-wise). Argument col has type pd.Series.
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
# This results in the following:
# strings abc abracadabra brabra cadra
# strings
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
df.groupby('id').apply(cross_fuzz)
In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.
Here is what this could look like:
import pandas as pd
from fuzzywuzzy import fuzz
# Create sample data frame.
df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
columns=['id', 'strings'])
# Cross product, using a temporary column.
df['_tmp'] = 0
mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])
# Apply the function between the two strings.
mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)
# Reorganize data.
ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
ret.index.name = None
ret.columns.name = None
# This results in the following:
# abc abracadabra brabra cadra
# abc 100 43 44 25
# abracadabra 43 100 71 62
# brabra 44 71 100 55
# cadra 25 62 55 100
import csv
from fuzzywuzzy import fuzz
import numpy as np
input_file = csv.DictReader(open('random_data_file.csv'))
string = []
for row in input_file: #file is appended row by row into a python dictionary
string.append(row["String"]) #keys for the dict. are the headers
#now you have a list of the string values
length = len(string)
resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X
for i in range (length):
for j in range (length):
resultMat[i][j] = fuzz.ratio(string[i], string[j])
print resultMat
I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100
Hope it helps

pandas.DatetimeIndex frequency is None and can't be set

I created a DatetimeIndex from a "date" column:
sales.index = pd.DatetimeIndex(sales["date"])
Now the index looks as follows:
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-04', '2003-01-06',
'2003-01-07', '2003-01-08', '2003-01-09', '2003-01-10',
'2003-01-11', '2003-01-13',
...
'2016-07-22', '2016-07-23', '2016-07-24', '2016-07-25',
'2016-07-26', '2016-07-27', '2016-07-28', '2016-07-29',
'2016-07-30', '2016-07-31'],
dtype='datetime64[ns]', name='date', length=4393, freq=None)
As you see, the freq attribute is None. I suspect that errors down the road are caused by the missing freq. However, if I try to set the frequency explicitly:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-148-30857144de81> in <module>()
1 #### DEBUG
----> 2 sales_train = disentangle(df_train)
3 sales_holdout = disentangle(df_holdout)
4 result = sarima_fit_predict(sales_train.loc[5002, 9990]["amount_sold"], sales_holdout.loc[5002, 9990]["amount_sold"])
<ipython-input-147-08b4c4ecdea3> in disentangle(df_train)
2 # transform sales table to disentangle sales time series
3 sales = df_train[["date", "store_id", "article_id", "amount_sold"]]
----> 4 sales.index = pd.DatetimeIndex(sales["date"], freq="d")
5 sales = sales.pivot_table(index=["store_id", "article_id", "date"])
6 return sales
/usr/local/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
89 else:
90 kwargs[new_arg_name] = new_arg_value
---> 91 return func(*args, **kwargs)
92 return wrapper
93 return _deprecate_kwarg
/usr/local/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
399 'dates does not conform to passed '
400 'frequency {1}'
--> 401 .format(inferred, freq.freqstr))
402
403 if freq_infer:
ValueError: Inferred frequency None from passed dates does not conform to passed frequency D
So apparently a frequency has been inferred, but is stored neither in the freq nor inferred_freq attribute of the DatetimeIndex - both are None. Can someone clear up the confusion?
You have a couple options here:
pd.infer_freq
pd.tseries.frequencies.to_offset
I suspect that errors down the road are caused by the missing freq.
You are absolutely right. Here's what I use often:
def add_freq(idx, freq=None):
"""Add a frequency attribute to idx, through inference or directly.
Returns a copy. If `freq` is None, it is inferred.
"""
idx = idx.copy()
if freq is None:
if idx.freq is None:
freq = pd.infer_freq(idx)
else:
return idx
idx.freq = pd.tseries.frequencies.to_offset(freq)
if idx.freq is None:
raise AttributeError('no discernible frequency found to `idx`. Specify'
' a frequency string with `freq`.')
return idx
An example:
idx=pd.to_datetime(['2003-01-02', '2003-01-03', '2003-01-06']) # freq=None
print(add_freq(idx)) # inferred
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'], dtype='datetime64[ns]', freq='B')
print(add_freq(idx, freq='D')) # explicit
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'], dtype='datetime64[ns]', freq='D')
Using asfreq will actually reindex (fill) missing dates, so be careful of that if that's not what you're looking for.
The primary function for changing frequencies is the asfreq function.
For a DatetimeIndex, this is basically just a thin, but convenient
wrapper around reindex which generates a date_range and calls reindex.
It seems to relate to missing dates as 3kt notes. You might be able to "fix" with asfreq('D') as EdChum suggests but that gives you a continuous index with missing data values. It works fine for some some sample data I made up:
df=pd.DataFrame({ 'x':[1,2,4] },
index=pd.to_datetime(['2003-01-02', '2003-01-03', '2003-01-06']) )
df
Out[756]:
x
2003-01-02 1
2003-01-03 2
2003-01-06 4
df.index
Out[757]: DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'],
dtype='datetime64[ns]', freq=None)
Note that freq=None. If you apply asfreq('D'), this changes to freq='D':
df.asfreq('D')
Out[758]:
x
2003-01-02 1.0
2003-01-03 2.0
2003-01-04 NaN
2003-01-05 NaN
2003-01-06 4.0
df.asfreq('d').index
Out[759]:
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-04', '2003-01-05',
'2003-01-06'],
dtype='datetime64[ns]', freq='D')
More generally, and depending on what exactly you are trying to do, you might want to check out the following for other options like reindex & resample: Add missing dates to pandas dataframe
I'm not sure if earlier versions of python have this, but 3.6 has this simple solution:
# 'b' stands for business days
# 'w' for weekly, 'd' for daily, and you get the idea...
df.index.freq = 'b'
It could happen if for examples the dates you are passing aren't sorted.
Look at this example:
example_ts = pd.Series(data=range(10),
index=pd.date_range('2020-01-01', '2020-01-10', freq='D'))
example_ts.index = pd.DatetimeIndex(np.hstack([example_ts.index[-1:],
example_ts.index[:-1]]), freq='D')
The previous code goes into your error, because of the non-sequential dates.
example_ts = pd.Series(data=range(10),
index=pd.date_range('2020-01-01', '2020-01-10', freq='D'))
example_ts.index = pd.DatetimeIndex(np.hstack([example_ts.index[:-1],
example_ts.index[-1:]]), freq='D')
This one runs correctly, instead.
I am not sure but I was having the same error. I was not able to resolve my issue by suggestions posted above but solved it using the below solution.
Pandas DatetimeIndex + seasonal_decompose = missing frequency.
Best Regards
Similar to some of the other answers here, my problem was that my data had missing dates.
Instead of dealing with this issue in Python, I opted to change my SQL query that I was using to source the data. So instead of skipping dates, I wrote the query such that it would fill in missing dates with the value 0.
It seems to be an issue with missing values in the index. I have simply re-build the index based on the original index in the frequency I needed:
df.index = pd.date_range(start=df.index[0], end=df.index[-1], freq="h")

Categories