Python Pandas Reshape Data Frame

Python Pandas Reshape Data Frame - python

It seems to be very basic knowledge, but I got stuck despite having some theoretical background in data processing (via other software). Worth to mention I'm new to python and pandas library.
So. I've got a data frame:
My task is to put values of 'Series Name' column as separate columns (transform from long to wide). I've spent ages trying different methods, but got only errors.
For example:
mydata = mydata.pivot(index=['Country', 'Year'], columns='Series Name', values='Value')
And I got an error:
... a lot of text...
ValueError: Length of passed values is 2487175, index implies 2
Could anybody guide me through that process please? Thanks.
It's for the code
'mydata = mydata.pivot(index=['Country', 'Year'], columns='Series Name', values='Value')'
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-8169d6d374c7> in <module>
----> 1 mydata = mydata.pivot(index=['Country', 'Year'], columns='Series Name', values='Value')
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
5192 """
5193 from pandas.core.reshape.reshape import pivot
-> 5194 return pivot(self, index=index, columns=columns, values=values)
5195
5196 _shared_docs['pivot_table'] = """
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values)
412 else:
413 indexed = self._constructor_sliced(self[values].values,
--> 414 index=index)
415 return indexed.unstack(columns)
416
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
260 'Length of passed values is {val}, '
261 'index implies {ind}'
--> 262 .format(val=len(data), ind=len(index)))
263 except TypeError:
264 pass
ValueError: Length of passed values is 2487175, index implies 2

Try maybe:
mydata = mydata.pivot_table(index=['Country', 'Year'], columns='Series Name', values='Value', aggfunc='sum')
(If you want to sum your Value) it seems that you need to somehow aggregate your data explicitly.
Although would be good, if you would share full error message.
I managed to reproduce your error. Like I said- you need to provide aggregating function:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": list("abbbaa"), "c": [4,3,6,2,7,5], "d": list("pqqppp")})
df2=df.pivot(index=["b", "d"], columns="a", values="c")
#ValueError: Length of passed values is 6, index implies 2
df2=df.pivot_table(index=["b", "d"], columns="a", values="c", aggfunc=set)
#works fine - you need aggregation function e.g. list/set to collect all/unique values or e.g. sum/max to do some numeric operation

Nearly there. The resulting table is
How is it possible to to put 'Country' and 'Year' to the same level as other column names to be able to export it normally to excel? If I export like it is now 'Country' and 'Year' not included in the table.

Related

Is there a way to automate data cleaning for pandas DataFrames?

I am cleaning my data for a machine learning project by replacing the missing values with the zeros and the mean for the 'Age' and 'Fare' columns respectively. The code for which is given below:
train_data['Age'] = train_data['Age'].fillna(0)
mean = train_data['Fare'].mean()
train_data['Fare'] = train_data['Fare'].fillna(mean)
Since I would I have to do this multiple times for other sets of data, I want to automate this process by creating a generic function that takes the DataFrame as input and performs the operations for modifying it and returning the modified function. The code for that is given below:
def data_cleaning(df):
df['Age'] = df['Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df['Fare'].fillna()
return df
However when I pass the training data DataFrame:
train_data = data_cleaning(train_data)
I get the following error:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
3 cross_val_data = data_cleaning(cross_val_data)
/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
2 df['Age'] = df['Age'].fillna(0)
3 fare_mean = df['Fare'].mean()
----> 4 df['Fare'] = df['Fare'].fillna()
5 return df
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args,
**kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value,
method, axis, inplace, limit, downcast)
4820 inplace=inplace,
4821 limit=limit,
-> 4822 downcast=downcast,
4823 )
4824
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value,
method, axis, inplace, limit, downcast)
6311 """
6312 inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313 value, method = validate_fillna_kwargs(value, method)
6314
6315 self._consolidate_inplace()
/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
368
369 if value is None and method is None:
--> 370 raise ValueError("Must specify a fill 'value' or 'method'.")
371 elif value is None and method is not None:
372 method = clean_fill_method(method)
ValueError: Must specify a fill 'value' or 'method'.
On some research, I found that I would have to use apply() and map() functions instead, but I am not sure how to input the mean value of the column. Furthermore, this does not scale well as I would have to calculate all the fillna values before inputting them into the function, which is cumbersome. Therefore I want to ask, is there better way to automate data cleaning?

This line df['Fare'] = df['Fare'].fillna() in your function, you did not fill the n/a with anything, thus it returns an error. You should change it to df['Fare'] = df['Fare'].fillna(fare_mean).
If you intend to make this usable for another file in same directory, you can just call it in another file by:
from file_that_contain_function import function_name
And if you intend to make it reusable for your workspace/virtual environment, you may need to create your own python package.

So yes, the other answer explains where the error is coming from.
However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to
def data_cleaning(df):
df['Age'] = df.loc[:, 'Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean) # <- and also fix this error
return df
I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.

Replace None with NaN and ignore NoneType in Pandas

I'm attempting to create a raw string variable from a pandas dataframe, which will eventually be written to a .cfg file, by firstly joining two columns together as shown below and avoiding None:
Section of df:
command value
...
439 sensitivity "0.9"
440 cl_teamid_overhead_always 1
441 host_writeconfig None
...
code:
...
df = df['value'].replace('None', np.nan, inplace=True)
print df
df = df['command'].astype(str)+' '+df['value'].astype(str)
print df
cfg_output = '\n'.join(df.tolist())
print cfg_output
I've attempted to replace all the None values with NaN firstly so that no lines in cfg_output contain "None" as part of of the string. However, by doing so I seem to get a few undesired results. I made use of print statements to see what is going on.
It seems that df = df['value'].replace('None', np.nan, inplace=True), simply outputs None.
It seems that df = df['command'].astype(str)+' '+df['value'].astype(str) and cfg_output = '\n'.join(df.tolist()), cause the following error:
TypeError: 'NoneType' object has no attribute '__getitem__'
Therefore, I was thinking that by ignoring any occurrences of NaN, the code may run smoothly, although I'm unsure about how to do so using Pandas
Ultimately, my desired output would be as followed:
sensitivity "0.9"
cl_teamid_overhead_always 1
host_writeconfig

First of all, df['value'].replace('None', np.nan, inplace=True) returns None because you're calling the method with the inplace=True argument. This argument tells replace to not return anything but instead modify the original dataframe as it is. Similar to how pop or append work on lists.
With that being said, you can also get the desired output calling fillna with an empty string:
import pandas as pd
import numpy as np
d = {
'command': ['sensitivity', 'cl_teamid_overhead_always', 'host_writeconfig'],
'value': ['0.9', 1, None]
}
df = pd.DataFrame(d)
# df['value'].replace('None', np.nan, inplace=True)
df = df['command'].astype(str) + ' ' + df['value'].fillna('').astype(str)
cfg_output = '\n'.join(df.tolist())
>>> print(cfg_output)
sensitivity 0.9
cl_teamid_overhead_always 1
host_writeconfig

You can replace None to ''
df=df.replace('None','')
df['command'].astype(str)+' '+df['value'].astype(str)
Out[436]:
439 sensitivity 0.9
440 cl_teamid_overhead_always 1
441 host_writeconfig
dtype: object

pandas.DatetimeIndex frequency is None and can't be set

I created a DatetimeIndex from a "date" column:
sales.index = pd.DatetimeIndex(sales["date"])
Now the index looks as follows:
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-04', '2003-01-06',
'2003-01-07', '2003-01-08', '2003-01-09', '2003-01-10',
'2003-01-11', '2003-01-13',
...
'2016-07-22', '2016-07-23', '2016-07-24', '2016-07-25',
'2016-07-26', '2016-07-27', '2016-07-28', '2016-07-29',
'2016-07-30', '2016-07-31'],
dtype='datetime64[ns]', name='date', length=4393, freq=None)
As you see, the freq attribute is None. I suspect that errors down the road are caused by the missing freq. However, if I try to set the frequency explicitly:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-148-30857144de81> in <module>()
1 #### DEBUG
----> 2 sales_train = disentangle(df_train)
3 sales_holdout = disentangle(df_holdout)
4 result = sarima_fit_predict(sales_train.loc[5002, 9990]["amount_sold"], sales_holdout.loc[5002, 9990]["amount_sold"])
<ipython-input-147-08b4c4ecdea3> in disentangle(df_train)
2 # transform sales table to disentangle sales time series
3 sales = df_train[["date", "store_id", "article_id", "amount_sold"]]
----> 4 sales.index = pd.DatetimeIndex(sales["date"], freq="d")
5 sales = sales.pivot_table(index=["store_id", "article_id", "date"])
6 return sales
/usr/local/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
89 else:
90 kwargs[new_arg_name] = new_arg_value
---> 91 return func(*args, **kwargs)
92 return wrapper
93 return _deprecate_kwarg
/usr/local/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, start, end, periods, copy, name, tz, verify_integrity, normalize, closed, ambiguous, dtype, **kwargs)
399 'dates does not conform to passed '
400 'frequency {1}'
--> 401 .format(inferred, freq.freqstr))
402
403 if freq_infer:
ValueError: Inferred frequency None from passed dates does not conform to passed frequency D
So apparently a frequency has been inferred, but is stored neither in the freq nor inferred_freq attribute of the DatetimeIndex - both are None. Can someone clear up the confusion?

You have a couple options here:
pd.infer_freq
pd.tseries.frequencies.to_offset
I suspect that errors down the road are caused by the missing freq.
You are absolutely right. Here's what I use often:
def add_freq(idx, freq=None):
"""Add a frequency attribute to idx, through inference or directly.
Returns a copy. If `freq` is None, it is inferred.
"""
idx = idx.copy()
if freq is None:
if idx.freq is None:
freq = pd.infer_freq(idx)
else:
return idx
idx.freq = pd.tseries.frequencies.to_offset(freq)
if idx.freq is None:
raise AttributeError('no discernible frequency found to `idx`. Specify'
' a frequency string with `freq`.')
return idx
An example:
idx=pd.to_datetime(['2003-01-02', '2003-01-03', '2003-01-06']) # freq=None
print(add_freq(idx)) # inferred
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'], dtype='datetime64[ns]', freq='B')
print(add_freq(idx, freq='D')) # explicit
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'], dtype='datetime64[ns]', freq='D')
Using asfreq will actually reindex (fill) missing dates, so be careful of that if that's not what you're looking for.
The primary function for changing frequencies is the asfreq function.
For a DatetimeIndex, this is basically just a thin, but convenient
wrapper around reindex which generates a date_range and calls reindex.

It seems to relate to missing dates as 3kt notes. You might be able to "fix" with asfreq('D') as EdChum suggests but that gives you a continuous index with missing data values. It works fine for some some sample data I made up:
df=pd.DataFrame({ 'x':[1,2,4] },
index=pd.to_datetime(['2003-01-02', '2003-01-03', '2003-01-06']) )
df
Out[756]:
x
2003-01-02 1
2003-01-03 2
2003-01-06 4
df.index
Out[757]: DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-06'],
dtype='datetime64[ns]', freq=None)
Note that freq=None. If you apply asfreq('D'), this changes to freq='D':
df.asfreq('D')
Out[758]:
x
2003-01-02 1.0
2003-01-03 2.0
2003-01-04 NaN
2003-01-05 NaN
2003-01-06 4.0
df.asfreq('d').index
Out[759]:
DatetimeIndex(['2003-01-02', '2003-01-03', '2003-01-04', '2003-01-05',
'2003-01-06'],
dtype='datetime64[ns]', freq='D')
More generally, and depending on what exactly you are trying to do, you might want to check out the following for other options like reindex & resample: Add missing dates to pandas dataframe

I'm not sure if earlier versions of python have this, but 3.6 has this simple solution:
# 'b' stands for business days
# 'w' for weekly, 'd' for daily, and you get the idea...
df.index.freq = 'b'

It could happen if for examples the dates you are passing aren't sorted.
Look at this example:
example_ts = pd.Series(data=range(10),
index=pd.date_range('2020-01-01', '2020-01-10', freq='D'))
example_ts.index = pd.DatetimeIndex(np.hstack([example_ts.index[-1:],
example_ts.index[:-1]]), freq='D')
The previous code goes into your error, because of the non-sequential dates.
example_ts = pd.Series(data=range(10),
index=pd.date_range('2020-01-01', '2020-01-10', freq='D'))
example_ts.index = pd.DatetimeIndex(np.hstack([example_ts.index[:-1],
example_ts.index[-1:]]), freq='D')
This one runs correctly, instead.

I am not sure but I was having the same error. I was not able to resolve my issue by suggestions posted above but solved it using the below solution.
Pandas DatetimeIndex + seasonal_decompose = missing frequency.
Best Regards

Similar to some of the other answers here, my problem was that my data had missing dates.
Instead of dealing with this issue in Python, I opted to change my SQL query that I was using to source the data. So instead of skipping dates, I wrote the query such that it would fill in missing dates with the value 0.

It seems to be an issue with missing values in the index. I have simply re-build the index based on the original index in the frequency I needed:
df.index = pd.date_range(start=df.index[0], end=df.index[-1], freq="h")

Length mismatch error when scaling up multiIndex slicer on large dataset

I am trying to split imported csv files (timeseries) and manipulated them, using pandas MultiIndex Slicer command .xs(). The following df replicates the structure of my imported csv file.
import pandas as pd
df = pd.DataFrame(
{'Sensor ID': [14,1,3,14,3],
'Building ID': [109,109,109,109,109],
'Date/Time': ["26/10/2016 14:31:14","26/10/2016 14:31:16", "26/10/2016 14:32:17", "26/10/2016 14:35:14", "26/10/2016 14:35:38"],
'Reading': [20.95, 20.62, 22.45, 20.65, 22.83],
})
df.set_index(['Sensor ID','Date/Time'], inplace=True)
df.sort_index(inplace=True)
print(df)
SensorList = [1, 3, 14]
for s in SensorList:
df1 = df.xs(s, level='Sensor ID')
I have tested the code on a small excerpt of csv data and it works fine. However, when implementing with the entire csv file, I receive the error: ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements.
Printing df.info() returns the following:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 65981 entries, (1, 2016-10-26 14:35:15) to (19, 2016-11-07 11:27:14)
Data columns (total 2 columns):
Building ID 65981 non-null int64
Reading 65981 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5+ MB
None
Any tip on what may be causing the error?
EDIT
I inadvertently truncated my code, thus leaving it pointless in its current form. The original code resamples values into 15-minutes and 1-hour intervals.
with:
units = ['D1','D3','D6','D10']
unit_output_path = './' + unit + '/'
the loop does:
for s in SensorList:
## Slice multi-index to isolate all readings for sensor s
df1 = df_mi.xs(s, level='Sensor ID')
df1.drop('Building ID', axis=1, inplace=True)
## Resample by 15min and 1hr intervals and exports individual csv files
df1_15min = df1.resample('15Min').mean().round(1)
df1_hr = df1.resample('60Min').mean().round(1)
Traceback:
File "D:\AN6478\AN6478_POE_ABo.py", line 52, in <module>
df1 = df_mi.xs(s, level='Sensor ID')
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1736, in xs
setattr(result, result._get_axis_name(axis), new_ax)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2685, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\src\properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas\lib.c:44748)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\generic.py", line 428, in _set_axis
self._data.set_axis(axis, labels)
File "C:\Program Files\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2635, in set_axis
(old_len, new_len))
ValueError: Length mismatch: Expected axis has 19562 elements, new values have 16874 elements

I can't tell you why exactly df1 = df_mi.xs(s, level='Sensor ID') raises the ValueError here. Where does df_mi come from?
Here is an alternative using groupby which accomplishes what you want on your given dummy data frame without relying on multiIndex and xs. :
# reset index to have DatetimeIndex, otherwise resample won't work
df = df.reset_index(0)
df.index = pd.to_datetime(df.index)
# create data frame for each sensor, keep relevant "Reading" column
grouped = df.groupby("Sensor ID")["Reading"]
# iterate each sensor data frame
for sensor, sub_df in grouped:
quarterly = sub_df.resample('15Min').mean().round(1)
hourly = sub_df.resample('60Min').mean().round(1)
# implement your to_csv saving here
Note, you could also use the groupby on the multiIndex with df.groupby(level="Sensor ID"), however since you want to resample later on, it is easier to drop Sensor ID from the multiIndex which simplifies it overall.

Exception during groupby pandas

I am just beginning to learn analytics with python for network analysis using the Python For Data Analysis book and I'm getting confused by an exception I get while doing some groupby's... here's my situation.
I have a CSV of NetFlow data that I've imported to pandas. The data looks something like:
dt, srcIP, srcPort, dstIP, dstPort, bytes
2013-06-06 00:00:01.123, 123.123.1.1, 12345, 234.234.1.1, 80, 75
I've imported and indexed the data as follows:
df = pd.read_csv('mycsv.csv')
df.index = pd.to_datetime(full_set.pop('dt'))
What I want is a count of unique srcIPs which visit my servers per time period (I have data over several days and I'd like time period by date,hour). I can obtain an overall traffic graph by grouping and plotting as follows:
df.groupby([lambda t: t.date(), lambda t: t.hour]).srcIP.nunique().plot()
However, I want to know how that overall traffic is split amongst my servers. My intuition was to additionally group by the 'dstIP' column (which only has 5 unique values), but I get errors when I try to aggregate on srcIP.
grouped = df.groupby([lambda t: t.date(), lambda t: t.hour, 'dstIP'])
grouped.sip.nunique()
...
Exception: Reindexing only valid with uniquely valued Index objects
So, my specific question is: How can I avoid this exception in order to create a plot where traffic is aggregated over 1 hour blocks and there is a different series for each server.
More generally, please let me know what newb errors I'm making.
Also, the data does not have regular frequency timestamps and I don't want sampled data in case that makes any difference in your answer.
EDIT 1
This is my ipython session exactly as input. output ommitted except for the deepest few calls in the error.
EDIT 2
Upgrading pandas from 0.8.0 to 0.12.0 as yielded a more descriptive exception shown below
import numpy as np
import pandas as pd
import time
import datetime
full_set = pd.read_csv('june.csv', parse_dates=True, index_col=0)
full_set.sort_index(inplace=True)
gp = full_set.groupby(lambda t: (t.date(), t.hour, full_set['dip'][t]))
gp['sip'].nunique()
...
/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _make_labels(self)
1239 raise Exception('Should not call this method grouping by level')
1240 else:
-> 1241 labs, uniques = algos.factorize(self.grouper, sort=self.sort)
1242 uniques = Index(uniques, name=self.name)
1243 self._labels = labs
/usr/local/lib/python2.7/dist-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel)
123 table = hash_klass(len(vals))
124 uniques = vec_klass()
--> 125 labels = table.get_labels(vals, uniques, 0, na_sentinel)
126
127 labels = com._ensure_platform_int(labels)
/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_labels (pandas/hashtable.c:12229)()
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
52 def __hash__(self):
53 raise TypeError('{0!r} objects are mutable, thus they cannot be'
---> 54 ' hashed'.format(self.__class__.__name__))
55
56 def __unicode__(self):
TypeError: 'TimeSeries' objects are mutable, thus they cannot be hashed

So I'm not 100 percent sure why that exception was raised.. but a few suggestions:
You can read in your data and parse the datetime and index by the datetime all at once with read_csv:
df = pd.read_csv('mycsv.csv', parse_dates=True, index_col=0)
Then you can form your groups by using a lambda function that returns a tuple of values:
gp = df.groupby( lambda t: ( t.date(), t.hour, df['dstIP'][t] ) )
The input to this lambda function is the index, we can use this index to go into the dataframe in the outer scope and retrieve the srcIP value at that index and thus factor it into the grouping.
Now that we have the grouping, we can apply the aggregator:
gp['srcIP'].nunique()

I ended up solving my problem by adding a new column of hour-truncated datetimes to the original dataframe as follows:
f = lambda i: i.strftime('%Y-%m-%d %H:00:00')
full_set['hours'] = full_set.index.map(f)
Then I can groupby('dip') and loop through each destIP creating an hourly grouped plot as I go...
for d, g in dipgroup:
g.groupby('hours').sip.nunique().plot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas Reshape Data Frame - python

Nearly there. The resulting table is How is it possible to to put 'Country' and 'Year' to the same level as other column names to be able to export it normally to excel? If I export like it is now 'Country' and 'Year' not included in the table.

Related

Is there a way to automate data cleaning for pandas DataFrames?

Replace None with NaN and ignore NoneType in Pandas

pandas.DatetimeIndex frequency is None and can't be set

Length mismatch error when scaling up multiIndex slicer on large dataset

Exception during groupby pandas

Categories

Resources