Add third dimension to a 2-dimensional numpy.ndarray - python

I have an array which contains 50 time series. Each time series has 50 values.
The shape of my array is therefore:
print(arr.shape) = (50,50)
I want to extract the 50 time series and I want to assign a year to each of them:
years = list(range(1900,1950))
print(len(years)) = 50
The order should be maintained. years[0] should correspond with arr[0,:] (this is the first time series).
I am glad for any help!
Edit: This is the small example
import random
years = list(range(1900,1904))
values = random.sample(range(10, 30), 16)
arr = np.reshape(values, (4, 4))

Let's say you have the following data:
import numpy as np
data = np.random.randint(low=1, high=9, size=(5, 4))
years = np.arange(1900, 1905)
You can use np.concatenate:
>>> arr = np.concatenate([years[:, None], data], axis=1)
>>> arr
array([[1900, 5, 8, 1, 2],
[1901, 3, 3, 1, 5],
[1902, 7, 4, 7, 5],
[1903, 1, 6, 6, 4],
[1904, 4, 5, 3, 8]])
or maybe use a pandas.DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame(data)
>>> df = df.assign(year=years)
>>> df = df.set_index("year")
>>> df
0 1 2 3
year
1900 3 2 8 1
1901 5 8 5 2
1902 3 5 4 3
1903 6 2 7 6
1904 8 8 4 6

Related

Is it okey to use lambda in this case?

I'm trying to figure out a way to loop over a panda DataFrame to generate a new key.
Here's an example of the dataframe:
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
And now I want to create a new key, that basically takes the first and last number of df["range"] and full the list with the numbers in the middle (ie, the first one will be [1 2 3 4 5 6 7 8 9 10])
So far I think that I should be using something like this, but I could be completely wrong:
df["total"] = df["range"].map(lambda x: #and here I should append all the "x" that are betwen df["range"][0] and df["range"][1]
Here's an example of the result that I'm looking for:
pdb beg end range total
0 a 1 10 1 10 1 2 3 4 5 6 7 8 9 10
1 b 2 11 2 11 2 3 4 5 6 7 8 9 10 11
I could use some help with the lambda function, I get really confused with the syntax.
Try with apply
df['new'] = df.apply(lambda x : list(range(x['beg'],x['end']+1)),axis=1)
Out[423]:
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
1 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
dtype: object
This should work:
df['total'] = df['range'].apply(lambda x: [n for n in range(x[0], x[1]+1)])
As per your output, you need
In [18]: df['new'] = df.apply(lambda x : " ".join(list(map(str,range(x['beg'],x['end']+1)))),axis=1)
In [19]: df
Out[19]:
pdb beg end range new
0 a 1 10 [1, 10] 1 2 3 4 5 6 7 8 9 10
1 b 2 11 [2, 11] 2 3 4 5 6 7 8 9 10 11
If you want to use iterrows then you can do it in the loop itself as follows:
Code :
import pandas as pd
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
df['total'] = [range(*x) for x in zip(df['beg'], df['end'])]
Output:
pdb beg end range total
0 a 1 10 [1, 10] (1, 2, 3, 4, 5, 6, 7, 8, 9)
1 b 2 11 [2, 11] (2, 3, 4, 5, 6, 7, 8, 9, 10)

Loss of numpy array dimensions when save and retrieve from csv file using pandas

I have a numpy.array data type and I want to write it to a .csv file with pandas, so I run this:
data = numpy.array([1, 2, 3, 4, 5, 6])
print(data)
print((data.shape))
df = pd.DataFrame(columns = ['content'])
df.loc[0, 'content'] = data
df.to_csv('data.csv', index = False)
print(df.head())
>>> [1 2 3 4 5 6]
>>> (6,)
>>> content
0 [1, 2, 3, 4, 5, 6]
As seen in the output, the dimensions of the numpy array is (6,).
But the problem is that when I retrieve it from .csv file array dimension loss and change to ()
data = pd.read_csv('data.csv')
val = numpy.array(data['content'][0])
print(val.shape)
print(val)
>>> ()
>>> [1 2 3 4 5 6]
Why is this happening? How can I solve this problem?
In [46]: import pandas as pd
In [47]: data = np.arange(1,7)
In [48]: data.shape
Out[48]: (6,)
The original dataframe:
In [49]: df = pd.DataFrame(columns = ['content'])
...: df.loc[0, 'content'] = data
In [50]: df
Out[50]:
content
0 [1, 2, 3, 4, 5, 6]
In [52]: df.to_numpy()
Out[52]: array([[array([1, 2, 3, 4, 5, 6])]], dtype=object)
to_numpy from a dataframe produces a 2d array, here with 1 element, and that element is an array ifself.
In [54]: df.to_numpy()[0,0]
Out[54]: array([1, 2, 3, 4, 5, 6])
Look at the full file, not just the head:
In [55]: df.to_csv('data.csv', index = False)
In [56]: cat data.csv
content
[1 2 3 4 5 6]
That 2nd line is the str(data) display - with [] and no commas
read_csv loads that as a string. It does not try convert it to an array; it can't.
In [57]: d = pd.read_csv('data.csv')
In [58]: d
Out[58]:
content
0 [1 2 3 4 5 6]
In [59]: d.to_numpy()
Out[59]: array([['[1 2 3 4 5 6]']], dtype=object)
In [60]: d.to_numpy()[0,0]
Out[60]: '[1 2 3 4 5 6]'
csv is not a good format for saving a dataframe that contains objects such as arrays or lists as elements. It only works well for elements that are simple numbers and strings.

non fixed rolling window

I am looking to implement a rolling window on a list, but instead of a fixed length of window, I would like to provide a rolling window list:
Something like this:
l1 = [5, 3, 8, 2, 10, 12, 13, 15, 22, 28]
l2 = [1, 2, 2, 2, 3, 4, 2, 3, 5, 3]
get_custom_roling( l1, l2, np.average)
and the result would be:
[5, 4, 5.5, 5, 6.67, ....]
6.67 is calculated as average of 3 elements 10, 2, 8.
I implemented a slow solution, and every idea is welcome to make it quicker :):
import numpy as np
def get_the_list(end_point, number_points):
"""
example: get_the_list(6, 3) ==> [4, 5, 6]
example: get_the_list(9, 5) ==> [5, 6, 7, 8, 9]
"""
if np.isnan(number_points):
return []
number_points = int( number_points)
return list(range(end_point, end_point - number_points, -1 ))
def get_idx(s):
ss = list(enumerate(s) )
sss = (get_the_list(*elem) for elem in ss )
return sss
def get_custom_roling(s, ss, funct):
output_get_idx = get_idx(ss)
agg_stuff = [s[elem] for elem in output_get_idx]
res_agg_stuff = [ funct(elem) for elem in agg_stuff ]
res_agg_stuff = eiu.pd.Series(data=res_agg_stuff, index = s.index)
return res_agg_stuff
Pandas custom window rolling allows you to modify size of window.
Simple explanation: start and end arrays hold values of indexes to make slices of your data.
#start = [0 0 1 2 2 2 5 5 4 7]
#end = [1 2 3 4 5 6 7 8 9 10]
Arguments passed to get_window_bounds are given by BaseIndexer.
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
end = np.arange(1, num_values+1, dtype=np.int64)
start = end - np.array(self.custom_name_whatever, dtype=np.int64)
return start, end
df = pd.DataFrame({"l1": [5, 3, 8, 2, 10, 12, 13, 15, 22, 28],
"l2": [1, 2, 2, 2, 3, 4, 2, 3, 5, 3]})
indexer = CustomIndexer(custom_name_whatever=df.l2)
df["variable_mean"] = df.l1.rolling(indexer).mean()
print(df)
Outputs:
l1 l2 variable_mean
0 5 1 5.000000
1 3 2 4.000000
2 8 2 5.500000
3 2 2 5.000000
4 10 3 6.666667
5 12 4 8.000000
6 13 2 12.500000
7 15 3 13.333333
8 22 5 14.400000
9 28 3 21.666667

Conditioning pandas Dataframe on an Array

I'm trying to figure out how to condition on an array I've created.
first6 = df["Tbl_Name_Dur"].unique()[0:6]
for element in first6:
print(element)
df_test = df[df['Tbl_Name_Dur'] for element in first6]
I've printed the elements and that works. How do I condition on selecting my dataframe based on first6. I've tried the following:
df_test = df[df['Tbl_Name_Dur'] in first6]
df_test = df[df['Tbl_Name_Dur'] == first6]
Any help would be much appreciated!
You can use the isin method. Here is an example:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8 ,8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.isin(first6)]
df.dropna(inplace=True)
print(df)
Alternatively, you can use a lambda function together with map:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8, 8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.col.map(lambda x : x in first6)]
print(df)
Output:
col
0 1
1 2
2 3
3 4
4 4
5 5
6 6

Count the individual values within the arrays stored in a pandas Series

Here's a simple example to set the stage:
import pandas as pd
import numpy as np
example_series = pd.Series([np.arange(5),
np.arange(15),
np.arange(12),
np.arange(7),
np.arange(3)])
print example_series
0 [0, 1, 2, 3, 4]
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
3 [0, 1, 2, 3, 4, 5, 6]
4 [0, 1, 2]
I've got a pandas Series (example_series) that contains a bunch of arrays. I'm trying to count the number of instances each number appears in the series. So, I'm hoping to return something that looks like:
# Counts =
0:5
1:5
2:5
3:4
4:4
5:3
#...and so on
And I'd prefer that it returned a Series, but it's OK if it's something else. This seems simple enough, but I can't figure it out. I'll post a few failed attempts below.
# None of these work
example_series.count(0)
example_series.count(lambda x: x == 0)
example_series[example_series == 0]
example_series.unique()
Thanks for any help!
Flatten the list then use value_counts()
pd.Series([item for sublist in example_series for item in sublist]).value_counts()
2 5
1 5
0 5
4 4
3 4
6 3
5 3
11 2
10 2
9 2
8 2
7 2
14 1
13 1
12 1
Not sure what the Pandas syntax is for this. But the pure numpy solution, which would be quite fast, would be to flatten your collection of arrays with np.flatten() and then call the histogram function. This would return a numpy array as a result, which could be wrapped into a Series with one line.

Categories