creating index columns with python - python

As a minimal working example, I have a file.txt containing a list of numbers:
1.1
2.1
3.1
4.1
5.1
6.1
7.1
8.1
which actually should be presented with indices that makes it a 3D array
0 0 1.1
1 0 2.1
0 1 3.1
1 1 4.1
0 2 5.1
1 2 6.1
0 3 7.1
1 3 8.1
I want to import the 3D array into python and have been using bash to generate the indices and then pasting the index to file.txt before importing the resulting full.txt in python using pandas:
for ((y=0;y<=3;y++)); do
for ((x=0;x<=1;x++)); do
echo -e "$x\t$y"
done
done
done > index.txt
paste index.txt file.txt> full.txt
The writing of index.txt has been slow in my actual code, which has x up to 9000 and y up to 5000. Is there a way to generate the indices into the first 2 columns of a 2D python numpy array so I only need to import the data from file.txt as as the third column?

I would recommend using pandas for loading the data and managing columns with different types.
We can generate the indices with np.indices with the desired dimensions and reshape to match your format.
Then concatenate 'file.txt'.
Creating the index for (9000,5000) takes about 950ms on a colab instance.
import numpy as np
import pandas as pd
x,y = 2,4 # dimensions, also works with 9000,5000 but assumes 'file.txt' has the correct size
pd.concat([
pd.DataFrame(np.indices((x,y)).ravel('F').reshape(-1,2), columns=['ind1','ind2']),
pd.read_csv('file.txt', header=None, names=['Value'])
], axis=1)
Out:
ind1 ind2 Value
0 0 0 1.1
1 1 0 2.1
2 0 1 3.1
3 1 1 4.1
4 0 2 5.1
5 1 2 6.1
6 0 3 7.1
7 1 3 8.1
How this works
First create the indices for your desired dimensions with np.indices
np.indices((2,4))
Out:
array([[[0, 0, 0, 0],
[1, 1, 1, 1]],
[[0, 1, 2, 3],
[0, 1, 2, 3]]])
Which gives us the right indices but in the wrong order.
With np.ravel('F') we can specify to flatten the array in columns first order
np.indices((2,4)).ravel('F')
Out:
array([0, 0, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 0, 3, 1, 3])
To get the desired columns reshape into a 2D array with shape (8,2). With (-1,2) the first dimension is inferred.
np.indices((2,4)).ravel('F').reshape(-1,2)
Out:
array([[0, 0],
[1, 0],
[0, 1],
[1, 1],
[0, 2],
[1, 2],
[0, 3],
[1, 3]])
Then convert into a dataframe with columns ind1 and ind2.
Working with more dimensions
pd.DataFrame(np.indices((2,4,3)).ravel('F').reshape(-1,3)).add_prefix('ind')
Out:
ind0 ind1 ind2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0
6 0 3 0
7 1 3 0
8 0 0 1
9 1 0 1
10 0 1 1
11 1 1 1
12 0 2 1
13 1 2 1
14 0 3 1
15 1 3 1
16 0 0 2
17 1 0 2
18 0 1 2
19 1 1 2
20 0 2 2
21 1 2 2
22 0 3 2
23 1 3 2

Here is a quick example how to create the 3D array from a 1D array. As a dummy i have random numbers. Then it creates tuples of x,y,value.
It takes about a minute for 45M rows
from random import randrange
x = 5000
y = 9000
numbers = [randrange(100000,999999) for i in range(x*y)]
array = [(a,b, numbers[b*(x-1)+a]) for a in range(x) for b in range(y)]
Output
pd.DataFrame(array)
Out[23]:
0 1 2
0 0 0 878704
1 0 1 524573
2 0 2 943657
3 0 3 496507
4 0 4 802714```

If you want to stick to your bash then you can avoid two loops:
Code:
for ((y=0;y<=3;y++)); do
echo -e "0\t$y\n1\t$y"
done
Output:
0 0
1 0
0 1
1 1
0 2
1 2
0 3
1 3
above in python is:
Code:
for y in range(4):
print(f'0\t{y}\n1\t{y}')
Output:
0 0
1 0
0 1
1 1
0 2
1 2
0 3
1 3

Related

How to change several values of pandas DataFrame at once?

Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.
Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.
You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

Dataframe column: to find local maxima

In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio").
I would like to find the local maxima of every non-zero vector contained in column "CumRetperTrade"
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which provides maxima for every vector contained in column "CumRetperTrade" its local max value. The numeric example is below. Thanks in advance.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
1 3 3
1 2 3
1 1 3
0 0 0
0 0 0
0 0 0
1 4 4
1 2 4
1 1 4
You can use :
df1['PeakCumRet'] = (df1.groupby(df1['Portfolio'].ne(df1['Portfolio'].shift()).cumsum())
['CumRetperTrade'].transform('max')
)
Output:
Portfolio CumRetperTrade PeakCumRet
0 1 3 3
1 1 2 3
2 1 1 3
3 0 0 0
4 0 0 0
5 0 0 0
6 1 4 4
7 1 2 4
8 1 1 4

Data frame mode function

HI I want to ask I am using df.mode() function to find the most common in one row. This will give me an extra column how could I have only one column? I am using df.mode(axis=1)
for example I have a data frame
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
so I want the output
1 1
2 0
3 0
but I am getting
1 1 NaN
2 0 NaN
3 0 NaN
Does anyone know why?
The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3.
import pandas as pd
df = pd.DataFrame(
data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]],
index=[1, 2, 3])
df
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
df.mode(axis=1)
0
1 1
2 0
3 0
There could be different data types in your columns and mode cannot be used to compare column of different data type.
Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)

Vectorizing for-loop

I have a very large dataframe (~10^8 rows) where I need to change some values. The algorithm I use is complex so I tried to break down the issue into a simple example below. I mostly programmed in C++, so I keep thinking in for-loops. I know I should vectorize but I am new to python and very new to pandas and cannot come up with a better solution. Any solutions which increase performance are welcome.
#!/usr/bin/python3
import numpy as np
import pandas as pd
data = {'eventID': [1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8],
'types': [0, -1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 1, -1, -1]
}
mydf = pd.DataFrame(data, columns=['eventID', 'types'])
print(mydf)
MyIntegerCodes = np.array([0, 1])
eventIDs = np.unique(mydf.eventID.values) # can be up to 10^8 values
for val in eventIDs:
currentTypes = mydf[mydf.eventID == val].types.values
if (0 in currentTypes) & ~(1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 0
if ~(0 in currentTypes) & (1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 1
print(mydf)
Any ideas?
EDIT: I was ask to explain what I do with my for-loops.
For every eventID I want to know if all corresponding types contain a 1 or a 0 or both. If they contain a 1, all values which are equal to -1 should be changed to 1. If the values are 0, all values equal to -1 should be changed to 0. My problem is to do this efficiently for each eventID independently. There can be one or multiple entries per eventID.
Input of example:
eventID types
0 1 0
1 1 -1
2 1 -1
3 2 -1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 -1
9 6 -1
10 6 -1
11 6 1
12 7 -1
13 8 -1
Output of example:
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1
First we create boolean masks m1 and m2 using Series.eq then use DataFrame.groupby on this mask and transform using any, then using np.select chose the elements from 1, 0 depending upon the conditions m1 or m2:
m1 = mydf['types'].eq(1).groupby(mydf['eventID']).transform('any')
m2 = mydf['types'].eq(0).groupby(mydf['eventID']).transform('any')
mydf['types'] = np.select([m1 , m2], [1, 0], mydf['types'])
Result:
# print(mydf)
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1

python - stumped by pandas conditionals and/or boolean indexing

I am having trouble with conditionals / boolean indexing. I am trying to populate a dataframe (dfp) with logic which is conditional on data from a similarly shaped dataframe (dfs) plus the previous row of itself (dfp).
This is my latest fail...
import pandas as pd
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
In [171]: dfs
Out[171]:
a b
0 1 0
1 0 1
2 -1 0
3 0 0
4 1 -1
5 0 0
6 0 1
7 -1 0
8 0 -1
9 0 0
dfp = pd.DataFrame(index=dfs.index,columns=dfs.columns)
dfp[(dfs==1)|((dfp.shift(1)==1)&(dfs!=-1))] = 1
In [166]: dfp.fillna(0)
Out[166]:
a b
0 1.0 0.0
1 0.0 1.0
2 0.0 0.0
3 0.0 0.0
4 1.0 0.0
5 0.0 0.0
6 0.0 1.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
So I would like dfp to have a 1 in row n if either of 2 conditions are met:
1) dfs same row = 1 or 2) both dfp previous row = 1 and dfs same row <> -1
I would like my final output to look like this:
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0
UPDATE / EDIT:
Sometimes the visual is more helpful - below is how it would map out in Excel.
Thanks in advance, very grateful for your time.
Let's summarize the invariants:
If the dfs value is 1 then the dfp value is 1.
If the dfs value is -1 then the dfp value is 0.
If the dfs value is 0 then the dfp value is 1 if the previous dfp value is 1 otherwise it's 0.
Or to formulate in another way:
The dfp starts with 1 if the first value is 1, otherwise 0
The dfp values are 0 until there is a 1 in dfs.
The dfp values are 1 until there is a -1 in dfs.
This is very easy to formulate in python:
def create_new_column(dfs_col):
newcol = np.zeros_like(dfs_col)
if dfs_col[0] == 1:
last = 1
else:
last = 0
for idx, val in enumerate(dfs_col):
if last == 1 and val == -1:
last = 0
if last == 0 and val == 1:
last = 1
newcol[idx] = last
return newcol
And the test:
>>> create_new_column(dfs.a)
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> create_new_column(dfs.b)
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
However this is very inefficient in Python because iterating over numpy-arrays (and pandas Series/DataFrames) is slow and the for-loops in python are inefficient as well.
However if you have numba or Cython you can compile this and it will be (probably) faster than any NumPy solution could be, because NumPy would require several rolling and/or accumulate operations.
For example with numba:
>>> import numba
>>> numba_version = numba.njit(create_new_column) # compilation step
>>> numba_version(np.asarray(dfs.a)) # need cast to np.array
array([1, 1, 0, 0, 1, 1, 1, 0, 0, 0], dtype=int64)
>>> numba_version(np.asarray(dfs.b)) # need cast to np.array
array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0], dtype=int64)
Even if dfs has millions of rows the numba solution will take only milliseconds:
>>> dfs = pd.DataFrame({'a':np.random.randint(-1, 2, 1000000),'b':np.random.randint(-1, 2, 1000000)})
>>> %timeit numba_version(np.asarray(dfs.b))
100 loops, best of 3: 9.37 ms per loop
Not the best way to do it but something that works.
dfs = pd.DataFrame({'a':[1,0,-1,0,1,0,0,-1,0,0],'b':[0,1,0,0,-1,0,1,0,-1,0]})
dfp = dfs.copy()
Define the function as follows. Usage of 'last' here is a little hacky.
last = [0]
def f( x ):
if x == 1:
x = 1
elif x != -1 and last[0] == 1:
x = 1
else:
x = 0
last[0] = x
return x
Simply apply the func f on each column.
dfp.a = dfp.a.apply( f )
dfp
a b
0 1 0
1 1 1
2 0 0
3 0 0
4 1 -1
5 1 0
6 1 1
7 0 0
8 0 -1
9 0 0
Similarly for col b. Don't forget to re-initalize 'last'.
last[0] = 0
dfp.b = dfp.b.apply( f )
dfp
a b
0 1 0
1 1 1
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 0 1
8 0 0
9 0 0

Categories