Python Pandas, Selecting Column with „loc“ statement - python

I´ve got the following problem:
If i select some index of my Pandas DataFrame:
df = pd.DataFrame(data=CoordArray[0:,1:],index=CoordArray[:,0],columns=["x","y","z"])
like this:
print(df.loc[['1234567','7654321'],:])
it works pretty well.
but if i have those data in a numpy array, transform this array to a list and do it like this:
mynewlist = list(SomeNumpyArray)
print(df.loc[mynewlist])
i get the following problem:
"None of [[1234567, 7654321]] are in the [index]"
I really dont know whats going wrong.

I haven't been able to replicate your issue. As #Wen commented, your list and numpy array may not have the same types.
Here is an example demonstrating that lists or numpy arrays are acceptable as indexers:
import pandas as pd, numpy as np
df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]],
index=['1000', '2000', '3000', '4000'],
columns= ['x', 'y', 'z'])
idx = np.array(['2000', '3000'])
df.loc[idx]
# x y z
# 2000 4 5 6
# 3000 7 8 9
lst = list(idx)
df.loc[idx_lst]
# x y z
# 2000 4 5 6
# 3000 7 8 9

Related

Create Table of all Sums from two Arrays in Pandas

I got two arrays:
arr1 = [1,2,3]
arr2 = [5,10]
Now i want to create a Dataframe from the arrays which hold the sum of all combinations:
pd.DataFrame([[6,7,8], [11,12,13]],
columns=['1', '2', '3'],
index=['5', '10'])
1
2
3
5
6
7
8
10
11
12
13
I know this can be easily done by iterating over the arrays, but I guess there is a built-in function to accomplish the same but way faster.
I've already looked in the documentation of different functions like the merge function but without success.
We can use numpy broadcasting with addition then build the resulting DataFrame by assigning the index and column names from the lists:
import numpy as np
import pandas as pd
arr1 = [1, 2, 3]
arr2 = [5, 10]
df = pd.DataFrame(
np.array(arr1) + np.array(arr2)[:, None], index=arr2, columns=arr1
)
Or with add + outer (which works if arr1 and arr2 are lists or arrays):
df = pd.DataFrame(np.add.outer(arr2, arr1), index=arr2, columns=arr1)
*Note if arr1 and arr2 are already arrays (instead of list) it can just look like:
import numpy as np
import pandas as pd
arr1 = np.array([1, 2, 3])
arr2 = np.array([5, 10])
df = pd.DataFrame(arr1 + arr2[:, None], index=arr2, columns=arr1)
All ways produce df:
1 2 3
5 6 7 8
10 11 12 13

How to calculate number of rows between 2 indexes of pandas dataframe

I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)

numpy.argmax() not working on my sorted pandas.Series

Can someone please explain to my why the argmax() function does not work after using sort_values() on my pandas series?
Below is the example of my code. The indices in the output is based on the original DataFrame, and not on the sorted Series.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.idxmax()
df.apply(sec_largest)
Then the output is
a 1
b 3
c 0
dtype: int64
And when I checked the Series using xsorted.iloc[0] function, it gives me the maximum values in the series.
Can someone explain to me how this works? Thank you very much.
The problem is that you are using the sort on the pandas Series, with which the indices also get passed along while sorting, and idxmax returns the original index with the highest value, not the index of the sorted series..
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.values.argmax()
By using the values of xsorted we use the numpy dataframe, and not the underlying pandas datastructure and everything works as expected.
If you print xsorted in the function you can see that the indices also get sorted along:
1 5
0 4
2 3
4 2
3 1

how to add pound(#) symbol in pandas 'to_csv' header?

I have a pandas DataFrame and I would like to save the DataFrame in a tab separated file format with pound(#) symbol at the beginning of the header.
Here is my demo code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
The above code create a dataframe and save it in a tab separated value format. that looks like:
a b c
1 2 3
4 5 6
7 8 9
But how I can add add pound symbol with the header while saving the DataFrame.
I want the output to be like bellow:
#a b c
1 2 3
4 5 6
7 8 9
Hope I am clear with the question and thanks in advance for the help.
Note: I would like to keep the DataFrame header definition same
Using your code, just modify the a column to be #a like below
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['#a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
Edit
If you don't want to adjust the starting dataframe, use .rename before sending to csv:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.rename(columns={
'a' : '#a'
}).to_csv(file_name, sep='\t', index=False)
Use the header argument to create aliases for the columns.
df.to_csv(file_name, sep='\t', index=False,
header=[f'#{x}' if x == df.columns[0] else x for x in df.columns])
#a b c
1 2 3
4 5 6
7 8 9
Here's another way to get your column aliases:
from itertools import zip_longest
header = [''.join(x) for x in zip_longest('#', df.columns, fillvalue='')]
#['#a', 'b', 'c']

Apply arithmetic calculations on specific rows of a large dataframe

Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)

Categories