I am struggeling to get the right (restricted to the selection) index when using the methode xs by pandas to select specific data in my dataframe. Let me demonstrate, what I am doing:
print(df)
value
idx1 idx2 idx3 idx4 idx5
10 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
7 8 6.0 ...
8 9 6.0 ...
20 2.0 0.0010 1 2 6.0 ...
2 3 6.0 ...
...
18 19 6.0 ...
19 20 6.0 ...
# get dataframe for idx1 = 10, idx2 = 2.0, idx3 = 0.0010
print(df.xs([10,2.0,0.0010]))
value
idx4 idx5
1 2 6.0 ...
2 3 6.0 ...
3 4 6.0 ...
4 5 6.0 ...
5 6 6.0 ...
6 7 6.0 ...
7 8 6.0 ...
8 9 6.0 ...
# get the first index list of this part of the dataframe
print(df.xs([10,2.0,0.0010]).index.levels[0])
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19]
So I do not understand, why the full list of values that occur in idx4 is returned even though we restricted the dataframe to a part where idx4 only takes values from 1 to 8. Is it that I use the index method in a wrong way?
This is a known feature not bug. pandas preserves all of the index information. You can determine which of the levels are expressed and at what location via the labels attribute.
If you are looking to create an index that is fresh and just contains the information relevant to the slice you just made, you can do this:
df_new = df.xs([10,2.0,0.0010])
idx_new = pd.MultiIndex.from_tuples(df_new.index.to_series(),
names=df_new.index.names)
df_new.index = idx_new
Related
I have a dataset in which few values are null. I want to change them to either 4 or 5 randomly in specific rows. How do I do that?
data.replace(np.nan, np.random.randint(4,5))
I tried this and every nan value changed to only 4 and not 4 and 5 randomly. Also I dont know how to replace nan values for only specific rows like row 1,4,5,8.
Use loc and select by index and isna. Change np.random.randint(4,5) to (4,6) to get both four and fives.
import pandas as pd
import numpy as np
data = {
'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [0, np.nan, 1, 2.0, 2, np.nan, 3, 2.0, 7, np.nan]}
df = pd.DataFrame(data)
# A B
# 1 0.0
# 2 NaN
# 3 1.0
# 4 2.0
# 5 2.0
# 6 NaN
# 7 3.0
# 8 2.0
# 9 7.0
# 10 NaN
# If index is 1 or 5, and the value is NaN, change B to 4 or 5
df.loc[df.index.isin([1, 5]) & pd.isna(df["B"]), "B"] = np.random.randint(4,6)
# A B
# 1 0.0
# 2 4.0
# 3 1.0
# 4 2.0
# 5 2.0
# 6 4.0
# 7 3.0
# 8 2.0
# 9 7.0
# 10 NaN
I would like to create a filter that allows me to retrieve only the values of opposite signs on a certain column. (example 10, -10, 22, -22)
How can I do this? Thanks
I would like to keep only B codes whose opposite value is in A, typically:
The exact logic and expected output is unclear (please provide an example), bus you could use the absolute value and the sign as groupers:
out = (df
.assign(abs=df['col'].abs(), sign=np.sign(df['col']))
.pivot(index='abs', columns='sign')
)
output:
id col
sign -1 1 -1 1
abs
4 NaN 4.0 NaN 4.0
7 5.0 NaN -7.0 NaN
10 3.0 0.0 -10.0 10.0
22 2.0 1.0 -22.0 22.0
used input:
df = pd.DataFrame({'id': range(6),
'col': [10, 22, -22, -10, 4, -7],
})
id col
0 0 10
1 1 22
2 2 -22
3 3 -10
4 4 4
5 5 -7
I am fairly new to Python and trying to figure out how to generate dataframes for multiple arrays. I have a list where the arrays are currently stored:
list = [ [1 2 3 4], [12 19 30 60 95 102] ]
What I want to do is take each array from this list and put them into separate dataframes, with the array contents populating a column of the dataframe like so:
Array2_df
1 12
2 19
3 30
4 60
I have found several answers involving the use of dictionaries, but am not sure how that would actually solve my problem... I also don't understand how naming the dataframes dynamically would work. I have tried playing around with for loops, but that just overwrote the same dataframe repeatedly. Please help!! Thanks :)
As mentioned in the comments, dynamically created variables is a bad idea. Why not use a single dataframe, like so:
In [1]: zlist = [[1, 2, 3, 4], [12, 19, 30, 60, 95, 102], [1, 2, 4, 5, 1, 6, 1, 7, 8, 21]]
In [2]: pd.DataFrame({f"array_{i}": pd.Series(z) for i, z in enumerate(zlist)})
Out[2]:
array_0 array_1 array_2
0 1.0 12.0 1
1 2.0 19.0 2
2 3.0 30.0 4
3 4.0 60.0 5
4 NaN 95.0 1
5 NaN 102.0 6
6 NaN NaN 1
7 NaN NaN 7
8 NaN NaN 8
9 NaN NaN 21
If you really insist on separate dataframes, then you should store them in a dictionary:
df_dict = {f"array_{i}": pd.DataFrame({f"array_{i}": z}) for i, z in enumerate(zlist)}
Then, you can access a specific dataframe by name:
In [8]: df_dict["array_2"]
Out[8]:
array_2
0 1
1 2
2 4
3 5
4 1
5 6
6 1
7 7
8 8
9 21
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
I have a list of columns in a dataframe that shouldn't be empty.
I want to remove any rows that are empty in any of these columns. My solution would be to iterate through the required columns and set the column 'excluded' to the error message that the user will be shown before excluding them (I will present these to the user in the form of a report at the end of the process)
I'm currently trying something like this:
for col in requiredColumns:
df[pd.isnull(df[col])]['excluded'] = df[pd.isnull(df[col])]['excluded'].apply(lambda x: str(x) + col + ' empty, excluded')
but no luck - the columns aren't updated. The filter by itself (to get only the empty rows) works, the update part doesn't seem to be working.
I'm used to SQL:
UPDATE df SET e = e & "empty, excluded" WHERE NZ(col, '') = ''
If you need to update a panda based on multiple conditions:
You can simply use .loc
>>> df
A B C
0 2 40 800
1 1 90 600
2 6 80 700
3 1998 70 55
4 1 90 300
5 7 80 700
6 4 20 300
7 1998 20 2
8 7 10 100
9 1998 60 2
>>> df.loc[(df['A'] > 7) & (df['B'] > 69) , 'C'] = 75
This will set 'C' = 75 where 'A' > 7 and 'B' > 69
One way is to use numpy functions to create a column with the desired marker.
Setup
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [1, np.nan, 2, 3, 4, 5],
'B': [2, 3, np.nan, 5, 1, 9],
'C': [5, 8, 1, 9, np.nan, 7]})
A B C
0 1.0 2.0 5.0
1 NaN 3.0 8.0
2 2.0 NaN 1.0
3 3.0 5.0 9.0
4 4.0 1.0 NaN
5 5.0 9.0 7.0
Solution
df['test'] = np.any(np.isnan(df.values), axis=1)
A B C test
0 1.0 2.0 5.0 False
1 NaN 3.0 8.0 True
2 2.0 NaN 1.0 True
3 3.0 5.0 9.0 False
4 4.0 1.0 NaN True
5 5.0 9.0 7.0 False
Explanation
np.isnan returns a Boolean array corresponding to whether the elements of a numpy array are null.
Use np.any or np.all, as required, to determine which rows are in scope.
Use df.values to extract underlying numpy array from dataframe. For selected columns, you can use df[['A', 'B']].values.