I would like to modify the cell value based on its size.
If the dateframe is as below:
A B C
25802523 X1 2
M25JK0010 Y1 1
K25JK0010 Y2 1
I would like to modify the column 'A' and insert to another column.
For example, if the first cell value the size of column A is 8. I would like to break it and get least 5 values, similarly others depend on their sizes of each cell.
Is there any way I'm able to do this?
You can do this:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445'],
'size': [2, 1, 8]} )
Define a dictionary of your desired final length based on the corresponding size. Here if the size is 8 I will take the 5 last characters
size_dict = {8: 5, 2: 3, 1: 4}
Then use a simple pandas apply
t['A_bis'] = t.apply(lambda x: x['A'][len(x['A']) - size_dict[x['size']]:], axis=1)
The result is
0 523 >> 3 last characters (key 2)
1 0010 >> 4 last characters (key 1)
2 R4445 >> 5 last characters (key 8)
Another approach to do this:
Sample df:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445']})
Get the count of each elements of A:
t['Count'] =(t['A'].apply(len))
Then write a condition to replace:
t.loc[t.Count == 8, 'Number'] = t['A'].str[-5:]
Related
Given a dataframe, I know I can select rows by condition using below syntax:
df[df['colname'] == 'Target Value']
But what about a Series? Series does not have a column (axis 1) name, right?
My scenario is I have created a Series by through the nunique() function:
sr = df.nunique()
And I want to list out the index names of those rows with value 1.
Having failed to find a clear answer on the Net, I resorted to below solution:
for (colname, coldata) in sr.iteritems():
if coldata == 1:
print(colname)
Question: what is a better way to get my answer (i.e list out index names of Series (or column names of the original Dataframe) which has just a single value?)
The ultimate objective was to find which columns in a DF has one and only one unique value. Since I did not know how to do that direct from a DF, I first used nunique() and that gave me a Series. Thus i needed to process the Series with a "== 1" (i.e one and only one)
I hope my question isnt silly.
It is unclear what you want. Whether you want to work on the dataframe or on the Series ?
Case 1: Working on DataFrame
In case you want to work on the dataframe to to list out the index names of those rows with value 1, you can try:
df.index[df[df==1].any(axis=1)].tolist()
Demo
data = {'Col1': [0, 1, 2, 2, 0], 'Col2': [0, 2, 2, 1, 2], 'Col3': [0, 0, 0, 0, 1]}
df = pd.DataFrame(data)
Col1 Col2 Col3
0 0 0 0
1 1 2 0
2 2 2 0
3 2 1 0
4 0 2 1
Then, run the code, it gives:
[1, 3, 4]
Case 2: Working on Series
If you want to extract the index of a Series with value 1, you can extract it into a list, as follows:
sr.loc[sr == 1].index.tolist()
or use:
sr.index[sr == 1].tolist()
It would work the same way, due to the fact that pandas overloads the == operator:
selected_series = series[series == my_value]
I have a pandas dataframe that looks something like this:
df = pd.DataFrame(np.array([[1,1, 0], [5, 1, 4], [7, 8, 9]]),columns=['a','b','c'])
a b c
0 1 1 0
1 5 1 4
2 7 8 9
I want to find the first column in which the majority of elements in that column are equal to 1.0.
I currently have the following code, which works, but in practice, my dataframes usually have thousands of columns and this code is in a performance critical part of my application, so I wanted to know if there is a way to do this faster.
for col in df.columns:
amount_votes = len(df[df[col] == 1.0])
if amount_votes > len(df) / 2:
return col
In this case, the code should return 'b', since that is the first column in which the majority of elements are equal to 1.0
Try:
print((df.eq(1).sum() > len(df) // 2).idxmax())
Prints:
b
Find columns with more than half of values equal to 1.0
cols = df.eq(1.0).sum().gt(len(df)/2)
Get first one:
cols[cols].head(1)
I have a Panda's Dataframe that I need to reindex into a specific fashion. There are several numbered indices, but the last one is a string. Without the inclusion of the string, the index goes in numerical order, 1-20 just fine.
However, as soon as I include the string index, the order switches to alpha-numeric (1, 11, 12... 18, 19, 2, 20, 3, 4, etc.). Is there any way where I can organize the list properly numerically, then add on the string index without changing how the list is organized?
[EDIT]:
Realized the shortcoming on my own part. I thought I had included that the data frame was being converted to an html-safe table ( DataTable ) after construction and being displayed on a web page. It is possible this might be causing the issue I am having, though any insights to this matter are welcome.
An example of the kind of data frame I am looking at:
Column 1
0 Value 1
1 Value 2
2 Value 3
3 Value 4
...
18 Value 19
19 Value 20
string Value 21
Something along these lines should work:
new_index = list(df.index)
new_index[-1] = 'string'
df.index=new_index
For example:
df=pd.DataFrame(np.random.random(5))
>>> df
0
0 0.665922
1 0.591298
2 0.274722
3 0.561243
4 0.382927
new_index = list(df.index)
new_index[-1] = 'string'
df.index=new_index
returns a re-indexed df:
>>> df
0
0 0.665922
1 0.591298
2 0.274722
3 0.561243
string 0.382927
Here is one way. You can separate out numeric and non-numeric indices and sort them independently.
df = pd.DataFrame({1: ['Val1', 'Val2', 'Val3', 'Val4', 'Val5']},
index=['0', '1', '11', '2', 'string'])
order1 = sorted((x for x in df.index if x.isdigit()), key=lambda i: int(i))
order2 = sorted(x for x in df.index if not x.isdigit())
df = df.loc[order1+order2]
# 1
# 0 Val1
# 1 Val2
# 2 Val3
# 11 Val4
# string Val5
I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!
I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))