change value in pandas dataframe based on length of current value [duplicate] - python

This question already has answers here:
Pandas - Add leading "0" to string values so all values are equal len
(3 answers)
Closed 6 years ago.
I have a pandas dataframe that has a certain column that should have values of a length of four. If the length is three, I would like to add a '0' to the beginning of the value. For example:
a b c
1 2 0054
3 6 021
5 5 0098
8 2 012
So in column c I would like to change the second row to '0021' and last row to '0012.' The values are already strings. I've tried doing:
df.loc[len(df['c']) == 3, 'c'] = '0' + df['c']
but it's not working out. Thanks for any help!

If the type in C is int you can do something like this:
df['C'].apply(lambda x: ('0'*(4-len(str(x))))+str(x) if(len(str(x)) < 4) else str(x))
In the lambda function, I check whether the number of digits/characters in x is less than four. If yes, I add zeros in front, so that the number of digits/characters in x will be four (this is also known as padding). If not, I return the value as string.
In case your type is string, you can remove the str() function calls, but it will work either way.

Related

How to select rows from pandas dataframe by looking a feature' data types when a feature contains more than one type of value [duplicate]

This question already has answers here:
Select row from a DataFrame based on the type of the object(i.e. str)
(3 answers)
Closed 3 months ago.
I have a dataframe with 3 features: id, name and point. I need to select rows that type of 'point' value is string.
id
name
point
0
x
5
1
y
6
2
z
ten
3
t
nine
4
q
two
How can I split the dataframe just looking by type of one feature' value?
I tried to modify select_dtypes method but I lost. Also I tried to divide dataset with using
df[df[point].dtype == str] or df[df[point].dtype is str]
but didn't work.
Technically, the answer would be:
out = df[df['point'].apply(lambda x: isinstance(x, str))]
But this would also select rows containing a string representation of a number ('5').
If you want to select "strings" as opposed to "numbers" whether those are real numbers or string representations, you could use:
m = pd.to_numeric(df['point'], errors='coerce')
out = df[df['point'].notna() & m]
The question is now, what if you have '1A' or 'AB123' as value?

sum columns in dataframe python (different columns each row) [duplicate]

This question already has answers here:
Dynamically evaluate an expression from a formula in Pandas
(2 answers)
Closed 2 years ago.
I have a dataframe with 3 columns a, b, c like below:
df = pd.DataFrame({'a':[1,1,5,3], 'b':[2,0,6,1], 'c':[4,3,1,4]})
I want to add column d which is sum of some columns in df, but is not the same column for each row, for example
only row 1 and 3 is sum from the same column, row 0 and 2 is sum from others columns.
what I found on Stack over flow is always for certain column for whole dataframe, but in this case it is differnt.
How is the best way I can do it?
Because column d is randomly calculated, the only way to do it for each row, is separately.
df['d'] = 0
df['d'].iloc[0] = df['b'].iloc[0]
df['d'].iloc[1] = df['a'].iloc[1] + df['c'].iloc[1]
df['d'].iloc[2] = df['a'].iloc[2]
df['d'].iloc[3] = df['a'].iloc[3] + df['c'].iloc[3]
If rows 1 and 3, have a rule:
df['d'].loc[(df.index % 2)==1] = df['a'].iloc[df.index] + df['c'].iloc[df.index]
Also, with for-loop:
for i in range(0, 4):
if i % 2 == 1:
df['d'].iloc[i] = df['a'].iloc[i] + df['c'].iloc[i]
The dynamic way uses pd.eval(), as per [this solution][1]. This evaluates each row's formula individually, which allows df['formula'] to be different on each row, and nothing is hardcoded in your code. There's a huge amount going on in this one-liner, see the explanation in Notes below.
df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
0 2
1 4
2 5
3 4
# ^--- this is the result
and if you want to assign that result to a dataframe column, say df['z']:
df['z'] = df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
alternatively you could use pd.eval(..., inplace=True), but then the formula would need to contain an actual assignment, e.g. 'z=a+b', and also the 'z' column would need to have been declared already: df['z'] = np.NaN. That part is slightly annoying to implement, so I didn't.
NOTES:
we use pd.eval(...) to dynamically evaluate the ['formula'] column
...using the pd.eval(.., local_dict=...) argument to pass in the variables for that row
to evaluate an expression on each dataframe row, we use df.apply(..., axis=1). We have to provide some lambda function to tell it what to evaluate.
So how does pd.eval() know how to map the strings a,b,c to their values on that individual row?
When we call df.apply(..., axis=1) row-wise like that, each row gets passed in as an individual Series, so within our apply(... axis=1), we can no longer reference the dataframe as df or its columns as df['a'], df['b'], ...
So instead we need to pass in that row as a Python dict, hence the local_dict=row.to_dict() argument to pd.eval, inside the lambda function.
The pd.eval() approach can handle arbitrarily complicated formulas in the variables, not just simple sums; it can handle e.g. (a + c**2)/(b+c). You could reference external constants, or external functions e.g. log10.
References:
[1]: Compute dataframe columns from a string formula in variables?

Changing pandas column values into another format [duplicate]

This question already has answers here:
How to convert string representation of list to a list
(19 answers)
Closed 3 years ago.
The labels column in my test['labels'] dataframe, looks like:
0 ['Edit Distance']
1 ['Island Perimeter']
2 ['Longest Substring with At Most K Distinct Ch...
3 ['Valid Parentheses']
4 ['Intersection of Two Arrays II']
5 ['N-Queens']
For each value in the column, which is a string representation of list ("['Edit Distance']"), I want to apply the function below to convert it into an actual list.
ast.literal_eval(VALUE HERE)
What is a straightforward way to do this?
Use:
import ast
test['labels'] = test['labels'].apply(ast.literal_eval)
print (test)
labels
0 [Edit Distance]
1 [Island Perimeter]
2 [Longest Substring with At Most K Distinct Ch]
3 [Valid Parentheses]
4 [Intersection of Two Arrays II]
5 [N-Queens]

Pandas: How to find index of value of one column whose cell value matches certain value

My dataframe consists of the following table:
Time X Y
0100 5 9
0200 7 10
0300 11 12
0400 3 13
0500 4 14
My goal is to find the index of the value of Y which corresponds to a certain number (e.g.: 9) and return the corresponding X value from the table.
My idea previously was for a for-loop (as I have a number of Ys) to loop through and find all the values which match and then to create an empty array to store the values of X as such:
for i in (list of Ys):
empty_storing_array.append(df[index_of_X].loc[df[Y] == i])
Problem is (if my newbie understanding of Pandas holds true), the values that loc gives is no number, but rather something else. How should I do it so that empty_storing_array then lists the numbers of X which corresponds to the values in array Y?
you can use df.loc and then ask for the index explicitly. This will return an array, so we slice the first item to get the integer:
df.loc[df['Y']==9, 'X'].index.values[0]
try with this :
list_Ys = [9,8,15] #example
new_df = df[df['Y'].isin(list_Ys)]['X']
the isin method tells whether each element in the DataFrame is contained in values.
if you want to convert your resulting dataframe to an array
new_df.values
If you need to have a way to retrieve which Y corresponds to a given X then keep both X and Y:
df.loc[df['Y'].isin(list_of_ys), ['Y', 'X']].values
Perhaps create a dictionary that puts all the Xs corresponding to a Y in a tuple and make the Y the keys.

Slicing a Data frame by checking consecutive elements [duplicate]

This question already has answers here:
Pandas: Drop consecutive duplicates
(8 answers)
Closed 4 years ago.
I have a DF indexed by time and one of its columns (with 2 variables) is like [x,x,y,y,x,x,x,y,y,y,y,x]. I want to slice this DF so Ill get this column without same consecutive variables- in this example :[x,y,x,y,x] and every variable was the first in his subsequence.
Still trying to figure it out...
Thanks!!
Assuming you have df like below
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
We using shift to find the next is equal to the current or not
df[df[0].shift()!=df[0]]
Out[142]:
0
0 x
2 y
4 x
7 y
11 x
You jsut try to loop through and safe the last element used
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
df2=pd.DataFrame()
old = df[0].iloc[0] # get the first element
for column in df:
df[column].iloc[0] != old:
df2.append(df[column].iloc[0])
old = df[column].iloc[0]
EDIT:
Or for a vector use a list
>>> L=[1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [x[0] for x in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]

Categories