In a DataFrame, I want to extract an integer (0-9) from a string which always comes after a specific word, and add it as a new column at a specific position (not the end). In the simplified example below I want to extract the integer which comes after the word 'number'.
DataFrame:
testDf = ['Number1', 'number2', 'aNumber8', 'Number6b']
df = pd.DataFrame(testDf, columns=['Tagname'])
Tagname
Number1
number2
aNumber8
Number6b
The code below works, but since it adds the column at the end of the dataframe, I have to move the column.
df['Number'] = df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE)
Tagname Number
Number1 1
number2 2
aNumber8 8
Number6b 6
insertNum = df['Number']
df.drop(labels=['Number'], axis=1, inplace = True)
df.insert(0, 'Number', insertNum)
Number Tagname
1 Number1
2 number2
8 aNumber8
6 Number6b
What I hoped I could do is to use .insert(), but this raises the ValueError shown below.
df.insert(0, 'Number', df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE))
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Is it possible to use .insert() this way?
Use expand=False for Series from Series.str.extract, if omit it get one or more column DataFrame, because default parameter is expand=True:
Details:
print (df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE))
0
0 1
1 2
2 8
3 6
print (df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE, expand=False))
0 1
1 2
2 8
3 6
Name: Tagname, dtype: object
df.insert(0,'Number',df['Tagname'].str.extract(r'number*(\d)', re.IGNORECASE, expand=False))
print (df)
Number Tagname
0 1 Number1
1 2 number2
2 8 aNumber8
3 6 Number6b
Related
I would like to add a string at the beginning of each row- either positive or negative - depending on the value in the columns:
I keep getting ValueError, as per screenshot
For a generic method to handle any number of columns, use pandas.from_dummies:
cols = ['positive', 'negative']
user_input_1.index = (pd.from_dummies(user_input_1[cols]).squeeze()
+'_'+user_input_1.index
)
Example input:
Score positive negative
A 1 1 0
B 2 0 1
C 3 1 0
Output:
Score positive negative
positive_A 1 1 0
negative_B 2 0 1
positive_C 3 1 0
Use Series.map for prefixes by conditions and add to index:
df.index = df['positive'].eq(1).map({True:'positive_', False:'negative_'}) + df.index
Or use numpy.where:
df.index = np.where(df['positive'].eq(1), 'positive_','negative_') + df.index
I want to create two binary indicators by checking to see if the characters in the
first and third positions for column 'A' matches the characters found in the first and third positions of column 'B'.
Here is a sample data frame:
df = pd.DataFrame({'A' : ['a%d', 'a%', 'i%'],
'B' : ['and', 'as', 'if']})
A B
0 a%d and
1 a% as
2 i% if
I would like the data frame to look like below:
A B Match_1 Match_3
0 a%d and 1 1
1 a% as 1 0
2 i% if 1 0
I tried using the following string comparison, but it the column just returns '0' values for the match_1 column.
df['match_1'] = np.where(df['A'][0] == df['B'][0], 1, 0)
I am wondering if there is a function that is similar to the substr function found in SQL.
You could use pandas str method, that can work to slice the elements:
df['match_1'] = df['A'].str[0].eq(df['B'].str[0]).astype(int)
df['match_3'] = df['A'].str[2].eq(df['B'].str[2]).astype(int)
output:
A B match_1 match_3
0 a%d and 1 1
1 a% as 1 0
2 i% if 1 0
If you have many positions to test, you can use a loop:
for pos in (1, 3):
df['match_%d' % pos] = df['A'].str[pos-1].eq(df['B'].str[pos-1]).astype(int)
I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?
As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)
>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]
Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.
I'm trying to define a function that will create a column and clean the numbers to just their ten digit area code and number. The Date frame.
PNum1
0 18888888888
1 1999999999
2 +++(112)31243134
I have all the individual functions and even stored them into a DataFrame and Dictionary.
def GetGoodNumbers(col):
column = col.copy()
Cleaned = column.replace('\D+', '', regex=True)
NumberCount = Cleaned.astype(str).str.len()
FirstNumber = Cleaned.astype(str).str[0]
SummaryNum = {'Number':Cleaned,'First':FirstNumber,'Count':NumberCount}
df = pd.DataFrame(data=SummaryNum)
DecentNumbers = []
return df
returns
Count First Number
0 11 1 18888888888
1 10 3 3999999999
2 11 2 11231243134
How can I loop through the dataframe column and return a new column that will:
-remove all non-digits.
-get the length (which will be usually 10 or 11)
-If length is 11, return the right 10 digits.
The desired output:
number
1231243134
1999999999
8888888888
You can remove every non-digit and slice the last 10 digits.
df.PNum1.str.replace('\D+', '').str[-10:]
0 8888888888
1 1999999999
2 1231243134
Name: PNum1, dtype: object
I have a column with values in degrees with the degree sign.
42.9377º
42.9368º
42.9359º
42.9259º
42.9341º
The digit 0 should replace the degree symbol
I tried using regex or str.replace but I can't figure out the exact unicode character.
The source xls has it as º
the error shows it as an obelus ÷
printing the dataframe shows it as ?
the exact position of the degree sign may vary, depending on rounding of the decimals, so I can't replace using exact string position.
Use str.replace:
df['a'] = df['a'].str.replace('º', '0')
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
#check hex format of char
print ("{:02x}".format(ord('º')))
ba
df['a'] = df['a'].str.replace(u'\xba', '0')
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
Solution with extract floats.
df['a'] = df['a'].str.extract('(\d+\.\d+)', expand=False) + '0'
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
Or if all last values are º is possible use indexing with str:
df['a'] = df['a'].str[:-1] + '0'
print (df)
a
0 42.93770
1 42.93680
2 42.93590
3 42.92590
4 42.93410
If you know that it's always the last character you could remove that character and append a "0".
s = "42.9259º"
s = s[:-1]+"0"
print(s) # 42.92590