I am trying to detect values with some specific characters e.g(?,/ etc). Below you can see a small sample with some data.
import pandas as pd
import numpy as np
data = {
'artificial_number':['000100000','000010000','00001000/1','00001000?','0?00/10000'],
}
df1 = pd.DataFrame(data, columns = [
'artificial_number',])
Now I want to detect values with specific characters that are not numbers ('00001000/1','00001000?','0?00/10000') I tried with this lines below
import re
clean = re.sub(r'[^a-zA-Z0-9\._-]', '', df1['artificial_number'])
But this code is not working as I expected. So can anybody help me how to solve this problem ?
#replace the non-digit with an empty value
df1['artificial_number'].str.replace(r'([^\d])','', regex=True)
0 000100000
1 000010000
2 000010001
3 00001000
4 00010000
Name: artificial_number, dtype: object
if you like to list the column with non-digit values
df1.loc[df1['artificial_number'].str.extract(r'([^\d])')[0].notna()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
Assuming a number in your case is an integer, to find the values that have non-numbers, just count the number of numbers, and compare with length of string:
rows = [len(re.findall('[0-9]', s)) != len(s) for s in df1.artificial_number]
df1.loc[rows]
# artificial_number
#2 00001000/1
#3 00001000?
#4 0?00/10000
To detect which of the values aren't interpretable as numeric, you can also use str.isnumeric:
df1.loc[~df1.artificial_number.str.isnumeric()]
artificial_number
2 00001000/1
3 00001000?
4 0?00/10000
If all characters need to be digits (e.g. 10.0 should also be excluded), use str.isdigit:
df1.loc[~df1.artificial_number.str.isdigit()]
df1.iloc[0,0] = '000100000.0'
artificial_number
0 000100000.0
2 00001000/1
3 00001000?
4 0?00/10000
Related
I am sure someone has asked a question like this before but my current attempts to search have not yielded a solution.
I have a column of text values, for example:
import pandas as pd
df2 = pd.DataFrame({'text':['a','bb','cc','4','m','...']})
print(df2)
text
0 a
1 bb
2 cc
3 4
4 m
5 ...
The column in 'text' is comprised of strings, ints, floats, and nan type data.
I am trying to combine (with a space [' '] between each text value) all the text values in-between each number (int/float) in the text column, ignoring Nan values, making each concatenated set a separate row.
What would be the most efficient way to accomplish this?
I thought to possibly read all values into a string, strip the Nan's, then split this successively if a number is encountered, but this seems highly inefficient.
Thank you for your help!
edit:
desired sample output
text
0 'a bb cc'
1 'm ...'
You can convert columns to numeric and test non missing values, so get Trues for numeric rows, then filter only non numeric in inverted mask by ~ in DataFrame.loc and aggregate by cumulative sum with mask by Series.cumsum with aggregate join:
#for remove NaNs before solution
df2 = df2.dropna(subset=['text'])
m = pd.to_numeric(df2['text'], errors='coerce').notna()
df = df2.loc[~m, 'text'].groupby(m.cumsum()).agg(' '.join).reset_index(drop=True).to_frame()
print (df)
text
0 a bb cc
1 m ...
I would avoid pandas for this operation altogether. Instead, use the library module more_itertools - namely, the split_at() function:
import more_itertools as mit
def test(x): # Test if X is a number of some sort or a nan
try: float(x); return True
except: return False
result = [" ".join(x) for x in mit.split_at(df2['text'].dropna(), test)]
# ['a bb cc', 'm ...']
df3 = pd.DataFrame(result, columns=['text',])
P.S. On a dataframe of 13,000 rows with an average group length of 10, this solution is 2 times faster than the pandas solution proposed by jezrael (0.00087 sec vs 0.00156 sec). Not a huge difference, indeed.
I have a some data that I put into a pandas dataframe. Inside of cell [0,5] I have a list of times that I want to call and be printed out.
Dataframe:
GAME_A PROCESSING_SPEED
yellow_selected 19
red_selected 0
yellow_total 19
red_total 60
counters [0.849998, 1.066601, 0.883263, 0.91658, 0.96668]
Code:
import pandas as pd
df = pd.read_csv('data.csv', sep = '>')
print(df.iloc[0])
proc_speed = df.iat[0,5]
print(proc_speed[2])
When I try to print the 3rd time in the dictionary I get .. I tried to use a for loop to print the times but instead I get this. How can I call the specific values from the list. How would I print out the 3rd time 0.883263?
[
0
.
8
4
9
9
9
8
,
1
.
0
6
6
...
This happens because with the way you are loading the data, the column 'PROCESSING_SPEED' is read as an object type, therefore, all elements of that series are considered strings (i.e., in this case proc_speed = "[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]", which is exactly the string the loop is printing character by character).
Before printing the values you desire to display (from that cell), one should convert the string to a list of numbers, for example:
proc_speed = df.iat[4,1]
proc_speed = [float(s) for s in proc_speed[1:-1].split(',')]
for num in proc_speed:
print( num)
Where proc_speed[1:-1].split(',') takes the string containing the list, except for the brackets at the beginning and end, and splits it according to the commas separating values.
In general, we have to be careful when loading columns with varying or ambiguous data types, as Pandas could have trouble parsing them correctly or in the way we want/expect it to be.
You can simply call proc_speed[index] as you have set this variable as a list. Here is a working example, note my call to df.iat has different indexes;
d = {'GAME_A':['yellow_selected', 'red_selected', 'yellow_total', 'red_total', 'counters'],'PROCESSING_SPEED':[19,0,19,60,[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]]}
df = pd.DataFrame(d)
proc_speed = df.iat[4, 1]
for i in proc_speed:
print(i)
0.849998
1.066601
0.883263
0.91658
0.96668
proc_speed[1]
1.066601
proc_speed[3]
0.91658
You can convert with apply, it's easier than splitting, and converts your ints to ints:
pd.read_clipboard(sep="(?!\s+(?<=,\s))\s+")['PROCESSING_SPEED'].apply(eval)[4][2]
# 0.883263
Consider a dataframe in Pandas, where one of the many columns have data that has TWO decimals in the column.
Like
13.343.00
12.345.00
98.765.00
How can one get a new column (float) where values are stored in only 1 decimal format stripping that last part of 14.234(.00).
Desired output should be a new column like
13.343
12.345
98.765
If the digits after the second period are not always 0s (and not always two), the following code is more robust:
df["col"] = df["col"].str.extract("(.+)\.[0-9]+").astype(float)
Use:
#remove last 3 values
df['col'] = df['col'].str[:-3].astype(float)
Or:
#get values before last .
df['col'] = df['col'].str.rsplit('.', 1).str[0].astype(float)
Or:
#one or zero integer \d* \. and integer \d+ pattern
df["col"] = df["col"].str.extract("(\d*\.\d+)").astype(float)
You can use:
print(df)
col
0 13.343.00
1 12.345.00
2 98.765.00
df.col=df.col.str.rstrip('.00')
print(df)
col
0 13.343
1 12.345
2 98.765
You can convert it back to float if you like by astype(float)
Note : You should not use this if you have all 0s example: 00.000.00 instead use the second solution.
If the second decimal is not always 0 use:
df.col.str.rsplit(".",1).str[0]
I have an expression like ( one row of a column, say 'old_col' in pandas data frame) ( Shown the top two rows from a column of the dataframe )
abcd_6.9_uuu ghaha_12.8 _sksks
abcd_5.2_uuu ghaha_13.9 _sksks
I was trying to use the str.extract on the dataframe to get the two floating numbers. However I find two issues, only the first one is picked up( 6.9 from first row and 5.2 from second row )
1. So how can I do that?
2. Also how can I make the extract method general to pick numbers upto any digits ( 5.7or 12.9 irrespective)
I am using:
df['newcol'] = df['old_col'].str.extract('(_\d.\d)')
To get more than one digit,
df['col'].str.extract('(\_\d+\.\d+)')
col
0 _6.9
1 _15.9
To get all occurrences, use str.extractall
df['col'].str.extractall('(\_\d+\.\d+)')
col
match
0 0 _6.9
1 _12.8
1 0 _15.9
1 _13.9
To assign back to df:
s = df['col'].str.extractall('(\_\d+\.\d+)')['col']
df['new_col'] = s.groupby(s.index.get_level_values(0)).agg(list)
You can use Series.str.findall:
import pandas as pd
df=pd.DataFrame({'old_col':['abcd_6.9_uuu ghaha_12.8 _sksks','abcd_5.2_uuu ghaha_13.9 _sksks']})
df['newcol'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?')
df['newcol_str'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?').str.join(', ')
# >>> df
# old_col newcol newcol_str
# 0 abcd_6.9_uuu ghaha_12.8 _sksks [6.9, 12.8] 6.9, 12.8
# 1 abcd_5.2_uuu ghaha_13.9 _sksks [5.2, 13.9] 5.2, 13.9
Regex details:
\d+(?:\.\d+)? - one or more digits followed with an optional occurrence of a . and one or more digits
\d+\.\d+ would match only float values where the . is obligatory between at least two digits.
Since .str.findall(r'\d+(?:\.\d+)?') returns a list, the newcol column contains lists, with .str.join(', '), the newcol_str column contains strings with found matches merged.
If you must check if the numbers occur between underscores add them on both sides of the pattern and wrap the number matching pattern with parentheses:
.str.findall(r'_(\d+(?:\.\d+)?)_')
I'm having a problem trying to get a character count column of the string values in another column, and haven't figured out how to do it efficiently.
for index in range(len(df)):
df['char_length'][index] = len(df['string'][index]))
This apparently involves first creating a column of nulls and then rewriting it, and it takes a really long time on my data set. So what's the most effective way of getting something like
'string' 'char_length'
abcd 4
abcde 5
I've checked around quite a bit, but I haven't been able to figure it out.
Pandas has a vectorised string method for this: str.len(). To create the new column you can write:
df['char_length'] = df['string'].str.len()
For example:
>>> df
string
0 abcd
1 abcde
>>> df['char_length'] = df['string'].str.len()
>>> df
string char_length
0 abcd 4
1 abcde 5
This should be considerably faster than looping over the DataFrame with a Python for loop.
Many other familiar string methods from Python have been introduced to Pandas. For example, lower (for converting to lowercase letters), count for counting occurrences of a particular substring, and replace for swapping one substring with another.
Here's one way to do it.
In [3]: df
Out[3]:
string
0 abcd
1 abcde
In [4]: df['len'] = df['string'].str.len()
In [5]: df
Out[5]:
string len
0 abcd 4
1 abcde 5