How to find a character in a (cell) array of strings, Python - python

I am loading a .csv with Pandas (pd.read_csv). Normally this would yield floats, however a few of my datasets have a 'q' inside some of the > 100000 numbers (for instance a matrix of 33x60000) included in the .csv file. Like this: '-13q27.20148186934421000000' (the q's are NOT always in the same place). This causes Pandas to not see them as numbers but as strings. This makes a conversion to float impossible, hence my question: how can I easily find the 'q's and remove them?
I tried using a for loop and check for each individual string if it contains a 'q', this however takes ages:
for i in range(tmp.values.shape[0]):
for j in range(tmp.values.shape[1]):
if 'q' in tmp.values[i,j]:
print('oh oh')
It is also possible that it is sometimes another letter then a 'q', so maybe it would be wise to look for letters in general, I have no idea how to do this in an efficient way.
Thanks in advance for your help!

Use pandas.DataFrame.replace with regex=True:
Given df:
col1 col2 col3
0 1.1 2.2 3.3
1 2q.2 3.q4 q5.3
2 4.4 5.5 6.6
df = df.replace('q', '', regex=True).astype(float)
print(df.dtypes)
print(df)
Output:
col1 float64
col2 float64
col3 float64
dtype: object
col1 col2 col3
0 1.1 2.2 3.3
1 2.2 3.4 5.3
2 4.4 5.5 6.6

you can delete all characters (here q) from a specific column (here named result):
data['result'] = data['result'].map(lambda x: x.lstrip('q').rstrip('q'))
afterwards you can convert your column to float.
data['result'] = data['result'].astype(float)
or alternativ:
df['result'] = df['result'].str.replace(r'\D', '').astype(float)

df.replace(['q'], 0.0, inplace=True)

Related

change number negative sign position

Hello I have the following example of df
col1 col2
12.4 12.32
11.4- 2.3
2.0- 1.1
I need the negative sign to be at the beginning of the number and not at the end
col1 col2
12.4 12.32
-11.4 2.3
-2.0 1.1
I am trying with the following function, so far I can get the data with the sign and print them correctly but I no longer know how to replace them in my column
updated_data = '' # iterate over the content
for line in df["col1"]:
# removing last word
updated_line = ' '.join(str(line).split('-')[:-1])
print(updated_line)
Could you help me please? or if there is an easier way to do it I would appreciate it
here is one way to do it, using np.where
#check if string contains - at the end, and then converting it float after removing '-' and multiplying by -1
df['col1']=np.where(df['col1'].str.strip().str.endswith('-'),
df['col1'].str.replace(r'-','',regex=True).astype('float')*(-1),
df['col1']
)
df
col1 col2
0 12.4 12.32
1 -11.4 2.30
2 -2.0 1.10

Formatting a string containing currency and commas

Does anyone know how I'd format this string (which is a column in a dataframe) to be a float so I can sort by the column please?
£880,000
£88,500
£850,000
£845,000
i.e. I want this to become
88,500
845,000
850,000
880,000
Thanks in advance!
Assuming 'col' the column name.
If you just want to sort, and keep as string, you can use natsorted:
from natsort import natsort_key
df.sort_values(by='col', key=natsort_key)
# OR
from natsort import natsort_keygen
df.sort_values(by='col', key=natsort_keygen())
output:
col
1 £88,500
3 £845,000
2 £850,000
0 £880,000
If you want to convert to floats:
df['col'] = pd.to_numeric(df['col'].str.replace('[^\d.]', '', regex=True))
df.sort_values(by='col')
output:
col
1 88500
3 845000
2 850000
0 880000
If you want strings, you can use str.lstrip:
df['col'] = df['col'].str.lstrip('£')
output:
col
0 880,000
1 88,500
2 850,000
3 845,000

Pandas iterate over values of single column in data frame

I am a beginner to python and pandas.
I have a 5000-row data frame that looks something like this:
INDEX COL1 COL2 COL3
0 10.0 12.0 15.0
1 14.0 16.0 153.8
2 18.0 20.0 16.3
3 22.0 24.0 101.7
I wish to iterate over the values in COL3 and carry out calculations, such that:
For each row in the data frame, if the value in COL3 is <= 100.0, multiply that value by 10 and assign to variable "New_Value";
Else, multiply the value by 5 and assign to variable "New_Value"
I understand that if statement cannot be directly applied to the data frame series, as it will lead to ambiguous value error. However, I am stuck trying to find the right tool for this task, and would appreciate some guidance.
Cheers
Using np.where:
df['New_Value'] = np.where(df['COL3']<=100,df['COL3']*10,df['COL3']*5)
One liner
df.COL1.apply(lambda x: x*10 if x<=100 else 5*x)
for this example, you can use apply, which will apply a function on each row of your data.
lambda is a quick function that you can define. It will have a bit of a difference compared to normal functions.
The condition is => x*10 if x<=100 so for each x under or equal to 100, multiply it by 10. ELSE multiply it by 5.
Try this:
df['New_Value']=df.COL3.apply(lambda x: 10*x if x<=100 else 5*x)

How to get the count and proportion of a single DataFrame column in Python

Pretty new to Python, so any help is welcome.
I'm trying to get a count and a proportion of a single column in a DataFrame in Python. I'm wondering what would be the best way to do it.
This works:
df = pd.DataFrame({'Value': [1,0,1,0,1,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0]})
colCount = pd.DataFrame(columns=['Count','Percentage'])
colCount['Count'] = df['Value'].value_counts()
colCount['Percentage'] = colCount['Count'] / colCount['Count'].sum()
display(colCount)
I thought this might work, but the output is not right:
df.groupby('Value').apply(lambda x: pd.Series( (x.count(), x.count()/len(df)), ['Count', 'Percentage']))
Output:
Count Percentage
Value
0 Value 11 dtype: int64 Value 0.55 dtype: float64
1 Value 9 dtype: int64 Value 0.45 dtype: float64
It is probably very simple, but I guess I'm missing something.

Replace unwanted strings in pandas dataframe element wise and efficiently

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index
col1 col2 col3
time
05/04/2018 05:14:52 AM +unend +unend 0
05/04/2018 05:14:57 AM 0 0 0
05/04/2018 05:15:02 AM 30.691 0.000 0.121
05/04/2018 05:15:07 AM 30.691 n. def. 0.108
05/04/2018 05:15:12 AM 30.715 0.000 0.105
As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.
If i try to do
mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0
i get error: nothing to repeat
if i do
mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0
i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.
The below
mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0
does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?
These are not the classical +infinity or NaN , that df.fillna() could take care off
You can specify a list of strings to consider as NA when reading the csv file.
df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])
And then fill the NA values with fillna
As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)
If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:
df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)
For substrings with define values:
There are special regex char + and ., so need escape them by \:
df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)
Or use applymap for elemnetwise check:
df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)
print (df)
col1 col2 col3
time
05/04/2018 05:14:52 AM 0.000 0.0 0.000
05/04/2018 05:14:57 AM 0.000 0.0 0.000
05/04/2018 05:15:02 AM 30.691 0.0 0.121
05/04/2018 05:15:07 AM 30.691 0.0 0.108
05/04/2018 05:15:12 AM 30.715 0.0 0.105
Do not use pd.Series.str.contains or pd.Series.isin
A more efficient solution to this problem is to use pd.to_numeric to convert try and convert all data to numeric.
Use errors='coerce' to default to NaN, which you can then use with pd.Series.fillna.
cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)

Categories