I have a dataframe generated from unformatted csv. So I need format some datas (e.g there is some strings as 12.323,03 for float format and i'm trying to convert them 12323.03 for convert string to float in python)
I'm trying to do it as:
for column in data:
if(data[column].name != 'blabla' and data[column].name != 'otherblabla'):
for row_value in data[column]:
if type(row_value) == str:
float_format = row_value.replace('.','').replace(',','.')
row_value = row_value.replace(row_value, float_format)
float format: converts string "12.323,03" to "12323.03".
But row values are not affected. What am i missing?
To affect the new value, you must locate it with
df.loc[row_index,column_name] = row_value
To do so, try an enumerate.
for row_index, row_value in enumerate(data_column):
Here an example to understand it:
df = pd.DataFrame({'A':[1,2,3,4],'B':[5,6,7,8]})
print('Before change')
print(df)
for i,j in enumerate(df['B']):
if j == 6:
df.loc[i,'B'] = 4
print('Afetr Change')
print(df)
The variable row_value is a single value of a row/column pair in the original df which does not reference back that df position. As the other answer has pointed out with your approach you need to locate the value in order to change the df.
Additionally, I want to mention that the second replace of row_value could be substituted with simply row_value = float_format. Also, I share with you an approach using apply which I consider cleaner and you might find useful:
df = pd.DataFrame(
{
'c1': ['100,12', 1.230, '30.000,4'],
'c2': ['5.367,46', '10', 7.3],
'c3': ['a', 'b', 'c']
}
)
cols = ['c1', 'c2']
for col in cols:
df[col] = df[col].apply(
lambda x: float(x.replace('.','').replace(',','.')) if type(x) == str else x
)
This results in:
c1 c2 c3
0 100.12 5367.46 a
1 1.23 10.00 b
2 30000.40 7.30 c
c1 float64
c2 float64
c3 object
dtype: object
Related
I need to clean up some data. For items in a dataframe that are of the format '<x' I want to return 'x/2' so if the cell contents is '<10' it should be replaced with '5', if the cell contents is '<0.006' it should be replace with 0.003 etc. I want changed cells to be formatted red and bold. I have the following code which operates in two steps and each step does what I want (almost) but I get a TypeError: 'float' object is not iterable when I try and chain them using : fixed_df=df.style.apply(color_less_than,axis=None).applymap(lessthan)
Note that the actual dataset may be thousands of rows and will contain mixed and Dummy data and code :
import pandas as pd
df = pd.DataFrame({'A': ['<10', '20', 'foo', '<30', '40'],
'B': ['baz', '<dlkj', 'bar', 'foo', '<5']})
def color_less_than(x):
c1 = 'color: red; font-weight: bold'
c2 = ''
df1 = pd.DataFrame(c2, index=x.index, columns=x.columns)
for col in x.columns:
mask = x[col].str.startswith("<")
#display(mask)
df1.loc[mask, col] = c1
return df1
def lessthan(x):
#for x in df:
if isinstance(x, np.generic):
return x.item()
elif type(x) is int:
return x
elif type(x) is float:
return x
elif type(x) is str and x[0]=="<":
try:
return float(x[1:])/2
except:
return x
elif type(x) is str and len(x)<10:
try:
return float(x)
except:
return x
else:
return x
coloured=df.style.apply(color_less_than,axis=None)
halved=df.applymap(lessthan)
display(coloured)
display(halved)
Note that the df item <dlkj does not display at all after applying color_less_than and I don't know why, I want it to be returned unformatted as it should not be changed (it's a string and cant be 'halved'). I have been trying to use the boolean mask to do both the calculation and the formatting but I can't get it to work.
This code will looped through the entire dataset and change any value containing '<' + integer||float to (int||float/2). I will then also check to see if the value is a string such as 'dlkj' and then add the color/bold style to the cell. Might have to test the line of code though, I did not attempt to do it.
for col in df:
for value in df[col].values:
if '<' in value:
num = value.split('<')[1]
try:
df[col] = df[col].replace([value], int(num)/2)
except ValueError:
try:
df[col] = df[col].replace([value], float(num)/2)
except ValueError:
print(num) # <-- should be your '<dlkj' value
# not sure if this line of code will work or not, wasnt able to test it
#df.style.set_properties(subset=df[col][value],**{'color': 'red', 'font-weight': 'bold'})
Without the style mapping, the desired output DF can be reached like so:
df = pd.DataFrame({'A': ['<10', '20', 'foo', '<30', '40'],
'B': ['baz', '<dlkj', 'bar', 'foo', '<5']})
for col in df.columns:
mask = df[col].str.match('<[0-9]+$|<[0-9]+[.][0-9]+$')
tmp = pd.to_numeric(df[col].str.slice(1), errors='coerce')
df[col] = np.where(mask, tmp/2, df[col])
print(df)
# A B
# 0 5.0 baz
# 1 20 <dlkj
# 2 foo bar
# 3 15.0 foo
# 4 40 2.5
Having issues with building a find and replace tool in python. Goal is to search a column in an excel file for a string and swap out every letter of the string based on the key value pair of the dictionary, then write the entire new string back to the same cell. So "ABC" should convert to "BCD". I have to find and replace any occurrence of individual characters.
The below code runs without debugging, but newvalue never creates and I don't know why. No issues writing data to the cell if newvalue gets created.
input: df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
expected output: df = pd.DataFrame({'Code1': ['BCD1', 'C5DE', 'D3EF']})
mycolumns = ["Col1", "Col2"]
mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
for x in mycolumns:
# 1. If the mycolumn value exists in the headerlist of the file
if x in headerlist:
# 2. Get column coordinate
col = df.columns.get_loc(x) + 1
# 3. iterate through the rows underneath that header
for ind in df.index:
# 4. log the row coordinate
rangerow = ind + 2
# 5. get the original value of that coordinate
oldval = df[x][ind]
for count, y in enumerate(oldval):
# 6. generate replacement value
newval = df.replace({y: mydictionary}, inplace=True, regex=True, value=None)
print("old: " + str(oldval) + " new: " + str(newval))
# 7. update the cell
ws.cell(row=rangerow, column=col).value = newval
else:
print("not in the string")
else:
# print(df)
print("column doesn't exist in workbook, moving on")
else:
print("done")
wb.save(filepath)
wb.close()
I know there's something going on with enumerate and I'm probably not stitching the string back together after I do replacements? Or maybe a dictionary is the wrong solution to what I am trying to do, the key:value pair is what led me to use it. I have a little programming background but ery little with python. Appreciate any help.
newvalue never creates and I don't know why.
DataFrame.replace with inplace=True will return None.
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> df = df.replace('ABC1','999')
>>> df
Code1
0 999
1 B5CD
2 C3DE
>>> q = df.replace('999','zzz', inplace=True)
>>> print(q)
None
>>> df
Code1
0 zzz
1 B5CD
2 C3DE
>>>
An alternative could b to use str.translate on the column (using its str attribute) to encode the entire Series
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
>>> table = str.maketrans('ABC','BCD')
>>> df
Code1
0 ABC1
1 B5CD
2 C3DE
>>> df.Code1.str.translate(table)
0 BCD1
1 C5DD
2 D3DE
Name: Code1, dtype: object
>>>
How to filter out a row value in a column B, if another column C has a specific text say "ABC" ? in this case "google.com" would be filtered out.
A B C D
0 True facebook.com kxy 19999
1 True google.com ABC 21212
2 False yahoo.com PoP 3213231
Everytime there is "ABC" in Col C, row value from col B should be appended in a list.
pseudocode:
dataset = pd.read_csv('xyz.csv')
path = []
for value in dataset.C:
if dataset['C'] == 'abc':
#append path with row value of Col B
else:
#not append path
path = dataset.loc[dataset.C == 'ABC', 'B'].tolist()
will give you the desired list in one go.
as an alternative you can use where and list:
path = list(data.B.where(data.C == 'ABC').dropna())
print(path)
# ['google.com']
inds will be a pandas series with boolean values, indicating whether a row's value in column 'C' is equal to 'ABC'.
Once we know that, we can subset dataset and take the values of column 'B':
inds = dataset['C'] == 'ABC'
list(dataset.loc[inds, 'B'])
Direct asnwer:
filtered_values = dataset.loc[dataset["C"]=='ABC']['B'].tolist()
For understanding purposes:
First get the rows where C="ABC"
filtered_rows = dataset.loc[dataset["C"]=='ABC']
filtered_rows
Output:
A B C D
1 True google.com ABC 21212
From these rows get the values of only column B and convert this Series into a list with .tolist() function
filtered_values = filtered_rows["B"].tolist()
filtered_values
Output:
['google.com']
I have dataframe in which I have to drop row if some of values.
for instance,
x not in ['N/A', ''] where x is columns
is there a way like, apply?
df[x] = df[x].apply(lambda x: x.lower())
I am think in something like:
df.drop.apply(lambda x: X not in ['N/A', ''])???
My DF
F T l
0 0 "0" "0"
1 1 "" "1"
2 2 "2" ""
drop row if T == "" or l == ""
F T l
0 0 "0" "0"
I could not use
df.drop(df.T == "") since the condition ("") depend on runtime data
If you are looking to remove any row that has 'N/A' or '' in the row then you can us a boolean index, just take the inverse of isin() e.g.:
In []:
df[~df.isin(['N/A', '']).any(axis=1)]
Out[]:
F T l
0 0 0 0
If you need to limit to just columns 'A', 'l' then select them, e.g.:
df[~df[['A', 'l']].isin(['N/A', '']).any(axis=1)]
You could also use a dict with isin() but that would only be useful if you had different values for the columns, e.g.:
df[~df.isin({'A': ['N/A', ''], 'l': ['']}).any(axis=1)]
From the following answer, the solution is:
mask = df.pipe(lambda x: (x['T'].isin(['N/A', ''])) | (x['T'].isna()),)
df.drop(df[mask].index, inplace=True)
this allow to provide different lambdas
In my pandas DataFrame I want to add a new column (NewCol), based on some conditions that follow from data of another column (OldCol).
To be more specific, my column OldCol contains three types of strings:
BB_sometext
sometext1
sometext 1
I want to differentiate between these three types of strings. Right now, I did this using the following code:
df['NewCol'] = pd.Series()
for i in range(0, len(df)):
if str(df.loc[i, 'OldCol']).split('_')[0] == "BB":
df.loc[i, 'NewCol'] = "A"
elif len(str(df.loc[i, 'OldCol']).split(' ')) == 1:
df.loc[i, 'NewCol'] = "B"
else:
df.loc[i, 'NewCol'] = "C"
Even though this code seems to work, I'm sure there is a better way to do something like this, as this seems very inefficient. Does anyone know a better way to do this? Thanks in advance.
In general, you need something like the following formulation:
>>> df.loc[boolean_test, 'NewCol'] = desired_result
Or, for multiple conditions (Note the parentheses around each condition, and the rather unpythonic & instead of and):
>>> df.loc[(boolean_test1) & (boolean_test2), 'NewCol'] = desired_result
Example
Let's start with an example Data.Frame:
>>> df = pd.DataFrame(dict(OldCol=['sometext1', 'sometext 1', 'BB_ccc', 'sometext1']))
Then you'd do:
>>> df.loc[df['OldCol'].str.split('_').str[0] == 'BB', 'NewCol'] = "A"
To set all BB_ columns to A. You could even (optionally, for readability) separate out the boolean condition onto its own line:
>>> oldcol_starts_BB = df['OldCol'].str.split('_').str[0] == 'BB'
>>> df.loc[oldcol_starts_BB, 'NewCol'] = "A"
I like this method become it means the reader doesn't have to work out the logic hidden within the split('_').str[0] part.
Then, to set all columns with no space, which are still not set (i.e. where isnull is true):
>>> oldcol_has_no_space = df['OldCol'].str.find(' ') < 0
>>> newcol_is_null = df['NewCol'].isnull()
>>> df.loc[(oldcol_has_no_space) & (newcol_is_null), 'NewCol'] = 'C'
Then finally, set all remaining values of NewCol to B:
>>> df.loc[df['NewCol'].isnull(), 'NewCol'] = 'B'
>>> df
OldCol NewCol
0 sometext1 C
1 sometext 1 B
2 BB_ccc A
3 sometext1 C