What I have:
df = pd.DataFrame(data = ["version11.11","version2.2","version3"], columns=["software_version"])
Index software_version
0 version11.11
1 version2.2
2 version3
What I am trying to do:
Is to detect the type of the second last character in the dataframe column called software_version and create a new column in the dataframe based on that condition.
If the second last character is a digit or an alphabet, extract the whole name without the last alpha/digital. Such as version11.11 becomes version11.1 OR version3 becomes version. elif, its a decimal place then extract til before the decimal place, version2.2 becomes version2
Output Should be:
Index software_version main_software
0 version11.11 version11.1
1 version2.2 version2
2 version3 version
What I did so far:
How can I cleanly add the column above main_software ?
import pandas as pd
df = pd.DataFrame(data = ["version11.11","version2.2","version3"], columns=["software_version"])
for name in df.software_version:
if name[-2].isalnum():
print(name[:-1])
elif name[-2] == ".":
print(name[:-2])
else :
print("!Alphanum-dot")
You can first define a function that makes the necessary changes on the string.
def GetMainSoftware(string):
new_string=string[:-1] #first remove the last character
if(new_string[-1]=="."): #if "." is present, remove that too
return new_string[:-1]
else:
return new_string
And then use apply on the dataframe to create a new column with these specifics.
df["main_software"]=df.apply(lambda row: GetMainSoftware(row["software_version"]),axis=1)
df will now be :
software_version main_software
0 version11.11 version11.1
1 version2.2 version2
2 version3 version
Related
I have a dataframe like this
Index
Identifier
0
10769289.0
1
1082471174.0
The "Identifier column is a string column" and I need to remove the ".0"
I'm using the following code:
Dataframe["Identifier"] = Dataframe["Identifier"].replace(regex=['.0'],value='')
But I got this:
IndexIdentifier0769289182471174
As you can see it removed more than just the ".0". I also tried to use
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace(".0", "")
but I got the same result.
The dot (.) in regex or in replace can indicate any character. Therefore you have to escape the decimal point. Otherwise it will replace any character followed by a zero. Which in your case would mean that it would replace the 10 at the beginning of 10769289.0 and 1082471174.0, as well as the .0 at the end of each number. By escaping the decimal point, it will only look for the following: .0 - which is what you intended.
import pandas as pd
# Create the dataframe as per the example
Dataframe = pd.DataFrame({"Index": [0,1], "Identifier": ['10769289.0', '1082471174.0']})
# Replace the decimal and the zero at the end of each Identifier.
Dataframe["Identifier"] = Dataframe["Identifier"].str.replace("\.0", "")
# Print the dataframe
print(Dataframe)
OUTPUT:
Index Identifier
0 0 10769289
1 1 1082471174
I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!
You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11
you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object
I have a dataframe that contains numbers represented as strings which uses the comma separator (e.g. 150,000). There are also some values that are represented by "-".
I'm trying to convert all the numbers that are represented as strings into a float number. The "-" will remain as it is.
My current code uses a for loop to iterate each column and row to see if each cell has a comma. If so, it removes the comma then converts it to a number.
This works fine most of the time except some of the dataframes have duplicated column names and that's when it falls apart.
Is there a more efficient way of doing this update (i.e. not using loops) and also avoid the problem when there are duplicated column names?
Current code:
for col in statement_df.columns:
row = 0
while row < len(statement_df.index):
row_name = statement_df.index[row]
if statement_df[col][row] == "-":
#do nothing
print(statement_df[col][row])
elif statement_df[col][row].find(",") >= 0:
#statement_df.loc[col][row] = float(statement_df[col][row].replace(",",""))
x = float(statement_df[col][row].replace(",",""))
statement_df.at[row_name, col] = x
print(statement_df[col][row])
else:
x = float(statement_df[col][row])
statement_df.at[row_name, col] = x
print(statement_df[col][row])
row = row + 1
Use str.replace(',', '') on dataframe itself
For a dataframe like below
Name Count
Josh 12,33
Eric 24,57
Dany 9,678
apply like these
df['Count'] = df['Count'].str.replace(',', '')
df
It will give you the following output
Name Count
0 Josh 1233
1 Eric 2457
2 Dany 9678
You can use iloc functionality for that -
for idx in range(len(df.columns)):
df.iloc[:, idx] = df.iloc[:, idx].apply(your_function)
The code in your_function should be able to deal with input from one row. For example -
def your_function(x):
if x == ',': return 0
return float(x)
I have a pandas dataframe with a column called 'picture'; that column has values that either start with a number or letter. What I'm trying to do is create a new column that checks whether or not the value starts with a letter or number, and populate that new column accordingly. I'm using np.where, and my code is below (raw_master is the dataframe, 'database' is the new column):
def iaps_or_naps(x):
if x in ["1","2","3","4","5","6","7","8","9"]:
return True
else:
return False
raw_master['database'] = np.where(iaps_or_naps(raw_master.picture[?][0])==True, 'IAPS', 'NAPS')
My issue is that if I just do raw_master.picture[0], that checks the value of the entire string, which is not what I need. I need the first character; however, if I do raw_master.picture[0][0], that will just evaluate to the first character of the first row for the whole dataframe. BTW, the question mark just means I'm not sure what to put there.
How can I get it so it takes the first character of the string for every row?
Thanks so much!
You don't need to write your own function for this. Take this small df as an example:
s = pd.DataFrame(['3asd', 'asd', '3423', 'a123'])
looks like:
0
0 3asd
1 asd
2 3423
3 a123
using a pandas builtin:
# checking first column, s[0], first letter, str[0], to see if it is digit.
# if so, assigning IAPS, if not, assigning NAPS
s['database'] = np.where(s[0].str[0].str.isdigit(), 'IAPS', 'NAPS')
output:
0 database
0 3asd IAPS
1 asd NAPS
2 3423 IAPS
3 a123 NAPS
Applying this to your dataframe:
raw_master['database'] = np.where(raw_master['picture'].str[0].str.isdigit(), 'IAPS', 'NAPS')
IIUC you can just test if the first char is an int using pd.to_numeric
np.where(pd.to_numeric(df['your_col'].str[0],errors='coerce').isnull(),'IAPS'
,'NAPS') # ^ not a number
#^ number
You could use a mapping function such as apply which iterates over each element in the column, this way accessing the first character with indexing [0]
df['new_col'] = df['picture'].apply(lambda x: 'IAPS' if x[0].str.isdigit() else 'NAPS')
I have 2 large dataframes I want to compare against each other.
I have .split(" ") one of the columns and placed the result in a new column of the dataframe.
I now want to check and see if a value exists in that new column, instead of using a .contains() in the original column, to avoid the value getting picked up within a word.
Here is what I've tried and why I'm frustrated.
row['company'][i] == 'nom'
L_df['Name split'][7126853] == "['nom', '[this', 'is', 'nom]']"
row['company'][i] in L_df['Name split'][7126853] == True (this is the index where I know the specific value occurs)
row['company'][i] in L_df['Name split'] #WHAAT? == False (my attempt to check the entire column); why is this false when I've shown it exists?
L_df[L_df['Name split'].isin([row['company'][i]])] == [empty]
edit: I should additionally add that I am trying to set up a process where I can iterate to check entries in the smaller dataset against the larger one.
result = L_df[ #The [9] is a placeholder for our iterable 'i' that will go row by row
L_df['Company name'].str.contains(row['company'][i], na=False) #Can be difficult with names like 'Nom'
#(row['company'][i] in L_df['Name split'])
& L_df['Industry'].str.contains('marketing', na=False) #Unreliable currently, need to get looser matches; min. reduction
& L_df['Locality'].str.contains(row['city'][i], na=False) #Reliable, but not always great at reducing results
& ((row['workers'][i] >= L_df['Emp Lower bound']) & (row['workers'][i] <= L_df['Emp Upper bound'])) #Unreliable
]
the first line is what I am trying to replace with this new process, so I don't get matches when 'nom' appears in the middle of words.
Here is a solution that first merges the two dataframes into one and then uses lambda to process the columns of interest. The result is placed in a new column found:
df1 = pandas.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pandas.DataFrame(data={'Name split': ["here is a string including findme and then some".split(" "), "something here".split(" ")]})
combined_df = pandas.concat([df1,df2], axis=1)
combined_df['found'] = combined_df.apply(lambda row: row['company'] in row['Name split'], axis=1)
Result:
company Name split found
0 findme [here, is, a, string, including, findme, and, ... True
1 asdf [something, here] False
EDIT:
In order to compare each value from the company column to every cell in the Name split column in the other dataframe, and to have access to the whole row from the latter dataframe, I would simply loop through each column, see here:
df1 = pd.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pd.DataFrame(data={'Name split': ["random text".split(" "), "here is a string including findme and then some".split(" "), "somethingasdfq here".split(" ")], '`another column`': [3, 1, 2]})
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1['company'] in row2['Name split']:
# do something here with row2
print(row2)
Probably not very efficient, but could be improved by breaking out of the inner loop as soon as a match is found if we only need one match.