When I look at the values in a column in my dataframe, I can see that due to user data entry errors, the same category has been entered incorrectly.
For my dataframe I use this code:
df['column_name'].value_counts()
output:
Targeted 523534
targeted 1
story 25425
story 2
multiple 2524543
For story, I guess there is a space?
I am trying to replace targeted with Targeted.
df['column_name'].replace("targeted","Targeted")
But nothing is happening, I still get the same value count.
Yes, is seems there is start of end white-space(s).
Need str.strip first and then Series.replace or Series.str.replace:
df['column_name'] = df['column_name'].str.strip().replace("targeted","Targeted")
df['column_name'] = df['column_name'].str.strip().str.replace("targeted","Targeted")
Another possible solution is convert all characters to lowercase:
df['column_name'] = df['column_name'].str.strip().str.lower()
Related
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
I am using code below to remove all non english characters below:
DF.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
where df has a column called text with text in it like below:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。\n
¡Hola miguel! Lamento mucho la confusión cau
expected output:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
For my rows where my code removes characters -
I want to delete those rows from the df completely, meaning if it does replace any non-english characters, I want to delete that row from the df completely to avoid having that row with either 0 characters or a few characters that are meaningless after they have been altered by the code above.
You can use
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['hi what are you saying?', 'ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。'], 'another_col':['demo 1', 'demo 2']})
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
# text another_col
# 0 hi what are you saying? demo 1
Notes:
df['text'].str.contains(r'[^\x00-\x7F]') finds all values in text column that contain a character other than ASCII char (it is our "mask")
df[~...] only keeps those rows that did not match the regex.
str.contains() returns a Series of booleans that we can use to index our frame
patternDel = "[^\x00-\x7F]"
filter = df['Event Name'].str.contains(patternDel)
I tend to keep the things we want as opposed to deleting rows. Since filter represents things we want to delete we use ~ to get all the rows that don't match and keep them
df = df[~filter]
So I'm trying to get two CSV's to be outputted by splitting them based upon the date. however, whenever I try to to do this I get a syntax error when calling the heading title
ufp = pd.read_excel('pythoncsv/C20190723.xlsx')
ufp.head()
"Segmentation Account# Last Name First Name State Phone Alt Phone Language Therapy Last Order Date Next Order Date Customer Type"
Last order is shown as a datime64
ts = pd.to_datetime('2019-04-25')
so this is the jist of the code I want to work.
ufp.loc[ufp.'Last Order Date' >= ts, :]
but i get
File "<ipython-input-22-0e4a7bef835a>", line 1
ufp.loc[ufp.'Last Order Date' >= ts, :]
^
SyntaxError: invalid syntax
Is the way I'm phrasing the column wrong? I'm extremely new to python so it's possible I just don't get what I'm doing.
ufp.loc[ufp['Last Order Date'] >= ts, :]
This should work
Columns (when talking about them as attributes of a data frame using .column_name notation) do not require quotation marks like they would when referenced elsewhere. To access columns this way you will need to remove the spaces in the column label. For instance, you can refer to ufp.Last_Order_Date but not ufp.Last Order Date.
Try what Ramesh suggested above.
Just for completeness: starting with version 0.25.0, you can also use query with column names that contain spaces by surrounding them in backticks:
ufp.query(`Last Order Date` >= #ts)
I have a set of columns that contain dates (imported from Excel file), and I need to process them as follows:
If a cell in one of those columns is blank, set another column to 1, else that column is 0. This allows me to sum all the 1's and show that those items are missing.
This is how I am doing that at present:
df_combined['CDR_Form_notfound'] = np.where(df_combined['CDR-Form'].mask(df_combined['CDR-Form'].str.len()==0).isnull(),1,0)
A problem I am having is that I have to format those columns so that A) dates are trimmed to show only the day/month/year and B) some of the columns have a value of "see notes" in them, instead of being a date or blank. That "see notes" is essential to properly accounting for missing items, it has to be there to keep the cell from flagging as empty and the item counting as missing (adding to the 'blank cells' count). The actual problem is that if I run this code before the .isnull code above, evry blank becomes a NaN or a nan or a NaT, and then NOTHING flags as null/missing.
This is the code I am using to trim the date strings and change the "see notes" to a string...because otherwise it just ends up blank in the output.
for c in df_combined[dateColumns]:
df_combined[c] = df_combined[c].astype(str) # uncomment this if columns change from dtype=str
df_combined[c] = np.where(df_combined[c].str.contains("20"), df_combined[c].str[:10], df_combined[c])
df_combined[c] = np.where(df_combined[c].str.contains("see notes"), df_combined[c].str, df_combined[c])
I think my problem might have something to do with the dtypes of the columns. When I run print(df.dtypes), every column shows as 'object', except for one I specifically set to int using this:
df_combined['Num'] = df_combined['Num'].apply(lambda x: int(x) if x == x else "")
Are you trying to count NaNs?
If yes, you can just do:
len(df.loc[:, df.isnull().any()])
I see that you mention "blank" because it is comming from excel, so what you could do is to transform these blanks into nan before running the command above using:
df['CDR-Form'].replace('', np.NaN,inplace=True)
I have CSV file which has 3 columns.
Here is what I have to do:
I want to write an if condition or whatever like if Divi == 'core' then I need the count of tags (distinct) without redundancy i.e ( two sand1 in the tag for core division should be considered as only one count).
One more if condition like Div === saturn or core && type == dev then same thing need to count the no of tags(distinct)
Can anyone help me out with this? As it was my idea.. any new ideas will be accepted if it satisfies requirement
First, load up your data with pandas.
import pandas as pd
dataframe = pd.read_csv(path_to_csv)
Second, format your data properly (you might have lower case/upper case data as in column 'Division' from your example)
for column in dataframe.columns:
dataframe[column] = dataframe[column].lower()
If you want to count frequency just by one column you can:
dataframe['Division'].value_counts()
If you want to count by two columns you can:
dataframe.groupby(['Division','tag']).count()
Hope that helps
edit:
While this wont give you just the count of when 2 conditions are met, which is what you asked for, it will give you a more 'complete' answer, showing the count for all two columns combinations