How to process dataframe without creating NaN's everywhere? - python

I have a set of columns that contain dates (imported from Excel file), and I need to process them as follows:
If a cell in one of those columns is blank, set another column to 1, else that column is 0. This allows me to sum all the 1's and show that those items are missing.
This is how I am doing that at present:
df_combined['CDR_Form_notfound'] = np.where(df_combined['CDR-Form'].mask(df_combined['CDR-Form'].str.len()==0).isnull(),1,0)
A problem I am having is that I have to format those columns so that A) dates are trimmed to show only the day/month/year and B) some of the columns have a value of "see notes" in them, instead of being a date or blank. That "see notes" is essential to properly accounting for missing items, it has to be there to keep the cell from flagging as empty and the item counting as missing (adding to the 'blank cells' count). The actual problem is that if I run this code before the .isnull code above, evry blank becomes a NaN or a nan or a NaT, and then NOTHING flags as null/missing.
This is the code I am using to trim the date strings and change the "see notes" to a string...because otherwise it just ends up blank in the output.
for c in df_combined[dateColumns]:
df_combined[c] = df_combined[c].astype(str) # uncomment this if columns change from dtype=str
df_combined[c] = np.where(df_combined[c].str.contains("20"), df_combined[c].str[:10], df_combined[c])
df_combined[c] = np.where(df_combined[c].str.contains("see notes"), df_combined[c].str, df_combined[c])
I think my problem might have something to do with the dtypes of the columns. When I run print(df.dtypes), every column shows as 'object', except for one I specifically set to int using this:
df_combined['Num'] = df_combined['Num'].apply(lambda x: int(x) if x == x else "")

Are you trying to count NaNs?
If yes, you can just do:
len(df.loc[:, df.isnull().any()])
I see that you mention "blank" because it is comming from excel, so what you could do is to transform these blanks into nan before running the command above using:
df['CDR-Form'].replace('', np.NaN,inplace=True)

Related

How to detect #N/A in a data frame (data taken from xlsx file) using pandas?

The blank cells with no data can be checked with:
if pd.isna(dataframe.loc[index_name, column_name] == True)
but if the cell has #N/A, the above command does not work nor
dataframe.loc[index, column_name] == '#N/A'.
On reading that cell, it shows NaN, but the above codes does not work. My main target is to capture the release dates and store it in a list.
If you're reading your dataframe tft from a spreadsheet (and it seems to be the case here), you can use the parameter na_values of pandas.read_excel to consider some values (e.g #N/A) as NaN values like below :
tft= pd.read_excel("path_to_the_file.xlsx", na_values=["#N/A"])
Otherwise, if you want to preserve those #N/A values/strings, you can check/select them like this :
tft.loc[tft["Release Data"].eq("#N/A")] #will return a dataframe
In the first scenario, your code would be like this :
rel_date= []
for i in range(len(tft)):
if pd.isna(tft["Release Date"])
continue
else:
rel_date.append(int(str(tft.loc[i, "Release Date"]).split()[1]))
However, there is no need for the loop here, you can make a list of the release dates with this :
rel_date= (
tft["Release Date"]
.str.extract(("Release (\d{8})"), expand=False)
.dropna()
.astype(int)
.drop_duplicates()
.tolist()
)
print(rel_date)
[20220603, 20220610]

After inserting a column into dataframe using Pandas, the first element of the inserted column is deleted

I have a large dataframe and am trying to add a leading (far left, 0th position) column for descriptive purposes. The dataframe and column which I'm trying to insert both have the same number of lines.
The column I'm inserting looks like this:
Description 1
Description 2
Description 3
.
.
.
Description n
The code I'm using to attach the column is:
df.insert(loc=0, column='description', value=columnToInsert)
The code I'm using to write to file is:
df.to_csv('output', sep='\t', header=None, index=None)
(Note: I've written to file with and without the "header=None" option, doesn't change my problem)
Now after writing to file, what I end up getting is:
Description 2 E11 ... E1n
Description 3 E21 ... E2n
.
.
.
Description n E(n-1)1... E(n-1)n
NaN En1 ... Enn
So the first element of my descriptive, leading column is deleted, all the descriptions are off by one, and the last row has "not a number" as it's description.
I have no idea what I'm doing which might cause this, and I'm not really sure where to start in correcting it.
Figured it out. The issue was stemming from the fact that I had deleted a row from my large dataframe prior to inserting my descriptive column, this was causing the indices to line up improperly.
So now I included the line:
df.reset_index(drop=True, inplace=True)
Everything lines up properly now and no elements are deleted!

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

Record exact value that caused change in comparison of Row values python

I want to spot the differences between two columns in a pandas' DataFrame
Say I have two columns which I have compared to spot changes. Then I find a row like this in output MOO89-'WR' --> M0089-'Wx'. This means that the row value was modified from first to second. What should I do to record the output of the change and record in another column ? I want 'x' stored in another columns, since it is the one that caused a change.
Final_df["Unique"] = Final_df['New_Branching Logic'][~Final_df['New_Branching Logic'].isin(Final_df['Branching Logic)'])].drop_duplicates()
Final_df
I have tried this code but its not capture the real value that caused the change. I want to create a column with the value that caused change as in this case its x is what caused change. HERE IS THE LINK TO TESTDATA SAMPLE TESTData.csv
This function returns all of the charcaters that are different in the second string (column 'New_Branching Logic'):
def string_comparison(row):
return [row['New_Branching Logic'][i]
for i in range(len(row['Branching Logic']))
if row['Branching Logic'][i] != row['New_Branching Logic'][i]]
Then, you can apply it on every row of the DataFrame:
Final_df["Unique"] = Final_df.apply(string_comparison, axis=1)
When testing on the following DataFrame:
Final_df = pd.DataFrame([["M0089-'WR'","M0089-'Wx'"]],
columns=['Branching Logic', 'New_Branching Logic'])
I get this result:
Branching Logic New_Branching Logic Unique
0 M0089-'WR' M0089-'Wx' [x]
In the 'Unique' column is a list of all the different characters.

Loop through column until blank

I am trying to loop through a panda DataFrame until the columns are blank or does not contain the term 'Stock'. If it contains a date I want the word 'check' to be printed.
I am using:
print(df)
Stock 15/12/2015 15/11/2015 15/10/2015
0 AA 10 11 11
1 BB 20 10 8
2 CC 30 33 26
3 DD 40 80 60
I have tried the below (which is wrong):
column = df
while column != ("") or 'Stock':
print ('Check'),
column += 1
print ("")
There's a few problems in your code. First of all you've screwed up indentation so it's not even valid code.
Second your comparison is broken because it doesn't mean what you probably expect. column != ("") or 'Stock' will always be true because it means that first it will compare column with ("") and if that's equal the expression will be True, otherwise it will evaluate 'Stock' and make that the value of the expression (and in boolean context that would be considered true). What you probably should have written instead is column != "" and column != "Stock" or possibly column not in ("", "Stock").
Then I'm not sure if you're looping the right way or using column the right way either. Is it correct to step to the next by using column += 1? I don't know panda, but it seems odd. Also comparing it to a string may be incorrect.
Your code really needs improvement. You should follow #skyking advises. I'd like to add that you may want to transpose your dataframe, and put the dates as a variable.
Anyway, let me rephrase what you are looking for, to make sure I got it right: you want to iterate over the columns of your df and for every column which name is a date, you do print('Check'), otherwise nothing happens. Please, let us know if this is wrong.
To achieve that, here is a possible approach. You can iterate over the columns name and attempt to convert the string to a date, for instance, using pd.to_datetime. If successful, it prints a message.
for name in df.columns:
print(name) # comment this line after testing
try:
pd.to_datetime(name)
except ValueError:
pass # or do something in case the column name is not a date
else:
print('Check')
This outputs
Stock
15/12/2015
Check
15/11/2015
Check
15/10/2015
Check
You can see that Check was only printed when the column name was, at least, coercible into a date.

Categories