Delete row based on value in any column of the dataframe - python

There are several posts on how to drop rows if one column in a dataframe holds a certain undesired string, but I am struggling with how to do that if I have to check all columns in a dataset for that string, AND if I do not know beforehand exactly which column contains the string.
Suppose:
data = pd.DataFrame({'col1' : ['December 31,', 'December 31, 2019', 'countryB', 'countryC'],
'col2' : ['December 31,', 21, 19, 18],
'col3' : [np.NaN, 22, 23, 14]})
Which gives:
col1 col2 col3
0 December 31, December 31, NaN
1 December 31, 2019 21 22.0
2 countryB 19 23.0
3 countryC 18 14.0
I want to delete all rows that contain December 31,, but not if December 31, is followed by a year in YYYY format. Is use a regex for that: r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})', which properly identifies December 31, only.
The problem is that I have many of such tables, and I do not know beforehand in which column the December 31, (or its equivalent for other months) appears.
What I currently do is:
delete = pd.DataFrame(columns = data.columns)
for name, content in data.iteritems():
take = data[data[name].astype(str).str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
regex = True,
flags = re.IGNORECASE & re.DOTALL, na = False)]
delete = delete.append(take)
delete = delete.drop_duplicates()
index = mean(delete.index)
clean = data.drop([index])
Which returns, as desired:
col1 col2 col3
1 December 31, 2019 21 22.0
2 countryB 19 23.0
3 countryC 18 14.0
That is, I loop over all columns in data, store in delete the rows that I want to delete from data, delete duplicates (because December 31, appears in col1 and col2), get the index of the unique undesired row (0 here) and then delete that row in data based on the index. It does work, but that seems like a cumbersome way of achieving this.
I am wondering: Is there a better way of deleting all rows in which December 31, appears in any column?

data[~data.apply(lambda x: any([True if re.match('December 31,$',str(y)) else False for y in x]), axis=1)]
You can use .apply method to filter rows like this.
Doesn't using r"December 31,$"' regex works for your case? $ represent ending of string. If not just replace regex with your working regex.

Use pd.DataFrame.any(...)
mask = data.astype(str).apply(lambda x: x.str.contains(r'Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(?!.*\d{4})',
regex = True, flags = re.IGNORECASE & re.DOTALL, na = False)).any(axis=1)
data.loc[~mask]

Related

How to use pandas to read a .csv file and skip rows based on keywords?

Right now, I am parsing my file by using skiprows, but the skiprows is unreliable because the data can change. I want to skip rows based on keywords such as "Ferrari, Apple, Baseball". How can I accomplish this? Could you please provide examples?
EDIT: If possible, another solution that could work better for me is to skip n rows in the beginning and then stop reading values in the columns after a BLANK entry is reached. Is this possible?
import pandas as pd
import pyodbc
df = pd.read_csv(r'C://mycsvfile.csv', skiprows=[3,108,109,110,111,112,114,115,116,118])
"""
Step 2 Specify columns we want to import
"""
columns = ['Run Date','Action','Symbol','Security Description','Security Type','Quantity','Price ($)','Commission ($)','Fees ($)','Accrued Interest ($)','Amount ($)','Settlement Date']
df_data = df[columns]
records = df_data.values.tolist()
print(df)
You can try parse every column and try to find keyword which you need and delete row where your keyword contains.
df = df[df["Run Date"].str.contains("Ferrari") == False]
Make it loop.
There's a few ways to do it. Here's my solution.
Make all keywords lower case to eliminate case sensitive
Define which columns you need to check for the keywords (I could alter this to check all columns if needed)
Concatenate the columns to check all columns at once as opposed to iterating through each
Make the cells all lower case (see 1)
Keep rows that do not contain a keyword
Code:
import pandas as pd
df = pd.DataFrame([['I love apples.', '', 1, 'Jan 1, 2021'],
['Apple is tasty.', 'Ferrari', 2, 'Jan 2, 2022'],
['This does not contain a keyword', 'Nor does this.', 15, 'Mar 1, 2021'],
['This row is ok', 'But it has baseball in it.', 34, 'Feb 1, 2021']], columns = ['A','B','Value','Date'])
keywords = ['Ferrari', 'Apple', 'Baseball']
keywords = '|'.join(keywords)
keywords = keywords.lower()
columns_to_check = ['A','B', 'Value']
df = df[~df[columns_to_check].astype(str).sum(1).str.lower().str.contains(keywords)]
Input:
print(df.to_string())
A B Value Date
0 I love apples. 1 Jan 1, 2021
1 Apple is tasty. Ferrari 2 Jan 2, 2022
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021
3 This row is ok But it has baseball in it. 34 Feb 1, 2021
Output:
print(df.to_string())
A B Value Date
2 This does not contain a keyword Nor does this. 15 Mar 1, 2021

Create a dataframe from multiple list of dictionary values

I have a code as below,
safety_df ={}
for key3,safety in analy_df.items():
safety = pd.DataFrame({"Year":safety['index'],
'{}'.format(key3)+"_CR":safety['CURRENT'],
'{}'.format(key3)+"_ICR":safety['ICR'],
'{}'.format(key3)+"_D/E":safety['D/E'],
'{}'.format(key3)+"_D/A":safety['D/A']})
safety_df[key3] = safety
Here in this code I'm extracting values from another dictionary. It will looping through the various companies that why I named using format in the key. The output contains above 5 columns for each company(Year,CR, ICR,D/E,D/A).
Output which is printing out is with plenty of NA values where after
Here I want common column which is year for all companies and print following columns which is C1_CR, C2_CR, C3_CR, C1_ICR, C2_ICR, C3_ICR,...C3_D/A ..
I tried to extract using following code,
pd.concat(safety_df.values())
Sample output of this..
Here it extracts values for each list, but NA values are getting printed out because of for loops?
I also tried with groupby and it was not worked?..
How to set Year as common column, and print other values side by side.
Thanks
Use axis=1 to concate along the columns:
import numpy as np
import pandas as pd
years = np.arange(2010, 2021)
n = len(years)
c1 = np.random.rand(n)
c2 = np.random.rand(n)
c3 = np.random.rand(n)
frames = {
'a': pd.DataFrame({'year': years, 'c1': c1}),
'b': pd.DataFrame({'year': years, 'c2': c2}),
'c': pd.DataFrame({'year': years[1:], 'c3': c3[1:]}),
}
for key in frames:
frames[key].set_index('year', inplace=True)
df = pd.concat(frames.values(), axis=1)
print(df)
which results in
c1 c2 c3
year
2010 0.956494 0.667499 NaN
2011 0.945344 0.578535 0.780039
2012 0.262117 0.080678 0.084415
2013 0.458592 0.390832 0.310181
2014 0.094028 0.843971 0.886331
2015 0.774905 0.192438 0.883722
2016 0.254918 0.095353 0.774190
2017 0.724667 0.397913 0.650906
2018 0.277498 0.531180 0.091791
2019 0.238076 0.917023 0.387511
2020 0.677015 0.159720 0.063264
Note that I have explicitly set the index to be the 'year' column, and in my example, I have removed the first year from the 'c' column. This is to show how the indices of the different dataframes are matched when concatenating. Had the index been left to its standard value, you would have gotten the years out of sync and a NaN value at the bottom of column 'c' instead.

Pandas dataframe.set_index() deletes previous index and column

I just came across a strange phenomenon with Pandas DataFrames, when setting index using DataFrame.set_index('some_index') the old column that was also an index is deleted! Here is an example:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
>>> df_mn
sale year
month
1 55 2012
4 40 2014
7 84 2013
10 31 2014
Now I change the index to year:
df_mn.set_index('year')
sale
year
2012 55
2014 40
2013 84
2014 31
.. and the month column was removed with the index. This is vary irritating because I just wanted to swap the DataFrame index.
Is there a way to not have the previous column that was an index from being deleted? Maybe through something like: DataFrame.set_index('new_index',delete_previous_index=False)
Thanks for any advice
You can do the following
>>> df_mn.reset_index().set_index('year')
month sale
year
2012 1 55
2014 4 40
2013 7 84
2014 10 31
the solution I found to reatain a previous columns is to set drop=False
dataframe.set_index('some_column',drop=False). This is not the perfect answer but it works!
No, in such cases you have to save your previous column, like shown
below:
import pandas as pd
df = pd.DataFrame({'month': [1, 4, 7, 10],'year': [2012, 2014, 2013, 2014],'sale':[55, 40, 84, 31]})
df_mn=df.set_index('month')
df_mn['month'] = df_mn.index #Save it as another column, and then run set_index with year column as value.
df_mn.set_index('year')
Besides you are using a duplicate dataframe df_mn , so the dataframe df remains unchanged you can use it again.
And also if you aren't setting the
inplace argument for set_index to True
df_mn won't have changed even after you call set_index() on it.
Also, like the other answer you can always use reset_index().

grouping and reordering by partial identifiers in python

I have data from a csv that produces a dataframe that looks like the following:
d = {"clf_2007": [20],
"e_2007": [25],
"ue_2007": [17],
"clf_2008": [300],
"e_2008": [20],
"ue_2008": [10]}
df = pd.DataFrame(d)
which produces a data frame (forgive me for not knowing how to properly code that into stackoverflow)
clf_2007 clf_2008 e_2007 e_2008 ue_2007 ue_2008
0 20 300 25 20 17 10
I want to manipulate that data to produce something that looks like this:
clf e ue
2007 20 25 17
2008 300 20 10
2007 and 2008 in the original column names represent dates, but they don't need to be datetime now. I need to merge them with another dataframe that has the same "dates" eventually, but I can figure that out later.
Thus far, I've tried groupbys and I've tried them by string indexes (like str[ :8]) and such, and, outside of it not working, I don't even think groupby is the right tool. I've also tried pd.PeriodIndex, but, again, that doesn't seem like the right tool to me.
Is there a standardized way to do something like this? Or is the brute force way (get it into an excel spreadsheet and just move the data around manually), the only way to get what I'm looking for here?
I think this will be a lot easier if you pre-process your data to have three columns: key, year and value. Something like:
rows = []
for k, v in d.iteritems():
key, year = k.split("_")
for val in v:
rows.append({'key': key, 'year': year, 'value': val})
Put those rows into a dataframe, call it dfA. I'm assuming you might have more than one value for each (key, year) pair and you want to aggregate them somehow. I'll assume you do that and end up with a dataframe called df, whose columns are still key, year, and value. At that point, you just need to pivot:
pd.pivot_table(df,index=['year'], columns=['key'])
You end up with multi-indexed rows/columns that you'll want to clean up, but I'll leave that to you.
You can generate a column multiindex:
df.columns = pd.MultiIndex.from_tuples([col.split("_") for col in df])
print(df.columns)
# clf e ue
# 2007 2008 2007 2008 2007 2008
And then stack the table:
df = df.stack()
print(df)
# clf e ue
#0 2007 20 25 17
# 2008 300 20 10
You can optionally flatten the index, too:
df.index = df.index.get_level_values(1)
print(df)
# clf e ue
#2007 20 25 17
#2008 300 20 10

Reindexing only valid with uniquely valued Index objects: Pandas DataFrame Panel

I am trying to average each cell of a bunch of .csv files to export as a single averaged .csv file using Pandas.
I have no problems, creating the dataframe itself, but when I try to turn it into a Panel (i.e. panel=pd.Panel(dataFrame)), I get the error: InvalidIndexError: Reindexing only valid with uniquely valued Index objects pandas pd.panel
An example of what each csv file looks like:
Year, Month, Day, Latitude, Longitude, Value1, Value 2
2010, 06, 01, 23, 97, 1, 3.5
2010, 06, 01, 24, 97, 5, 8.2
2010, 06, 01, 25, 97, 6, 4.6
2010, 06, 01, 26, 97, 4, 2.0
Each .csv file is from gridded data so they have the same number of rows and columns, as well as some no data values (given a value of -999.9), which my code snippet below addresses.
The code that I have so far to do this is:
june=[]
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
june.append(csv1)
dfs={i: pd.DataFrame.from_csv(i) for i in june}
panel=pd.Panel(dfs)
panels=panel.replace(-999.9,np.NaN)
dfs_mean=panels.mean(axis=0)
I have seen questions where the user is getting the same error, but the solutions for those questions doesn't seem to work with my issue. Any help fixing this, or ideas for a better approach would be greatly appreciated.
pd.Panel has been deprecated
Use pd.concat with a dictionary comprehension and take the mean over level 1.
df1 = pd.concat({f: pd.read_csv(f) for f in glob('meansample[0-9].csv')})
df1.mean(level=1)
Year Month Day Latitude Longitude Value1 Value 2
0 2010 6 1 23 97 1 3.5
1 2010 6 1 24 97 5 8.2
2 2010 6 1 25 97 6 4.6
3 2010 6 1 26 97 4 2.0
I have a suggestion to change the approach a bit. Instead of converting each DF into panel, just concat them into one big DF but for each one give a unique ID. After you can just do groupby by the ID and use mean() to get the result.
It would look similar to this:
import Pandas as pd
df = pd.DataFrame()
for csv1 in glob.glob(path+'\\'+'*.csv'):
if csv1[-10:-8] == '06':
temp_df = pd.read_csv(csv1)
temp_df['df_id'] = csv1
df = pd.concat([df, temp_df])
df.replace(-999.9, np.nan)
df = df.groupby("df_id").mean()
I hope it helps somehow, if you still have any issues with that let me know.

Categories