Remove miss-matched data in columns using Pandas - python

How would I remove data rows in a dataset which do not match the dictionary values of two columns. for example take the snippet of my data set.
I want to remove the rows where the data doesn't match the dictionary like in row 6 and apply this function to a large data set.
dictionary = SiteID, Location [188:A, 203:B, 206:C, 270:D, 463:E]

Try to do with merge
d = {188:'A', 203:'B', 206:'C', 270:'D', 463:'E'}
out = df.merge(pd.Series(d).rename_axis('SiteID').to_frame('Location').reset_index(),how='left')

Related

Replace column Values based on Index of other Dataframe

I am trying to replace the Values in the "All Assortment" column of the "buyer" data frame.
I need to replace them with the data from the "All Stores" column of the "asl" data frame. The twist is that the index values of the asl data frame are the values that need to match for the replacement to work.
Hard to say without a minimal reproducible example, but try mapping the values of buyer['All Assortment'] to corresponding values from the asl['All Stores'] column based on the asl index:
buyer['All Assortment'] = buyer['All Assortment'].map(asl['All Stores'])

Aggregate Function to dataframe while retaining rows in Pandas

I want to aggregate my data based off a field known as COLLISION_ID and a count of each COLLISION_ID.
I want to remove repeating COLLISION_IDs since they have the same Coordinates, but retain a count of occurrences in original data-set.
My code is below
df2 = df1.groupby(['COLLISION_ID'])[['COLLISION_ID']].count()
This returns such:
I would like my data returned as the COLLISION_ID numbers, the count, and the remaining columns of my data which are not shown here(~40 additional columns that will be filtered later)
If you are talking about filter , we should do transform
df1['count_col']=df1.groupby(['COLLISION_ID'])['COLLISION_ID'].transform('count')
Then you can filter the df1 with column count

Select a subset of an object type cell in panda Dataframe

I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)

pandas dataset transformation to normalize the data

I have a csv file like this:
I want to transform it into a pandas dataframe like this:
Basically i'm trying to normalize the dataset to populate a sql table.
I have used json_normalize to create a separate dataset from genres column but I'm at a loss over how to transform both the columns as shown in the above depiction.
Some suggestions would be highly appreciated.
If the genre_id is the only numeric value (as shown in the picture), you can use the following:
#find all occurrences of digits in the column and convert the list items to comma separated string.
df['genre_id'] = df['genres'].str.findall(r'(\d+)').apply(', '.join)
#use pandas.DataFrame.explode to generate new genre_ids by comma separating them.
df = df.assign(genre_id = df.genre_id.str.split(',')).explode('genre_id')
#finally remove the extra space
df['genre_id'] = df['genre_id'].str.lstrip()
#if required create a new dataframe with these 2 columns only
df = df[['id','genre_id']]

Create a dictionary from DataFrame?

I want to create a dictionary from a dataframe in python.
In this dataframe, frame one column contains all the keys and another column contains multiple values of that key.
DATAKEY DATAKEYVALUE
name mayank,deepak,naveen,rajni
empid 1,2,3,4
city delhi,mumbai,pune,noida
I tried this code to first convert it into simple data frame but all the values are not separating row-wise:
columnnames=finaldata['DATAKEY']
collist=list(columnnames)
dfObj = pd.DataFrame(columns=collist)
collen=len(finaldata['DATAKEY'])
for i in range(collen):
colname=collist[i]
keyvalue=finaldata.DATAKEYVALUE[i]
valuelist2=keyvalue.split(",")
dfObj = dfObj.append({colname: valuelist2}, ignore_index=True)
You should modify you title question, it is misleading because pandas dataframes are "kind of" dictionaries in themselves, that is why the first comment you got was relating to the .to_dict() pandas' built-in method.
What you want to do is actually iterate over your pandas dataframe row-wise and for each row generate a dictionary key from the first column, and a dictionary list from the second column.
For that you will have to use:
an empty dictionary: dict()
the method for iterating over dataframe rows: dataframe.iterrows()
a method to split a single string of values separated by a separator as the split() method you suggested: str.split().
With all these tools all you have to do is:
output = dict()
for index, row in finaldata.iterrows():
output[row['DATAKEY']] = row['DATAKEYVALUE'].split(',')
Note that this generates a dictionary whose values are lists of strings. And it will not work if the contents of the 'DATAKEYVALUE' column are not singles strings.
Also note that this may not be the most efficient solution if you have a very large dataframe.

Categories