Replace some values with a string in a column in python - python

I have a column "Country" in a data frame, I would like to group the "Country" column with only two options: "Mainland China" and " Others". I have tried different options e.g. filter, etc. No one works. How should I do it?
Here is the dataset https://drive.google.com/file/d/17DY8f-Jxba0Ky5iOUQqEZehhoWNO3vzR/view?usp=sharing
FYI, I have already grouped different provinces in China as one country "Mainland China"
Thanks for your help!

I think the quickest way to change the value would be using .loc instead of apply since .loc is optimized for pandas.
df.loc[df.Country != 'Mainland China', 'Country'] = 'Others'

Try (and group by Country):
import numpy as np
df["Country"]=np.where(df["Country"].eq("Mainland China"), "Mainland China", "Other")
Edit
timeit (please note I didn't do .loc[] as lambda doesn't support assignment - feel free to suggest a way of adding it):
import pandas as pd
import numpy as np
import timeit
from timeit import Timer
#proportion-wise that's the dataframe, as per OP's question
df=pd.DataFrame({"Country": ["Mainland China"]*398+["a", "b","c"]*124})
df["otherCol"]=2
df["otherCol2"]=3
#shuffle
df2=df.copy().sample(frac=1)
df3=df2.copy()
df4=df3.copy()
op2=Timer(lambda: np.where(df2["Country"].eq("Mainland China"), "Mainland China", "Other"))
op3=Timer(lambda: df3.Country.map(lambda x: x if x == 'Mainland China' else 'Others'))
op4=Timer(lambda: df4["Country"].apply(lambda x: x if x == "Mainland China" else "Others"))
print(op2.timeit(number=1000))
print(op3.timeit(number=1000))
print(op4.timeit(number=1000))
Returns:
2.1856687490362674 #numpy
2.2388894270407036 #map
2.4437739049317315 #apply

Try using apply:
dataframe["Country"] = dataframe["Country"].apply(lambda x: x if x == "Mainland China" else "Others")

Assuming df is your pandas dataframe.
You could do:
df['Country'] = df.Country.map(lambda x: x if x == 'Mainland China' else 'Others')

Related

loop over columns in dataframes python

I want to loop over 2 columns in a specific dataframe and I want to access the data by the name of the column but it gives me this error (type error) on line 3
i=0
for name,value in df.iteritems():
q1=df[name].quantile(0.25)
q3=df[name].quantile(0.75)
IQR=q3-q1
min=q1-1.5*IQR
max=q3+1.5*IQR
minout=df[df[name]<min]
maxout=df[df[name]>max]
new_df=df[(df[name]<max) & (df[name]>min)]
i+=1
if i==2:
break
It looks like you want to exclude outliers based on the 1.5*IQR rule. Here is a simpler solution:
Input dummy data:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'col%s' % (i+1): np.random.normal(size=1000)
for i in range(4)})
Removing the outliers (keep data: Q1-1.5IQR < data < Q3+1.5IQR):
Q1 = df.iloc[:, :2].quantile(.25)
Q3 = df.iloc[:, :2].quantile(.75)
IQR = Q3-Q1
non_outliers = (df.iloc[:, :2] > Q1-1.5*IQR) & (df.iloc[:, :2] < Q3+1.5*IQR)
new_df = df[non_outliers.all(axis=1)]
output:
Type error might happen for a lot of reasons so it will be better if you add part of the DF to try to understand the issue.
Also to loop over columns you can also use the iterrows() function:
import pandas as pd
df = pd.read_csv('filename.csv')
for _, content in df.iterrows():
print(content['columnname']) #add the name of the columns you want to loop over
refer to the following link for more information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows

How to split out tagged column from df and fill new column?

I have a column in a Python df like:
TAGS
{user_type:active}
{session_type:session1}
{user_type:inactive}
How can I efficiently make this column its own column for each of the tags specified?
Desired:
TAGS |user_type|session_type
{user_type:active} |active |null
{session_type:session1}|null |session1
{user_type:inactive} |inactive |null
My attempt only is able to do this in a boolean sense (not what I want) and only if I specify the columns from the tags (which I don't know ahead of time):
mask = df['tags'].apply(lambda x: 'user_type' in x)
df['user_type'] = mask
there are better ways but this is from what you got
df['user_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'user_type' in x else np.nan)
df['session_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'session_type' in x else np.nan)
You could use pandas.json_normalize() to convert TAGS column to dict object and check if user_type is a key of that dict.
df2 = pd.json_normalize(df['TAGS'])
df2['user_type'] = df2['TAGS'].apply(lambda x: x['user_type'] if 'user_type' in x else 'null')
This is what ended up working for me, wanted to post a short working example from the json library the helped.
def js(row):
if row:
return json.loads(row)
else:
return {'':''}
#This example includes if there was/wasn't a dataframe with other fields including tags
import json
import pandas as pd
df2 = df.copy()
#Make some dummy tags
df2['tags'] = ['{"user_type":"active","nonuser_type":"inactive"}']*len(df2['tags'])
df2['tags'] = df2['tags'].apply(js)
df_temp = pd.DataFrame(df2['tags'].values.tolist())
df3 = (pd.concat([df2.drop('tags', axis=1), df_temp], axis=1))
#Ynjxsjmh your approach reminds me of something like that I had used in the past, but in this case, I had gotten the following error:
AttributeError: 'str' object has no attribute 'values'
#Bing Wang I am a big fan of list comprehension but in this case I don't know the names of the columns before hand.

Pandas Filterting Data

I have a dataset from Bing. And this dataset contains state and county-level information. I am trying to create two different datasets, one for the county level and one for the state level.
How do I create an only-state level data frame. Here is a picture of what the dataset looks like:
The counties dataframe worked with this code
import pandas as pd
df = pd.read_csv("COVID19-DATA-WITHSTATES.csv")
only_counties = df[df['AdminRegion2'].str.contains("", na = True)]
It didn't work for the state level with this code:
only_states = df[df['AdminRegion2' != ""]]
EDIT: This is the code that worked
only_states = usa_only[lambda x: ~pd.notnull(x['AdminRegion2']) & (usa_only["AdminRegion1"].str.contains("", na = False))]
You can filter it with lambda expression:
only_states = df[lambda x: ~pd.isnull(x['AdminRegion2'])]
For the second question the above solution works as well:
df[lambda x: x['date'] == "date"]
Here is the answer for only states, that worked.
only_states = usa_only[lambda x: ~pd.notnull(x['AdminRegion2']) & (usa_only["AdminRegion1"].str.contains("", na = False))]

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

How can I do to remove $

Hello I have this file :
date;category_name;item_number;item_description;bottlevolume_ml;state_bottle_retail;bottles_sold;volume_sold_gallons
11/04/2015;APRICOT$ BRANDIES;54436;$Mr. Boston Apricot Brandy;750;6.75;12;2.38
03/02/2016;BLENDED WHISKIES;27605;Tin Cup;750;$20.63;2;0.40
02/11/2016;STRAIGHT BOURBON WHISKIES;19067;Jim Beam;1000;$18.89;24;6.34
02/03/2016;AMERICAN COCKTAILS;59154;1800 Ultimate Margarita;1750;$14.25;6;2.77
08/18/2015;VODKA 80 PROOF;35918;Five O'clock Vodka;1750;$10.80;12;5.55
I would like to remove the $ using panda.
I tried this :
import pandas as pd
import numpy as np
df = pd.read_csv('data2.csv', delimiter=';')
df.date = [x.strip('$') for x in df.date]
df.category_name = [x.strip('$') for x in df.category_name]
df.item_number = [x.strip('$') for x in df.ite_number]
But I would like using pandas to remove from all my columns the $
Any ideas ?
Thank you !
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
Explanation:
If a column has a '$', it will be a object-type column. It's useful to select only these, because then you can use .str.replace (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) to find all '$"-signs in that column and replace it with an empty string.
Nothe that this solution also removes'$' in the middle of the string (in contrast to the .strip method you've used so far).
This should work.
df = df.apply(lambda x: x.str.strip('$') if x.dtype == "object" else x)

Categories