How to replace string values in pandas dataframe to integers? - python

I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?

You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])

It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')

You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)

Related

trim string to first space python

I have a dataframe of this style:
id patient_full_name
7805 TOMAS FRANCONI
7810 Camila Gualtieri
7821 Lola Borrego
7823 XIMENA ALVAREZ LANUS
7824 MONICA VIVIANA RODRIGUEZ DE MARENGO
I need to save the first name of values from the second column. I want to trim that value down to the first spacing and I don't know how.
I would like it to stay in a structure like this:
patients_names = ["TOMAS","CAMILA","LOLA","XIMANA","MONICA",...."N-NAME"]
All this done in Pandas Python
You can use the split function in a list comprehension to do this:
df = pd.DataFrame([
{"id": 7805, "patient_full_name": "TOMAS FRANCONI"},
{"id": 7810, "patient_full_name": "Camila Gualtieri"},
{"id": 7821, "patient_full_name": "Lola Borrego"}
])
df["first_name"] = [n.split(" ")[0] for n in df["patient_full_name"]]
That adds a column (first_name) with the output you wanted, which you can then pull off as a list or series if you want:
first_name_as_series = df["first_name"]
first_name_as_list = list(df["first_name"])
In your question, you show the desired output in all upper case. That's easy to get with a simple tweak to the list comprehension:
df["first_name"] = [n.split(" ")[0].upper() for n in df["patient_full_name"]]
You can do it by using extract as well, which do not rely on a loop:
(df
.assign(first_name=lambda x: x.fullname.str.extract(r"(.*) "))
)

Extracting values to new columns with pandas

I have a dataframe where the coordinates column comes in this format
[-7.821, 37.033]
I would like to create two columns where the first is lonand the second is lat
I've tried
my_dict = df_map['coordinates'].to_dict()
df_map_new = pd.DataFrame(list(my_dict.items()),columns = ['lon','lat'])
But the dictionary that is created does not split the values between ,
Instead it creates a dict with the following format
0: '[-7.821, 37.033]'
What is the best way to extract the values within [,] and put them into two new columns in the original dataframe df_map?
Thank you in advance!
You can parse string:
pattern = r"\[(?P<lon>.*),\s*(?P<lat>.*)\]"
out = df_map['coordinates'].str.extract(pattern).astype(float)
print(out)
# Output
lon lat
0 -7.821 37.033
Convert values to lists by ast.literal_eval, then to lists instead dicts:
import ast
my_L = df_map['coordinates'].apply(ast.literal_eval).tolist()
df_map_new = pd.DataFrame(my_L,columns = ['lon','lat'])
Additionally to the answers already provided, you can also try this:
ser_lon = df['coordinates'].apply(lambda x: x[0])
ser_lat = df['coordinates'].apply(lambda x: x[1])
df_map['lon'] = ser_lon
df_map['lat'] = ser_lat

Trying to find the nearest date before and after a specified date from a list of dates in a comma separated string within a pandas Dataframe

tldr; I have an index_date in dtype: datetime64[ns] <class 'pandas.core.series.Series'> and a list_of_dates of type <class 'list'> with individual elements in str format. What's the best way to convert these to the same data type so I can sort the dates into closest before and closest after index_date?
I have a pandas dataframe (df) with columns:
ID_string object
indexdate datetime64[ns]
XR_count int64
CT_count int64
studyid_concat object
studydate_concat object
modality_concat object
And it looks something like:
ID_string indexdate XR_count CT_count studyid_concat studydate_concat
0 55555555 2020-09-07 10 1 ['St1', 'St5'...] ['06/22/2019', '09/20/2020'...]
1 66666666 2020-06-07 5 0 ['St11', 'St17'...] ['05/22/2020', '06/24/2020'...]
Where the 0 element in studyid_concat ("St1") corresponds to the 0 element in studydate_concat, and in modality_concat, etc. I did not show modality_concat for space reasons, but it's something like ['XR', 'CT', ...]
My current goal is to find the closest X-ray study performed before and after my indexdate, as well as being able to rank studies from closest to furthest. I'm somewhat new to pandas, but here is my current attempt:
df = pd.read_excel(path_to_excel, sheet_name='Sheet1')
# Convert comma separated string from Excel to lists of strings
df.studyid_concat = df.studyid_concat.str.split(',')
df.studydate_concat = df.studydate_concat.str.split(',')
df.modality_concat = df.modality_concat.str.split(',')
for x in in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate']
# Had to use subscript [0] below because result of above was a list in an array
studyid_list = df.loc[df['ID_string'] == x, 'studyid_concat'].values[0]
date_list = df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
xr_date_list = [date_list[x] for x in range(len(date_list)) if modality_list[x]=="XR"]
xr_studyid_list = [studyid_list[x] for x in range(len(studyid_list)) if modality_list[x]=="XR"]
That's about as far as I got because I'm somewhat confused on datatypes here. My indexdate is currently in dtype: datetime64[ns] <class 'pandas.core.series.Series'> which I was thinking of converting using the datetime module, but was having a hard time figuring out how. I also wasn't sure if I needed to. My xr_study_list is a list of strings containing dates in format 'mm/dd/yyyy'. I think if I could figure out the rest if I could get the data types in the right format. I'd just compare if the dates are >= or < indexdate to sort into before/after, and then subtract each date by indexdate and sort. I think whatever I do with my xr_date_list, I'd just have to be sure to do the same with xr_studyid_list to keep track of the unique study id
Edit: Desired output dataframe would look like
ID_string indexdate StudyIDBefore StudyDateBefore
0 55555555 2020-09-07 ['St33', 'St1', ...] [2020-09-06, 2019-06-22, ...]
1 66666666 2020-06-07 ['St11', 'St2', ...] [2020-05-22, 2020-05-01, ...]
Where the "before" variables would be sorted from nearest to furthest, and similar "after columns would exist. My current goal is just to check if a study exists within 3 days before and after this indexdate, but having the above dataframe would give me the flexibility if I need to start looking beyond the nearest study.
Think I found my own answer after spending some time thinking about it some more and referencing more of pandas to_datetime documentation. Basically realized I could convert my list of string dates using pd.to_datetime
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
Then could subtract my index date from this list. Opted to do this within a temporary dataframe so I could keep track of the other column values (like study ID, modality, etc.).
Full code is below:
for x in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate'].values[0]
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
studyid_list = df.loc[df['ID_string'] == x, '_concat'].values[0]
tempdata = list(zip(studyid_list, date_list, modality_list))
tempdf = pd.DataFrame(tempdata, columns=['studyid', 'studydate', 'modality'])
tempdf['indexdate'] = index_date
tempdf['timedelta'] = tempdf['studydate']-tempdf['index_date']
tempdf['study_done_wi_3daysbefore'] = np.where((tempdf['timedelta']>=np.timedelta64(-3,'D')) & (tempdf['timedelta']<np.timedelta64(0,'D')), True, False)
tempdf['study_done_wi_3daysafter'] = np.where((tempdf['timedelta']<=np.timedelta64(3,'D')) & (tempdf['timedelta']>=np.timedelta64(0,'D')), True, False)
tempdf['study_done_onindex'] = np.where(tempdf['timedelta']==np.timedelta64(0,'D'), True, False)
XRonindex[x] = True if len(tempdf.loc[(tempdf['study_done_onindex']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
XRwi3days[x] = True if len(tempdf.loc[(tempdf['study_done_wi_3daysbefore']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
# can later map these values back to my original dataframe as a new column

Imputing the missing values string using a condition(pandas DataFrame)

Kaggle Dataset(working on)- Newyork Airbnb
Created with a raw data code for running better explanation of the issue
`airbnb= pd.read_csv("https://raw.githubusercontent.com/rafagarciac/Airbnb_NYC-Data-Science_Project/master/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv")
airbnb[airbnb["host_name"].isnull()][["host_name","neighbourhood_group"]]
`DataFrame
I would like to fill the null values of "host_name" based on the "neighbourhood_group" column entities.
like
if airbnb['host_name'].isnull():
airbnb["neighbourhood_group"]=="Bronx"
airbnb["host_name"]= "Vie"
elif:
airbnb["neighbourhood_group"]=="Manhattan"
airbnb["host_name"]= "Sonder (NYC)"
else:
airbnb["host_name"]= "Michael"
(this is wrong,just to represent the output format i want)
I've tried using if statement but I couldn't apply in a correct way. Could you please me solve this.
Thanks
You could try this -
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Bronx"), "host_name"] = "Vie"
airbnb.loc[(airbnb['host_name'].isnull()) & (airbnb["neighbourhood_group"]=="Manhattan"), "host_name"] = "Sonder (NYC)"
airbnb.loc[airbnb['host_name'].isnull(), "host_name"] = "Michael"
Pandas has a special method to fill NA values:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
You may create a dict with values for "host_name" field using "neighbourhood_group" values as keys and do this:
host_dict = {'Bronx': 'Vie', 'Manhattan': 'Sonder (NYC)'}
airbnb['host_name'] = airbnb['host_name'].fillna(value=airbnb[airbnb['host_name'].isna()]['neighbourhood_group'].map(host_dict))
airbnb['host_name'] = airbnb['host_name'].fillna("Michael")
"value" argument here may be a Series of values.
So, first of all, we create a Series with "neighbourhood_group" values which correspond to our missing values by using this part:
neighbourhood_group_series = airbnb[airbnb['host_name'].isna()]['neighbourhood_group']
Then using map function together with "host_dict" we get a Series with values that we want to impute:
neighbourhood_group_series.map(host_dict)
Finally we just impute in all other NA cells some default value, in our case "Michael".
You can do it with:
ornek = pd.DataFrame({'samp1': [None, None, None],
'samp2': ["sezer", "bozkir", "farkli"]})
def filter_by_col(row):
if row["samp2"] == "sezer":
return "ping"
if row["samp2"] == "bozkir":
return "pong"
return None
ornek.apply(lambda x: filter_by_col(x), axis=1)

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories