I have a dataframe of this style:
id patient_full_name
7805 TOMAS FRANCONI
7810 Camila Gualtieri
7821 Lola Borrego
7823 XIMENA ALVAREZ LANUS
7824 MONICA VIVIANA RODRIGUEZ DE MARENGO
I need to save the first name of values from the second column. I want to trim that value down to the first spacing and I don't know how.
I would like it to stay in a structure like this:
patients_names = ["TOMAS","CAMILA","LOLA","XIMANA","MONICA",...."N-NAME"]
All this done in Pandas Python
You can use the split function in a list comprehension to do this:
df = pd.DataFrame([
{"id": 7805, "patient_full_name": "TOMAS FRANCONI"},
{"id": 7810, "patient_full_name": "Camila Gualtieri"},
{"id": 7821, "patient_full_name": "Lola Borrego"}
])
df["first_name"] = [n.split(" ")[0] for n in df["patient_full_name"]]
That adds a column (first_name) with the output you wanted, which you can then pull off as a list or series if you want:
first_name_as_series = df["first_name"]
first_name_as_list = list(df["first_name"])
In your question, you show the desired output in all upper case. That's easy to get with a simple tweak to the list comprehension:
df["first_name"] = [n.split(" ")[0].upper() for n in df["patient_full_name"]]
You can do it by using extract as well, which do not rely on a loop:
(df
.assign(first_name=lambda x: x.fullname.str.extract(r"(.*) "))
)
I am working on a data cleaning project and in this, I have to remove some outliers of price_per_sqft.. So I used groupby function and by statistic, the formula creates a data frame without outliers and concat it with the output data frame...
But in the output this type of word returns with the location names so how can I get a clean location name instead of this..?
Code:
def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) & (subdf.price_per_sqft<=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out
df6 = remove_pps_outliers(df5)
df6.head()
Output:
enter image description here
How can I get the answer without "1st Phase" or "1st Block" keywords like this...
enter image description here
A rudimentary fix would be to just replace the characters you do not want. Luckily in this example, both '1st Phase ' and '1st Block ' contain 10 characters so you could use :
df6['location'] = df6['location'].str.slice_replace(0,10,'')
I have a dataframe where the coordinates column comes in this format
[-7.821, 37.033]
I would like to create two columns where the first is lonand the second is lat
I've tried
my_dict = df_map['coordinates'].to_dict()
df_map_new = pd.DataFrame(list(my_dict.items()),columns = ['lon','lat'])
But the dictionary that is created does not split the values between ,
Instead it creates a dict with the following format
0: '[-7.821, 37.033]'
What is the best way to extract the values within [,] and put them into two new columns in the original dataframe df_map?
Thank you in advance!
You can parse string:
pattern = r"\[(?P<lon>.*),\s*(?P<lat>.*)\]"
out = df_map['coordinates'].str.extract(pattern).astype(float)
print(out)
# Output
lon lat
0 -7.821 37.033
Convert values to lists by ast.literal_eval, then to lists instead dicts:
import ast
my_L = df_map['coordinates'].apply(ast.literal_eval).tolist()
df_map_new = pd.DataFrame(my_L,columns = ['lon','lat'])
Additionally to the answers already provided, you can also try this:
ser_lon = df['coordinates'].apply(lambda x: x[0])
ser_lat = df['coordinates'].apply(lambda x: x[1])
df_map['lon'] = ser_lon
df_map['lat'] = ser_lat
I have a Pandas DataFrame that contains several string values.
I want to replace them with integer values in order to calculate similarities.
For example:
stores[['CNPJ_Store_Code','region','total_facings']].head()
Out[24]:
CNPJ_Store_Code region total_facings
1 93209765046613 Geo RS/SC 1.471690
16 93209765046290 Geo RS/SC 1.385636
19 93209765044084 Geo PR/SPI 0.217054
21 93209765044831 Geo RS/SC 0.804633
23 93209765045218 Geo PR/SPI 0.708165
and I want to replace region == 'Geo RS/SC' ==> 1, region == 'Geo PR/SPI'==> 2 etc.
Clarification: I want to do the replacement automatically, without creating a dictionary first, since I don't know in advance what my regions will be.
Any ideas? I am trying to use DictVectorizer, with no success.
I'm sure there's a way to do it in intelligent way, but I just can't find it.
Anyone familiar with a solution?
You can use the .apply() function and a dictionary to map all known string values to their corresponding integer values:
region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... }
stores['region'] = stores['region'].apply(lambda x: region_dictionary[x])
It looks to me like you really would like panda categories
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
I think you just need to change the dtype of your text column to "category" and you are done.
stores['region'] = stores["region"].astype('category')
You can do:
df = pd.read_csv(filename, index_col = 0) # Assuming it's a csv file.
def region_to_numeric(a):
if a == 'Geo RS/SC':
return 1
if a == 'Geo PR/SPI':
return 2
df['region_num'] = df['region'].apply(region_to_numeric)
I am trying to import a json file using the function:
sku = pandas.read_json('https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111')
However, i keep getting the following error
ValueError: arrays must all be same length
What should I do to import it correctly into a dataframe?
this is the structure of the json:
{
"id": "5",
"sku": "JOSH:BECO-BRN",
"last_updated": "2013-06-10 15:46:22",
...
"propertyType1": [
"manufacturer_colour"
],
"category": [
{
"category_id": "10",
"category_name": "All Products"
},
...
{
"category_id": "238",
"category_name": "All Sofas"
}
],
"root_categories": [
"516"
],
"url": "/p/Beco Suede Sofa Bed?product_id=5",
"item": [
"2"
],
"image_names": "[\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-1.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/L\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/P\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/SP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk \\/images\\/products\\/SS\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/ST\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\",\"https:\\/\\/cdn.worldstores.co.uk\\/images\\/products\\/WP\\/19\\/Beco_Suede_Sofa_Bed-2.jpg\"]"
}
The pandas.read_json function takes multiple formats.
Since you did not specify which format your json file is in (orient= attribute), pandas will default to believing your data is columnar. The different formats pandas expects are discussed below.
The data that you are trying to parse from https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111
Does not seem to conform to any of the supported formats as it seems to be only a single "record". Pandas expects some kind of collection.
You probably should try to collect multiple entries into a single file, then parse it with the read_json function.
EDIT:
Simple way of getting multiple rows and parsing it with the pandas.read_json function:
import urllib2
import pandas as pd
url_base = "https://cws01.worldstores.co.uk/api/product.php?product_sku={}"
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_data_list = []
for sku in products:
url = url_base.format(sku)
raw_data_list.append(urllib2.urlopen(url).read())
data = "[" + (",".join(raw_data_list)) + "]"
data = pd.read_json(data, orient='records')
data
/EDIT
My take on the pandas.read_json function formats.
The pandas.read_json function is yet another shining example of pandas trying to jam as much functionality as possible into a single function. This leads of course to a very very complicated function.
Series
If your data is a Series, pandas.read_json(orient=) defaults to 'index'
The values allowed for orient while parsing a Series are: {'split','records','index'}
Note that the Series index must be unique for orient='index'.
DataFrame
If your data is a DataFrame, pandas.read_json(orient=) defaults to 'columns'
The values allowed for orient while parsing a DataFrame are:
{'split','records','index','columns','values'}
Note that the Series index must be unique for orient='index' and orient='columns', and the DataFrame columns must be unique for orient='index', orient='columns', and orient='records'.
Format
No matter if your data is a DataFrame or a Series, the orient= will expect data in the same format:
Split
Expects a string representation of a dict like what the DataFrame constructor takes:
{"index":[1,2,3,4], "columns":["col1","col2"], "data":[[8,7,6,5], [5,6,7,8]]}
Records
Expects a string representation of a list of dicts like:
[{"col1":8,"col2":5},{"col1":7,"col2":6},{"col1":6,"col2":7},{"col1":5,"col2":8}]
Note there is no index set here.
Index
Expects a string representation of a nested dict dict like:
{"1":{"col1":8,"col2":5},"2":{"col1":7,"col2":6},"3":{"col1":6,"col2":7},"4":{"col1":5,"col2":8}}
Good to note is that it won't accept indicies of other types than strings. May be fixed in later versions.
Columns
Expects a string representation of a nested dict like:
{"col1":{"1":8,"2":7,"3":6,"4":5},"col2":{"1":5,"2":6,"3":7,"4":8}}
Values
Expects a string representation of a list like:
[[8, 5],[7, 6],[6, 7],[5, 8]]
Resulting dataframe
In most cases, the dataframe you get will look like this, with the json strings above:
col1 col2
1 8 5
2 7 6
3 6 7
4 5 8
Maybe this is not the most elegant solution however gives me back what I want, or at least I believe so, feel free to warn if something is wrong
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111"
data = urllib2.urlopen(url).read()
data = json.loads(data)
data = pd.DataFrame(data.items())
data = data.transpose()
Another solution is to use a try except.
json_path='https://cws01.worldstores.co.uk/api/product.php?product_sku=125T:FT0111'
try: a=pd.read_json(json_path)
except ValueError: a=pd.read_json("["+json_path+"]")
Iterating on #firelynx's answer:
#! /usr/bin/env python3
from urllib.request import urlopen
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
raw_lines = ""
for sku in products:
url = f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}"
raw_lines += urlopen(url).read() + "\n"
data = pd.read_json(raw_lines, lines=True)
This would support any source returning a single JSON object or a bunch of newline ('\n') separated ones.
Or this one-liner(ish) should work the same:
#! /usr/bin/env python3
import pandas as pd
products = ["125T:FT0111", "125T:FT0111", "125T:FT0111"]
data = pd.concat(
pd.read_json(
f"https://cws01.worldstores.co.uk/api/product.php?product_sku={sku}",
lines=True
) for sku in products
)
PS: python3 is only for fstring support here, so you should use str.format for python2 compatibility.