Extracting values to new columns with pandas - python

I have a dataframe where the coordinates column comes in this format
[-7.821, 37.033]
I would like to create two columns where the first is lonand the second is lat
I've tried
my_dict = df_map['coordinates'].to_dict()
df_map_new = pd.DataFrame(list(my_dict.items()),columns = ['lon','lat'])
But the dictionary that is created does not split the values between ,
Instead it creates a dict with the following format
0: '[-7.821, 37.033]'
What is the best way to extract the values within [,] and put them into two new columns in the original dataframe df_map?
Thank you in advance!

You can parse string:
pattern = r"\[(?P<lon>.*),\s*(?P<lat>.*)\]"
out = df_map['coordinates'].str.extract(pattern).astype(float)
print(out)
# Output
lon lat
0 -7.821 37.033

Convert values to lists by ast.literal_eval, then to lists instead dicts:
import ast
my_L = df_map['coordinates'].apply(ast.literal_eval).tolist()
df_map_new = pd.DataFrame(my_L,columns = ['lon','lat'])

Additionally to the answers already provided, you can also try this:
ser_lon = df['coordinates'].apply(lambda x: x[0])
ser_lat = df['coordinates'].apply(lambda x: x[1])
df_map['lon'] = ser_lon
df_map['lat'] = ser_lat

Related

How to convert and merge the values obtained from stats.linregress to csv in python?

From the following code, I filtered a dataset and obtained their statistic values using stats.linregress(x,y). I would like to merge the obtained values in the lists to a table and then covert it to csv. How to merge the lists? I tried the .append() but then it adds [...] at the end of each list. How to convert these lists in one csv? The code below only convert the last list to csv. Also, where is appropriate to add .02f function to shorten down the digits? Many thanks!
for i in df.ingredient.unique():
mask = df.ingredient == i
x_data = df.loc[mask]["single"]
y_data = df.loc[mask]["total"]
ing = df.loc[mask]["ingredient"]
res = stats.linregress(x_data, y_data)
result_list=list(res)
#sum_table = result_list.append(result_list)
sum_table = result_list
np.savetxt("sum_table.csv", sum_table, delimiter=',')
#print(f"{i} = res")
#print(f"{i} = result_list")
output:
[0.555725080482033, 15.369647540612188, 0.655901508882146, 0.34409849111785396, 0.45223586826559015, [...]]
[0.8240446598271236, 16.290731244189164, 0.7821893273053173, 0.00012525348188386877, 0.16409500805404134, [...]]
[0.6967783360917531, 25.8981921144781, 0.861561500951743, 0.13843849904825695, 0.29030899523536124, [...]]

i have a comma separated list of email ids in a column in python. I want to extract unique list of domain names (sorted) in a new column

I have a column in a python data frame with comma separated list of email ids. I want to extract unique list of domain names, sorted in alphabetical order.
Email Ids
Required Output
jgj#myu.com
myu.com
abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com
gmail.com, svc.com, yyy.com
zya#try.com,abs#cba.com
cba.com, try.com
I tried the following code, however its returning the output of first row for all rows
def Dom1(lpo):
mylist1 = []
for i in lpo:
domain = str(i).split("#")[1]
domain1=domain.replace('>','')
domain1=domain1.replace(']'," ")
if domain1 not in mylist1:
mylist1.append(domain1)
mylist1=sorted(mylist1, key=str.lower)
return mylist1
df['Email_Id1']=df.apply(lambda row: Dom1(df['Email_Id']),axis=1)
How to fix this issue?
I assume that the column Email_Id is a list of email ids.
Here is how your dataframe should look. All the values should be a list even if it has only 1 item. I have a feeling that a single email is not being stored as a list of strings and this is probably your source of error.
df = pd.DataFrame({ 'Email_Id': [['jgj#myu.com'], ['abc#gmail.com', 'lll#yyy.com', 'xyz#svc.com,abc#yyy.com'], ['zya#try.com','abs#cba.com']] })
df
Initial Dataframe
And then with a few minor changes and cleanup here is how you can apply the lambda function.
Apply it to only a series instead of the whole dataframe.
Also I am not sure why you are calling
domain1=domain.replace('>','') and domain1=domain1.replace(']'," ") domain names should not have such characters.
You don't need to sort after every insertion. Just sort it while returning the list as it will be called only once.
Change your variable names so that they make sense.
You could use a python set, but if you do not have a lot of emails in a single row, a list should do just fine
def get_domain(emails):
domains = []
for email in emails:
d = str(email).split("#")[1]
if d not in domains:
domains.append(d)
return sorted(domains, key=str.lower)
df['Email_Id1'] = df['Email_Id'].apply(lambda x: get_domain(x))
df
Final Dataframe
I would simply do a one-liner here:
df["domains"]=df["emails"].apply(lambda row: [ email[email.find("#")+1:] for email in row]).apply(sorted)
import re
col1 = ['jgj#myu.com', 'abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com', 'zya#try.com,abs#cba.com']
df1 = pd.DataFrame({'Email Ids':col1})
def getUniqueEmail(st1):
result_obj = {}
for i in st1.split(','):
if i not in result_obj:
result_obj[re.sub('^.+#','', i)] = 1
return ','.join(sorted(list(result_obj.keys()), key=str.lower))
df1['Required output'] = df1['Email Ids'].apply(lambda x: getUniqueEmail(x))

Order file based on numbers in name

I have a bunch of file with names as follows:
tif_files = av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
However, they are not guaranteed to be in any sort of order.
I want to sort them based on names such that files from the same year are sorted together. When I do this
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
I get the following result:
av_v5_1983_001.tif, av_v5_1984_001.tif, av_v5_1985_001.tif...av_v5_2021_001.tif
but I want this:
av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
take the last two using [2:] for example ['1984', '001.tif']
tif_files = 'av_v5_1983_001.tif', 'av_v5_1983_002.tif', 'av_v5_1983_003.tif',\
'av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v5_2021_001.tif', 'av_v5_2021_002.tif'
sorted(tif_files, key=lambda x: x.split('_')[2:])
# ['av_v5_1983_001.tif',
# 'av_v5_1983_002.tif',
# 'av_v5_1983_003.tif',
# 'av_v5_1984_001.tif',
# 'av_v5_1984_002.tif',
# 'av_v5_2021_001.tif',
# 'av_v5_2021_002.tif']
if you have v1 or v2 or ... v5 or ... you need to consider number of version also like below:
tif_files = ['av_v1_1983_001.tif', 'av_v5_1983_002.tif', 'av_v6_1983_002.tif','av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v4_2021_001.tif','av_v5_2021_001.tif', 'av_v5_2021_002.tif', 'av_v4_1984_002.tif']
sorted(tif_files, key=lambda x: [x.split('_')[2:], x.split('_')[1]])
Output:
['av_v1_1983_001.tif',
'av_v5_1983_002.tif',
'av_v6_1983_002.tif',
'av_v5_1984_001.tif',
'av_v4_1984_002.tif',
'av_v5_1984_002.tif',
'av_v4_2021_001.tif',
'av_v5_2021_001.tif',
'av_v5_2021_002.tif']
What you did was sorting it by the 00x index first then by the year as x.split('_')[-1] produces 001 and etc. Try to change the index to sort by year first , then sort it again by the index:
sorted(tif_files, key=lambda x:x.split('_')[2])
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
As long as your naming convention remains consistent, you should be able to just sort them alphanumerically. As such, the below code should work;
sorted(tif_files)
If you instead wanted to sort by the last two numbers in the file name while ignoring the prefix, you would need something a bit more dramatic that would break those numbers out and let you order by them. You could use something like the below:
import pandas as pd
tif_files_list = [[xx, int(xx.split("_")[2]), int(xx.split("_")[3])] for xx in tif_files]
tif_files_frame = pd.DataFrame(tif_files_list, columns=["Name", "Primary Index", "Secondary Index"])
tif_files_frame_ordered = tif_files_frame.sort_values(["Primary Index", "Secondary Index"], axis=0)
tif_files_ordered = tif_files_frame_ordered["Name"].tolist()
This breaks the numbers in the names out into separate columns of a Pandas Dataframe, then sorts your entries by those broken out columns, at which point you can extract the ordered name column on its own.
If key returns a tuple of 2 values, the sort function will try to sort based on the first value then the second value.
please refer to: https://stackoverflow.com/a/5292332/9532450
tif_files = [
"hea_der_1983_002.tif",
"hea_der_1983_001.tif",
"hea_der_1984_002.tif",
"hea_der_1984_001.tif",
]
def parse(filename: str) -> tuple[str, str]:
split = filename.split("_")
return split[2], split[3]
sort = sorted(tif_files, key=parse)
print(sort)
output
['hea_der_1983_001.tif', 'hea_der_1983_002.tif', 'hea_der_1984_001.tif', 'hea_der_1984_002.tif']
right click your folder and click sort by >> name.

I have multiple lists and I want to filter by the most current

I have the following bucket AWS schema:
In my python code, it returns a list of the buckets with their dates.
I need to stick with the most up-to-date of the two main buckets:
I am starting in Python, this is my code:
str_of_ints = [7100, 7144]
for get_in_scenarioid in str_of_ints:
resultado = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_scenarioid +'/')
#print(resultado)
sub_prefix = [val['Prefix'] for val in resultado['CommonPrefixes']]
for get_in_sub_prefix in sub_prefix:
resultado2 = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_sub_prefix) # +'/')
#print(resultado2)
get_key_and_last_modified = [val['Key'] for val in resultado2['Contents']] + int([val['LastModified'].strftime('%Y-%m-%d %H:%M:%S') for val in resultado2['Contents']])
print(get_key_and_last_modified)
I would recommend to convert your array into pandas DataFrame and to use group by:
import pandas as pd
df = pd.DataFrame([["a",1],["a",2],["a",3],["b",2],["b",4]], columns=["lbl","val"])
df.groupby(['lbl'], sort=False)['val'].max()
lbl
a 3
b 4
In your case you would also have to split your label into 2 parts first, better keep in separate column.
Update:
Once you split your lable into bucket and sub_bucket, you can return max values like this:
dfg = df.groupby("main_bucket")
dfm = dfg.max()
res = dfm.reset_index()

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

Categories