Looking for a better way do accomplish dataframe to dictionary by series - python

Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']

One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}

df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe

Related

Index count being wrapped into the data as one object

First time on stackoverflow so bear with me. Code is below. Basically, the df_history is a dataframe with different variables. I am trying to pull the 'close' variable and sort it based on the categorical type of the currency.
When I pull data over using the .query command, it gives me 1 object with all the individual observations together separated by a space. I know how to separate that back into independent data, but issue is that it is pulling the index count with the observations. In the image you can see 179, 178, 177 etc in the BTC object. I dont want that there and didnt indend to pull it. How do I get rid of that?
additional_rows = []
for currency in selected_coins:
df_history = df_history.sort_values(['date'], ascending=True)
row_data = [currency,
df_history.query('granularity == \'daily\' and currency == #currency')['close'],
df_history.query('granularity == \'daily\' and currency == #currency').head(180)['close'].pct_change(),
df_history['date']
]
additional_rows.append(row_data)
df_additional_info = pd.DataFrame(additional_rows, columns = ['currency',
'close',
'returns',
'df_history'])
df_additional_info.set_index('currency').transpose()
import ast
list_of_lists = df_additional_info.close.to_list()
flat_list = [i for sublist in list_of_lists for i in ast.literal_eval(sublist)]
uniq_list = list(set(flat_list))
len(uniq_list),len(flat_list)
I was trying to pull data from one data frame to the next and sort it based on a categorical input from the currency variable. It is not transferring over well

i have a comma separated list of email ids in a column in python. I want to extract unique list of domain names (sorted) in a new column

I have a column in a python data frame with comma separated list of email ids. I want to extract unique list of domain names, sorted in alphabetical order.
Email Ids
Required Output
jgj#myu.com
myu.com
abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com
gmail.com, svc.com, yyy.com
zya#try.com,abs#cba.com
cba.com, try.com
I tried the following code, however its returning the output of first row for all rows
def Dom1(lpo):
mylist1 = []
for i in lpo:
domain = str(i).split("#")[1]
domain1=domain.replace('>','')
domain1=domain1.replace(']'," ")
if domain1 not in mylist1:
mylist1.append(domain1)
mylist1=sorted(mylist1, key=str.lower)
return mylist1
df['Email_Id1']=df.apply(lambda row: Dom1(df['Email_Id']),axis=1)
How to fix this issue?
I assume that the column Email_Id is a list of email ids.
Here is how your dataframe should look. All the values should be a list even if it has only 1 item. I have a feeling that a single email is not being stored as a list of strings and this is probably your source of error.
df = pd.DataFrame({ 'Email_Id': [['jgj#myu.com'], ['abc#gmail.com', 'lll#yyy.com', 'xyz#svc.com,abc#yyy.com'], ['zya#try.com','abs#cba.com']] })
df
Initial Dataframe
And then with a few minor changes and cleanup here is how you can apply the lambda function.
Apply it to only a series instead of the whole dataframe.
Also I am not sure why you are calling
domain1=domain.replace('>','') and domain1=domain1.replace(']'," ") domain names should not have such characters.
You don't need to sort after every insertion. Just sort it while returning the list as it will be called only once.
Change your variable names so that they make sense.
You could use a python set, but if you do not have a lot of emails in a single row, a list should do just fine
def get_domain(emails):
domains = []
for email in emails:
d = str(email).split("#")[1]
if d not in domains:
domains.append(d)
return sorted(domains, key=str.lower)
df['Email_Id1'] = df['Email_Id'].apply(lambda x: get_domain(x))
df
Final Dataframe
I would simply do a one-liner here:
df["domains"]=df["emails"].apply(lambda row: [ email[email.find("#")+1:] for email in row]).apply(sorted)
import re
col1 = ['jgj#myu.com', 'abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com', 'zya#try.com,abs#cba.com']
df1 = pd.DataFrame({'Email Ids':col1})
def getUniqueEmail(st1):
result_obj = {}
for i in st1.split(','):
if i not in result_obj:
result_obj[re.sub('^.+#','', i)] = 1
return ','.join(sorted(list(result_obj.keys()), key=str.lower))
df1['Required output'] = df1['Email Ids'].apply(lambda x: getUniqueEmail(x))

Parsing a zipcode boundary geojson file based on a condition, then appending to a a new json file if condition is met

I have a geojson file of zipcode boundaries.
with open('zip_geo.json') as f:
gj = geojson.load(f)
gj['features'][0]['properties']
Prints out;
{'STATEFP10': '36',
'ZCTA5CE10': '12205',
'GEOID10': '3612205',
'CLASSFP10': 'B5',
'MTFCC10': 'G6350',
'FUNCSTAT10': 'S',
'ALAND10': 40906445,
'AWATER10': 243508,
'INTPTLAT10': '+42.7187855',
'INTPTLON10': '-073.8292399',
'PARTFLG10': 'N'}
I also have a pandas dataframe with one of the fields being the zipcode.
I want to create a new geojson file only if the 'ZCTA5CEO' value of the specific element is present in the zipcode column of my dataframe.
How would I go about doing this?
I was thinking of something like; (This is pseudo code)
new_dict = {}
for index,item in enumerate(gj):
if item['features'][index]['properties']['ZCTACE10'] in df['zipcode']:
new_dict += item
The syntax of the code above is obviously wrong, but I am not sure how to parse though multiple nested dictionaries and append the results to a new dictionary.
Link to the geojson file : https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/ny_new_york_zip_codes_geo.min.json
In short I want to remove all the elements relating to the zipcodes that are not there in the zipcode column in my dataframe.
try this. just update your ziplist. then you can save the new json to a local file
ziplist = ['12205', '14719', '12193', '12721'] #list of zips in your dataframe
url='https://github.com/OpenDataDE/State-zip-code-GeoJSON/raw/master/ny_new_york_zip_codes_geo.min.json'
gj = requests.get(url).json()
inziplist = []
for ft in gj['features']:
if ft['properties']['ZCTA5CE10'] in ziplist:
print(ft['properties']['ZCTA5CE10'])
inziplist.append(ft)
print(len(inziplist))
new_zip_json = {}
new_zip_json['type'] = 'FeatureCollection'
new_zip_json['features'] = inziplist
new_zip_json = json.dumps(new_zip_json)

Mining for Term that is "Included In" Entry Rather than "Equal To"

I am doing some data mining. I have a database that looks like this (pulling out three lines):
100324822$10032482$1$PS$BENICAR$OLMESARTAN MEDOXOMIL$1$Oral$UNK$$$Y$$$$021286$$$TABLET$
1014687010$10146870$2$SS$BENICAR HCT$HYDROCHLOROTHIAZIDE\OLMESARTAN MEDOXOMIL$1$Oral$1/2 OF 40/25MG TABLET$$$Y$$$$$.5$DF$FILM-COATED TABLET$QD
115700162$11570016$5$C$Olmesartan$OLMESARTAN$1$Unknown$UNK$$$U$U$$$$$$$
My Code looks like this :
with open('DRUG20Q4.txt') as fileDrug20Q4:
drugTupleList20Q4 = [tuple(map(str, i.split('$'))) for i in fileDrug20Q4]
drug20Q4 = []
for entryDrugPrimaryID20Q4 in drugTupleList20Q4:
drug20Q4.append((entryDrugPrimaryID20Q4[0], entryDrugPrimaryID20Q4[3], entryDrugPrimaryID20Q4[5]))
fileDrug20Q4.close()
drugNameDataFrame20Q4 = pd.DataFrame(drug20Q4, columns = ['PrimaryID', 'Role', 'Drug Name']) drugNameDataFrame20Q4 = pd.DataFrame(drugNameDataFrame20Q4.loc[drugNameDataFrame20Q4['Drug Name'] == 'OLMESARTAN'])
Currently the code will pull only entries with the exact name "OLMESARTAN" out, how do I capture all the variations, for instance "OLMESARTAN MEDOXOMIL" etc? I can't simply list all the varieties as there's an infinite amount of variations, so I would need something that captures anything with the term "OLMESARTAN" within it.
Thanks!
You can use str.contains to get what you are looking for.
Here's an example (using some string I found in the documentation):
import pandas as pd
df = pd.DataFrame()
item = 'Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.'
df['test'] = item.split(' ')
df[df['test'].str.contains('de')]
This outputs:
test
4 Index
22 Index.

Python reading Excel spreadsheet, creating multiple lists according to variables and conditions

Hi there’s an Excel spreadsheet showing Product ID and Location, like below.
I want to list all the locations of each product ID in sequence with no duplication.
For example:
53424 has Phoenix, Matsuyama, Phoenix, Matsuyama, Phoenix, Matsuyama, Phoenix.
56224 has Samarinda, Boise. Seoul.
etc.
What's the best way to achieve it with Python?
I can only read the cells in the spreadsheet but have no idea what’s good to proceed.
Thank you.
the_file = xlrd.open_workbook("C:\\excel file.xlsx")
the_sheet = the_file.sheet_by_name("Sheet1")
for row_index in range(0, the_sheet.nrows):
product_id = the_sheet.cell(row_index, 0).value
location = the_sheet.cell(row_index, 1).value
You need to make use of Python's groupby() function to take away the duplicates as follows:
from collections import defaultdict
from itertools import groupby
import xlrd
the_file = xlrd.open_workbook(r"excel file.xlsx")
the_sheet = the_file.sheet_by_name("Sheet1")
products = defaultdict(list)
for row_index in range(1, the_sheet.nrows):
products[int(the_sheet.cell(row_index, 0).value)].append(the_sheet.cell(row_index, 1).value)
for product, v in sorted(products.items()):
print "{} has {}.".format(product, ', '.join(k for k, g in groupby(v)))
This uses a defaultlist() with a dictionary to build your products. So each key in the dictionary holds your product ID and the contents is automatically a list of the matching entries. Finally the groupby() is used to read out each raw value and only give you one entry for the cases where there are consecutive identically values. Finally the list this produces is joined together with commas between them.
You should use a dictionary to store the data from excel and then traverse it according to product ID.
So, following code should help you out -
the_file = xlrd.open_workbook("C:\\excel file.xlsx")
the_sheet = the_file.sheet_by_name("Sheet1")
dataset = dict()
for row_index in range(0, the_sheet.nrows):
product_id = the_sheet.cell(row_index, 0).value
location = the_sheet.cell(row_index, 1).value
if product_id in dataset:
dataset[product_id].append(location)
else:
dataset[product_id] = [location]
for product_id in sorted(dataset.keys()):
print "{0} has {1}.".format(product_id, ", ".join(dataset[product_id]))
Above will preserve the order of locations as per product_id (in sequence).

Categories