So the problem is that it is a csv file and when I open it with pandas, it looks like this:
data=pd.read_csv('test.csv', sep=',', usecols=['properties'])
data.head()[![enter image description here][1]][1]
It is like a dictionary in each row, just confused how to open it correctly with gender, document_type, etc as columns
{'gender': 'Male', 'nationality': 'IRL', 'document_type': 'passport', 'date_of_expiry': '2019-08-12', 'issuing_country': 'IRL'}
{'gender': 'Female', 'document_type': 'driving_licence', 'date_of_expiry': '2023-02-28', 'issuing_country': 'GBR'}
{'gender': 'Male', 'nationality': 'ITA', 'document_type': 'passport', 'date_of_expiry': '2018-06-09', 'issuing_country': 'ITA'}
It looks like the cvs file is not properly formated to be read by the default funtion from pandas. You will need to create the columns your self.
data['gender'] = data['properties'].str[0].str['gender']
For each one of the fields in the dictionary you have.
If there are too many columns, you should consider evaluating the string. like this
import ast
my_dict = ast.literal_eval(df.loc[0]['properties')
for key in my_dict.keys():
data[key] = data['properties'].str[0].str[key]
This should build your DataFrame just fine
Related
I've created a generator object and want to write it out into a CSV file so I can upload it to an external tool. At the minute the generator returns records as separate dictionaries but don't appear to have any commas separating the records/dictionaries and when I write out the file to a txt file and reload it back into the script it returns a <class 'str'>.
Class Generator declared as:
matches =
{'type_of_reference': 'JOUR', 'title': 'Ranking evidence in substance use and addiction', 'secondary_title': 'International Journal of Drug Policy', 'alternate_title1': 'Int. J. Drug Policy', 'volume': '83', 'year': '2020', 'doi': '10.1016/j.drugpo.2020.102840'}
{'type_of_reference': 'JOUR', 'title': 'Methods used in the selection of instruments for outcomes included in core outcome sets have improved since the publication of the COSMIN/COMET guideline', 'secondary_title': 'Journal of Clinical Epidemiology', 'alternate_title1': 'J. Clin. Epidemiol.', 'volume': '125', 'start_page': '64', 'end_page': '75', 'year': '2020', 'doi': '10.1016/j.jclinepi.2020.05.021',}
Which is a result of the following generator function that compares records "doi" key within this generator object and a set of doi's from an other file.
def match_record():
with open(filename_ris) as f:
ris_records = readris(f)
for entry in ris_records:
if entry['doi'] in doi_match:
yield entry
I've outputted this generator class matches by using the following code to review that the correct records have been kept as a txt file.
with open('output.txt', 'w') as f:
for x in matchs:
f.write(str(x))
It's not a list of dictionaries nor dictionaries separated by commas that I have so I'm a bit confused about how to read/load it into pandas effectively. I want to load it into pandas to drop certain series[keys] and then write it out as a csv once completed.
I'm reading it in using pd.read_csv and just returns the key: value pairs for all the separate records as column headers which is no surprise but I don't know what to do before this step.
Given an output of df=pandas.read_csv(somePath,header=None):
0 1
0 Name Bambang
1 Gender Male
2 Age 25
How to convert it into:
dict_data={
'Name':Bambang,
'Gender':Male,
'Age':25
}
I can do it but in a long way:
df=pandas.read_csv(somePath,header=None)
df=df.set_index([0])
theDict=df.to_dict()
theDict=theDict[1]
Is there a native and simple way to do it using pandas.read_csv() or python native command? Thank you.
The assumption is that u've read the data and want it as a dict
something like this could work :
df.set_index('0').T.to_dict('records')[0]
{'Name': 'Bambang', 'Gender': 'Male', 'Age ': '25'}
Also, if u really want to to do this, it would be better to just use python's csv reader instead to get ur dict, instead of the round about way of pandas first then dict:
This is how the data looks in data.txt; I'm not sure if this replicates exactly what you have:
data = '''
Name Bambang
Gender Male
Age 25'''
data
import csv
A = []
with open('data.txt', newline = '') as csvfile:
content = csv.reader(csvfile,delimiter = ' ')
for row in content:
A.append([entry for entry in row if entry != ''])
dict(A)
{'Name': 'Bambang', 'Gender': 'Male', 'Age': '25'}
UPDATE : thanks to #AMC, it is much simpler from the pandas end -: get the numpy values and apply dict:
dict(df.to_numpy())
{'Name': 'Bambang', 'Gender': 'Male', 'Age': '25'}
I am using pandas to read a CSV which contains a phone_number field (string), however, I need to convert this field into the below JSON format
[{'phone_number':'+01 373643222'}] and put it under a new column name called phone_numbers, how can I do that?
Searched online but the examples I found are converting the all the columns into JSON by using to_json() which is apparently cannot solve my case.
Below is an example
import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'phone_number': ['+1 569-483-2388', '+1 555-555-1212', '+1 432-867-5309']})
use map function like this
df["phone_numbers"] = df["phone_number"].map(lambda x: [{"phone_number": x}] )
display(df)
I was wondering if there was a way to read in Categorical values during the read_csv() process.
Normally you can do a convert after the fact with something like:
df.zone = df.zone.astype('category')
At this point the df takes up more memory and I'm looking for a way to reduce that.
I've tried things like:
parking_meters = pd.read_csv('parking_meter_data.csv',
converters={'zone': pd.Categorical(),
'sub_area': pd.Categorical(),
'area': pd.Categorical(),
'config_name': pd.Categorical(),
'pole' : str(),
'longitude' : np.float(),
'latitude' : np.float()
})
parking_meters.memory_usage(deep=True).sum()
However categorical data needs an initialization argument of the actual data, which is in CSV file.
Let's try with dtype:
parking_meters = pd.read_csv('parking_meter_data.csv',
dtype={'zone': 'category',
'sub_area': 'category',
'area': 'category',
'config_name': 'category'
})
I am working on loading a dataset from a pickle file like this
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
It works fine and loads the data correctly. This is an example of one row:
'GLISAN JR BEN F': {'salary': 274975, 'to_messages': 873, 'deferral_payments': 'NaN', 'total_payments': 1272284, 'exercised_stock_options': 384728, 'bonus': 600000, 'restricted_stock': 393818, 'shared_receipt_with_poi': 874, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 778546, 'expenses': 125978, 'loan_advances': 'NaN', 'from_messages': 16, 'other': 200308, 'from_this_person_to_poi': 6, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 71023, 'email_address': 'ben.glisan#enron.com', 'from_poi_to_this_person': 52}
Now, how can get the number of features? e.g (salary, to_messages, .... , from_poi_to_this_person) ?
I got this row by printing my whole dataset (print data_dict) and this is one of the results. I want to know how many features are there is general i.e. in the whole dataset without specifying a key in the dictionary.
Thanks
Try this.
no_of_features = len(data_dict[data_dict.keys()[0]])
This will work only if all your keys in data_dict have same number of features.
or simply
no_of_features = len(data_dict['GLISAN JR BEN F'])
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
print len(data_dict)
I think you want to find out the size of the set of all unique field names used in the row dictionaries. You can find that like this:
data_dict = {
'red':{'alpha':1,'bravo':2,'golf':3,'kilo':4},
'green':{'bravo':1,'delta':2,'echo':3},
'blue':{'foxtrot':1,'tango':2}
}
unique_features = set(
feature
for row_dict in data_dict.values()
for feature in row_dict.keys()
)
print(unique_features)
# {'golf', 'delta', 'foxtrot', 'alpha', 'bravo', 'echo', 'tango', 'kilo'}
print(len(unique_features))
# 8
Apply sum to the len of each nested dictionary:
sum(len(v) for _, v in data_dict.items())
v represents a nested dictionary object.
Dictionaries will naturally return their keys when you call an iterator on them (or something of that sort), so calling len will return the number of keys in each nested dictionary, viz. number of features.
If the features may be duplicated across nested objects, then collect them in a set and apply len
len(set(f for v in data_dict.values() for f in v.keys()))
Here is the answer
https://discussions.udacity.com/t/lesson-5-number-of-features/44253/4
where we choose 1 person in this case SKILLING JEFFREY K within the database called enron_data. and then we print the lenght of the keys in the dictionary.
print len(enron_data["SKILLING JEFFREY K"].keys())