I have an API that returns a single row of data as a Python dictionary. Most of the keys have a single value, but some of the keys have values that are lists (or even lists-of-lists or lists-of-dictionaries).
When I throw the dictionary into pd.DataFrame to try to convert it to a pandas DataFrame, it throws a "Arrays must be the same length" error. This is because it cannot process the keys which have multiple values (i.e. the keys which have values of lists).
How do I get pandas to treat the lists as 'single values'?
As a hypothetical example:
data = { 'building': 'White House', 'DC?': True,
'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
I want to turn it into a DataFrame like this:
ix building DC? occupants
0 'White House' True ['Barack', 'Michelle', 'Sasha', 'Malia']
This works if you pass a list (of rows):
In [11]: pd.DataFrame(data)
Out[11]:
DC? building occupants
0 True White House Barack
1 True White House Michelle
2 True White House Sasha
3 True White House Malia
In [12]: pd.DataFrame([data])
Out[12]:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
This turns out to be very trivial in the end
data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
df = pandas.DataFrame([data])
print df
Which results in:
DC? building occupants
0 True White House [Barack, Michelle, Sasha, Malia]
Solution to make dataframe from dictionary of lists where keys become a sorted index and column names are provided. Good for creating dataframes from scraped html tables.
d = { 'B':[10,11], 'A':[20,21] }
df = pd.DataFrame(d.values(),columns=['C1','C2'],index=d.keys()).sort_index()
df
C1 C2
A 20 21
B 10 11
Would it be acceptable if instead of having one entry with a list of occupants, you had individual entries for each occupant? If so you could just do
n = len(data['occupants'])
for key, val in data.items():
if key != 'occupants':
data[key] = n*[val]
EDIT: Actually, I'm getting this behavior in pandas (i.e. just with pd.DataFrame(data)) even without this pre-processing. What version are you using?
I had a closely related problem, but my data structure was a multi-level dictionary with lists in the second level dictionary:
result = {'hamster': {'confidence': 1, 'ids': ['id1', 'id2']},
'zombie': {'confidence': 1, 'ids': ['id3']}}
When importing this with pd.DataFrame([result]), I end up with columns named hamster and zombie. The (for me) correct import would be to have these as row titles, and confidence and ids as column titles. To achieve this, I used pd.DataFrame.from_dict:
In [42]: pd.DataFrame.from_dict(result, orient="index")
Out[42]:
confidence ids
hamster 1 [id1, id2]
zombie 1 [id3]
This works for me with python 3.8 + pandas 1.2.3.
if you know the keys of the dictionary beforehand, why not first create an empty data frame and then keep adding rows?
Related
I want to get a sub-dataframe that contains all elements in a list.
Let's take the DataFrame as an example.
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',\
'skill and practical oriented teaching',\
'all agricultural practices']
}
df = pd.DataFrame(my_dict)
Output looks thus:
Job Job_Detail
0 Painting all sort of painting
1 Capentry kitchen utensils, all types of roofing etc.
2 Teacher skill and practical oriented teaching
3 Farming all agricultural practices
my_lst = ['of', 'all']
I want to filter df with mylst to get a sub_DataFrame that looks like this:
Job Job_Detail
0 Painting all sort of painting
1 Capentry kitchen utensils, all types of roofing etc.
I've tried df[df.Job_Detail.isin(['of', 'all']) but it returns an empty DataFrame.
I'm no pandas expert, but the best function to use here seems to be str.contains
From the docs:
Series.str.contains(pat, case=True, flags=0, na=None, regex=True)
Test if pattern or regex is contained within a string of a Series or Index.
Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
Edit: This masks using or, not and
import pandas as pd
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',
'skill and practical oriented teaching',
'all agricultural practices']
}
my_lst = ['of', 'all']
df = pd.DataFrame(my_dict)
print(df)
mask = df.Job_Detail.str.contains('|'.join(my_lst), regex=True)
print(df[mask])
Here's a solution that masks uing and:
import pandas as pd
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',
'skill and practical oriented teaching',
'all agricultural practices']
}
my_lst = ['of', 'all']
df = pd.DataFrame(my_dict)
print(df)
print("------")
masks = [df.Job_Detail.str.contains(word) for word in my_lst]
mask = pd.concat(masks, axis=1).all(axis=1)
print(df[mask])
#Lone Your code answered a different question, but it helped me arrive at the answer. Thank you, appreciated.
Here's the closest to what I needed:
df[(df.Job_Detail.str.contains('of')) & (df.Job_Detail.str.contains('all'))]
using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()
Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.
I am attempting to resort a list based on the order in which they are listed in a dataframe, despite the dataframe column being of a greater length.
enrolNo Surname
0 1 Jones
1 2 Smith
2 3 Henderson
3 4 Kilm
4 5 Henry
5 6 Joseph
late = ['Kilm', 'Henry', 'Smith']
Desired output:
sorted_late = ['Smith', 'Kilm', 'Henry']
My initial attempt was to add a new column to the existing dataframe and then extract it as a list, but this seems like a very long way around. Furthermore I discovered that my attempt was not going to work due the the different lengths as stated by the error message after trying this to start with:
df_register['late_arrivals'] = np.where((df_register['Surname'] == late),
late , '')
Should I be using a 'for' loop instead?
Pluck out the matching values from dataframe itself. No need to sort the list itself:
sorted_late = df[df.Surname.isin(late)].Surname.to_list()
If it were a list you can be clever with that too:
sorted_late = [master_late for master_late in master_list if master_late in late]
Why not using the .isin() function?
df['Surename'].isin(late)
then you should get the desired output.
you can specify a custom key for the sort function
import pandas
df = pandas.DataFrame([
{"enrolNo": 1, "Surname": "Jones"},
{"enrolNo": 2, "Surname": "Smith"},
{"enrolNo": 3, "Surname": "Henderson"},
{"enrolNo": 4, "Surname": "Kilm"},
{"enrolNo": 5, "Surname": "Henry"},
{"enrolNo": 6, "Surname": "Joseph"},
])
# set Surname as index so we can access enrolNo by it
df = df.set_index('Surname')
# now you can access enrolNo by Surname
assert df.loc['Kilm']['enrolNo'] == 4
# define the list to be sorted
late = ['Kilm', 'Henry', 'Smith']
# Sort late by enrolNo as listed in the dataframe
late_sorted = sorted(late, key=lambda n: df.loc[n]['enrolNo'])
# ['Smith', 'Kilm', 'Henry']
I have a list of nested dictionaries that I want to get specific values and put into a dictionary like this:
vid = [{'a':{'display':'axe', 'desc':'red'}, 'b':{'confidence':'good'}},
{'a':{'display':'book', 'desc':'blue'}, 'b':{'confidence':'poor'}},
{'a':{'display':'apple', 'desc':'green'}, 'b':{'confidence':'good'}}
]
I saw previous questions similar to this, but I still can't get the values such as 'axe' and 'red'. I would like the new dict to have a 'Description', 'Confidence' and other columns with the values from the nested dict.
I have tried this for loop:
new_dict = {}
for x in range(len(vid)):
for y in vid[x]['a']:
desc = y['desc']
new_dict['Description'] = desc
I got many errors but mostly this error:
TypeError: string indices must be integers
Can someone please help solve how to get the values from the nested dictionary?
You don't need to iterate through the keys in the dictionary (the inner for-loop), just access the value you want.
vid = [{'a':{'display':'axe', 'desc':'red'}, 'b':{'confidence':'good'} },
{'a':{'display':'book', 'desc':'blue'}, 'b':{'confidence':'poor'}},
{'a':{'display':'apple', 'desc':'green'}, 'b':{'confidence':'good'}}
]
new_dict = {}
list_of_dicts = []
for x in range(len(vid)):
desc = vid[x]['a']['desc']
list_of_dicts.append({'desc': desc})
I have found a temporary solution for this. I decided to use the pandas dataframe instead.
df = pd.DataFrame(columns = ['Desc'])
for x in range(len(vid)):
desc = vid[x]['a']['desc']
df.loc[len(df)] = [desc]
so you want to write this to csv later so pandas will help you a lot for this problem using pandas you can get the desc by
import pandas as pd
new_dict = {}
df = pd.DataFrame(vid)
for index, row in df.iterrows() :
new_dict['description'] = row['a']['desc']
a b
0 {'display': 'axe', 'desc': 'red'} {'confidence': 'good'}
1 {'display': 'book', 'desc': 'blue'} {'confidence': 'poor'}
2 {'display': 'apple', 'desc': 'green'} {'confidence': 'good'}
this is how dataframe looks like a b are column of the dataframe and your nested dicts are rows of dataframe
Try using this list comprehension:
d = [{'Description': i['a']['desc'], 'Confidence': i['b']['confidence']} for i in vid]
print(d)
I have a pandas series whose unique values are something like:
['toyota', 'toyouta', 'vokswagen', 'volkswagen,' 'vw', 'volvo']
Now I want to fix some of these values like:
toyouta -> toyota
(Note that not all values have mistakes such as volvo, toyota etc)
I've tried making a dictionary where key is the correct word and value is the word to be corrected and then map that onto my series.
This is how my code looks:
corrections = {'maxda': 'mazda', 'porcshce': 'porsche', 'toyota': 'toyouta', 'vokswagen': 'vw', 'volkswagen': 'vw'}
df.brands = df.brands.map(corrections)
print(df.brands.unique())
>>> [nan, 'mazda', 'porsche', 'toyouta', 'vw']
As you can see the problem is that this way, all values not present in the dictionary are automatically converted to nan. One solution is to map all the correct values to themselves, but I was hoping there could be a better way to go about this.
Use:
df.brands = df.brands.map(corrections).fillna(df.brands)
Or:
df.brands = df.brands.map(lambda x: corrections.get(x, x))
Or:
df.brands = df.brands.replace(corrections)