Having problems trying to convert a json to a DataFrame - python

I am trying to solve this problem
1.Create a function 'wbURL' that takes an indicator, a country code, and begin and end years to creates a URL of the following format:
http://api.worldbank.org/countries/COUNTRYCODE/indicators/INDICATORformat=json&date=BEGIN_DATE:END_DATE
Create another function 'wbDF' that takes the same inputs and returns a dataframe constructed from the response (the response will come as a JSON list, and the 1st (i.e., index 1) element of this list contains the relevant data. Extract this element, which is a list of dictionaries---this is what you want to construct the dataframe out of.Drop all columns except indicator country date and value. For the country and value columns, notice that the data is itself a dictionary: use apply to extract the value out of these dictionaries.
This is the code I wrote out:
def wbURL(contcode, ind, begin, end):
return f'http://api.worldbank.org/countries{contcode}/indicators/{ind}?format=json&date={begin}:{end}'
def wbDF(contcode, ind, begin, end):
url = wbURL(contcode, ind, begin, end)
response = requests.get(url)
wb_raw = response.content
wb = json.loads(wb_raw)
data = wb[1]
df = pd.DataFrame(data)
df = df.drop(columns=['countryiso3code', 'unit', 'obs_status', 'decimal'])
df['indicator'] = [d['id'] for d in [d['indicator'] for d in data]]
df['country'] = [d['value'] for d in [d['country'] for d in data]]
df['date'] = [d['date'] for d in data]
df['value'] = [d['value'] for d in data]
return df
test = wbDF('GBR', 'SP.DYN.LE00.IN', 2000, 2019)
print(test)
When I ren it, i got an error:
IndexError: list index out of range
I would like someone to have a look at the code I have written and give me some advice on how I should change it to make ir working

Related

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

Appending dictionaries generated from a loop to the same dataframe

I have a loop within a nested loop that at the end generates 6 dictionaries. Each dictionary has the same key but different values, I would at the end of every iteration to append the dictionary to the same dataframe but it keeps failing.
At the end I would like to have a table with 6 columns plus an index which holds the keys.
This is the idea behind what I'm trying to do:
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary)
df_final = pd.concat([dictionary, df])
I get the error:
cannot concatenate object of type '<class 'dict'>'; only series and dataframe objs are valid
I created a practice dataset set if necessary here:
letts = [ ('a','b','c'),('e','f','g'),('h','i','j'),('k','l','m'),('n','o','p')]
numns = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]
dictionary = dict()
for i in letts:
for j in numns:
dictionary = dict(zip(i, j))
i am confusing about your practice dataset, but modifications below could provide an idea...
df_final = pd.DataFrame()
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary, index="index must be passed")
df_final = pd.concat([df_final, df])

Cannot assign to function call when looping through and converting excel files

With this code:
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i,snlist in list(zip(range(1,13),sn)):
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist, skiprows=range(6))
I get this error:
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist,
skiprows=range(6))
^ SyntaxError: cannot assign to function call
I can't understand the error and how solve. What's the problem?
df+str(i) also return error
i want to make result as:
df1 = pd.read_excel.. list1...
df2 = pd.read_excel... list2....
You can't assign the result of df.read_excel to 'df{}'.format(str(i)) -- which is a string that looks like "df0", "df1", "df2" etc. That is why you get this error message. The error message is probably confusing since its treating this as assignment to a "function call".
It seems like you want a list or a dictionary of DataFrames instead.
To do this, assign the result of df.read_excel to a variable, e.g. df and then append that to a list, or add it to a dictionary of DataFrames.
As a list:
dataframes = []
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes.append(df)
As a dictionary:
dataframes = {}
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes[i] = df
In both cases, you can access the DataFrames by indexing like this:
for i in range(len(dataframes)):
print(dataframes[i])
# Note indexes will start at 0 here instead of 1
# You may want to change your `range` above to start at 0
Or more simply:
for df in dataframes:
print(df)
In the case of the dictionary, you'd probably want:
for i, df in dataframes.items():
print(i, df)
# Here, `i` is the key and `df` is the actual DataFrame
If you really do want df1, df2 etc as the keys, then do this instead:
dataframes[f'df{i}'] = df

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

How to stop apply() from changing order of columns?

I have a reproducible example, toy dataframe:
df = pd.DataFrame({'my_customers':['John','Foo'],'email':['email#gmail.com','othermail#yahoo.com'],'other_column':['yes','no']})
print(df)
my_customers email other_column
0 John email#gmail.com yes
1 Foo othermail#yahoo.com no
And I apply() a function to the rows, creating a new column inside the function:
def func(row):
# if this column is 'yes'
if row['other_column'] == 'yes':
# create a new column with 'Hello' in it
row['new_column'] = 'Hello'
# return to df
return row
# otherwise
else:
# just return the row
return row
I then apply the function to the df, and we can see that the order has been changed. The columns are now in alphabetical order. Is there any way to avoid this? I would like to keep it in the original order.
df = df.apply(func, axis = 1)
print(df)
email my_customers new_column other_column
0 email#gmail.com John Hello yes
1 othermail#yahoo.com Foo NaN no
Edited for clarification - the above code was too simple
input
df = pd.DataFrame({'my_customers':['John','Foo'],
'email':['email#gmail.com','othermail#yahoo.com'],
'api_status':['data found','no data found'],
'api_response':['huge json','huge json']})
my_customers email api_status api_response
0 John email#gmail.com data found huge json
1 Foo othermail#yahoo.com no data found huge json
Parsing the api_response. I need to create many new rows in the DF:
def api_parse(row):
# if we have response data
if row['api_response'] == huge json:
# get response for parsing
response_data = row['api_response']
"""Let's get associated URLS first"""
# if there's a URL section in the response
if 'urls' in response_data .keys():
# get all associated URLS into a list
urls = extract_values(response_data ['urls'], 'url')
row['Associated_Urls'] = urls
"""Get a list of jobs"""
if 'jobs' in response_data .keys():
# get all associated jobs and organizations into a list
titles = extract_values(person_data['jobs'], 'title')
organizations = extract_values(person_data['jobs'], 'organization')
counter = 1
# create a new column for each job
for pair in zip(titles,organizations):
row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'
counter +=1
"""Get a list of education"""
if 'educations' in response_data .keys():
# get all degrees into list
degrees = extract_values(response_data ['educations'], 'display')
counter = 1
# create a new column for each degree
for edu in degrees:
row['education'+'_'+str(counter)] = edu
counter +=1
"""Get a list of social profiles from URLS we parsed earlier"""
facebook = [i for i in urls if 'facebook' in i] or [np.nan]
instagram = [i for i in urls if 'instagram' in i] or [np.nan]
linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
twitter = [i for i in urls if 'twitter' in i] or [np.nan]
amazon = [i for i in urls if 'amazon' in i] or [np.nan]
row['facebook'] = facebook
row['instagram'] = instagram
row['linkedin'] = linkedin
row['twitter'] = twitter
row['amazon'] = amazon
return row
elif row['api_Status'] == 'No Data Found':
# do nothing
return row
expected output:
my_customers email api_status api_response job_1 job_2 \
0 John email#gmail.com data found huge json xyz xyz2
1 Foo othermail#yahoo.com no data found huge json nan nan
education_1 facebook other api info
0 foo profile1 etc
1 nan nan nan
You could adjust the order of columns in your DataFrame after running the apply function. For example:
df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]
To reduce the amount of duplication (i.e. by having to retype all column names), you could get the existing set of columns before calling the apply function:
columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]
Update based on the author's edits to the original question. Whilst I'm not sure if the data structure chosen (storing API results in a Data Frame) is the best option, one simple solution could be to extract the new columns after calling the apply functions.
# Store the existing columns before calling apply
existing_columns = list(df.columns)
df = df.apply(func, axis = 1)
all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]
df = df[columns + new_columns]
For performance optimisations, you could store the existing columns in a set instead of a list which will yield lookups in constant time due to the hashed nature of a set data structure in Python. This would change existing_columns = list(df.columns) to existing_columns = set(df.columns).
Finally, as #Parfait very kindly points out in their comment, the code above may raise some depreciation warnings. Using pandas.DataFrame.reindex instead of df = df[columns + new_columns] will make the warnings disappear:
new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)
That occurs because you don't assign a value to the new column if row["other_column"] != 'yes'. Just try this:
def func(row):
if row['other_column'] == 'yes':
row['new_column'] = 'Hello'
return row
else:
row['new_column'] = ''
return row
df.apply(func, axis = 1)
You can choose the value for row["new_column"] == 'no' to be whatever. I just left it blank.

Categories