I have a set of urls containing json files and an empty pandas dataframe with columns representing the attributes of the jsnon files. Not all json files have all the attributes in the pandas dataframe. What I need to do is to create dictionaries out of the json files and then append each dictionary to the pandas dataframe as a new row and, in case the json file doesn't have an attribute matching a column in the dataframe this has to be filled blank.
I managed to create dictionaries as:
import urllib2
import json
url = "https://cws01.worldstores.co.uk/api/product.php?product_sku=ULST:7BIS01CF"
data = urllib2.urlopen(url).read()
data = json.loads(data)
and then I tried to create a for loop as follows:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
for column in df.columns:
if str(column) == str(key):
df.loc[[str(column)],row] = data[str(key)]
else:
df.loc[[str(column)],row] = None
where df is the dataframe and links is the set of urls
However, I get the following error:
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['2_seater_depth_mm'] not in index"
where ['2_seater_depth_mm'] is the first column of the pandas dataframe
For me below code works:
row = -1
for i in links:
row = row + 1
data = urllib2.urlopen(str(i)).read()
data = json.loads(data)
for key in data.keys():
df.loc[row,key] = data[key]
You have mixed order of arguments in .loc() and have one to much []
Assuming that df is empty and has the same columns as the url dictionary keys, i.e.
list(df)
#[u'alternate_product_code',
# u'availability',
# u'boz',
# ...
len(df)
#0
then you can use pandas.append
for url in links:
url_data = urllib2.urlopen(str(url)).read()
url_dict = json.loads(url_data)
a_dict = { k:pandas.Series([str(v)], index=[0]) for k,v in url_dict.iteritems() }
new_df = pandas.DataFrame.from_dict(a_dict)
df.append(new_df, ignore_index=True)
Not too sure why your code won't work, but consider the following few edits which should clean things up, should you still want to use it:
for row,url in enumerate(links):
data = urllib2.urlopen(str(url)).read()
data_dict = json.loads(data)
for key,val in data_dict.items():
if key in list(df):
df.ix[row,key] = val
I used enumerate to iterate over the index and value of links array, in this way you dont need an index counter (row in your code) and then I used the .items dictionary method, so I can iterate over key and values at once. I believe pandas will automatically handle the empty dataframe entries.
Related
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
I have attached an csv file. I have written a python script which reads the csv file and iterates over a data frame and process the contents of csv and insert it into mongoDB.
Right now, all data is getting inserted into the DB.
Is there a way to iterate over python dict and only take first 10 ranks data (group rank column), this column is grouped as you can see in attached img.
file = request.files['file']
client = pymongo.MongoClient("mongodb://localhost:27017")
df = pd.read_csv(file)
final_dict = {}
for row in df.iterrows():
cluster_name = row[1][1]
print(cluster_name)
if cluster_name not in final_dict.keys():
final_dict[cluster_name] = {}
final_dict[cluster_name]["queries"] = []
final_dict[cluster_name]["queries"].append(
{"cluster_name": row[1][0], "cluster_rank": row[1][1],
"cluster_size": row[1][2]})
else:
final_dict[cluster_name]["queries"].append(
{"cluster_name": row[1][0], "cluster_rank": row[1][1], "cluster_size": row[1][2]})
db = client["db_name"]
for key in final_dict:
db.testing.insert_one(final_dict[key])
To only get rows that are equal to or less than 10 in the group rank you can use a loc option
df = df.loc[df['group rank'] <= 10]
df
I have a column named (events) comes from csv file, I have loaded csv into dataframe.
This column contains the events of each soccer match:
here is an example:
sample of data
I need each key to be a column and the rows will be its value
to be like:
event_team event_time event_type ....
home 34 Yellow card
away 14 Goal
....
this is a sample file
Sample from data
how can I do it please ?
Pandas support reading straight from a list of dicts like this.
list1 = [dict1, dict2, dict3]
df = pd.DataFrame(list1)
Using that you can later select a column using:
column = df["column_name"]
If you want a non pandas way you can do this:
list1 = [dict1, dict2, dict3]
columns = {}
# Initializing the keywords
for d in list1:
for k in d:
if k not in columns:
columns[k] = []
for d in list1:
for k in columns:
if k in d:
columns[k].append(d[k])
else:
# because you want all columns to have the same length
columns[k].append(None)
print(columns)
EDIT: This script unpacks the "events_list" column to a new dataframe with the given blueprint described by the OP.
import pandas as pd
import ast
df = pd.read_csv("Sampleofuefa.csv")
l = []
for d in df["events_list"]:
# the values in the columns are strings, you have to interpret them
# since ast.literal_eval returns a list of dicts, we extend the following
# list with that list of dict: l = l1 + l2
l.extend(ast.literal_eval(d))
event_df = pd.DataFrame(l)
I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.
Below loop supposes to add multiple table's row (html page) in one dataframe. loop works fine, it creates a dataframe for each table one by one but it also replaces previous table's data from dataframe which is i want to fix. it should append each table's data into the same dataframe, It should not replace previous table's data from dataframe. Plase help me on this.
column_headers = ['state', 'sr_no', 'district_name', 'country']
headers = ['district_id']
district_link = [[li.get('href') for li in data_rows_link[i].findAll('a')]
for i in range(len(data_rows))]
district_data_02 = [] # create an empty list to hold all the data
for i in range(len(data_rows)): # for each table row
district_row = [] # create an empty list for each pick/player
district_row.append("a")
# for each table data element from each table row
for li in data_rows[i].findAll('li'):
# get the text content and append to the district_row
district_row.append(li.getText())
# then append each pick/player to the district_data matrix
district_data_02.append(district_row)
district_data == district_data_02
#dataframe - districtlist
districtlist = pd.DataFrame(district_data ,columns=column_headers)
districtid = pd.DataFrame(district_link, columns=headers)
#df_row_merged = pd.concat([df, df1])
#dataframe - districtid
final_districtlist =pd.concat([districtlist, districtid], axis=1)