Appending dictionaries generated from a loop to the same dataframe - python

I have a loop within a nested loop that at the end generates 6 dictionaries. Each dictionary has the same key but different values, I would at the end of every iteration to append the dictionary to the same dataframe but it keeps failing.
At the end I would like to have a table with 6 columns plus an index which holds the keys.
This is the idea behind what I'm trying to do:
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary)
df_final = pd.concat([dictionary, df])
I get the error:
cannot concatenate object of type '<class 'dict'>'; only series and dataframe objs are valid
I created a practice dataset set if necessary here:
letts = [ ('a','b','c'),('e','f','g'),('h','i','j'),('k','l','m'),('n','o','p')]
numns = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]
dictionary = dict()
for i in letts:
for j in numns:
dictionary = dict(zip(i, j))

i am confusing about your practice dataset, but modifications below could provide an idea...
df_final = pd.DataFrame()
dictionary = dict()
for i in blahh:
dictionary[i] = dict(zip(blahh['x'][i], blahh['y'][i]))
df = pd.DataFrame(dictionary, index="index must be passed")
df_final = pd.concat([df_final, df])

Related

Check for existence of data from two Dataframe's columns in List

I'm searching for difference between columns in DataFrame and a data in List.
I'm doing it this way:
# pickled_data => list of dics
pickled_names = [d['company'] for d in pickled_data] # get values from dictionary to list
diff = df[~df['company_name'].isin(pickled_names)]
which works fine, but I realized that I need to check not only for company_name but also for place, because there could be two companies with the same name.
df contains also column place as well as pickled_data contains place key in the dictionary.
I would like to be able to do something like this
pickled_data = [(d['company'], d['place']) for d in pickled_data]
diff = df[~df['company_name', 'place'].isin(pickled_data)] # For two values in same row
You can convert values to MultiIndex by MultiIndex.from_tuples, then convert both columns too and compare:
pickled_data = [(d['company'], d['place']) for d in pickled_data]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
Sample:
data = {'company_name':['A1','A2','A2','A1','A1','A3'],
'place':list('sdasas')}
df = pd.DataFrame(data)
pickled_data = [('A1','s'),('A2','d')]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
print (diff)
company_name place
2 A2 a
4 A1 a
5 A3 s
You can form a set of tuples from your pickled_data for faster lookup later, then using a list comprehension over company_name and place columns of the frame, we get a boolean list of whether they are in the frame or not. Then we use this to index into the frame:
comps_and_places = set((d["company"], d["place"]) for d in pickled_data)
not_in_list = [(c, p) not in comps_and_places
for c, p in zip(df.company_name, df.place)]
diff = df[not_in_list]

Cannot assign to function call when looping through and converting excel files

With this code:
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i,snlist in list(zip(range(1,13),sn)):
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist, skiprows=range(6))
I get this error:
'df{}'.format(str(i)) = pd.read_excel('test.xlsx',sheet_name=snlist,
skiprows=range(6))
^ SyntaxError: cannot assign to function call
I can't understand the error and how solve. What's the problem?
df+str(i) also return error
i want to make result as:
df1 = pd.read_excel.. list1...
df2 = pd.read_excel... list2....
You can't assign the result of df.read_excel to 'df{}'.format(str(i)) -- which is a string that looks like "df0", "df1", "df2" etc. That is why you get this error message. The error message is probably confusing since its treating this as assignment to a "function call".
It seems like you want a list or a dictionary of DataFrames instead.
To do this, assign the result of df.read_excel to a variable, e.g. df and then append that to a list, or add it to a dictionary of DataFrames.
As a list:
dataframes = []
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes.append(df)
As a dictionary:
dataframes = {}
xls = pd.ExcelFile('test.xlsx')
sn = xls.sheet_names
for i, snlist in list(zip(range(1, 13), sn)):
df = pd.read_excel('test.xlsx', sheet_name=snlist, skiprows=range(6))
dataframes[i] = df
In both cases, you can access the DataFrames by indexing like this:
for i in range(len(dataframes)):
print(dataframes[i])
# Note indexes will start at 0 here instead of 1
# You may want to change your `range` above to start at 0
Or more simply:
for df in dataframes:
print(df)
In the case of the dictionary, you'd probably want:
for i, df in dataframes.items():
print(i, df)
# Here, `i` is the key and `df` is the actual DataFrame
If you really do want df1, df2 etc as the keys, then do this instead:
dataframes[f'df{i}'] = df

Creating multiple dataframes using a for loop

Hi I have code which looks like this:
with open("file123.json") as json_file:
data = json.load(json_file)
df_1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][1].items()]))
df_1_made =pd.json_normalize(json.loads(df_1.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
df_2 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][2].items()]))
df_2_made = pd.json_normalize(json.loads(df_2.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
df_3 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in data["spt"][3].items()]))
df_3_made = pd.json_normalize(json.loads(df_3.to_json(orient="records"))).T.drop(["content.id","shortname","name"])
which the dataframe is built from a json file
the problem is that I am dealing with different json files and each one of them can lead to different number of dataframes. so the code above is 3, it may change to 7. Is there any way to make a for loop taking the length of the data:
length = len(data["spt"])
and make the correct number of dataframes from it? so I do not need to do it manually.
The simplest option here will be to put all your dataframes into a dictionary or a list. First define a function that creates the dataframe and then use a list comprehension.
def create_df(data):
df = pd.DataFrame(
dict(
[(k,pd.Series(v)) for k,v in data]
)
)
df =pd.json_normalize(
json.loads(
df.to_json(orient="records")
)
).T.drop(["content.id","shortname","name"])
return df
my_list_of_dfs = [create_df(data.items()) for x in data["spt"]]

How to group by and sum when all elements of one list are in another list

I have a data frame df1. "transactions" column has an array of int.
id transactions
1 [1,2,3]
2 [2,3]
data frame df2. "items" column has an array of int.
items cost
[1,2] 2.0
[2] 1.0
[2,4] 4.0
I need to check whether all elements of items are in each transaction if so sum up the costs.
Expected Result
id transaction score
1 [1,2,3] 3.0
2 [2,3] 1.0
I did the following
#cross join
-----------
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()],
right.values[ib2.ravel()]]))
out=cartesian_product_simplified(df1,df2)
#column names assigning
out.columns=['id', 'transactions', 'cost', 'items']
#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()
#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list()
for row in trans:
ret =np.all(np.in1d(itm, row))
out_list.append(ret)
return out_list
if true: group and sum
-----------------------
a=check(t,item)
for i in a:
if(i):
print(out.groupby(['id','transactions']))['cost'].sum()
else:
print("no")
Throws TypeError: 'NoneType' object is not subscriptable.
I am new to python and don't know how to put all these together. How to group by and sum the cost when all items of one list in another list?
The simplies way is just to check all items for all transactions:
# df1 and df2 are initialized
def sum_score(transaction):
score = 0
for _, row in df2.iterrows():
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
df1["score"] = df1["transactions"].map(sum_score)
It will be extremely slow on big scale. If this is a problem, we need to iterate not over every item, but preselect only possible. If you have enough memory, it can be done like that. For each item we remember all the row numbers in df2, where it appeared. So for each transaction we get the items, get all the possible lines and check only them.
import collections
# df1 and df2 are initialized
def get_sum_score_precalculated_func(items_cost_df):
# create a dict of possible indexes to search for an item
items_search_dict = collections.default_dict(set)
for i, (_, row) in enumerate(items_cost_df.iterrow()):
for item in row["items"]:
items_search_dict[item].add(i)
def sum_score(transaction):
possible_indexes = set()
for i in transaction:
possible_indexes += items_search_dict[i]
score = 0
for i in possible_indexes:
row = items_cost_df.iloc[i]
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
return sum_score
df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))
Here I use
set which is an unordered storage of unique values (it helps to join possible line numbers and avoid double count).
collections.defaultdict which is a usual dict, but if you are trying to access uninitialized values it fill it with the given data (blank set in my case). It help to avoid if x not in my_dict: my_dict[x] = set(). I also use so called "closure", which means sum_score function will have access to items_cost_df and items_search_dict which were accessible at the level the sum_score function was declared even after it was returned and get_sum_score_precalculated_func
That should be much faster in case the items are quite unique and can be found only in a few lines of df2.
If you have quite a few unique items and so many identical transactions, you'd better calculate score for each unique transaction first. And then just join the result.
transactions_score = []
for transaction in df1["transactions"].unique():
score = sum_score(transaction)
transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
transaction_score,
columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")
Here I use sum_score from first example of code
P.S. With the python error message there should be a line number which helps a lot to understand the problem.
# convert df_1 to dictionary for iteration
df_1_dict = dict(zip(df_1["id"], df_1["transactions"]))
# convert df_2 to list for iteration as there is no unique column
df_2_list = df_2.values.tolist()
# iterate through each combination to find a valid one
new_data = []
for rows in df_2_list:
items = rows[0]
costs = rows[1]
for key, value in df_1_dict.items():
# find common items in both
common = set(value).intersection(set(items))
# execute of common item exist in second dataframe
if len(common) == len(items):
new_row = {"id": key, "transactions": value, "costs": costs}
new_data.append(new_row)
merged_df = pd.DataFrame(new_data)
merged_df = merged_df[["id", "transactions", "costs"]]
# group the data by id to get total cost for each id
merged_df = (
merged_df
.groupby(["id"])
.agg({"costs": "sum"})
.reset_index()
)

Pandas - populate list with dataframe values as strings

I am reading csv files from a folder and filtering tem into a pandas dataframe, like so:
results=[]
for filename in glob.glob(os.path.join('/path/*.csv')):
with open(filename) as p:
df = pd.read_csv(p)
filtered = df[(df['duration'] > low1) & (df['duration'] < high1)]
artist = filtered['artist'].values
print artist
track = filtered['track'].values
print track
where low1 = 0, high_1 = 0.5
artist and track print hundreds of filtered items as normal strings, but if I try to append them to results in the loop:
artist = filtered['artist'].values
track = filtered['track'].values
results.append([track,artist])
I see that I am appendding objects and types and results ends up populated with a fraction of the filtered items. I don't get what happens.
How do I populate results with all items as regular strings, in this fashion:
[['artist1', 'track1'], ['artist1', 'track2], ...]]
Create list of DataFrames and then join them together by concat, last convert to nested lists:
results=[]
for filename in glob.glob(os.path.join('/path/*.csv')):
df = pd.read_csv(filename)
#filter by conditions and also columns by names with .loc
filtered = df.loc[(df['duration'] > low1) & (df['duration'] < high1), ['artist','track']]
#alternative solution
filtered = df.loc[df['duration'].between(low1, high1,inclusive=False), ['artist','track']]
results.append(filtered)
out = pd.concat(results).values.tolist()
Another solution id append lists and last flattening them by list comprehension:
results=[]
for filename in glob.glob(os.path.join('/path/*.csv')):
df = pd.read_csv(filename)
#filter by conditions and also columns by names with .loc
mask = df['duration'].between(low1, high1,inclusive=False)
filtered = df.loc[mask, ['artist','track']].values.tolist()
results.append(filtered)
out = [y for x in results for y in x]

Categories