Iterate through multiple list of dictionaries - python

I would like to iterate through list of dictionaries in order to get a specific value, but I can't figure it out.
I've made a simplified version of what I've been working with. These lists or much longer, with more dictionaries in them, but for the sake of an example, I hope this shortened dataset will be enough.
listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
For example, I need the values of the key "30" from the dictionaries above. I've managed to get those and stored them in a list of integers. ( [626, 914] )
These integers are basically IDs. After this, I need to get the value of these IDs from another list of dictionaries.
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
I would like to print/store the track_names and track_lengths of the IDs I've got from the listOfResults earlier. Unfortunately, I've ended up in a complete mess of for loops.

You want something like this:
ids = [626, 914]
result = { track for track in list_of_tracks if track.get("track_id") in ids }

I unfortunately can't comment on the answer given by Nathaniel Ford because I'm a new user so I just thought I'd share it here as an answer.
His answer is basically correct, but I believe you need to replace the curly braces with brackets or else you will get this error: TypeError: unhashable type: 'dict'
The answer should look like:
ids = [626, 914]
result = [track for track in listOfTrack if track.get("track_id") in ids]

listOfResults = [{"29":2523,"30":626,"10":0,"32":128},{"29":2466,"30":914,"10":0,"32":69}]
ids = [x.get('30') for x in listOfResults]
listOfTrack = [{"track_length": 1.26,"track_id": 626,"track_name": "Rainbow Road"},{"track_length": 6.21,"track_id": 914,"track_name": "Excalibur"}]
out = [x for x in listOfTrack if x.get('track_id') in ids]
Alternatively, it may be time to learn a new library if you're going to be doing a lot this.
import pandas as pd
results_df = pd.DataFrame(listOfResults)
track_df = pd.DataFrame(listOfTrack)
These Look like:
# results_df
29 30 10 32
0 2523 626 0 128
1 2466 914 0 69
# track_df
track_length track_id track_name
0 1.26 626 Rainbow Road
1 6.21 914 Excalibur
Now we can answer your question:
# Creates a mask of rows where this is True.
mask = track_df['track_id'].isin(results_df['30'])
# Specifies that we want just those two columns.
cols = ['track_length', 'track_name']
out = track_df.loc[mask, cols]
print(out)
# Or we can make it back into a dictionary:
print(out.to_dict('records'))
Output:
track_length track_name
0 1.26 Rainbow Road
1 6.21 Excalibur
[{'track_length': 1.26, 'track_name': 'Rainbow Road'}, {'track_length': 6.21, 'track_name': 'Excalibur'}]

Related

How to use pandas to check for list of values from a csv spread sheet while filtering out certain keywords?

Hey guys this is my first post. I am planning on building an anime recommendation engine using python. I came across a problem where I made a list called genre_list which stores the genres that I want to filter from the huge data spreadsheet I was given. I am using the Pandas library and it has an isin() function to check if the values of a list is included in the datasheet and its supposed to filter it out. I am using the function but its not able to detect "Action" from the datasheet although it is there. I got a feeling there's something wrong with the data types and I probably have to work around it somehow but I'm not sure how.
I downloaded my csv file from this link for anyone interested!
https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews?resource=download
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
print(genre_list)
df_genre = df[df["genre"].isin(genre_list)]
# df_genre = df["genre"]
print(df_genre)
Outout:
[1]: https://i.stack.imgur.com/XZzcc.png
You want to check if ANY value in your user input list is in each of the list values in the "genre" column. The "isin" function will check if your input in it's entirety is in a cell value, which is not what you want here. Change that line to this:
df_genre = df[df['genre'].apply(lambda x: any([i in x for i in genre_list]))]
Let me know if you need any more help.
import pandas as pd
df = pd.read_csv('animes.csv')
genre = True
genre_list = []
while genre:
genre_input = input("What genres would you like to watch?, input \"done\" when done listing!\n")
if genre_input == "done":
genre = False
else:
genre_list.append(genre_input)
# List of all cells and their genre put into a list
col_list = df["genre"].values.tolist()
temp_list = []
# Each val in the list is compared with the genre_list to see if there is a match
for index, val in enumerate(col_list):
if all(x in val for x in genre_list):
# If there is a match, the UID of that cell is added to a temp_list
temp_list.append(df['uid'].iloc[index])
print(temp_list)
# This checks if the UID is contained in the temp_list of UIDs that have these genres
df_genre = df["uid"].isin(temp_list)
new_df = df.loc[df_genre, "title"]
# Prints all Anime with the specified genres
print(new_df)
This is another approach I took and works as well. Thanks for all the help :D
To make a selection from a dataframe, you can write this:
df_genre = df.loc[df['genre'].isin(genre_list)]
I've downloaded the file animes.csv from Kaggle and read it into a dataframe. What I found is that the column genre actually contains strings (of lists), not lists. So one way to get what you want would be:
...
m = df["genre"].str.contains(r"'(?:" + "|".join(genre_list) + r")'")
df_genre = df[m]
Result for genre_list = ["Action"]:
uid ... link
3 5114 ... https://myanimelist.net/anime/5114/Fullmetal_A...
4 31758 ... https://myanimelist.net/anime/31758/Kizumonoga...
5 37510 ... https://myanimelist.net/anime/37510/Mob_Psycho...
7 38000 ... https://myanimelist.net/anime/38000/Kimetsu_no...
9 2904 ... https://myanimelist.net/anime/2904/Code_Geass_...
... ... ... ...
19301 10350 ... https://myanimelist.net/anime/10350/Hakuouki_S...
19303 1293 ... https://myanimelist.net/anime/1293/Urusei_Yatsura
19304 150 ... https://myanimelist.net/anime/150/Blood_
19305 4177 ... https://myanimelist.net/anime/4177/Bounen_no_X...
19309 450 ... https://myanimelist.net/anime/450/InuYasha_Mov...
[4215 rows x 12 columns]
If you want to transform the values of the genre column for some reason into lists, then you could do either
df["genre"] = df["genre"].str[1:-1].str.replace("'", "").str.split(r",\s*")
or
df["genre"] = df["genre"].map(eval)
Afterwards
df_genre = df[~df["genre"].map(set(genre_list).isdisjoint)]
would give you the filtered dataframe.

Iterating on a list based on different parameters

I'm once again asking for help on iterating over a list. This is the problem that eludes me this time:
I have this table:
that contains various combinations of countries with their relative trade flow.
Since trade goes both ways, my list has for example one value for ALB-ARM (how much albania traded with armenia that year) and then down the list another value for ARM-ALB (the other way around).
I want to sum this two trade values for every pair of countries; and I've been trying around with some code but I quickly realise how all my approaches are wrong.
How do I even set it up? I feel like it's too hard with a loop and it will be easy with some function that I don't even know exists.
Example data in Table format:
from astropy.table import Table
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
t = Table([country1,country2,flow],names=["1","2","flow"],meta={"Header":"Table"})
and the expected output would be:
trade = [700,90,700,320,90,320]
result = Table([country1,country2,flow,trade],names=["1","2","flow","trade"],meta={"Header":"Table"})
Thank you in advance all
Maybe this could help:
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
trade = []
pairs = map(lambda t: '-'.join(t), zip(country1, country2))
flow_map = dict(zip(pairs, flow))
for left_country, right_country in zip(country1, country2):
trade.append(flow_map['-'.join((left_country, right_country))] + flow_map['-'.join((right_country, left_country))])
print(trade)
outputs:
[700, 90, 700, 320, 90, 320]

Count match in 2 pandas dataframes

I have 2 dataframes containing text as list in each row. This one is called df
Datum File File_type Text
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr..
and i have another one, df_lm which looks like this
List_type Words
0 LM_cnstrain. [abide, abiding, bound, bounded, commit, commi...
1 LM_litigius. [abovementioned, abrogate, abrogated, abrogate...
2 LM_modal_me. [can, frequently, generally, likely, often, ou...
3 LM_modal_st. [always, best, clearly, definitely, definitive...
4 LM_modal_wk. [almost, apparently, appeared, appearing, appe...
I want to create new columns in df, where the match of words should be counted, so for example how many words are there from df_lm.Words[0] in df.Text[0]
Note: df has ca 500 rows and df_lm has 6 -> so i need to create 6 new columns in df so that the updated df looks somewhat like this
Datum ...LM_cnstrain LM_litigius Lm_modal_me ...
2000-01-27 ... 5 3 4
2000-02-25 ... 7 1 0
I hope i was clear on my question.
Thanks in advance!
EDIT:
i have already done smth. similar by creating a list and loop over it, but as the lists in df_lm are very long this is not an option.
The code looked like this:
result_list[]
for file in file_list:
count_growth = 0
for word in text.split ():
if word in growth:
count_growth = count_growth +1
a={'Grwoth':count_growth}
result_list.append(a)
According to my comments you can try something like this:
The below code has to run in a loop where text column from 1st df has to be matched with all 6 from next and make column with value from len(c)
desc = df_lm.iloc[0,1]
matches = df.text.isin(desc)
result = df.text[matches]
If this helps you, let me know otherwise will update/delete the answer
So ive come to the following solution:
for file in file_list:
count_lm_constraint = 0
count_lm_litigious = 0
count_lm_modal_me = 0
for word in text.split()
if word in df_lm.iloc[0,1]:
count_lm_constraint = count_lm_constraint +1
if word in df_lm.iloc[1,1]:
count_lm_litigious = count_lm_litigious +1
if word in df_lm.iloc[2,1]:
count_lm_modal_me = count_lm_modal_me +1
a={"File": name, "Text": text,'lm_uncertain':count_lm_uncertain,'lm_positive':count_lm_positive ....}
result_list.append(a)

Computing aggregate by creating nested dictionary on the fly

I'm new to python and I could really use your help and guidance at the moment. I am trying to read a csv file with three cols and do some computation based on the first and second column i.e.
A spent 100 A spent 2040
A earned 60
B earned 48
B earned 180
A spent 40
.
.
.
Where A spent 2040 would be the addition of all 'A' and 'spent' amounts. This does not give me an error but it's not logically correct:
for row in rows:
cols = row.split(",")
truck = cols[0]
if (truck != 'A' and truck != 'B'):
continue
record = cols[1]
if(record != "earned" and record != "spent"):
continue
amount = int(cols[2])
#print(truck+" "+record+" "+str(amount))
if truck in entries:
#entriesA[truck].update(record)
if record in records:
records[record].append(amount)
else:
records[record] = [amount]
else:
entries[truck] = records
if record in records:
records[record].append(amount)
else:
entries[truck][record] = [amount]
print(entries)
I am aware that this part is incorrect because I would be adding the same inner dictionary list to the outer dictionary but I'm not sure how to go from there:
entries[truck] = records
if record in records:
records[record].append(amount)
However, Im not sure of the syntax to create a new dictionary on the fly that would not be 'records'
I am getting:
{'B': {'earned': [60, 48], 'spent': [100]}, 'A': {'earned': [60, 48], 'spent': [100]}}
But hoping to get:
{'B': {'earned': [48]}, 'A': {'earned': [60], 'spent': [100]}}
Thanks.
For the kind of calculation you are doing here, I highly recommend Pandas.
Assuming in.csv looks like this:
truck,type,amount
A,spent,100
A,earned,60
B,earned,48
B,earned,180
A,spent,40
You can do the totalling with three lines of code:
import pandas
df = pandas.read_csv('in.csv')
totals = df.groupby(['truck', 'type']).sum()
totals now looks like this:
amount
truck type
A earned 60
spent 140
B earned 228
You will find that Pandas allows you to think on a much higher level and avoid fiddling with lower level data structures in cases like this.
if record in entries[truck]:
entries[truck][record].append(amount)
else:
entries[truck][record] = [amount]
I believe this is what you would want? Now we are directly accessing the truck's records, instead of trying to check a local dictionary called records. Just like you did if there wasn't any entry of a truck.

Compute values from sequential pandas rows

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:
In [1]: movements.head()
Out[1]:
Typedoc Port NRT GRT Draft
Vessname ECDate
400 L 2012-01-19 0 2394 2328 7762 4.166667
2012-07-22 1 2394 2328 7762 17.000000
2012-10-29 0 2395 2328 7762 6.000000
A 397 2012-05-27 1 3315 2928 2928 18.833333
2012-06-01 0 3315 2928 2928 5.250000
I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:
stops_dict = {'400 L':[
['2012-01-19', 0, 2394, 4.166667],
['2012-07-22', 1, 2394, 17.000000],
['2012-10-29', 0, 2395, 6.000000]
]
}
Where the nested list values are:
[ECDate, Typedoc, Port, Draft]
If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:
t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]
i += 1
and
t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:
link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft
The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.
I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:
with open(filename,'sb) as csvfile:
datareader = csv.reader(csvfile, delimiter=",")
next(datareader, None)
<FLOW CONTROL> #based on Typedoc and ECDate values
The function adds to an empty dictionary:
stops_dict = {}
def createStopsDict(row):
#this reads each row in a csv file,
#creates a dict entry from row[0]: Vessname if not in dict
#or appends things after row[0] to the dict entry if Vessname in dict
ves = row[0]
if ves in stops_dict:
stops_dict[ves].append(row[1:])
else:
stops_dict[ves]=[row[1:]]
return
This is an inefficient way of doing things...
I could possibly be using iterrows instead of a csv reader...
I've looked into melt and unstack and I don't think those are correct...
This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...
Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).
I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.
Thanks!
UPDATE 2:
I think I have this mostly figured out...
Beginning with my original data frame movements:
movements.reset_index().apply(
lambda x: makeRoute(x.Vessname,
[x.ECDate,
x.Typedoc,
x.Port,
x.NRT,
x.GRT,
x.Draft]),
axis=1
)
where:
routemap = {}
def makeRoute(Vessname, info):
if Vessname in routemap:
route = routemap[Vessname]
route.append(info)
else:
routemap[Vessname] = [info]
return
returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

Categories