Fetching data from related tables - python

I have 3 tables: Continent, Country and Story.
Country has ForeignKey(Continent) and Story has ManyToManyField(Country, blank=True) field.
What I need is to get a list of countries which at least have one story belonging to it, and I need these countries grouped by continents.
How can I achieve that?

One way to do it is:
countries = {}
country_list = Country.objects.all()
for c in country_list:
# check the stories that has the country
stories = Story.objects.filter(country_set__name__exact=c.name)
# only if the country have stories
if stories.count() > 0:
## check for initialize list
if not countries.has_key(c.continent.name):
countries[c.continent.name] = []
## finally we add the country
countries[c.continent.name].append(c)
That will do the work.
Bye

Related

Using geocoder.osm to fetch state, county and country and appending each in a pandas dataframe column

I have a pandas dataframe with a location column. Each row has one such location entry. Some entries only show country, some only state and some are fake, e.g. "Your Mom's Basement". I am interested only in US locations - specifically the county object fetched byNominatim OSM which I use to search. Not all searches fetch those county objects since some locations are only states, e.g. Texas, US, while fake locations also do not provide this.
I tried to filter the results with the code below and append state, county and country entries to new columns in the dataframe. However, many of the values I get seem absurd, e.g. for the entry "Kansas City, MO", I get state as Puerto Rico. It seems that some entries have shifted up relative to the correct row entries. In the above example, 4 rows above there is location called Bayamon, Puerto Rico. It seems the cells have shifted but I cannot find a clear pattern.
I would very much appreciate any help
import geocoder
from geopy.geocoders import Nominatim
states = []
counties = []
countries = []
for locat in df["user_location"][0:50]:
#print(locat)
try:
g = geocoder.osm(locat)
if g.accuracy >= 0.8:
#try extracting county, state and country objects
try:
counties.append(g.county)
print(g.county)
if g.county == None:
counties.append(np.nan)
except:
counties.append(np.nan)
try:
states.append(g.state)
print(g.state)
if g.state == None:
g.state = np.nan
states.append(np.nan)
except:
states.append(np.nan)
try:
countries.append(g.country)
print(g.country)
if g.country == None:
g.country = np.nan
countries.append(np.nan)
except:
countries.append(np.nan)
#Catching fake user_location names, e.g. "Your Mom's Basement"
except:
counties.append(np.nan)
states.append(np.nan)
countries.append(np.nan)
df["state"] = pd.Series(states)
df["county"] = pd.Series(counties)
df["country"] = pd.Series(countries)

Replace a string with a string out of many in Pandas

So, I have a pandas data frame where one column contains the description of the nationality of a user and I want to replace this whole description with the country he's from.
My inputs are the df and the list of countries:
Description
ID
I am from Atlantis
1
My family comes from Narnia
2
["narnia","uzbekistan","Atlantis",...]
I know that:
I only have one country per description
the description contains the name of the country or does not, there is no necessity to infer the country from what he says, I only want to map [phrase containing name of country] to [country].
If I had only one country to replace I could use something like
df.loc[df['description'].str.contains('Atlantis', case=False), 'description'] = 'Atlantis'
I know that, because the country names are organised in a list, I could cycle through it and apply this to all the elements, something like:
for country in country_list:
df.loc[df['description'].str.contains(country, case=False), 'description'] = country
but it seems to me quite unpythonic so I was wondering if anyone could help me finding a better way (that I'm sure exists)
The output should be:
Description
ID
Atlantis
1
Narnia
2
You can use pd.Series.str.extract:
country_list = ["narnia","uzbekistan","Atlantis"]
df = pd.DataFrame({'Description': {0: 'I am from Atlantis',
1: 'My family comes from Narnia'},
'ID': {0: 1, 1: 2}})
print (df["Description"].str.extract(f"({'|'.join(country_list)})", flags=re.I))
0
0 Atlantis
1 Narnia

Reference only DataFrames where condidtion is True Python Pandas

Similar to this question but somewhat different (and that answer did not work). I am trying to reference DataFrames where a condition is true. In my case, whether or not a word from a word bank is contained in the string. If the word is in the string, I want to be able to use that specific DataFrame later (like pull out the link if true and continue searching). So I have:
wordBank = ['bomb', 'explosion', 'protest',
'port delay', 'port closure', 'hijack',
'tropical storm', 'tropical depression']
rss = pd.read_csv('RSSfeed2019.csv')
# print(rss.head())
feeds = [] # list of feed objects
for url in rss['URL'].head(5):
feeds.append(feedparser.parse(url))
# print(feeds)
posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
for post in feed.entries:
if hasattr(post, 'summary'):
posts.append((post.title, post.link, post.summary))
else:
posts.append((post.title, post.link))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
if (df['summary'].str.find(wordBank)) or (df['title'].str.find(wordBank)):
print(df['title'])
and tried from the other question...
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
for word in wordBank:
mask = (df['summary'].str.find(word)) or (df['title'].str.find(word))
df.loc[mask, 'summary'] = word
df.loc[mask, 'title'] = word
How can I just get it to print the titles of the fields where the words are contained in either the summary or title? I want to be able to manipulate only those frames further. With current code, it prints every title in the DataFrame because I THINK since one is true, it thinks to print ALL the titles. How can I only reference titles where true?
Given the following setup:
posts = [["Global protest Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "There is a tropical depression"]]
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
print(df)
SETUP
title ... summary
0 Global protest Breaks Record ... The world’s total cellular containership fleet...
1 Global TEU Breaks Record ... The world’s total cellular containership fleet...
2 Global TEU Breaks Record ... There is a tropical depression
You could:
# create mask
mask = df['summary'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False) | df['title'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False)
# extract titles
titles = df['title'].values
# print them
for title in titles[mask]:
print(title)
Output
Global protest Breaks Record
Global TEU Breaks Record
Notice that the first row has protest in the title, and the last row has tropical depression in the summary. The key idea is to use a regex to match one of the alternatives in wordBank. See more about regex, here and the documentation of str.contains.

Iterating through form data

I have a QueryDict object in Django as follows:
{'ratingname': ['Beginner', 'Professional'], 'sportname': ['2', '3']
where the mapping is such:
2 Beginner
3 Professional
and 2, 3 are the primary key values of the sport table in models.py:
class Sport(models.Model):
name = models.CharField(unique=True, max_length=255)
class SomeTable(models.Model):
sport = models.ForeignKey(Sport)
rating = models.CharField(max_length=255, null=True)
My question here is, how do I iterate through ratingname such that I can save it as
st = SomeTable(sport=sportValue, rating=ratingValue)
st.save()
I have tried the following:
ratings = dict['ratingname']
sports = dict['sportname']
for s,i in enumerate(sports):
sport = Sport.objects.get(pk=sports[int(s[1])])
rate = SomeTable(sport=sport, rating=ratings[int(s)])
rate.save()
However, this creates a wrong entry in the tables. For example, with the above given values it creates the following object in my table:
id: 1
sport: 2
rating: 'g'
How do I solve this issue or is there a better way to do something?
There are a couple of problems here. The main one is that QueryDicts return only the last value when accessed with ['sportname'] or the like. To get the list of values, use getlist('sportname'), as documented here:
https://docs.djangoproject.com/en/1.7/ref/request-response/#django.http.QueryDict.getlist
Your enumerate is off, too - enumerate yields the index first, which your code assigns to s. So s[1] will throw an exception. There's a better way to iterate through two sequences in step, though - zip.
ratings = query_dict.getlist('ratingname') # don't reuse built in names like dict
sports = query_dict.getlist('sportname')
for rating, sport_pk in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_pk))
rate = SomeTable(sport=sport, rating=rating)
rate.save()
You could also look into using a ModelForm based on your SomeTable model.
You may use zip:
ratings = dict['ratingname']
sports = dict['sportname']
for rating, sport_id in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_id))
rate = SomeTable(sport=sport, rating=rating)
rate.save()

Count multiple occurrences in imported .csv file

Starting from a large imported data set, I am trying to identify and print each line corresponding to a city that has at least 2 unique colleges/universities there.
So far (the relevant code):
for line in file:
fields = line.split(",")
ID, name, city = fields[0], fields[1], fields[3]
count = line.count()
if line.count(city) >= 2:
if line.count(ID) < 2:
print "ID:", ID, "Name: ", name, "City: ", city
In other words, I want to be able to eliminate 1) any duplicate school listings (by ID - this file has many institutions appearing repeatedly), 2) any cities that do not have two or more institutions there.
Thank you!
dicts come in handy when you want to order data by some key. In your case, nested dicts that first index by city and then by ID should do the trick.
# will hold cities[city][ID] = [ID, name, city]
cities = {}
for line in file:
fields = lines.split()
ID, name, city = fields
cities.setdefault(name, {})[ID] = fields
# 'cities' values are the IDs for that city. make a list if there are at least 2 ids
multi_schooled_cities = [ids_by_city.values() for ids_by_city in cities.values() if len(ids_by_city) >= 2]

Categories