Counting across multiple columns in Python - python

I'm pretty new to Python and have searched the web for an answer to this but it is tricky to find without showing it as an example!
The data I have data is here:
Dataset
What I'm after is the number of times each 'HomeTeam' has appeared in both the 'HomeTeam' and 'AwayTeam' columns up to and including the date. So for the last row of data in the sample, the input would be 'Fulham', and the output = 4. This is because 'Fulham' has appeared 4 times in the 'HomeTeam' and 'AwayTeam' columns. For the first row of data, again, the input would be 'Fulham', but the output = 1, as it is the first time 'Fulham' has appeared. For the sample dataset, the output should be:
[1,1,2,1,3,1,4]
My code so far only allows me to get the number of times each team has appeared in the 'HomeTeam' column only:
df['H Count'] = df.groupby(['HomeTeam']).cumcount()+1
This gives me the output:
[1,1,1,1,2,1,2]
Any help would be much appreciated!

As I understand, the team currently in the HomeTeam is being used as input.
I don't know how you read in the dataset, but I have just created lists below. The logic should however be clear.
Having the below, I get [1, 1, 3]
HomeTeam = list()
HomeTeam.append("Fulham")
HomeTeam.append("Tottenham")
HomeTeam.append("Fulham")
AwayTeam = list()
AwayTeam.append("Chelsea")
AwayTeam.append("Fulham")
AwayTeam.append("Liverpool")
H_Count = []
p = 1
''' The team in the HomeTeam is used as input'''
for team in HomeTeam:
''' Get the list up until the current row'''
tmp_Home = HomeTeam[:p]
tmp_Away = AwayTeam[:p]
''' Count the number of times team has occured in home and away'''
H_Count.append(tmp_Home.count(team) + tmp_Away.count(team))
p+=1

Related

Python Panda's row selection

I've tried doing some searching, but I'm having troubles finding what I specifically need. I currently have this.
location = 'Location'
data = pd.read_csv('testbook.csv')
df = pd.DataFrame(data)
search = 'OR' # This will be replaced with an input
row = (df[df.eq(search).any(1)])
print(row)
Location = row.at[0, location]
print(Location)
This outputs this
row print out
Location City Price Etc
0 FL OR 50 123
Location print out
FL
this is the CSV information that it's pull the data from.
My main question and issue is what I'm trying to find out is at this specific line of code
Location = row.at[0, location]
for Location what I'm trying to do and see if possible is in the brackets [0, location].
I want it to automate in the future since for example if I need to find instead of 'OR' I need to find what data is in 'OR1'. The issue is that the [0] is related to the Row # hence this(this is the entire df).
Location City Price Etc
0 FL OR 50 123
1 FL1 OR1 501 1231
2 FL2 OR2 502 1232
I would have to manually change the code every single time which of course is unfeasible with what I'm trying to accomplish.
My main question is, how do I pull specific row numbers all the way on the left and take that output and make it a variable that I can input anywhere?
I'm having a bit of trouble figuring out what you are looking for but this is my best guess
import pandas as pd
data = {'Location':['FL', 'FL1', 'FL2'],
'City': ['OR', 'OR1', 'OR2'],
'Price':[50, 501, 502],
'Etc': [123,1231,1232]}
data = pd.DataFrame(data)
df = pd.DataFrame(data)
# Given search term -> find location
search = 'OR'
# Outputs 'FL'
df['Location'][df['City'] == search].any()

filling in columns with info from other file based on condition

So there are 2 csv files im working with:
file 1:
City KWR1 KWR2 KWR3
Killeen
Killeen
Houston
Whatever
file2:
location link reviews
Killeen www.example.com 300
Killeen www.differentexample.com 200
Killeen www.example3.com 100
Killeen www.extraexample.com 20
Here's what im trying to make this code do:
look at the 'City' in file one, take the top 3 links in file 2 (you can go ahead and assume the cities wont get mixed up) and then put these top 3 into the KWR1 KWR2 KWR3 columns for all the same 'City' values.
so it gets the top 3 and then just copies them to the right of all the Same 'City' values.
even asking this question correctly is difficult for me, hope i've provided enough information.
i know how to read the file in with pandas and all that, just cant code this exact situation in...
It is a little unusual requirement but I think you need to three steps:
1. Keep only the first three values you actually need.
df = df.sort_values(by='reviews',ascending=False).groupby('location').head(3).reset_index()
Hopefully this keeps only the first three from every city.
Then you somehow need to label your data, there might be better ways to do this but here is one way:- You assign a new column with numbers and create a user defined function
import numpy as np
df['nums'] = np.arange(len(df))
Now you have a column full of numbers (kind of like line numbers)
You create your function then that will label your data...
def my_func(index):
if index % 3 ==0 :
x = 'KWR' + str(1)
elif index % 3 == 1:
x = 'KWR' + str(2)
elif index % 3 == 2:
x = 'KWR' + str(3)
return x
You can then create the labels you need:
df['labels'] = df.nums.apply(my_func)
Then you can do:
my_df = pd.pivot_table(df, values='reviews', index=['location'], columns='labels', aggfunc='max').reset_index()
Which literally pulls out the labels (pivots) and puts the values in to the right places.

find most frequent pairs in a dataframe

Suppose I have a two-column dataframe where the first column is the ID of a meeting and the second is the ID of one of the participants in that meeting. Like this:
meeting_id,person_id
meeting0,person1234
meeting0,person4321
meeting0,person5555
meeting1,person4321
meeting1,person9999
# ... ~1 million rows
I want to find each person's top 15 co-participants. Eg.: I want to know which 15 people most frequently participate in meetings with Brad.
As an intermediate step I wrote a script that takes the original dataframe and makes a person-to-person dataframe, like this:
person1234,person4321
person1234,person5555
person4321,person5555
person4321,person9999
...
But I'm not sure this intermediate step is necessary. Also, it's taking forever to run (by my estimate it should take weeks!). Here's the monstrosity:
import pandas as pd
links = []
lic = pd.read_csv('meetings.csv', sep = ';', names = ['meeting_id', 'person_id'], dtype = {'meeting_id': str, 'person_id': str})
grouped = lic.groupby('person_id')
for i, group in enumerate(grouped):
print(i, 'of', len(grouped))
person_id = group[0].strip()
if len(person_id) == 14:
meetings = set(group[1]['meeting_id'])
for meeting in meetings:
lic_sub = lic[lic['meeting_id'] == meeting]
people = set(lic_sub['person_id'])
for person in people:
if person != person_id:
tup = (person_id, person)
links.append(tup)
df = pd.DataFrame(links)
df.to_csv('links.csv', index = False)
Any ideas?
So here is one way using merge then sort the columns
s=df.merge(df,on='meeting_id')
s[['person_id_x','person_id_y']]=np.sort(s[['person_id_x','person_id_y']].values,1)
s=s.query('person_id_x!=person_id_y').drop_duplicates()
s
meeting_id person_id_x person_id_y
1 meeting0 person1234 person4321
2 meeting0 person1234 person5555
5 meeting0 person4321 person5555
10 meeting1 person4321 person9999

How to iterate over a data frame

I have a dataset of users, books and ratings and I want to find users who rated high particular book and to those users I want to find what other books they liked too.
My data looks like:
df.sample(5)
User-ID ISBN Book-Rating
49064 102967 0449244741 8
60600 251150 0452264464 9
376698 52853 0373710720 7
454056 224764 0590416413 7
54148 25409 0312421273 9
I did so far:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.ix['0345339703'] # Lord of the Rings Part 1
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr['User-ID']
last line failed for
KeyError: 'User-ID'
I want to obtain users who rated LOTR > 7 to those users further find movies they liked too from the matrix.
Help would be appreciated. Thanks.
In your like_lotr dataframe 'User-ID' is the name of the index, you cannot select it like a normal column. That is why the line users = like_lotr['User-ID'] raises a KeyError. It is not a column.
Moreover ix is deprecated, better to use loc in your case. And don't put quotes: it need to be an integer, since 'User-ID' was originally a column of integers (at least from your sample).
Try like this:
df_p = df.pivot_table(index='ISBN', columns='User-ID', values='Book-Rating').fillna(0)
lotr = df_p.loc[452264464] # used another number from your sample dataframe to test this code.
like_lotr = lotr[lotr > 7].to_frame()
users = like_lotr.index.tolist()
user is now a list with the ids you want.
Using your small sample above and the number I used to test, user is [251150].
An alternative solution is to use reset_index. The two last lins should look like this:
like_lotr = lotr[lotr > 7].to_frame().reset_index()
users = like_lotr['User-ID']
reset_index put the index back in the columns.

Searching next item in list if object isn't in the list

I'm attempting to learn how to search csv files. In this example, I've worked out how to search a specific column (date of birth) and how to search indexes within that column to get the year of birth.
I can search for greater than a specific year - e.g. typing in 45 will give me everyone born in or after 1945, but the bit I'm stuck on is if I type in a year not specifically in the csv/list I will get an error saying the year isn't in the list (which it isn't).
What I'd like to do is iterate through the years in the column until the next year that is in the list is found and print anything greater than that.
I've tried a few bits with iteration, but my brain has finally ground to a halt. Here is my code so far...
data=[]
with open("users.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
print(data)
lookup = input("Please enter a year of birth to start at (eg 67): ")
#lookupint = int(lookup)
#searching column 3 eg [3]
#but also searching index 6-8 in column 3
#eg [6:8] being the year of birth within the DOB field
col3 = [x[3][6:8] for x in data]
#just to check if col3 is showing the right data
print(col3)
print ("test3")
#looks in column 3 for 'lookup' which is a string
#in the table
if lookup in col3: #can get rid of this
output = col3.index(lookup)
print (col3.index(lookup))
print("test2")
for k in range (0, len(col3)):
#looks for data that is equal or greater than YOB
if col3[k] >= lookup:
print(data[k])
Thanks in advance!

Categories